You want to migrate an Apache Spark 3 batch job from on-premises to Google Cloud. You need to minimally change the job so that the job reads from Cloud Storage and writes the result to BigQuery. Your job is optimized for Spark, where each executor has 8 vCPU and 16 GB memory, and you want to be able to choose similar settings. You want to minimize installation and management effort to run your job. What should you do?
Answer : A
You migrated a data backend for an application that serves 10 PB of historical product data for analytics. Only the last known state for a product, which is about 10 GB of data, needs to be served through an API to the other applications. You need to choose a cost-effective persistent storage solution that can accommodate the analytics requirements and the API performance of up to 1000 queries per second (QPS) with less than 1 second latency. What should you do?
Answer : D
You are preparing an organization-wide dataset. You need to preprocess customer data stored in a restricted bucket in Cloud Storage. The data will be used to create consumer analyses. You need to follow data privacy requirements, including protecting certain sensitive data elements, while also retaining all of the data for potential future use cases. What should you do?
Answer : A
You are designing a messaging system by using Pub/Sub to process clickstream data with an event-driven consumer app that relies on a push subscription. You need to configure the messaging system that is reliable enough to handle temporary downtime of the consumer app. You also need the messaging system to store the input messages that cannot be consumed by the subscriber. The system needs to retry failed messages gradually, avoiding overloading the consumer app, and store the failed messages after a maximum of 10 retries in a topic. How should you configure the Pub/Sub subscription?
Answer : D
You are migrating your on-premises data warehouse to BigQuery. As part of the migration, you want to facilitate cross-team collaboration to get the most value out of the organization's dat
a. You need to design an architecture that would allow teams within the organization to securely publish, discover, and subscribe to read-only data in a self-service manner. You need to minimize costs while also maximizing data freshness What should you do?
Answer : C
To provide a cost-effective storage and processing solution that allows data scientists to explore data similarly to using the on-premises HDFS cluster with SQL on the Hive query engine, deploying a Dataproc cluster is the best choice. Here's why:
Compatibility with Hive:
Dataproc is a fully managed Apache Spark and Hadoop service that provides native support for Hive, making it easy for data scientists to run SQL queries on the data as they would in an on-premises Hadoop environment.
This ensures that the transition to Google Cloud is smooth, with minimal changes required in the workflow.
Cost-Effective Storage:
Storing the ORC files in Cloud Storage is cost-effective and scalable, providing a reliable and durable storage solution that integrates seamlessly with Dataproc.
Cloud Storage allows you to store large datasets at a lower cost compared to other storage options.
Hive Integration:
Dataproc supports running Hive directly, which is essential for data scientists familiar with SQL on the Hive query engine.
This setup enables the use of existing Hive queries and scripts without significant modifications.
Steps to Implement:
Copy ORC Files to Cloud Storage:
Transfer the ORC files from the on-premises HDFS cluster to Cloud Storage, ensuring they are organized in a similar directory structure.
Deploy Dataproc Cluster:
Set up a Dataproc cluster configured to run Hive. Ensure that the cluster has access to the ORC files stored in Cloud Storage.
Configure Hive:
Configure Hive on Dataproc to read from the ORC files in Cloud Storage. This can be done by setting up external tables in Hive that point to the Cloud Storage location.
Provide Access to Data Scientists:
Grant the data scientist team access to the Dataproc cluster and the necessary permissions to interact with the Hive tables.
Dataproc Documentation
Hive on Dataproc
Google Cloud Storage Documentation
A web server sends click events to a Pub/Sub topic as messages. The web server includes an event Timestamp attribute in the messages, which is the time when the click occurred. You have a Dataflow streaming job that reads from this Pub/Sub topic through a subscription, applies some transformations, and writes the result to another Pub/Sub topic for use by the advertising department. The advertising department needs to receive each message within 30 seconds of the corresponding click occurrence, but they report receiving the messages late. Your Dataflow job's system lag is about 5 seconds, and the data freshness is about 40 seconds. Inspecting a few messages show no more than 1 second lag between their event Timestamp and publish Time. What is the problem and what should you do?
Answer : B
To ensure that the advertising department receives messages within 30 seconds of the click occurrence, and given the current system lag and data freshness metrics, the issue likely lies in the processing capacity of the Dataflow job. Here's why option B is the best choice:
System Lag and Data Freshness:
The system lag of 5 seconds indicates that Dataflow itself is processing messages relatively quickly.
However, the data freshness of 40 seconds suggests a significant delay before processing begins, indicating a backlog.
Backlog in Pub/Sub Subscription:
A backlog occurs when the rate of incoming messages exceeds the rate at which the Dataflow job can process them, causing delays.
Optimizing the Dataflow Job:
To handle the incoming message rate, the Dataflow job needs to be optimized or scaled up by increasing the number of workers, ensuring it can keep up with the message inflow.
Steps to Implement:
Analyze the Dataflow Job:
Inspect the Dataflow job metrics to identify bottlenecks and inefficiencies.
Optimize Processing Logic:
Optimize the transformations and operations within the Dataflow pipeline to improve processing efficiency.
Increase Number of Workers:
Scale the Dataflow job by increasing the number of workers to handle the higher load, reducing the backlog.
Dataflow Monitoring
Scaling Dataflow Jobs
You are running your BigQuery project in the on-demand billing model and are executing a change data capture (CDC) process that ingests dat
a. The CDC process loads 1 GB of data every 10 minutes into a temporary table, and then performs a merge into a 10 TB target table. This process is very scan intensive and you want to explore options to enable a predictable cost model. You need to create a BigQuery reservation based on utilization information gathered from BigQuery Monitoring and apply the reservation to the CDC process. What should you do?
Answer : D
https://cloud.google.com/blog/products/data-analytics/manage-bigquery-costs-with-custom-quotas.
Here's why creating a BigQuery reservation for the project is the most suitable solution:
Project-Level Reservation: BigQuery reservations are applied at the project level. This means that the reserved slots (processing capacity) are shared across all jobs and queries running within that project. Since your CDC process is a significant contributor to your BigQuery usage, reserving slots for the entire project ensures that your CDC process always has access to the necessary resources, regardless of other activities in the project.
Predictable Cost Model: Reservations provide a fixed, predictable cost model. Instead of paying the on-demand price for each query, you pay a fixed monthly fee for the reserved slots. This eliminates the variability of costs associated with on-demand billing, making it easier to budget and forecast your BigQuery expenses.
BigQuery Monitoring: You can use BigQuery Monitoring to analyze the historical usage patterns of your CDC process and other queries within your project. This information helps you determine the appropriate amount of slots to reserve, ensuring that you have enough capacity to handle your workload while optimizing costs.
Why other options are not suitable:
A . Create a BigQuery reservation for the job: BigQuery does not support reservations at the individual job level. Reservations are applied at the project or assignment level.
B . Create a BigQuery reservation for the service account running the job: While you can create reservations for assignments (groups of users or service accounts), it's less efficient than a project-level reservation in this scenario. A project-level reservation covers all jobs within the project, regardless of the service account used.
C . Create a BigQuery reservation for the dataset: BigQuery does not support reservations at the dataset level.
By creating a BigQuery reservation for your project based on your utilization analysis, you can achieve a predictable cost model while ensuring that your CDC process and other queries have the necessary resources to run smoothly.