One of your encryption keys stored in Cloud Key Management Service (Cloud KMS) was exposed. You need to re-encrypt all of your CMEK-protected Cloud Storage data that used that key. and then delete the compromised key. You also want to reduce the risk of objects getting written without customer-managed encryption key (CMEK protection in the future. What should you do?
Answer : C
To re-encrypt all of your CMEK-protected Cloud Storage data after a key has been exposed, and to ensure future writes are protected with a new key, creating a new Cloud KMS key and a new Cloud Storage bucket is the best approach. Here's why option C is the best choice:
Re-encryption of Data:
By creating a new Cloud Storage bucket and copying all objects from the old bucket to the new bucket while specifying the new Cloud KMS key, you ensure that all data is re-encrypted with the new key.
This process effectively re-encrypts the data, removing any dependency on the compromised key.
Ensuring CMEK Protection:
Creating a new bucket and setting the new CMEK as the default ensures that all future objects written to the bucket are automatically protected with the new key.
This reduces the risk of objects being written without CMEK protection.
Deletion of Compromised Key:
Once the data has been copied and re-encrypted, the old key can be safely deleted from Cloud KMS, eliminating the risk associated with the compromised key.
Steps to Implement:
Create a New Cloud KMS Key:
Create a new encryption key in Cloud KMS to replace the compromised key.
Create a New Cloud Storage Bucket:
Create a new Cloud Storage bucket and set the default CMEK to the new key.
Copy and Re-encrypt Data:
Use the gsutil tool to copy data from the old bucket to the new bucket while specifying the new CMEK key:
gsutil -o 'GSUtil:gs_json_api_version=2' cp -r gs://old-bucket/* gs://new-bucket/
Delete the Old Key:
After ensuring all data is copied and re-encrypted, delete the compromised key from Cloud KMS.
Cloud KMS Documentation
Cloud Storage Encryption
Re-encrypting Data in Cloud Storage
You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?
Answer : D
You have several different unstructured data sources, within your on-premises data center as well as in the cloud. The data is in various formats, such as Apache Parquet and CSV. You want to centralize this data in Cloud Storage. You need to set up an object sink for your data that allows you to use your own encryption keys. You want to use a GUI-based solution. What should you do?
Answer : A
To centralize unstructured data from various sources into Cloud Storage using a GUI-based solution while allowing the use of your own encryption keys, Cloud Data Fusion is the most suitable option. Here's why:
Cloud Data Fusion:
Cloud Data Fusion is a fully managed, cloud-native data integration service that helps in building and managing ETL pipelines with a visual interface.
It supports a wide range of data sources and formats, including Apache Parquet and CSV, and provides a user-friendly GUI for pipeline creation and management.
Custom Encryption Keys:
Cloud Data Fusion allows the use of customer-managed encryption keys (CMEK) for data encryption, ensuring that your data is securely stored according to your encryption policies.
Centralizing Data:
Cloud Data Fusion simplifies the process of moving data from on-premises and cloud sources into Cloud Storage, providing a centralized repository for your unstructured data.
Steps to Implement:
Set Up Cloud Data Fusion:
Deploy a Cloud Data Fusion instance and configure it to connect to your various data sources.
Create ETL Pipelines:
Use the GUI to create data pipelines that extract data from your sources and load it into Cloud Storage. Configure the pipelines to use your custom encryption keys.
Run and Monitor Pipelines:
Execute the pipelines and monitor their performance and data movement through the Cloud Data Fusion dashboard.
Cloud Data Fusion Documentation
Using Customer-Managed Encryption Keys (CMEK)
The data analyst team at your company uses BigQuery for ad-hoc queries and scheduled SQL pipelines in a Google Cloud project with a slot reservation of 2000 slots. However, with the recent introduction of hundreds of new non time-sensitive SQL pipelines, the team is encountering frequent quota errors. You examine the logs and notice that approximately 1500 queries are being triggered concurrently during peak time. You need to resolve the concurrency issue. What should you do?
Answer : C
To resolve the concurrency issue in BigQuery caused by the introduction of hundreds of non-time-sensitive SQL pipelines, the best approach is to differentiate the types of queries based on their urgency and resource requirements. Here's why option C is the best choice:
SQL Pipelines as Batch Queries:
Batch queries in BigQuery are designed for non-time-sensitive operations. They run in a lower priority queue and do not consume slots immediately, which helps to reduce the overall slot consumption during peak times.
By converting non-time-sensitive SQL pipelines to batch queries, you can significantly alleviate the pressure on slot reservations.
Ad-Hoc Queries as Interactive Queries:
Interactive queries are prioritized to run immediately and are suitable for ad-hoc analysis where users expect quick results.
Running ad-hoc queries as interactive jobs ensures that analysts can get their results without delay, improving productivity and user satisfaction.
Concurrency Management:
This approach helps balance the workload by leveraging BigQuery's ability to handle different types of queries efficiently, reducing the likelihood of encountering quota errors due to slot exhaustion.
Steps to Implement:
Identify Non-Time-Sensitive Pipelines:
Review and identify SQL pipelines that are not time-critical and can be executed as batch jobs.
Update Pipelines to Batch Queries:
Modify these pipelines to run as batch queries. This can be done by setting the priority of the query job to BATCH.
Ensure Ad-Hoc Queries are Interactive:
Ensure that all ad-hoc queries are submitted as interactive jobs, allowing them to run with higher priority and immediate slot allocation.
BigQuery Batch Queries
BigQuery Slot Allocation and Management
You are migrating your on-premises data warehouse to BigQuery. One of the upstream data sources resides on a MySQL database that runs in your on-premises data center with no public IP addresses. You want to ensure that the data ingestion into BigQuery is done securely and does not go through the public internet. What should you do?
Answer : D
To securely ingest data from an on-premises MySQL database into BigQuery without routing through the public internet, using Datastream with Private connectivity over Cloud Interconnect is the best approach. Here's why:
Datastream for Data Replication:
Datastream provides a managed service for data replication from various sources, including on-premises databases, to Google Cloud services like BigQuery.
Cloud Interconnect:
Cloud Interconnect establishes a private connection between your on-premises data center and Google Cloud, ensuring that data transfer occurs over a secure, private network rather than the public internet.
Private Connectivity:
Using Private connectivity with Datastream leverages the established Cloud Interconnect to securely connect your on-premises MySQL database with Google Cloud. This method ensures that the data does not traverse the public internet.
Encryption:
Using Server-only encryption ensures that data is encrypted in transit between Datastream and BigQuery, adding an extra layer of security.
Steps to Implement:
Set Up Cloud Interconnect:
Establish a Cloud Interconnect between your on-premises data center and Google Cloud to create a private connection.
Configure Datastream:
Set up Datastream to use Private connectivity as the connection method and allocate an IP address range within your VPC network.
Use Server-only encryption to ensure secure data transfer.
Create Connection Profile:
Create a connection profile in Datastream to define the connection parameters, including the use of Cloud Interconnect and Private connectivity.
Datastream Documentation
Cloud Interconnect Documentation
Setting Up Private Connectivity in Datastream
You migrated your on-premises Apache Hadoop Distributed File System (HDFS) data lake to Cloud Storage. The data scientist team needs to process the data by using Apache Spark and SQL. Security policies need to be enforced at the column level. You need a cost-effective solution that can scale into a data mesh. What should you do?
Answer : D
For automating the CI/CD pipeline of DAGs running in Cloud Composer, the following approach ensures that DAGs are tested and deployed in a streamlined and efficient manner.
Use Cloud Build for Development Instance Testing:
Use Cloud Build to automate the process of copying the DAG code to the Cloud Storage bucket of the development instance.
This triggers Cloud Composer to automatically pick up and test the new DAGs in the development environment.
Testing and Validation:
Ensure that the DAGs run successfully in the development environment.
Validate the functionality and correctness of the DAGs before promoting them to production.
Deploy to Production:
If the DAGs pass all tests in the development environment, use Cloud Build to copy the tested DAG code to the Cloud Storage bucket of the production instance.
This ensures that only validated and tested DAGs are deployed to production, maintaining the stability and reliability of the production environment.
Simplicity and Reliability:
This approach leverages Cloud Build's capabilities for automation and integrates seamlessly with Cloud Composer's reliance on Cloud Storage for DAG storage.
By using Cloud Storage for both development and production deployments, the process remains simple and robust.
Google Data Engineer Reference:
Cloud Composer Documentation
Using Cloud Build
Deploying DAGs to Cloud Composer
Automating DAG Deployment with Cloud Build
By implementing this CI/CD pipeline, you ensure that DAGs are thoroughly tested in the development environment before being automatically deployed to the production environment, maintaining high quality and reliability.
You are creating the CI'CD cycle for the code of the directed acyclic graphs (DAGs) running in Cloud Composer. Your team has two Cloud Composer instances: one instance for development and another instance for production. Your team is using a Git repository to maintain and develop the code of the DAGs. You want to deploy the DAGs automatically to Cloud Composer when a certain tag is pushed to the Git repository. What should you do?
Answer : C