Databricks-Certified-Professional-Data-Engineer Exam Practice Test Instant Access

Question 1

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

AThe total count of rows is calculated by scanning all data files

BThe total count of rows will be returned from cached results unless REFRESH is run

CThe total count of records is calculated from the Delta transaction logs

DThe total count of records is calculated from the parquet file metadata

EThe total count of records is calculated from the Hive metastore

Answer : C

https://delta.io/blog/2023-04-19-faster-aggregations-metadata/#:~:text=You%20can%20get%20the%20number,a%20given%20Delta%20table%20version.

Question 2

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

A''Read'' permissions should be set on a secret key mapped to those credentials that will be used by a given team.

BNo additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

C''Read'' permissions should be set on a secret scope containing only those credentials that will be used by a given team.

D''Manage'' permission should be set on a secret scope containing only those credentials that will be used by a given team.

Answer : C

In Databricks, using the Secrets module allows for secure management of sensitive information such as database credentials. Granting 'Read' permissions on a secret key that maps to database credentials for a specific team ensures that only members of that team can access these credentials. This approach aligns with the principle of least privilege, granting users the minimum level of access required to perform their jobs, thus enhancing security.

Databricks Documentation on Secret Management: Secrets

Question 3

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

ACluster: New Job Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: Unlimited

BCluster: New Job Cluster;
Retries: None;
Maximum Concurrent Runs: 1

CCluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1

DCluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1

ECluster: Existing All-Purpose Cluster;
Retries: None;
Maximum Concurrent Runs: 1

Answer : D

The configuration that automatically recovers from query failures and keeps costs low is to use a new job cluster, set retries to unlimited, and set maximum concurrent runs to 1. This configuration has the following advantages:

A new job cluster is a cluster that is created and terminated for each job run. This means that the cluster resources are only used when the job is running, and no idle costs are incurred.This also ensures that the cluster is always in a clean state and has the latest configuration and libraries for the job1.

Setting retries to unlimited means that the job will automatically restart the query in case of any failure, such as network issues, node failures, or transient errors.This improves the reliability and availability of the streaming job, and avoids data loss or inconsistency2.

Setting maximum concurrent runs to 1 means that only one instance of the job can run at a time.This prevents multiple queries from competing for the same resources or writing to the same output location, which can cause performance degradation or data corruption3.

Therefore, this configuration is the best practice for scheduling Structured Streaming jobs for production, as it ensures that the job is resilient, efficient, and consistent.

Question 4

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

A) preds.write.mode("append").saveAsTable("churn_preds")

B) preds.write.format("delta").save("/preds/churn_preds")

AOption A

BOption B

COption C

DOption D

EOption E

Answer : A

Question 5

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.

Which approach will ensure that this requirement is met?

AWhenever a database is being created, make sure that the location keyword is used

BWhen configuring an external data warehouse for all table storage. leverage Databricks for all ELT.

CWhenever a table is being created, make sure that the location keyword is used.

DWhen tables are created, make sure that the external keyword is used in the create table statement.

EWhen the workspace is being configured, make sure that external cloud object storage has been mounted.

Answer : C

This is the correct answer because it ensures that this requirement is met. The requirement is that all tables in the Lakehouse should be configured as external Delta Lake tables. An external table is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created by using the location keyword to specify the path to an existing directory in a cloud storage system, such as DBFS or S3. By creating external tables, the data engineering team can avoid losing data if they drop or overwrite the table, as well as leverage existing data without moving or copying it. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Delta Lake'' section; Databricks Documentation, under ''Create an external table'' section.

Question 6

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

AA logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

BAn external table will be created in the storage container mounted to /mnt/finance eda bucket.

CA logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

DAn managed table will be created in the storage container mounted to /mnt/finance eda bucket.

EA managed table will be created in the DBFS root storage container.

Answer : D

https://docs.databricks.com/en/lakehouse/data-objects.html

Question 7

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Aimportlib.resource path

B,sys.path

Cos-path

Dpypi.path

Epylib.source

Answer : B

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Practice Test