Databricks-Certified-Data-Engineer-Associate Exam Practice Test Instant Access

Question 1

A new data engineering team team has been assigned to an ELT project. The new data engineering team will need full privileges on the table sales to fully manage the project.

Which command can be used to grant full permissions on the database to the new data engineering team?

Agrant all privileges on table sales TO team;

BGRANT SELECT ON TABLE sales TO team;

CGRANT SELECT CREATE MODIFY ON TABLE sales TO team;

DGRANT ALL PRIVILEGES ON TABLE team TO sales;

Answer : A

To grant full privileges on a table such as 'sales' to a group like 'team', the correct SQL command in Databricks is:

GRANT ALL PRIVILEGES ON TABLE sales TO team;

This command assigns all available privileges, including SELECT, INSERT, UPDATE, DELETE, and any other data manipulation or definition actions, to the specified team. This is typically necessary when a team needs full control over a table to manage and manipulate it as part of a project or ongoing maintenance.

Reference: Databricks documentation on SQL permissions: SQL Permissions in Databricks

Question 2

A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.

They have the following incomplete code block:

____(f"SELECT customer_id, spend FROM {table_name}")

Which of the following can be used to fill in the blank to successfully complete the task?

Aspark.delta.sql

Bspark.delta.table

Cspark.table

Ddbutils.sql

Espark.sql

Answer : E

The spark.sql method can be used to execute SQL queries programmatically and return the result as a DataFrame. The spark.sql method accepts a string argument that contains a valid SQL statement. The data engineer can use a formatted string literal (f-string) to insert the Python variable table_name into the SQL query. The other methods are either invalid or not suitable for running SQL queries.Reference:Running SQL Queries Programmatically,Formatted string literals,spark.sql

Question 3

A data engineer has joined an existing project and they see the following query in the project repository:

CREATE STREAMING LIVE TABLE loyal_customers AS

SELECT customer_id -

FROM STREAM(LIVE.customers)

WHERE loyalty_level = 'high';

Which of the following describes why the STREAM function is included in the query?

AThe STREAM function is not needed and will cause an error.

BThe table being created is a live table.

CThe customers table is a streaming live table.

DThe customers table is a reference to a Structured Streaming query on a PySpark DataFrame.

EThe data in the customers table has been updated since its last run.

Answer : C

The STREAM function is used to process data from a streaming live table or view, which is a table or view that contains data that has been added only since the last pipeline update. Streaming live tables and views are stateful, meaning that they retain the state of the previous pipeline run and only process new data based on the current query. This is useful for incremental processing of streaming or batch data sources. The customers table in the query is a streaming live table, which means that it contains the latest data from the source. The STREAM function enables the query to read the data from the customers table incrementally and create another streaming live table named loyal_customers, which contains the customer IDs of the customers with high loyalty level.Reference:Difference between LIVE TABLE and STREAMING LIVE TABLE,CREATE STREAMING TABLE,Load data using streaming tables in Databricks SQL.

Question 4

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

AThe table was managed

BThe table's data was smaller than 10 GB

CThe table's data was larger than 10 GB

DThe table was external

EThe table did not have a location

Answer : A

The reason why all of the data files and metadata files were deleted from the file system after dropping the table is that the table was managed. A managed table is a table that is created and managed by Spark SQL. It stores both the data and the metadata in the default location specified by thespark.sql.warehouse.dirconfiguration property. When a managed table is dropped, both the data and the metadata are deleted from the file system.

Option B is not correct, as the size of the table's data does not affect the behavior of dropping the table. Whether the table's data is smaller or larger than 10 GB, the data files and metadata files will be deleted if the table is managed, and will be preserved if the table is external.

Option C is not correct, for the same reason as option B.

Option D is not correct, as an external table is a table that is created and managed by the user. It stores the data in a user-specified location, and only stores the metadata in the Spark SQL catalog. When an external table is dropped, only the metadata is deleted from the catalog, but the data files are preserved in the file system.

Option E is not correct, as a table must have a location to store the data. If the location is not specified by the user, it will use the default location for managed tables. Therefore, a table without a location is a managed table, and dropping it will delete both the data and the metadata.

Managing Tables

[Databricks Data Engineer Professional Exam Guide]

Question 5

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

AReplace predict with a stream-friendly prediction function

BReplace schema(schema) with option ('maxFilesPerTrigger', 1)

CReplace 'transactions' with the path to the location of the Delta table

DReplace format('delta') with format('stream')

EReplace spark.read with spark.readStream

Answer : E

: To read from a stream source, the data engineer needs to use the spark.readStream method instead of the spark.read method. The spark.readStream method returns a DataStreamReader object that can be used to specify the details of the input source, such as the format, the schema, the path, and the options. The spark.read method is only suitable for batch processing, not streaming processing. The other changes are not necessary or correct for reading from a stream source.Reference:Structured Streaming Programming Guide,Read a stream,Databricks Data Sources

Question 6

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.

Which of the following describes why Auto Loader inferred all of the columns to be of the string type?

AThere was a type mismatch between the specific schema and the inferred schema

BJSON data is a text-based format

CAuto Loader only works with string data

DAll of the fields had at least one null value

EAuto Loader cannot infer the schema of ingested data

Answer : B

JSON data is a text-based format that represents data as a collection of name-value pairs. By default, when Auto Loader infers the schema of JSON data, it treats all columns as strings. This is because JSON data can have varying data types for the same column across different files or records, and Auto Loader does not attempt to reconcile these differences. For example, a column named ''age'' may have integer values in some files, but string values in others. To avoid data loss or errors, Auto Loader infers the column as a string type. However, Auto Loader also provides an option to infer more precise column types based on the sample data. This option is called cloudFiles.inferColumnTypes and it can be set to true or false. When set to true, Auto Loader tries to infer the exact data types of the columns, such as integers, floats, booleans, or nested structures. When set to false, Auto Loader infers all columns as strings. The default value of this option is false.Reference:Configure schema inference and evolution in Auto Loader,Schema inference with auto loader (non-DLT and DLT),Using and Abusing Auto Loader's Inferred Schema,Explicit path to data or a defined schema required for Auto loader.

Question 7

In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?

AWhen the location of the data needs to be changed

BWhen the target table is an external table