Which of the following SQL keywords can be used to convert a table from a long format to a wide format?
Answer : A
2:Reshaping Data - Long vs Wide Format | Databricks on AWS
5:TRANSFORM | Databricks on AWS
: [SUM | Databricks on AWS]
Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?
Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?
Answer : C
Option C is the correct answer because Parquet files have a well-defined schema that is embedded within the data itself. This means that the data types and column names of the Parquet files are automatically detected and preserved when creating an external table from them. This also enables the use of SQL and other structured query languages to access and analyze the data. CSV files, on the other hand, do not have a schema embedded in them, and require specifying the schema explicitly or inferring it from the data when creating an external table from them. This can lead to errors or inconsistencies in the data types and column names, and also increase the processing time and complexity.
Which of the following approaches should be used to send the Databricks Job owner an email in the case that the Job fails?
Answer : B
To send the Databricks Job owner an email in the case that the Job fails, the best approach is to set up an Alert in the Job page. This way, the Job owner can configure the email address and the notification type for the Job failure event. The other options are either not feasible, not reliable, or not relevant for this task. Manually programming an alert system in each cell of the Notebook is tedious and error-prone. Setting up an Alert in the Notebook is not possible, as Alerts are only available for Jobs and Clusters. There is a way to notify the Job owner in the case of Job failure, so option D is incorrect. MLflow Model Registry Webhooks are used for model lifecycle events, not Job events, so option E is not applicable.Reference:
Add email and system notifications for job events
MLflow Model Registry Webhooks
A dataset has been defined using Delta Live Tables and includes an expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE
What is the expected behavior when a batch of data containing data that violates these constraints is processed?
Answer : B
The expected behavior when a batch of data containing data that violates the expectation is processed is that the job will fail. This is because the expectation clause has theON VIOLATION FAIL UPDATEoption, which means that if any record in the batch does not meet the expectation, the entire batch will be rejected and the job will fail. This option is useful for enforcing strict data quality rules and preventing invalid data from entering the target dataset.
Option A is not correct, as theON VIOLATION FAIL UPDATEoption does not drop the records that violate the expectation, but fails the entire batch. To drop the records that violate the expectation and record them as invalid in the event log, theON VIOLATION DROP RECORDoption should be used.
Option C is not correct, as theON VIOLATION FAIL UPDATEoption does not drop the records that violate the expectation, but fails the entire batch. To drop the records that violate the expectation and load them into a quarantine table, theON VIOLATION QUARANTINE RECORDoption should be used.
Option D is not correct, as theON VIOLATION FAIL UPDATEoption does not add the records that violate the expectation, but fails the entire batch. To add the records that violate the expectation and record them as invalid in the event log, theON VIOLATION LOG RECORDoption should be used.
Option E is not correct, as theON VIOLATION FAIL UPDATEoption does not add the records that violate the expectation, but fails the entire batch. To add the records that violate the expectation and flag them as invalid in a field added to the target dataset, theON VIOLATION FLAG RECORDoption should be used.
Delta Live Tables Expectations
[Databricks Data Engineer Professional Exam Guide]
A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs.
Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically?
In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?
Answer : E
A data engineer can create a multi-task job in Databricks that consists of multiple tasks that run in a specific order. Each task can have one or more dependencies, which are other tasks that must run before the current task. The Depends On field of a new Databricks Job Task allows the data engineer to specify the dependencies of the task. The data engineer should select a task in the Depends On field when they want the new task to run only after the selected task has successfully completed. This can help the data engineer to create a logical sequence of tasks that depend on each other's outputs or results. For example, a data engineer can create a multi-task job that consists of the following tasks:
Task A: Ingest data from a source using Auto Loader
Task B: Transform the data using Spark SQL
Task C: Write the data to a Delta Lake table
Task D: Analyze the data using Spark ML
Task E: Visualize the data using Databricks SQL
In this case, the data engineer can set the dependencies of each task as follows:
Task A: No dependencies
Task B: Depends on Task A
Task C: Depends on Task B
Task D: Depends on Task C
Task E: Depends on Task D
This way, the data engineer can ensure that each task runs only after the previous task has successfully completed, and the data flows smoothly from ingestion to visualization.
The other options are incorrect because they do not describe valid scenarios for selecting a task in the Depends On field. The Depends On field does not affect the following aspects of a task:
Whether the task needs to be replaced by another task
Whether the task needs to fail before another task begins
Whether the task has the same dependency libraries as another task