Pass4cram Databricks-Certified-Professional-Data-Engineer Dumps Real Exam Questions Test Engine Dumps Training
Databricks Databricks-Certified-Professional-Data-Engineer exam dumps and online Test Engine
The Databricks Certified Professional Data Engineer Exam certification exam is a computer-based test that consists of multiple-choice questions. Candidates have two hours to complete the exam, and they must achieve a minimum score of 70% to pass. Databricks-Certified-Professional-Data-Engineer exam is proctored, and candidates must have a reliable internet connection and a computer with a webcam and microphone to take the test.
Databricks is a cloud-based data analytics platform that enables businesses to extract valuable insights from large volumes of data. The platform is designed to automate data processing, making it easier for organizations to derive insights from data. Databricks has become an essential tool for data professionals globally, and as such, the demand for Databricks-certified professionals is on the rise.
NEW QUESTION # 54
What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?
- A. Use &Pip install in a notebook cell
- B. Use &sh install in a notebook cell
- C. Install libraries from PyPi using the cluster UI
- D. Run source env/bin/activate in a notebook setup script
Answer: C
Explanation:
Installing a Python package scoped at the notebook level to all nodes in the currently active cluster in Databricks can be achieved by using the Libraries tab in the cluster UI. This interface allows you to install libraries across all nodes in the cluster. While the%pipcommand in a notebook cell would only affect the driver node, using the cluster UI ensures that the package is installed on all nodes.
References:
* Databricks Documentation on Libraries: Libraries
NEW QUESTION # 55
You have written a notebook to generate a summary data set for reporting, Notebook was scheduled using the job cluster, but you realized it takes 8 minutes to start the cluster, what feature can be used to start the cluster in a timely fashion so your job can run immediatley?
- A. Use Databricks Premium edition instead of Databricks standard edition
- B. Setup an additional job to run ahead of the actual job so the cluster is running second job starts
- C. Pin the cluster in the cluster UI page so it is always available to the jobs
- D. Disable auto termination so the cluster is always running
- E. Use the Databricks cluster pools feature to reduce the startup time
Answer: E
Explanation:
Explanation
Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost incurred is Azure. Databricks run time cost is only billed once VM is allocated to a cluster.
Here is a demo of how to setup a pool and follow some best practices,
Graphical user interface, text Description automatically generated
NEW QUESTION # 56
How do you check the location of an existing schema in Delta Lake?
- A. Use Data explorer
- B. Run SQL command DESCRIBE SCHEMA EXTENDED schema_name
E Schemas are internally in-store external hive meta stores like MySQL or SQL Server - C. Check unity catalog UI
- D. Run SQL command SHOW LOCATION schema_name
Answer: B
Explanation:
Explanation
Here is an example of how it looks
Graphical user interface, text, application, email Description automatically generated
NEW QUESTION # 57
A new data engineer [email protected] has been assigned to an ELT project. The new data
engineer will need full privileges on the table sales to fully manage the project.
Which of the following commands can be used to grant full permissions on the table to the new data engineer?
- A. 1. GRANT ALL PRIVILEGES ON TABLE sales TO [email protected];
- B. 1. GRANT USAGE ON TABLE sales TO [email protected];
- C. 1. GRANT SELECT ON TABLE sales TO [email protected];
- D. 1. GRANT SELECT CREATE MODIFY ON TABLE sales TO [email protected];
- E. 1. GRANT ALL PRIVILEGES ON TABLE [email protected] TO sales;
Answer: A
NEW QUESTION # 58
Create a schema called bronze using location '/mnt/delta/bronze', and check if the schema exists before creating.
- A. Schema creation is not available in metastore, it can only be done in Unity catalog UI
- B. CREATE SCHEMA IF NOT EXISTS bronze LOCATION '/mnt/delta/bronze'
- C. CREATE SCHEMA bronze IF NOT EXISTS LOCATION '/mnt/delta/bronze'
- D. if IS_SCHEMA('bronze'): CREATE SCHEMA bronze LOCATION '/mnt/delta/bronze'
- E. Cannot create schema without a database
Answer: B
Explanation:
Explanation
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-schema.html
1.CREATE SCHEMA [ IF NOT EXISTS ] schema_name [ LOCATION schema_directory ]
NEW QUESTION # 59
You are working on a process to load external CSV files into a delta table by leveraging the COPY INTO command, but after running the command for the second time no data was loaded into the table name, why is that?
1.COPY INTO table_name
2.FROM 'dbfs:/mnt/raw/*.csv'
3.FILEFORMAT = CSV
- A. COPY INTO only works one time data load
- B. Use incremental = TRUE option to load new files
- C. COPY INTO did not detect new files after the last load
- D. COPY INTO does not support incremental load, use AUTO LOADER
- E. Run REFRESH TABLE sales before running COPY INTO
Answer: C
Explanation:
Explanation
The answer is COPY INTO did not detect new files after the last load,
COPY INTO keeps track of files that were successfully loaded into the table, the next time when the COPY INTO runs it skips them.
FYI, you can change this behavior by using COPY_OPTIONS 'force'= 'true', when this option is enabled all files in the path/pattern are loaded.
1.COPY INTO table_identifier
2. FROM [ file_location | (SELECT identifier_list FROM file_location) ]
3. FILEFORMAT = data_source
4. [FILES = [file_name, ... | PATTERN = 'regex_pattern']
5. [FORMAT_OPTIONS ('data_source_reader_option' = 'value', ...)]
6. [COPY_OPTIONS 'force' = ('false'|'true')]
NEW QUESTION # 60
A SQL Dashboard was built for the supply chain team to monitor the inventory and product orders, but all of the timestamps displayed on the dashboards are showing in UTC format, so they requested to change the time zone to the location of New York. How would you approach resolving this issue?
- A. Add SET Timezone = America/New_York on every of the SQL queries in the dashboard.
- B. Under SQL Admin Console, set the SQL configuration parameter time zone to Ameri-ca/New_York
- C. Change the timestamp on the delta tables to America/New_York format
- D. Move the workspace from Central US zone to East US Zone
- E. Change the spark configuration of SQL endpoint to format the timestamp to Ameri-ca/New_York
Answer: B
Explanation:
Explanation
The answer is, Under SQL Admin Console, set the SQL configuration parameter time zone to America/New_York Here are steps you can take this to configure, so the entire dashboard is changed without changing individual queries Configure SQL parameters To configure all warehouses with SQL parameters:
1.Click Settings at the bottom of the sidebar and select SQL Admin Console.
2.Click the SQL Warehouse Settings tab.
3.In the SQL Configuration Parameters textbox, specify one key-value pair per line. Sepa-rate the name of the parameter from its value using a space. For example, to ena-ble ANSI_MODE:
Graphical user interface, text, application Description automatically generated
Similarly, we can add a line in the SQL Configuration parameters
timezone America/New_York
SQL configuration parameters | Databricks on AWS
NEW QUESTION # 61
Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.
Which statement describes a limitation of Databricks Secrets?
- A. Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text.
- B. The Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.
- C. Iterating through a stored secret and printing each character will display secret contents in plain text.
- D. Account administrators can see all secrets in plain text by logging on to the Databricks Accounts console.
- E. Secrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default.
Answer: B
Explanation:
This is the correct answer because it describes a limitation of Databricks Secrets. Databricks Secrets is a module that provides tools to store sensitive credentials and avoid accidentally displaying them in plain text.
Databricks Secrets allows creating secret scopes, which are collections of secrets that can be accessed by users or groups. Databricks Secrets also allows creating and managing secrets using the Databricks CLI or the Databricks REST API. However, a limitation of Databricks Secrets is that the Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials. Therefore, users should still be careful with which credentials are stored in Databricks Secrets and which users have access to using these secrets. Verified References: [Databricks Certified Data Engineer Professional], under "Databricks Workspace" section; Databricks Documentation, under "List secrets" section.
NEW QUESTION # 62
A denote the event 'student is female' and let B denote the event 'student is French'. In a class of 100 students
suppose 60 are French, and suppose that 10 of the French students are females. Find the probability that if I
pick a French student, it will be a girl, that is, find P(A|B).
- A. 2/3
- B. 1/3
- C. 2/6
- D. 1/6
Answer: D
Explanation:
Explanation
Since 10 out of 100 students are both French and female, then
P(AandB)=10100
Also. 60 out of the 100 students are French, so
P(B)=60100
So the required probability is:
P(A|B)=P(AandB)P(B)=10/10060/100=16
NEW QUESTION # 63
A nightly job ingests data into a Delta Lake table using the following code:
The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():
- A. return spark.read.option("readChangeFeed", "true").table ("bronze")
- B. return spark.readStream.load("bronze")

- C. return spark.readStream.table("bronze")
- D.

Answer: D
Explanation:
Explanation
This is the correct answer because it completes the function definition that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline. The object returned by this function is a DataFrame that contains all change events from a Delta Lake table that has enabled change data feed. The readChangeFeed option is set to true to indicate that the DataFrame should read changes from the table, and the table argument specifies the name of the table to read changes from. The DataFrame will have a schema that includes four columns: operation, partition, value, and timestamp. The operation column indicates the type of change event, such as insert, update, or delete. The partition column indicates the partition where the change event occurred. The value column contains the actual data of the change event as a struct type. The timestamp column indicates the time when the change event was committed. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Read changes in batch queries" section.
NEW QUESTION # 64
How VACCUM and OPTIMIZE commands can be used to manage the DELTA lake?
- A. VACCUM command can be used to compress the parquet files to reduce the size of the table, OPTIMIZE command can be used to cache frequently delta tables for better performance.
- B. VACCUM command can be used to delete empty/blank parquet files in a delta table. OPTIMIZE command can be used to update stale statistics on a delta table.
- C. VACCUM command can be used to delete empty/blank parquet files in a delta table, OPTIMIZE command can be used to cache frequently delta tables for better perfor-mance.
- D. VACCUM command can be used to compact small parquet files, and the OP-TIMZE command can be used to delete parquet files that are marked for dele-tion/unused.
- E. OPTIMIZE command can be used to compact small parquet files, and the VAC-CUM command can be used to delete parquet files that are marked for deletion/unused.
(Correct)
Answer: E
Explanation:
Explanation
VACCUM:
You can remove files no longer referenced by a Delta table and are older than the retention thresh-old by running the vacuum command on the table. vacuum is not triggered automatically. The de-fault retention threshold for the files is 7 days. To change this behavior, see Configure data reten-tion for time travel.
OPTIMIZE:
Using OPTIMIZE you can compact data files on Delta Lake, this can improve the speed of read queries on the table. Too many small files can significantly degrade the performance of the query.
NEW QUESTION # 65
Which Python variable contains a list of directories to be searched when trying to locate required modules?
- A. pypi.path
- B. pylib.source
- C. os-path
- D. importlib.resource path
- E. ,sys.path
Answer: E
NEW QUESTION # 66
Which of the following is true of Delta Lake and the Lakehouse?
- A. Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.
- B. Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.
- C. Z-order can only be applied to numeric values stored in Delta Lake tables
- D. Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.
- E. Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.
Answer: D
Explanation:
Explanation
https://docs.delta.io/2.0.0/table-properties.html
Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters1. Data skipping is a performance optimization technique that aims to avoid reading irrelevant data from the storage layer1. By collecting statistics such as min/max values, null counts, and bloom filters, Delta Lake can efficiently prune unnecessary files or partitions from the query plan1. This can significantly improve the query performance and reduce the I/O cost.
The other options are false because:
Parquet compresses data column by column, not row by row2. This allows for better compression ratios, especially for repeated or similar values within a column2.
Views in the Lakehouse do not maintain a valid cache of the most recent versions of source tables at all times3. Views are logical constructs that are defined by a SQL query on one or more base tables3. Views are not materialized by default, which means they do not store any data, but only the query definition3. Therefore, views always reflect the latest state of the source tables when queried3.
However, views can be cached manually using the CACHE TABLE or CREATE TABLE AS SELECT commands.
Primary and foreign key constraints can not be leveraged to ensure duplicate values are never entered into a dimension table. Delta Lake does not support enforcing primary and foreign key constraints on tables. Constraints are logical rules that define the integrity and validity of the data in a table. Delta Lake relies on the application logic or the user to ensure the data quality and consistency.
Z-order can be applied to any values stored in Delta Lake tables, not only numeric values. Z-order is a technique to optimize the layout of the data files by sorting them on one or more columns. Z-order can improve the query performance by clustering related values together and enabling more efficient data skipping. Z-order can be applied to any column that has a defined ordering, such as numeric, string, date, or boolean values.
References: Data Skipping, Parquet Format, Views, [Caching], [Constraints], [Z-Ordering]
NEW QUESTION # 67
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?
- A. configure
- B. workspace
- C. libraries
- D. jobs
- E. fs
Answer: C
Explanation:
The libraries command group allows you to install, uninstall, and list libraries on Databricks clusters. You can use the libraries install command to install a custom Python Wheel on a cluster by specifying the --whl option and the path to the wheel file. For example, you can use the following command to install a custom Python Wheel named mylib-0.1-py3-none-any.whl on a cluster with the id 1234-567890-abcde123:
databricks libraries install --cluster-id 1234-567890-abcde123 --whl
dbfs:/mnt/mylib/mylib-0.1-py3-none-any.whl
This will upload the custom Python Wheel to the cluster and make it available for use with a production job.
You can also use the libraries uninstall command to uninstall a library from a cluster, and the libraries list command to list the libraries installed on a cluster.
References:
* Libraries CLI (legacy): https://docs.databricks.com/en/archive/dev-tools/cli/libraries-cli.html
* Library operations: https://docs.databricks.com/en/dev-tools/cli/commands.html#library-operations
* Install or update the Databricks CLI: https://docs.databricks.com/en/dev-tools/cli/install.html
NEW QUESTION # 68
A dataset has been defined using Delta Live Tables and includes an expectations clause: CON-STRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed?
- A. Records that violate the expectation are added to the target dataset and flagged as in-valid in a field added to the target dataset.
- B. Records that violate the expectation cause the job to fail.
- C. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
- D. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
- E. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
Answer: D
Explanation:
Explanation
The answer is Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
Delta live tables support three types of expectations to fix bad data in DLT pipelines Review below example code to examine these expectations, Diagram Description automatically generated with medium confidence
NEW QUESTION # 69
What could be the expected output of query SELECT COUNT (DISTINCT *) FROM user on this table
- A. 0
- B. NULL
- C. 1
- D. 2
(Correct) - E. 2
Answer: D
Explanation:
Explanation
The answer is 2,
Count(DISTINCT *) removes rows with any column with a NULL value
NEW QUESTION # 70
Consider flipping a coin for which the probability of heads is p, where p is unknown, and our goa is to
estimate p. The obvious approach is to count how many times the coin came up heads and divide by the total
number of coin flips. If we flip the coin 1000 times and it comes up heads 367 times, it is very reasonable to
estimate p as approximately 0.367. However, suppose we flip the coin only twice and we get heads both times.
Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it seems a bit
rash to conclude that the coin will always come up heads, and____________is a way of avoiding such rash
conclusions.
- A. Logistic Regression
- B. Naive Bayes
- C. Linear Regression
- D. Laplace Smoothing
Answer: D
Explanation:
Explanation
Smooth the estimates:consider flipping a coin for which the probability of heads is p, where p is unknown, and
our goal is to estimate p. The obvious approach is to count how many times the coin came up heads and divide
by the total number of coin flips. If we flip the coin 1000 times and it comes up heads 367 times, it is very
reasonable to estimate p as approximately 0.367. However, suppose we flip the coin only twice and we get
heads both times. Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it
seems a bit rash to conclude that the coin will always come up heads, and smoothing is a way of avoiding such
rash conclusions. A simple smoothing method, called Laplace smoothing (or Laplace's law of succession or
add-one smoothing in R&N), is to estimate p by (one plus the number of heads) / (two plus the total number of
flips). Said differently, if we are keeping count of the number of heads and the number of tails, this rule is
equivalent to starting each of our counts at one, rather than zero. Another advantage of Laplace smoothing is
that it avoids estimating any probabilities to be zero, even for events never observed in the data. Laplace
add-one smoothing now assigns too much probability to unseen words
NEW QUESTION # 71
A Delta Lake table was created with the below query:
Consider the following query:
DROP TABLE prod.sales_by_store -
If this statement is executed by a workspace admin, which result will occur?
- A. Data will be marked as deleted but still recoverable with Time Travel.
- B. The table will be removed from the catalog but the data will remain in storage.
- C. Nothing will occur until a COMMIT command is executed.
- D. An error will occur because Delta Lake prevents the deletion of production data.
- E. The table will be removed from the catalog and the data will be deleted.
Answer: E
Explanation:
Explanation
When a table is dropped in Delta Lake, the table is removed from the catalog and the data is deleted. This is because Delta Lake is a transactional storage layer that provides ACID guarantees. When a table is dropped, the transaction log is updated to reflect the deletion of the table and the data is deleted from the underlying storage. References:
https://docs.databricks.com/delta/quick-start.html#drop-a-table
https://docs.databricks.com/delta/delta-batch.html#drop-table
NEW QUESTION # 72
......
Databricks-Certified-Professional-Data-Engineer exam is a comprehensive assessment that evaluates a candidate's ability to design, implement, and manage data pipelines, as well as leverage advanced analytics and machine learning techniques on the Databricks platform. Databricks-Certified-Professional-Data-Engineer exam consists of multiple-choice questions and requires candidates to complete a hands-on project that demonstrates their ability to build a data solution on the Databricks platform.
Databricks Databricks-Certified-Professional-Data-Engineer: Selling Databricks Certification Products and Solutions: https://tesking.pass4cram.com/Databricks-Certified-Professional-Data-Engineer-dumps-torrent.html