Databricks-Certified-Professional-Data-Scientist Exam Practice Test Instant Access

Question 1

You are working in a data analytics company as a data scientist, you have been given a set of various types of Pizzas available across various premium food centers in a country. This data is given as numeric values like Calorie. Size, and Sale per day etc. You need to group all the pizzas with the similar properties, which of the following technique you would be using for that?

AAssociation Rules

BNaive Bayes Classifier

CK-means Clustering

DLinear Regression

EGrouping
Using K means clustering you can create group of objects based on their properties. Where K is number of the groups. In this case, in each group you determine the center of the group and then find the how far each object characteristics from the center. If it is near the center than it can be part of the group. Suppose we have 100 objects and we need to determine 4 groups. Hence, here K=4. Now we determine 4 center values and based on that center value we determine the distance of each object from the center.

Answer : C

Question 2

You are designing a recommendation engine for a website where the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user. What kind of this recommendation engine is ?

ANaive Bayes classifier

BCollaborative filtering

CLogistic Regression

DContent-based filtering
Another aspect of collaborative filtering systems is the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and help the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user

Answer : B

Question 3

You are having 1000 patients' data with the height and age. Where age in years and height in meters. You wanted to create cluster using this two attributes. You wanted to have near equal effect for both the age and height while creating the cluster. What you can do?

AYou will be adding height with the numeric value 100

BYou will be converting each height value to centimeters

CYou will be dividing both age and height with their respective standard deviation

DYou will be taking square root of height
When you see the data age in years would have values like 50, 60r 70 90 years etc. And while calculating distance from centroid maximum possible value can be 90-0 and its square will be 8100.
While using heights in meter can be 2-0.5(1.5) meters and its square will be 2.25 only. So you can see age has more effect than height. Hence bringing the height on same level you can convert it into centimeters. Can bring data upto 200 centimeters and then it be more effective like square of 200 maximum.
However there is another approach is to divide the each value with its standard deviation, which will not have impact of the units e.g. age/sd of the age, which results in value without unit. This can also help in reducing the effect of units.

Answer : B, C

Question 4

You are working with the Clustering solution of the customer datasets. There are almost 40 variables are available for each customer and almost 1.00,0000 customer's data is available. You want to reduce the number of variables for clustering, what would you do?

AYou will randomly reduce the number of variables

BYou will find the correlation among the variables and from their variables are not co-related will be discarded.

CYou will find the correlation among the variables and from the highly co-related variables, you will be considering only one or two variables from it.

DYou cannot discard any variable for creating clusters.

EYou can combine several variables in one variable
When you are applying clustering technique and you find that there are quite a huge number of variables are available. Then it is better the find the co-relation among the variables and consider only one or two variables from the highly co-related variables. Because highly co-related variable will have the same effect, while creating the cluster. We can use scatter plot matrix among the variables to find the co-relation.
You can also combine several variables into a single variable. For example if you have two values in the dataset like Asset and Debt than by combining these two values like Debt to Asset ratio and use it while creating the cluster.

Answer : C, E

Question 5

You have data of 10.000 people who make the purchasing from a specific grocery store. You also have their income detail in the dat

a. You have created 5 clusters using this data. But in one of the cluster you see that only 30 people are falling as below 30, 2400, 2600, 2700, 2270 etc."

What would you do in this case?

AYou will be increasing number of clusters.

BYou will be decreasing the number of clusters.

CYou will remove that 30 people from dataset

DYou will be multiplying standard deviation with the 100
Decreasing the number of clusters will help in adjusting this outlier cluster to get adjusted in another cluster.

Answer : B

Question 6

Refer to exhibit

You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables-A, B, and C-have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional dat

a. what is a way that you could try to increase the R2 of the model without artificially inflating it?

ACreate clusters based on the data and use them as model inputs

BForce all 15 variables into the model as independent variables

CCreate interaction variables based only on variables A, B, and C

DBreak variables A, B, and C into their own univariate models
In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. (This term should be distinguished from multivariate linear regression^ where multiple correlated dependent variables are predicted, rather than a single scalar variable.) In linear regression data are modeled using linear predictor functions, and unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly: linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X: which is the domain of multivariate analysis.

Answer : A

Question 7

In which phase of the analytic lifecycle would you expect to spend most of the project time?

ADiscovery

BData preparation

CCommunicate Results

DOperationalize
In the data preparation phase of the Data Analytics Lifecycle, the data range and distribution can be obtained. If the data is skewed, viewing the logarithm of the data (if it's all positive) can help detect structures that might otherwise be overlooked in a graph with a regular, nonlogarithmic scale.
When preparing the data, one should look for signs of dirty data, as explained in the
previous section. Examining if the data is unimodal or multimodal will give an idea of how many distinct populations with different behavior patterns might be mixed into the overall population. Many modeling techniques assume that the data follows a
normal distribution. Therefore, it is important to know if the available dataset can match that assumption before applying any of those modeling techniques.

Answer : B

Databricks-Certified-Professional-Data-Scientist Databricks Certified Professional Data Scientist Exam Practice Test