Which of the following metrics are used to evaluate classification models?
Answer : D
Evaluation metrics are tied to machine learning tasks. There are different metrics for the tasks of classification and regression. Some metrics, like precision-recall, are useful for multiple tasks. Classification and regression are examples of supervised learning, which constitutes a majority of machine learning applications. Using different metrics for performance evaluation, we should be able to im-prove our model's overall predictive power before we roll it out for production on unseen data. Without doing a proper evaluation of the Machine Learning model by using different evaluation metrics, and only depending on accuracy, can lead to a problem when the respective model is deployed on unseen data and may end in poor predictions.
Classification metrics are evaluation measures used to assess the performance of a classification model. Common metrics include accuracy (proportion of correct predictions), precision (true positives over total predicted positives), recall (true positives over total actual positives), F1 score (har-monic mean of precision and recall), and area under the receiver operating characteristic curve (AUC-ROC).
Confusion Matrix
Confusion Matrix is a performance measurement for the machine learning classification problems where the output can be two or more classes. It is a table with combinations of predicted and actual values.
It is extremely useful for measuring the Recall, Precision, Accuracy, and AUC-ROC curves.
The four commonly used metrics for evaluating classifier performance are:
1. Accuracy: The proportion of correct predictions out of the total predictions.
2. Precision: The proportion of true positive predictions out of the total positive predictions (precision = true positives / (true positives + false positives)).
3. Recall (Sensitivity or True Positive Rate): The proportion of true positive predictions out of the total actual positive instances (recall = true positives / (true positives + false negatives)).
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics (F1 score = 2 * ((precision * recall) / (precision + recall))).
These metrics help assess the classifier's effectiveness in correctly classifying instances of different classes.
Understanding how well a machine learning model will perform on unseen data is the main purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced then other methods like ROC/AUC perform better in evaluating the model performance.
ROC curve isn't just a single number but it's a whole curve that provides nuanced details about the behavior of the classifier. It is also hard to quickly compare many ROC curves to each other.
Which of the following process best covers all of the following characteristics?
* Collecting descriptive statistics like min, max, count and sum.
* Collecting data types, length and recurring patterns.
* Tagging data with keywords, descriptions or categories.
* Performing data quality assessment, risk of performing joins on the data.
* Discovering metadata and assessing its accuracy.
Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedded value dependencies, and performing inter-table analysis.
Answer : C
Data processing and analysis cannot happen without data profiling---reviewing source data for con-tent and quality. As data gets bigger and infrastructure moves to the cloud, data profiling is increasingly important.
What is data profiling?
Data profiling is the process of reviewing source data, understanding structure, content and interrelationships, and identifying potential for data projects.
Data profiling is a crucial part of:
* Data warehouse and business intelligence (DW/BI) projects---data profiling can uncover data quality issues in data sources, and what needs to be corrected in ETL.
* Data conversion and migration projects---data profiling can identify data quality issues, which you can handle in scripts and data integration tools copying data from source to target. It can also un-cover new requirements for the target system.
* Source system data quality projects---data profiling can highlight data which suffers from serious or numerous quality issues, and the source of the issues (e.g. user inputs, errors in interfaces, data corruption).
Data profiling involves:
* Collecting descriptive statistics like min, max, count and sum.
* Collecting data types, length and recurring patterns.
* Tagging data with keywords, descriptions or categories.
* Performing data quality assessment, risk of performing joins on the data.
* Discovering metadata and assessing its accuracy.
* Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedded value dependencies, and performing inter-table analysis.
Which command is used to install Jupyter Notebook?
Answer : A
Jupyter Notebook is a web-based interactive computational environment.
The command used to install Jupyter Notebook is pip install jupyter.
The command used to start Jupyter Notebook is jupyter notebook.
Consider a data frame df with 10 rows and index [ 'r1', 'r2', 'r3', 'row4', 'row5', 'row6', 'r7', 'r8', 'r9', 'row10']. What does the expression g = df.groupby(df.index.str.len()) do?
Answer : D
Data frames cannot be grouped by index values. Hence it results in Error.
Consider a data frame df with columns ['A', 'B', 'C', 'D'] and rows ['r1', 'r2', 'r3']. What does the ex-pression df[lambda x : x.index.str.endswith('3')] do?
Answer : D
It will Filters the row labelled r3.
As Data Scientist looking out to use Reader account, Which ones are the correct considerations about Reader Accounts for Third-Party Access?
Answer : D
Data sharing is only supported between Snowflake accounts. As a data provider, you might want to share data with a consumer who does not already have a Snowflake account or is not ready to be-come a licensed Snowflake customer.
To facilitate sharing data with these consumers, you can create reader accounts. Reader accounts (formerly known as ''read-only accounts'') provide a quick, easy, and cost-effective way to share data without requiring the consumer to become a Snowflake customer.
Each reader account belongs to the provider account that created it. As a provider, you use shares to share databases with reader accounts; however, a reader account can only consume data from the provider account that created it.
So, Data Sharing is possible between Snowflake & Non-snowflake accounts via Reader Account.
Which one is incorrect understanding about Providers of Direct share?
Answer : D
If you want to provide a share to many accounts, you might want to use a listing or a data ex-change.