Which of the following metrics are used to evaluate classification models?
Answer : D
Evaluation metrics are tied to machine learning tasks. There are different metrics for the tasks of classification and regression. Some metrics, like precision-recall, are useful for multiple tasks. Classification and regression are examples of supervised learning, which constitutes a majority of machine learning applications. Using different metrics for performance evaluation, we should be able to im-prove our model's overall predictive power before we roll it out for production on unseen data. Without doing a proper evaluation of the Machine Learning model by using different evaluation metrics, and only depending on accuracy, can lead to a problem when the respective model is deployed on unseen data and may end in poor predictions.
Classification metrics are evaluation measures used to assess the performance of a classification model. Common metrics include accuracy (proportion of correct predictions), precision (true positives over total predicted positives), recall (true positives over total actual positives), F1 score (har-monic mean of precision and recall), and area under the receiver operating characteristic curve (AUC-ROC).
Confusion Matrix
Confusion Matrix is a performance measurement for the machine learning classification problems where the output can be two or more classes. It is a table with combinations of predicted and actual values.
It is extremely useful for measuring the Recall, Precision, Accuracy, and AUC-ROC curves.
The four commonly used metrics for evaluating classifier performance are:
1. Accuracy: The proportion of correct predictions out of the total predictions.
2. Precision: The proportion of true positive predictions out of the total positive predictions (precision = true positives / (true positives + false positives)).
3. Recall (Sensitivity or True Positive Rate): The proportion of true positive predictions out of the total actual positive instances (recall = true positives / (true positives + false negatives)).
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics (F1 score = 2 * ((precision * recall) / (precision + recall))).
These metrics help assess the classifier's effectiveness in correctly classifying instances of different classes.
Understanding how well a machine learning model will perform on unseen data is the main purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced then other methods like ROC/AUC perform better in evaluating the model performance.
ROC curve isn't just a single number but it's a whole curve that provides nuanced details about the behavior of the classifier. It is also hard to quickly compare many ROC curves to each other.
Which of the following process best covers all of the following characteristics?
* Collecting descriptive statistics like min, max, count and sum.
* Collecting data types, length and recurring patterns.
* Tagging data with keywords, descriptions or categories.
* Performing data quality assessment, risk of performing joins on the data.
* Discovering metadata and assessing its accuracy.
Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedded value dependencies, and performing inter-table analysis.
Answer : C
Data processing and analysis cannot happen without data profiling---reviewing source data for con-tent and quality. As data gets bigger and infrastructure moves to the cloud, data profiling is increasingly important.
What is data profiling?
Data profiling is the process of reviewing source data, understanding structure, content and interrelationships, and identifying potential for data projects.
Data profiling is a crucial part of:
* Data warehouse and business intelligence (DW/BI) projects---data profiling can uncover data quality issues in data sources, and what needs to be corrected in ETL.
* Data conversion and migration projects---data profiling can identify data quality issues, which you can handle in scripts and data integration tools copying data from source to target. It can also un-cover new requirements for the target system.
* Source system data quality projects---data profiling can highlight data which suffers from serious or numerous quality issues, and the source of the issues (e.g. user inputs, errors in interfaces, data corruption).
Data profiling involves:
* Collecting descriptive statistics like min, max, count and sum.
* Collecting data types, length and recurring patterns.
* Tagging data with keywords, descriptions or categories.
* Performing data quality assessment, risk of performing joins on the data.
* Discovering metadata and assessing its accuracy.
* Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedded value dependencies, and performing inter-table analysis.
Which command is used to install Jupyter Notebook?
Answer : A
Jupyter Notebook is a web-based interactive computational environment.
The command used to install Jupyter Notebook is pip install jupyter.
The command used to start Jupyter Notebook is jupyter notebook.
Consider a data frame df with 10 rows and index [ 'r1', 'r2', 'r3', 'row4', 'row5', 'row6', 'r7', 'r8', 'r9', 'row10']. What does the expression g = df.groupby(df.index.str.len()) do?
Answer : D
Data frames cannot be grouped by index values. Hence it results in Error.
Consider a data frame df with columns ['A', 'B', 'C', 'D'] and rows ['r1', 'r2', 'r3']. What does the ex-pression df[lambda x : x.index.str.endswith('3')] do?
Answer : D
It will Filters the row labelled r3.
Which object records data manipulation language (DML) changes made to tables, including inserts, updates, and deletes, as well as metadata about each change, so that actions can be taken using the changed data of Data Science Pipelines?
Answer : C
A stream object records data manipulation language (DML) changes made to tables, including inserts, updates, and deletes, as well as metadata about each change, so that actions can be taken using the changed data. This process is referred to as change data capture (CDC). An individual table stream tracks the changes made to rows in a source table. A table stream (also referred to as simply a ''stream'') makes a ''change table'' available of what changed, at the row level, between two transactional points of time in a table. This allows querying and consuming a sequence of change records in a transactional fashion.
Streams can be created to query change data on the following objects:
* Standard tables, including shared tables.
* Views, including secure views
* Directory tables
* Event tables
A Data Scientist as data providers require to allow consumers to access all databases and database objects in a share by granting a single privilege on shared databases. Which one is incorrect SnowSQL command used by her while doing this task?
Assuming:
A database named product_db exists with a schema named product_agg and a table named Item_agg.
The database, schema, and table will be shared with two accounts named xy12345 and yz23456.
1. USE ROLE accountadmin;
2. CREATE DIRECT SHARE product_s;
3. GRANT USAGE ON DATABASE product_db TO SHARE product_s;
4. GRANT USAGE ON SCHEMA product_db. product_agg TO SHARE product_s;
5. GRANT SELECT ON TABLE sales_db. product_agg.Item_agg TO SHARE product_s;
6. SHOW GRANTS TO SHARE product_s;
7. ALTER SHARE product_s ADD ACCOUNTS=xy12345, yz23456;
8. SHOW GRANTS OF SHARE product_s;
Answer : C
CREATE SHARE product_s is the correct Snowsql command to create Share object.
Rest are correct ones.
https://docs.snowflake.com/en/user-guide/data-sharing-provider#creating-a-share-using-sql