Amazon MLS-C01 AWS ML Specialty Exam Practice Test Instant Access

Question 1

[Modeling]

A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.

Which storage scheme is MOST adapted to this scenario?

AStore datasets as files in Amazon S3.

BStore datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.

CStore datasets as tables in a multi-node Amazon Redshift cluster.

DStore datasets as global tables in Amazon DynamoDB.

Answer : A

Question 2

[Data Engineering]

A machine learning specialist stores IoT soil sensor data in Amazon DynamoDB table and stores weather event data as JSON files in Amazon S3. The dataset in DynamoDB is 10 GB in size and the dataset in Amazon S3 is 5 GB in size. The specialist wants to train a model on this data to help predict soil moisture levels as a function of weather events using Amazon SageMaker.

Which solution will accomplish the necessary transformation to train the Amazon SageMaker model with the LEAST amount of administrative overhead?

ALaunch an Amazon EMR cluster. Create an Apache Hive external table for the DynamoDB table and S3 data. Join the Hive tables and write the results out to Amazon S3.

BCrawl the data using AWS Glue crawlers. Write an AWS Glue ETL job that merges the two tables and writes the output to an Amazon Redshift cluster.

CEnable Amazon DynamoDB Streams on the sensor table. Write an AWS Lambda function that consumes the stream and appends the results to the existing weather files in Amazon S3.

DCrawl the data using AWS Glue crawlers. Write an AWS Glue ETL job that merges the two tables and writes the output in CSV format to Amazon S3.

Answer : D

Question 3

[Exploratory Data Analysis]

A company wants to segment a large group of customers into subgroups based on shared characteristics. The company's data scientist is planning to use the Amazon SageMaker built-in k-means clustering algorithm for this task. The data scientist needs to determine the optimal number of subgroups (k) to use.

Which data visualization approach will MOST accurately determine the optimal value of k?

ACalculate the principal component analysis (PCA) components. Run the k-means clustering algorithm for a range of k by using only the first two PCA components. For each value of k, create a scatter plot with a different color for each cluster. The optimal value of k is the value where the clusters start to look reasonably separated.

BCalculate the principal component analysis (PCA) components. Create a line plot of the number of components against the explained variance. The optimal value of k is the number of PCA components after which the curve starts decreasing in a linear fashion.

CCreate a t-distributed stochastic neighbor embedding (t-SNE) plot for a range of perplexity values. The optimal value of k is the value of perplexity, where the clusters start to look reasonably separated.

DRun the k-means clustering algorithm for a range of k. For each value of k, calculate the sum of squared errors (SSE). Plot a line chart of the SSE for each value of k. The optimal value of k is the point after which the curve starts decreasing in a linear fashion.

Answer : D

Question 4

[Data Engineering]

A data scientist needs to create a model for predictive maintenance. The model will be based on historical data to identify rare anomalies in the data.

The historical data is stored in an Amazon S3 bucket. The data scientist needs to use Amazon SageMaker Data Wrangler to ingest the dat

a. The data scientists also needs to perform exploratory data analysis (EDA) to understand the statistical properties of the data.

Which solution will meet these requirements with the LEAST amount of compute resources?

AImport the data by using the None option.

BImport the data by using the Stratified option.

CImport the data by using the First K option. Infer the value of K from domain knowledge.

DImport the data by using the Randomized option. Infer the random size from domain knowledge.

Answer : C

Question 5

[Modeling]

An aircraft engine manufacturing company is measuring 200 performance metrics in a time-series. Engineers

want to detect critical manufacturing defects in near-real time during testing. All of the data needs to be stored

for offline analysis.

What approach would be the MOST effective to perform near-real time defect detection?

AUse AWS IoT Analytics for ingestion, storage, and further analysis. Use Jupyter notebooks from withinAWS IoT Analytics to carry out analysis for anomalies.

BUse Amazon S3 for ingestion, storage, and further analysis. Use an Amazon EMR cluster to carry outApache Spark ML k-means clustering to determine anomalies.

CUse Amazon S3 for ingestion, storage, and further analysis. Use the Amazon SageMaker Random CutForest (RCF) algorithm to determine anomalies.

DUse Amazon Kinesis Data Firehose for ingestion and Amazon Kinesis Data Analytics Random Cut Forest(RCF) to perform anomaly detection. Use Kinesis Data Firehose to store data in Amazon S3 for furtheranalysis.

Answer : D

Question 6

[Modeling]

A machine learning (ML) engineer is creating a binary classification model. The ML engineer will use the model in a highly sensitive environment.

There is no cost associated with missing a positive label. However, the cost of making a false positive inference is extremely high.

What is the most important metric to optimize the model for in this scenario?

AAccuracy

BPrecision

CRecall

DF1

Answer : B

Question 7

[Modeling]

An insurance company developed a new experimental machine learning (ML) model to replace an existing model that is in production. The company must validate the quality of predictions from the new experimental model in a production environment before the company uses the new experimental model to serve general user requests.

Which one model can serve user requests at a time. The company must measure the performance of the new experimental model without affecting the current live traffic

Which solution will meet these requirements?

AA/B testing

BCanary release

CShadow deployment

DBlue/green deployment

Answer : C

Amazon MLS-C01 AWS Certified Machine Learning - Specialty AWS ML Specialty Exam Practice Test