A company stores its documents in Amazon S3 with no predefined product categories. A data scientist needs to build a machine learning model to categorize the documents for all the company's products.
Which solution will meet these requirements with the MOST operational efficiency?
Answer : C
Amazon SageMaker's Neural Topic Model (NTM) is designed to uncover underlying topics within text data by clustering documents based on topic similarity. For document categorization, NTM can identify product categories by analyzing and grouping the documents, making it an efficient choice for unsupervised learning where predefined categories do not exist.
A company's machine learning (ML) specialist is designing a scalable data storage solution for Amazon SageMaker. The company has an existing TensorFlow-based model that uses a train.py script. The model relies on static training data that is currently stored in TFRecord format.
What should the ML specialist do to provide the training data to SageMaker with the LEAST development overhead?
Answer : D
Amazon SageMaker script mode allows users to bring custom training scripts (such as train.py) without needing extensive modifications for specific data formats like TFRecord. By storing the TFRecord data in an Amazon S3 bucket and pointing the SageMaker training job to this bucket, the model can directly access the data, allowing the ML specialist to train the model without additional reformatting or data processing steps.
This approach minimizes development overhead and leverages SageMaker's built-in support for custom training scripts and S3 integration, making it the most efficient choice.
A finance company has collected stock return data for 5.000 publicly traded companies. A financial analyst has a dataset that contains 2.000 attributes for each company. The financial analyst wants to use Amazon SageMaker to identify the top 15 attributes that are most valuable to predict future stock returns.
Which solution will meet these requirements with the LEAST operational overhead?
Answer : D
Amazon SageMaker Autopilot is a fully managed solution that automatically explores different ML models and selects the most effective ones for a given prediction task. After model training, Amazon SageMaker Clarify can generate feature importance scores, identifying the top features in a straightforward, automated manner with minimal manual intervention.
By using SageMaker Autopilot, the data scientist can obtain the desired feature importance ranking for predictive attributes with minimal setup and low operational overhead, as opposed to manually configuring models in SageMaker.
A company has a podcast platform that has thousands of users. The company implemented an algorithm to detect low podcast engagement based on a 10-minute running window of user events such as listening to. pausing, and closing the podcast. A machine learning (ML) specialist is designing the ingestion process for these events. The ML specialist needs to transform the data to prepare the data for inference.
How should the ML specialist design the transformation step to meet these requirements with the LEAST operational effort?
Answer : C
In this scenario, Kinesis Data Streams efficiently ingests real-time event data, while Amazon Managed Service for Apache Flink (formerly Amazon Kinesis Data Analytics) is ideal for transforming and analyzing data in a continuous stream. Apache Flink allows processing of time-based windows, such as the 10-minute sliding window required here, with low operational overhead.
This combination provides an effective solution for low-latency data processing and transformation, meeting the requirements for preparing data for inference with minimal setup and serverless scalability.
A business to business (B2B) ecommerce company wants to develop a fair and equitable risk mitigation strategy to reject potentially fraudulent transactions. The company wants to reject fraudulent transactions despite the possibility of losing some profitable transactions or customers.
Which solution will meet these requirements with the LEAST operational effort?
Answer : C
Amazon Fraud Detector is a managed service designed to detect potentially fraudulent online activities, such as transactions. It uses machine learning and business rules to classify activities as fraudulent or legitimate, minimizing the need for custom model training. By using the Amazon Fraud Detector prediction API, the company can automatically approve or reject transactions flagged as fraudulent, implementing an efficient risk mitigation strategy without extensive operational effort.
This approach requires minimal setup and effectively allows the company to block fraudulent transactions with high confidence, addressing the business's need to balance risk mitigation and customer impact.
A company is building a new supervised classification model in an AWS environment. The company's data science team notices that the dataset has a large quantity of variables Ail the variables are numeric. The model accuracy for training and validation is low. The model's processing time is affected by high latency The data science team needs to increase the accuracy of the model and decrease the processing.
How it should the data science team do to meet these requirements?
Answer : B
The other options are not effective or appropriate, because they have the following drawbacks:
C: Applying normalization on the feature set can increase the accuracy of the model by scaling the features to a common range and avoiding the dominance of some features over others, but it can also decrease the processing time of the model by reducing the numerical instability and improving the convergence of the model . However, normalization alone is not enough to address the high dimensionality and high latency issues of the dataset, as it does not reduce the number of features or the variance in the data.
D: Using a multiple correspondence analysis (MCA) model is not suitable for numeric variables, as it is a technique that reduces the dimensionality of the dataset by transforming the original categorical variables into a smaller set of new variables, called factors, that capture most of the inertia and information in the data. MCA is similar to PCA, but it is designed for nominal or ordinal variables, not for continuous or interval variables.
References:
1:Principal Component Analysis - Amazon SageMaker
5:Dimensionality Reduction and Its Applications | by Aniruddha Bhandari | Towards Data Science
6:Principal Component Analysis (PCA) in Python | by Susan Li | Towards Data Science
7:Feature Engineering for Machine Learning | by Dipanjan (DJ) Sarkar | Towards Data Science
: [Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization | by Benjamin Obi Tayo Ph.D. | Towards Data Science]
: [Why, How and When to Scale your Features | by George Seif | Towards Data Science]
: [Normalization vs Dimensionality Reduction | by Saurabh Annadate | Towards Data Science]
: [Multiple Correspondence Analysis - Amazon SageMaker]
: [Multiple Correspondence Analysis (MCA) | by Raul Eulogio | Towards Data Science]
A company builds computer-vision models that use deep learning for the autonomous vehicle industry. A machine learning (ML) specialist uses an Amazon EC2 instance that has a CPU: GPU ratio of 12:1 to train the models.
The ML specialist examines the instance metric logs and notices that the GPU is idle half of the time The ML specialist must reduce training costs without increasing the duration of the training jobs.
Which solution will meet these requirements?
Answer : D
Switching to an instance type that has a CPU: GPU ratio of 6:1 will reduce the training costs by using fewer CPUs and GPUs, while maintaining the same level of performance. The GPU idle time indicates that the CPU is not able to feed the GPU with enough data, so reducing the CPU: GPU ratio will balance the workload and improve the GPU utilization. A lower CPU: GPU ratio also means less overhead for inter-process communication and synchronization between the CPU and GPU processes.References:
Optimizing GPU utilization for AI/ML workloads on Amazon EC2
Analyze CPU vs. GPU Performance for AWS Machine Learning