A business to business (B2B) ecommerce company wants to develop a fair and equitable risk mitigation strategy to reject potentially fraudulent transactions. The company wants to reject fraudulent transactions despite the possibility of losing some profitable transactions or customers.
Which solution will meet these requirements with the LEAST operational effort?
Answer : C
Amazon Fraud Detector is a managed service designed to detect potentially fraudulent online activities, such as transactions. It uses machine learning and business rules to classify activities as fraudulent or legitimate, minimizing the need for custom model training. By using the Amazon Fraud Detector prediction API, the company can automatically approve or reject transactions flagged as fraudulent, implementing an efficient risk mitigation strategy without extensive operational effort.
This approach requires minimal setup and effectively allows the company to block fraudulent transactions with high confidence, addressing the business's need to balance risk mitigation and customer impact.
Acybersecurity company is collecting on-premises server logs, mobile app logs, and loT sensor dat
a. The company backs up the ingested data in an Amazon S3 bucket and sends the ingested data to Amazon OpenSearch Service for further analysis. Currently, the company has a custom ingestion pipeline that is running on Amazon EC2 instances. The company needs to implement a new serverless ingestion pipeline that can automatically scale to handle sudden changes in the data flow.
Which solution will meet these requirements MOST cost-effectively?
Answer : B
To build a scalable, serverless, and cost-effective data ingestion pipeline, this solution uses a Kinesis data stream to handle fluctuations in data flow, buffering and distributing incoming data in real time. By connecting two Amazon Kinesis Data Firehose delivery streams to the Kinesis data stream, the company can simultaneously route data to Amazon S3 for backup and Amazon OpenSearch Service for analysis.
This approach meets all requirements by providing automatic scaling, reducing operational overhead, and ensuring data storage and analysis without duplicating efforts or needing additional infrastructure.
A data scientist uses Amazon SageMaker Data Wrangler to analyze and visualize dat
a. The data scientist wants to refine a training dataset by selecting predictor variables that are strongly predictive of the target variable. The target variable correlates with other predictor variables.
The data scientist wants to understand the variance in the data along various directions in the feature space.
Which solution will meet these requirements?
Answer : C
Principal Component Analysis (PCA) is a dimensionality reduction technique that captures the variance within the feature space, helping to understand the directions in which data varies most. In SageMaker Data Wrangler, the multicollinearity measurement and PCA features allow the data scientist to analyze interdependencies between predictor variables while reducing redundancy. PCA transforms correlated features into a set of uncorrelated components, helping to simplify the dataset without significant loss of information, making it ideal for refining features based on variance.
Options A and D offer methods to understand feature relevance but are less effective for managing multicollinearity and variance representation in the data.
An ecommerce company has observed that customers who use the company's website rarely view items that the website recommends to customers. The company wants to recommend items to customers that customers are more likely to want to purchase.
Which solution will meet this requirement in the SHORTEST amount of time?
Answer : C
Amazon Personalize is a managed AWS service specifically designed to deliver personalized recommendations with minimal development time. It uses machine learning algorithms tailored for recommendation systems, making it highly suitable for applications where quick integration is essential. By using Amazon Personalize, the company can leverage existing customer data to generate real-time, personalized product recommendations that align better with customer preferences, enhancing the likelihood of customer engagement with recommended items.
Options involving EC2 instances with GPU or accelerated computing primarily enhance computational performance but do not inherently improve recommendation relevance, while Amazon SageMaker would require more development effort to achieve similar results.
A company is building a predictive maintenance model for its warehouse equipment. The model must predict the probability of failure of all machines in the warehouse. The company has collected 10.000 event samples within 3 months. The event samples include 100 failure cases that are evenly distributed across 50 different machine types.
How should the company prepare the data for the model to improve the model's accuracy?
Answer : B
In predictive maintenance, when a dataset is imbalanced (with far fewer failure cases than non-failure cases), oversampling the minority class helps the model learn from the minority class effectively. The Synthetic Minority Oversampling Technique (SMOTE) generates synthetic samples for the minority class by creating data points between existing minority class instances. This can enhance the model's ability to recognize failure patterns, particularly in imbalanced datasets.
SMOTE increases the effective presence of failure cases in the dataset, providing a balanced learning environment for the model. This is more effective than undersampling, which would risk losing important non-failure data.
A company wants to use machine learning (ML) to improve its customer churn prediction model. The company stores data in an Amazon Redshift data warehouse.
A data science team wants to use Amazon Redshift machine learning (Amazon Redshift ML) to build a model and run predictions for new data directly within the data warehouse.
Which combination of steps should the company take to use Amazon Redshift ML to meet these requirements? (Select THREE.)
Answer : A, C, F
Amazon Redshift ML enables in-database machine learning model creation and predictions, allowing data scientists to leverage Redshift for model training without needing to export data.
To create and run a model for customer churn prediction in Amazon Redshift ML:
Define the feature variables and target variable: Identify the columns to use as features (predictors) and the target variable (outcome) for the churn prediction model.
Create the model: Write a CREATE MODEL SQL statement, which trains the model using Amazon Redshift's integration with Amazon SageMaker and stores the model directly in Redshift.
Run predictions: Use the SQL PREDICT function to generate predictions on new data directly within Redshift.
Options B, D, and E are not required as Redshift ML handles model creation and prediction without manual data export to Amazon S3 or additional Spectrum integration.
A data scientist needs to create a model for predictive maintenance. The model will be based on historical data to identify rare anomalies in the data.
The historical data is stored in an Amazon S3 bucket. The data scientist needs to use Amazon SageMaker Data Wrangler to ingest the dat
a. The data scientists also needs to perform exploratory data analysis (EDA) to understand the statistical properties of the data.
Which solution will meet these requirements with the LEAST amount of compute resources?
Answer : C
To perform efficient exploratory data analysis (EDA) on a large dataset for anomaly detection, using the First K option in SageMaker Data Wrangler is an optimal choice. This option allows the data scientist to select the first K rows, limiting the data loaded into memory, which conserves compute resources.
Given that the First K option allows the data scientist to determine K based on domain knowledge, this approach provides a representative sample without requiring extensive compute resources. Other options like randomized sampling may not provide data samples that are as useful for initial analysis in a time-series or sequential dataset context.