Question 1

Which scenario describes a challenge to data velocity?

A sales department wants to use a data source but does not have information about its lineage or the quality of the data.
A shopping website collects clickstream to make personalized recommendations while a user is shopping. When the website is very busy, there is a delay in returning results to customers.
A pipeline ingests data from regional sales sites, and the overnight batch job fails because it runs out of disk space.
Regional offices send data in different file formats to an organization’s head office.

Question 2

Which statement describes the goal of a modern data architecture?

Give users the ability to access all of an organization’s data through a highly structured data warehouse that provides for fast SQL queries.
Give users the ability to access all of an organization’s data by integrating a data lake, a data warehouse, and other purpose-built data stores.
Select a single ingestion service that can support the data formats, structures, and velocity requirements of all the data sources that will be collected.
Select purpose-built streaming services to make all of the organization’s data available for real-time analysis.

Question 3

Which analytic workload scenario can be a use case for the batch ingestion process?

Populate a dashboard with real-time error rates of sensors in a factory.
Send small bits of clickstream data at a continuous pace from a retailer's website for immediate analysis.
Produce real-time alerts based on log data to identify potential fraud as soon as it occurs.
Send sales transaction data from a retailer's website to a central location periodically. Analyze the data overnight, and deliver reports to branches in the morning.

Question 4

A medical research company has a ribonucleic acid (RNA) sequencing machine that stores its private results to the lab’s on-premises networked-attached storage. Their data science team wants to ingest these results into their AWS account. How should they ingest this data?

Use AWS Database Migration Service (AWS DMS) to sync data from the on-premises file store to Amazon S3.
Use Amazon AppFlow to connect to the on-premises data and move the data into the pipeline.
Use AWS Data Exchange to subscribe to the RNA-sequencing data.
Use AWS DataSync to transfer data from the on-premises file store to an Amazon S3 bucket in the data lake.

Question 5

A data engineer has ingested a new JSON file into an Amazon S3 bucket in their data lake. An AWS Glue Data Catalog maintains metadata about data in the lake. Which feature of AWS Glue can the data engineer use to discover the JSON data schema with the fewest steps in a code-free way?

Run an AWS Glue crawler on the S3 bucket.
Set up an AWS Glue workflow to orchestrate a set of jobs that transforms the data into an open columnar format.
Use AWS Glue Studio to write a script that converts the JSON data to Apache Parquet format.
Use AWS Glue Studio to transform the data and move it into the data warehouse.

When an AWS Glue crawler runs on an S3 bucket, the crawler automatically generates a schema and stores it with other metadata to the Data Catalog. This makes it discoverable for data lake consumers.

Question 6

A data pipeline will ingest clickstream data from a shopping website. The data engineer must transform data as it arrives to feed a real-time analytics Amazon OpenSearch Service dashboard. They must also generate a monthly report based on the dashboard. Which configuration meets this need?

Use Amazon Data Firehose to capture the data and send the data to Amazon S3. Run an AWS Glue crawler to support querying the data for the dashboard.
Use Amazon Kinesis Data Streams to capture the data. Use Amazon Managed Service for Apache Flink to consume and transform data from the stream. Use Amazon Data Firehose to deliver transformed data to OpenSearch Service.
Use Amazon Kinesis Data Streams to capture the data. Use Amazon Managed Service for Apache Flink as a consumer to deliver data to OpenSearch Service. Use Amazon Redshift Spectrum to run SQL queries on the data stream for the monthly report.
Use two data pipelines: one to ingest data into Amazon Managed Service for Apache Flink and send it to OpenSearch Service and one to send the streaming data directly from Amazon Kinesis Data Streams to Amazon S3.

Kinesis Data Streams can ingest the data. Amazon Managed Service for Apache Flink can also consume the data from the data stream and process it immediately to feed the OpenSearch Service dashboard by using Kinesis Data Streams. Firehose can deliver data to storage and analytics destinations such as OpenSearch Service, where the report can be produced.

Question 7

Which statement accurately describes a consideration for designing pipeline storage?

Archive data out of relational databases into a more cost-efficient storage option.
Store raw data that will be used for analytics in a data warehouse.
Choose the lowest cost storage option regardless of the intended use case.
Choose the storage option that provides the fastest queries regardless of the use case.

It is common practice to archive data out of a relational database into a more cost-efficient storage option. Amazon S3 storage classes are purpose-built for varying access patterns at corresponding costs.

Question 8

A data engineer is designing a low-cost infrastructure to store data directly from a central repository for both structured and unstructured data. Which option meets the data engineer’s needs?

AWS Database Migration Service (AWS DMS)
Amazon Quantum Ledger Database (Amazon QLDB)
Amazon Redshift
Amazon S3

This scenario describes a data lake. Data lakes are centralized repositories that developers can use to store structured and unstructured data regardless of scale. Amazon S3 is a good choice for this purpose.

Question 9

A DevOps engineer is migrating an on-premises Apache Hadoop cluster to AWS. The cluster runs scheduled jobs by using parallel processing. Which AWS service is the MOST appropriate choice?

Amazon EMR
Amazon Managed Service for Apache Flink
AWS Glue
AWS Glue DataBrew

Question 10

A marketing manager quickly needs one-time insights about the number of leads and closed deals across multiple postal codes. Which service would be the MOST cost-effective method to query daily aggregates of sales data stored in Amazon S3?

Amazon Redshift
Amazon Athena
Amazon QuickSight
Amazon OpenSearch Service

CLD120 Module 15 Knowledge Check

Planning for Disaster

Keyboard Shortcuts