Tutorial: Getting started with Amazon EMR 

Posted by taufik

August 27, 2024

With Amazon EMR you can set up a cluster to process and analyze data with big data frameworks in just a few minutes. This tutorial shows you how to launch a sample cluster using Spark, and how to run a simple PySpark script stored in an Amazon S3 bucket. It covers essential Amazon EMR tasks in three main workflow categories: Plan and Configure, Manage, and Clean Up. 

You’ll find links to more detailed topics as you work through the tutorial, and ideas for additional steps in the Next steps section. If you have questions or get stuck, contact the Amazon EMR team on our Discussion forum. 

 

Prerequisites 

 

Cost 

  • The sample cluster that you create runs in a live environment. The cluster accrues minimal charges. To avoid additional charges, make sure you complete the cleanup tasks in the last step of this tutorial. Charges accrue at the per-second rate according to Amazon EMR pricing. Charges also vary by Region. For more information, see Amazon EMR pricing. 
  • Minimal charges might accrue for small files that you store in Amazon S3. Some or all of the charges for Amazon S3 might be waived if you are within the usage limits of the AWS Free Tier. For more information, see Amazon S3 pricing and AWS Free Tier. 

 

Step 1: Plan and configure an Amazon EMR cluster 

Prepare storage for Amazon EMR 

When you use Amazon EMR, you can choose from a variety of file systems to store input data, output data, and log files. In this tutorial, you use EMRFS to store data in an S3 bucket. EMRFS is an implementation of the Hadoop file system that lets you read and write regular files to Amazon S3. For more information, see Work with storage and file systems. 

To create a bucket for this tutorial, follow the instructions in How do I create an S3 bucket? in the Amazon Simple Storage Service Console User Guide. Create the bucket in the same AWS Region where you plan to launch your Amazon EMR cluster. For example, US West (Oregon) us-west-2. 

Buckets and folders that you use with Amazon EMR have the following limitations: 

  • Names can consist of lowercase letters, numbers, periods (.), and hyphens (-). 
  • Names cannot end in numbers. 
  • A bucket name must be unique across all AWS accounts. 
  • An output folder must be empty. 

 

Prepare an application with input data for Amazon EMR 

The most common way to prepare an application for Amazon EMR is to upload the application and its input data to Amazon S3. Then, when you submit work to your cluster you specify the Amazon S3 locations for your script and data. 

In this step, you upload a sample PySpark script to your Amazon S3 bucket. We’ve provided a PySpark script for you to use. The script processes food establishment inspection data and returns a results file in your S3 bucket. The results file lists the top ten establishments with the most “Red” type violations. 

You also upload sample input data to Amazon S3 for the PySpark script to process. The input data is a modified version of Health Department inspection results in King County, Washington, from 2006 to 2020. For more information, see King County Open Data: Food Establishment Inspection Data. Following are sample rows from the dataset. 

name, inspection_result, inspection_closed_business, violation_type, violation_points
100 LB CLAM, Unsatisfactory, FALSE, BLUE, 5
100 PERCENT NUTRICION, Unsatisfactory, FALSE, BLUE, 5
7-ELEVEN #2361-39423A, Complete, FALSE, , 0

 

To prepare the example PySpark script for EMR 

1. Copy the example code below into a new file in your editor of choice. 

import argparse
from pyspark.sql import SparkSession
def calculate_red_violations(data_source, output_uri):
    “””
    Processes sample food establishment inspection data and queries the data to find the top 10 establishments
    with the most Red violations from 2006 to 2020.
    :param data_source: The URI of your food establishment data CSV, such as ‘s3://DOC-EXAMPLE-BUCKET/food-establishment-data.csv’.
    :param output_uri: The URI where output is written, such as ‘s3://DOC-EXAMPLE-BUCKET/restaurant_violation_results’.
    “””
    with SparkSession.builder.appName(“Calculate Red Health Violations”).getOrCreate() as spark:
        # Load the restaurant violation CSV data
        if data_source is not None:
            restaurants_df = spark.read.option(“header”, “true”).csv(data_source)
        # Create an in-memory DataFrame to query
        restaurants_df.createOrReplaceTempView(“restaurant_violations”)
        # Create a DataFrame of the top 10 restaurants with the most Red violations
        top_red_violation_restaurants = spark.sql(“””SELECT name, count(*) AS total_red_violations 
          FROM restaurant_violations 
          WHERE violation_type = ‘RED’ 
          GROUP BY name 
          ORDER BY total_red_violations DESC LIMIT 10″””)
        # Write the results to the specified output URI
        top_red_violation_restaurants.write.option(“header”, “true”).mode(“overwrite”).csv(output_uri)
if __name__ == “__main__”:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        ‘–data_source’, help=”The URI for you CSV restaurant data, like an S3 bucket location.”)
    parser.add_argument(
        ‘–output_uri’, help=”The URI where output is saved, like an S3 bucket location.”)
    args = parser.parse_args()
    calculate_red_violations(args.data_source, args.output_uri)

 

2. Save the file as health_violations.py. 

3. Upload health_violations.py to Amazon S3 into the bucket you created for this tutorial. For instructions, see Uploading an object to a bucket in the Amazon Simple Storage Service Getting Started Guide. 

 

To prepare the sample input data for EMR 

1. Download the zip file, food_establishment_data.zip. 

2. Unzip and save food_establishment_data.zip as food_establishment_data.csv on your machine. 

3. Upload the CSV file to the S3 bucket that you created for this tutorial. For instructions, see Uploading an object to a bucket in the Amazon Simple Storage Service Getting Started Guide. 

For more information about setting up data for EMR, see Prepare input data. 

 

Launch an Amazon EMR cluster 

After you prepare a storage location and your application, you can launch a sample Amazon EMR cluster. In this step, you launch an Apache Spark cluster using the latest Amazon EMR release version. 

1. Sign in to the AWS Management Console, and open the Amazon EMR console at https://console.aws.amazon.com/emr. 

2. Under EMR on EC2 in the left navigation pane, choose Clusters, and then choose Create cluster. 

3. On the Create Cluster page, note the default values for Release, Instance type, Number of instances, and Permissions. These fields automatically populate with values that work for general-purpose clusters. 

4. In the Cluster name field, enter a unique cluster name to help you identify your cluster, such as My first cluster. Your cluster name can’t contain the characters <, >, $, |, or ` (backtick). 

5. Under Applications, choose the Spark option to install Spark on your cluster. 

6. Under Cluster logs, select the Publish cluster-specific logs to Amazon S3 check box. Replace the Amazon S3 location value with the Amazon S3 bucket you created, followed by /logs. For example, s3://DOC-EXAMPLE-BUCKET/logs. Adding /logs creates a new folder called ‘logs’ in your bucket, where Amazon EMR can copy the log files of your cluster. 

7. Under Security configuration and permissions, choose your EC2 key pair. In the same section, select the Service role for Amazon EMR dropdown menu and choose EMR_DefaultRole. Then, select the IAM role for instance profile dropdown menu and choose EMR_EC2_DefaultRole. 

8. Choose Create cluster to launch the cluster and open the cluster details page. 

9. Find the cluster Status next to the cluster name. The status changes from Starting to Running to Waiting as Amazon EMR provisions the cluster. You may need to choose the refresh icon on the right or refresh your browser to see status updates. 

Your cluster status changes to Waiting when the cluster is up, running, and ready to accept work. For more information about reading the cluster summary, see View cluster status and details. For information about cluster status, see Understanding the cluster lifecycle. 

 

Step 2: Manage your Amazon EMR cluster 

Submit work to Amazon EMR 

After you launch a cluster, you can submit work to the running cluster to process and analyze data. You submit work to an Amazon EMR cluster as a step. A step is a unit of work made up of one or more actions. For example, you might submit a step to compute values, or to transfer and process data. You can submit steps when you create a cluster, or to a running cluster. In this part of the tutorial, you submit health_violations.py as a step to your running cluster. To learn more about steps, see Submit work to a cluster. 

 

To submit a Spark application as a step with the console 

1, Sign in to the AWS Management Console, and open the Amazon EMR console at https://console.aws.amazon.com/emr. 

2. Under EMR on EC2 in the left navigation pane, choose Clusters, and then select the cluster where you want to submit work. The cluster state must be Waiting. 

3. Choose the Steps tab, and then choose Add step. 

4. Configure the step according to the following guidelines: 

  • For Type, choose Spark application. You should see additional fields for Deploy mode, Application location, and Spark-submit options. 
  • For Name, enter a new name. If you have many steps in a cluster, naming each step helps you keep track of them. 
  • For Deploy mode, leave the default value Cluster mode. For more information on Spark deployment modes, see Cluster mode overview in the Apache Spark documentation. 
  • For Application location, enter the location of your health_violations.py script in Amazon S3, such as s3://DOC-EXAMPLE-BUCKET/health_violations.py. 
  • In the Arguments field, enter the following arguments and values: 

–data_source s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csv
–output_uri s3://DOC-EXAMPLE-BUCKET/myOutputFolder

Replace s3://DOC-EXAMPLE-BUCKET/food_establishment_data.csvwith the S3 bucket URI of the input data you prepared in Prepare an application with input data for Amazon EMR. 

Replace DOC-EXAMPLE-BUCKETwith the name of the bucket that you created for this tutorial, and replace myOutputFolder with a name for your cluster output folder. 

  • For Action if step fails, accept the default option Continue. This way, if the step fails, the cluster continues to run. 

5. Choose Add to submit the step. The step should appear in the console with a status of Pending. 

6. Monitor the step status. It should change from Pending to Running to Completed. To refresh the status in the console, choose the refresh icon to the right of Filter. The script takes about one minute to run. When the status changes to Completed, the step has completed successfully. 

For more information about the step lifecycle, see Running steps to process data. 

 

View results 

After a step runs successfully, you can view its output results in your Amazon S3 output folder. 

To view the results of health_violations.py 

1. Open the Amazon S3 console at https://console.aws.amazon.com/s3/. 

2. Choose the Bucket name and then the output folder that you specified when you submitted the step. For example, DOC-EXAMPLE-BUCKETand then myOutputFolder. 

3. Verify that the following items appear in your output folder: 

  • A small-sized object called _SUCCESS. 
  • A CSV file starting with the prefix part- that contains your results. 

4. Choose the object with your results, then choose Download to save the results to your local file system. 

5. Open the results in your editor of choice. The output file lists the top ten food establishments with the most red violations. The output file also shows the total number of red violations for each establishment. 

The following is an example of health_violations.py results. 

name, total_red_violations 

SUBWAY, 322 

T-MOBILE PARK, 315 

WHOLE FOODS MARKET, 299 

PCC COMMUNITY MARKETS, 251 

TACO TIME, 240 

MCDONALD’S, 177 

THAI GINGER, 153 

SAFEWAY INC #1508, 143 

TAQUERIA EL RINCONSITO, 134 

HIMITSU TERIYAKI, 128  

For more information about Amazon EMR cluster output, see Configure an output location. 

 

Step 3: Clean up your Amazon EMR resources 

Terminate your cluster 

Now that you’ve submitted work to your cluster and viewed the results of your PySpark application, you can terminate the cluster. Terminating a cluster stops all of the cluster’s associated Amazon EMR charges and Amazon EC2 instances. 

When you terminate a cluster, Amazon EMR retains metadata about the cluster for two months at no charge. Archived metadata helps you clone the cluster for a new job or revisit the cluster configuration for reference purposes. Metadata does not include data that the cluster writes to S3, or data stored in HDFS on the cluster. 

 

To terminate the cluster with the console 

1. Sign in to the AWS Management Console, and open the Amazon EMR console at https://console.aws.amazon.com/emr. 

2. Choose Clusters, and then choose the cluster you want to terminate. 

3. Under the Actions dropdown menu, choose Terminate cluster. 

4. Choose Terminate in the dialog box. Depending on the cluster configuration, termination may take 5 to 10 minutes. For more information on how to Amazon EMR clusters, see Terminate a cluster. 

 

Delete S3 resources 

To avoid additional charges, you should delete your Amazon S3 bucket. Deleting the bucket removes all of the Amazon S3 resources for this tutorial. Your bucket should contain: 

  • The PySpark script 
  • The input dataset 
  • Your output results folder 
  • Your log files folder 

You might need to take extra steps to delete stored files if you saved your PySpark script or output in a different location. 

To delete your bucket, follow the instructions in How do I delete an S3 bucket? in the Amazon Simple Storage Service User Guide. 

 

Credit to: AWS Documentation 

Privacy & Policy

PT Central Data Technology (“CDT” or “us”) is strongly committed to ensuring that your privacy is protected as utmost importance to us. https://centraldatatech.com/ , we shall govern your use of this website, including all pages within this website (collectively referred to herein below as this “Website”), we want to contribute to providing a safe and secure environment for visitors.

The following are terms of privacy policy (“Privacy Policy”) between you (“you” or “your”) and CDT. By accessing the website, you acknowledge that you have read, understood and agree to be bound by this Privacy Policy

Use of The Subscription Service by CDT and Our Customers

When you request information from CDT and supply information that personally identifies you or allows us to contact you, you agree to disclose that information with us. CDT may disclose such information for marketing, promotional and activity only for the purpose of CDT and the Website.

Collecting Information

You are free to explore the Website without providing any personal information about yourself. When you visit the Website or register for the subscription service, we provide some navigational information for you to fill out your personal information to access some content we offered.

CDT may collect your personal data such as your name, email address, company name, phone number and other information about yourself or your business. We are collecting your data in some ways, online and offline. CDT collects your data online using features of social media, email marketing, website, and cookies technology. We may collect your data offline in events like conference, gathering, workshop, etc. However, we will not use or disclose those informations with third party or send unsolicited email to any of the addresses we collect, without your express permission. We ensure that your personal identities will only be used in accordance with this Privacy Policy.

How CDT Use the Collected Information

CDT use the information that is collected only in compliance with this privacy policy. Customers who subscribe to our subscription services are obligated through our agreements with them to comply with this Privacy Policy.

In addition to the uses of your information, we may use your personal information to:

  • Improve your browsing experience by personalizing the websites and to improve the subscription services.
  • Send information about CDT.
  • Promote our services to you and share promotional and informational content with you in accordance with your communication preferences.
  • Send information to you regarding changes to our customers’ terms of service, Privacy Policy (including the cookie policy), or other legal agreements

Cookies Technology

Cookies are small pieces of data that the site transfers to the user’s computer hard drive when the user visits the website. Cookies can record your preferences when visiting a particular site and give the advantage of identifying the interest of our visitor for statistical analysis of our site. This information can enable us to improve the content, modifying and making our site more user friendly.

Cookies were used for some reasons such as technical reasons for our website to operate. Cookies also enable us to track and target the interest of our users to enhance the experience of our website and subscription service. This data is used to deliver customized content and promotions within the Helios to customers who have an interest on particular subjects.

You have the right to decide whether to accept or refuse cookies. You can edit your cookies preferences on browser setup. If you choose to refuse the cookies, you may still use our website though your access to some functionality and areas of our website may be restricted.

This Website may also display advertisements from third parties containing links to other websites of interest. Once you have used these links to leave our site, please note that we do not have any control over the website. CDT cannot be responsible for the protection and privacy of any information that you provide while visiting such websites and this Privacy Policy does not govern such websites.

Control Your Personal Data

CDT give control to you to manage your personal data. You can request access, correction, updates or deletion of your personal information. You may unsubscribe from our marketing activity by clicking unsubscribe us from the bottom of our email or contacting us directly to remove you from our subscription list.

We will keep your personal information accurate, and we allow you to correct or change your personal identifiable information through marketing@centraldatatech.com