Lab: Hadoop MapReduce Implementation in AWS using Amazon EMR
Objective:
In this lab, students will:
Set up a Hadoop cluster using Amazon EMR.
Write and deploy a simple MapReduce program in Java or Python.
Run the MapReduce job on the EMR cluster and analyze the output.
Step 1: Setting Up an Amazon EMR Cluster
Login to AWS Console:
Go to
Login with your credentials.
Navigate to Amazon EMR:
In the search bar, type EMR and select Amazon EMR from the services list.
Create a Cluster:
Click on the Create cluster button.
Select Go to advanced options.
Configure Software and Steps:
Under Software Configuration, select Release: emr-6.x.x (latest).
For Applications, ensure that Hadoop is selected. Other applications like Hive and Spark can be unchecked for simplicity.
Click Next.
Configure Hardware:
Set the Instance Type for both Master and Core nodes to m5.xlarge (or another instance type depending on your resource needs and budget).
Keep the Number of Core Instances as 1.
Click Next.
General Cluster Settings:
Give the cluster a meaningful name, like `Hadoop-MapReduce-Lab`.
Leave the default settings for networking and permissions.
Click Next.
Security Settings:
Create a new EC2 key pair if you don’t have one, or select an existing key pair.
Click Create cluster.
Wait for the Cluster to Launch:
The cluster will take a few minutes to launch. You can monitor the status in the Cluster List.
Step 2: Writing the MapReduce Program
You can write a basic word count MapReduce program. Here’s an example in Python:
Mapper (mapper.py):
Reducer (reducer.py):
Program Structure:
The Mapper splits input text into words and outputs each word with a count of 1.
The Reducer sums the counts for each word and outputs the total.
Step 3: Deploying the Program on Amazon EMR
Upload the Python Scripts to S3:
Navigate to the S3 service in the AWS Console.
Create a bucket (e.g., `hadoop-mapreduce-lab-bucket`) and upload the `mapper.py` and
`reducer.py` files.
Submit the Job to the EMR Cluster:
Go back to your EMR cluster page.
Click on the Steps tab.
Select Add Step.
Choose Custom JAR as the step type.
For the JAR location, select the built-in Hadoop streaming JAR:
`command-runner.jar`.
For Arguments, enter:
Replace `<your-bucket-name>` with your actual S3 bucket name. Get online assignment help services Now!
Add Input Data:
Upload a sample text file to the `input/` folder in your S3 bucket (e.g., a text file containing a few paragraphs).
Run the Job:
The job will start running, and you can monitor its progress in the Steps tab.
Step 4: Analyzing the Output
View the Output:
Once the job is complete, navigate to the `output/` folder in your S3 bucket.
Download the output files (`part-00000`, etc.) and view them to see the word count results.
Clean Up Resources:
Terminate the EMR cluster to avoid additional charges.
Delete the S3 bucket.