Automate HDFS and MR cluster on AWS using Ansible

4 min readApr 23, 2021

Hey There!!!

Here I am with another article on Ansible and Hadoop. This time I will show how to automate Hadoop HDFS and MR cluster on AWS using RedHat Ansible. But before we start the article, you should have an AWS account and Ansible installed on your computer with you because without it I can’t help you 😜

HDFS Cluster

Hadoop is a product from Apache for solving Big Data problems. Here the HDFS cluster helps to create Data Lake(Where we store the raw data in PB or ZB) with multiple computers. An HDFS Cluster has a Name Node(master node) and multiple Data Nodes(Which stores the data). With help of the client node, we can store the data on HDFS Cluster.

MR Cluster

MR stands for MapReduce. MR also follows the same structure that HDFS follows. It has a Job Tracker(master node) and multiple Task Tracker(worker nodes). In this structure, Task Trackers provide RAM and CPU to the Job Tracker for performing the different operations on the Big Data.

MR Cluster does not have any storage to get or store the data, It takes the data from HDFS Cluster.

We will use Dynamic Inventory for getting the IP addresses of the newly provisioned instance with ansible. Follow these steps to set up the clusters on AWS.

For this practical, I have created an image having pre-installed jdk and Hadoop on it. This is the image id ami-06932ae9961fdb800, you can use it if you want.

First, clone the repository created by me for the playbooks

https://github.com/urvish667/hadoop-aws-ansible

Step 1: Provision AWS EC2 instance

For provisioning, Ansible provides the ec2 module of AWS. It requires an AWS IAM user access key and secret key, we can get it from the IAM service of AWS if we have an AWS account. After getting the credentials, we can store them as environmental variables or we can use Ansible-vault to encrypt them. In this example, I have used Ansible-vault to provide more security. Here is a step to create it:

Ansible vault

ansible-vault create secret.yml
New Vault password: 
Confirm New Vault password:

After running the above command, it will ask you to create a password and then it will open a vi editor to edit the file. Just enter the access key and secret key in YAML format. Like this:

access_key: <access_key>
secret_key: <secret_key>

Now, Our Ansible-vault is ready to be used.
Use the vault’s variable like this in the playbook.

- hosts: localhost
  vars_files: 
  - secret.yml
  ec2:
    aws_access_key: "{{ access_key }}"
    aws_secret_key: "{{ secret_key }}"

I have already created a YAML file called ec2_provisioning.yml. It will launch the instances. For increase or decrease, the number of instances edits the count: inside the playbook and run the playbook with ansible-playbook — vault-id @prompt ec2_provisioning and enter the password of your vault.

Step 2: Dynamic Inventory

I have already uploaded the files for creating the dynamic inventory inside the folder called hosts, and we will mention it in the ansible.cfg file at inventory=hosts. In the hosts folder, we have the ec2.py and ec2.ini file, that will be going to help us to find the IP of the instances from the AWS cloud.

export AWS_ACCESS_KEY_ID=<'access-key'>
export AWS_SECRET_ACCESS_KEY=<'secret-key'>
export ANSIBLE_HOSTS=<path/of/ec2.py>
export EC2_INI_PATH=<path/of/ec2.ini>

The above command will export the environmental variables. These variable helps to find the necessary things for ec2.py file to run.

./ec2.py --list

The above command will list all the instances by tag name.

Step 3: Launch the instances with Ansible on the AWS

In the repo, I have created a file called ec2-instance-provision.yml, run this file and it will launch the instance according to your requirements by changing the numbers inside the file.

Step 4: Setting up the HDFS cluster

First, we’ll configure the Name Node. Just run the hadoop-name-node.yml file, at the end, it will print the private IP of the Name Node instance, copy that IP, and save it. Now, next thing, run the hadoop-data-node.yml file, this file will ask the IP address of the Name Node, enter the IP address and you are good to go.

Step 5: Setting up the MR cluster

First, we’ll configure the Job Tracker. Just run the hadoop-job-tracker.yml file, at the end, it will print the private IP of the Job Tracker instance, copy that IP, and save it. Now, next thing, run the hadoop-task-tracker.yml file, and you are good to go.

Now, Both the cluster has been set up. So, we’ll configure the client node to upload the data on the HDFS cluster and do the analysis on the MR cluster.

Step 5: Setting up the client

Run the file called hadoop-client.yml and enter both the IP of the Job Tracker and Name Node.

This is it. You can use both the cluster with the client node.