Using Ansible for Hadoop Automation

Introduction

Hello guys, Here I'm with a my latest blog that's going to clear your thinking and give you a new vision regarding the Hadoop automation using ansible. As for some of you the terms like Hadoop and ansible may be new so let me clear that first and then we'll be back on to Hadoop automation.

What is Hadoop ?

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing. The features of Hadoop Distributed File System are given below :-

It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

Some Basic Terminologies about Hadoop

Hadoop cluster :- A Hadoop Cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computation on big data sets.

Name node :- The Name node is the centre piece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Data node :- A Data node stores data in the Hadoop File System. A functional filesystem has more than one Data node, with data replicated across them. On start-up, a Data node connects to the Name node; spinning until that service comes up. It then responds to requests from the Name node for filesystem operations.

client node :- Client nodes are in charge of loading the data into the cluster. Client nodes first submit MapReduce jobs describing how data needs to be processed, and then fetch the results once the processing is finished.

What is Ansible ?

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows. It includes its own declarative language to describe system configuration. Ansible was written by Michael DeHaan and acquired by Red Hat in 2015. Ansible is agentless, temporarily connecting remotely via SSH or Windows Remote Management (allowing remote PowerShell execution) to do its tasks.

How does Ansible Works ?

Ansible is a radically simple IT automation engine that automates cloud provisioning, configuration management, application deployment, intra-service orchestration, and many other IT needs. Designed for multi-tier deployments since day one, Ansible models your IT infrastructure by describing how all of your systems inter-relate, rather than just managing one system at a time. It uses no agents and no additional custom security infrastructure, so it's easy to deploy - and most importantly, it uses a very simple language (YAML, in the form of Ansible Playbooks) that allow you to describe your automation jobs in a way that approaches plain English. On this page, we'll give you a really quick overview so you can see things in context. For more detail, hop over to docs.ansible.com .

Ansible Playbook

While modules provide the means of accomplishing a task, the way you use them is through an Ansible playbook. A playbook is a configuration file written in YAML that provides instructions for what needs to be done in order to bring a managed node into the desired state. Playbooks are meant to be simple, human-readable, and self-documenting. They are also idempotent, meaning that a playbook can be run on a system at any time without having a negative effect upon it. If a playbook is run on a system that's already properly configured and in its desired state, then that system should still be properly configured after a playbook runs.

Ansible Modules

Modules (also referred to as “task plugins” or “library plugins”) are discrete units of code that can be used from the command line or in a playbook task. Ansible executes each module, usually on the remote managed node, and collects return values.

Ansible Variables

Ansible uses variables to manage differences between systems. With Ansible, you can execute tasks and playbooks on multiple different systems with a single command. You can define these variables in your playbooks, in your inventory, in re-usable files or roles, or at the command line.

you need to write the playbooks of respective configuration on the target node. we need to first configure the name node and start it. The playbook of name node is written below.

Name Node Variables

hadoop_path: "/root/hadoop-1.2.1-1.x86_64.rpm"

jdk_path: "/root/jdk-8u171-linux-x64.rpm"

hadoop_software: "hadoop-1.2.1-1.x86_64.rpm"

jdk_software: "jdk-8u171-linux-x64.rpm"

core_site: "/root/namenode_files/core-site.xml"

hdfs_site: "/root/namenode_files/hdfs-site.xml"

directory_path: "/nn"

start_namenode: "hadoop-daemon.sh start namenode"

run_jps: "jps"

directory_delete: "rm -rf /nn"

stop_namenode: "hadoop-daemon.sh stop namenode"

hadoop_report: "hadoop dfsadmin -report"

NameNode Playbook

- hosts: namenode

vars_files:

- namenode_var.yml

tasks:

- name: "Copying the hadoop File"

copy:

src: "{{ hadoop_path }}"

dest: "/root/"

- name: "Copying the JDK File"

copy:

src: "{{ hadoop_software }}"

dest: "/root/"

- name: "Installing Jdk"

shell: "rpm -ivh {{ jdk_software }}"

ignore_errors: yes

- name: "Java Installation"

debug:

var: Java.stdout

- name: "Installing Hadoop"

shell: "rpm -ivh {{ hadoop_software }} --force"

ignore_errors: yes

- name: "Hadoop Installation"

debug:

var: Hadoop.stdout

- name: "Copying the core-site.xml file"

copy:

src: "{{ core_site }}"

dest: "/etc/hadoop/"

- name: "Copying the hdfs-site.xml file"

copy:

src: "{{ hdfs_site }}"

dest: "/etc/hadoop/"

- name: "Deleting the directory"

shell: "{{ directory_delete }}"

ignore_errors: yes

- name: "Creating a directory"

file:

state: directory

path: "{{ directory_path }}"

- name: "Formatting the directory"

shell: "echo Y | hadoop namenode -format"

- name: "Formatting NameNode"

debug:

var: format.stdout

- name: "Stopping the namenode"

shell: "{{ stop_namenode }}"

ignore_errors: yes

- name: "Stopping hadoop"

debug:

var: hadoop_stopped.stdout

- name: "Starting the namenode"

shell: "{{ start_namenode }}"

ignore_errors: yes

- name: "Started hadoop"

debug:

var: hadoop_started.stdout

- name: "Java Process"

shell: "{{ run_jps }}"

- name: "Java Process"

debug:

var: jps.stdout

The command to run the namenode playbook is:

ansible-playbook namenode.yml

Running NameNode Playbook

DataNode Variables

hadoop_path: "/root/hadoop-1.2.1-1.x86_64.rpm"

jdk_path: "/root/jdk-8u171-linux-x64.rpm"

hadoop_software: "hadoop-1.2.1-1.x86_64.rpm"

jdk_software: "jdk-8u171-linux-x64.rpm"

core_site: "/root/datanode_files/core-site.xml"

hdfs_site: "/root/datanode_files/hdfs-site.xml"

directory_path: "/dn1"

start_datanode: "hadoop-daemon.sh start datanode"

run_jps: "jps"

directory_delete: "rm -rf /dn1"

stop_datanode: "hadoop-daemon.sh stop datanode"

hadoop_report: "hadoop dfsadmin -report"

DataNode Playbook

- hosts: datanode

vars_files:

- datanode_var.yml

tasks:

- name: "Copying the hadoop File"

copy:

src: "{{ hadoop_path }}"

dest: "/root/"

- name: "Copying the JDK File"

copy:

src: "{{ hadoop_software }}"

dest: "/root/"

- name: "Installing Jdk"

shell: "rpm -ivh {{ jdk_software }}"

ignore_errors: yes

- name: "Java Installation"

debug:

var: Java.stdout

- name: "Installing Hadoop"

shell: "rpm -ivh {{ hadoop_software }} --force"

ignore_errors: yes

- name: "Hadop Installation"

debug:

var: Hadoop.stdout

- name: "Copying the core-site.xml file"

copy:

src: "{{ core_site }}"

dest: "/etc/hadoop/"

- name: "Copying the hdfs-site.xml file"

copy:

src: "{{ hdfs_site }}"

dest: "/etc/hadoop/"

- name: "Deleting the directory"

shell: "{{ directory_delete }}"

ignore_errors: yes

- name: "Creating a directory"

file:

state: directory

path: "{{ directory_path }}"

- name: "Formatting the directory"

shell: "echo Y | hadoop namenode -format"

ignore_errors: yes

- name: "Formating NameNode"

debug:

var: format.stdout

- name: "Stoping the namenode"

shell: "{{ stop_datanode }}"

ignore_errors: yes

- name: "Stopping hadoop"

debug:

var: hadoop_stopped.stdout

- name: "Starting the datanode"

shell: "{{ start_datanode }}"

ignore_errors: yes

- name: "Started hadoop"

debug:

var: hadoop_started.stdout

- name: "Java Process"

shell: "{{ run_jps }}"

- name: "Java Process"

debug:

var: jps.stdout

- name: "Running Hadoop Report"

shell: "{{ hadoop_report }}"

- name: "Showing Hadoop Report"

debug:

var: hadoop_report.stdout

Command to run playbook :

ansible-playbook datanode.yml

Running DataNode Playbook

Client Node Variable

hadoop_path: "/root/hadoop-1.2.1-1.x86_64.rpm"

jdk_path: "/root/jdk-8u171-linux-x64.rpm"

hadoop_software: "hadoop-1.2.1-1.x86_64.rpm"

jdk_software: "jdk-8u171-linux-x64.rpm"

core_site: "/root/client_files/core-site.xml"

hadoop_report: "hadoop dfsadmin -report"

client_report: "hadoop fs -ls /"

file_name: "file.txt"

put_file: "hadoop fs -put /root/{{ file_name }} / "

client_file_src: "/root/client_files/{{ file_name }}"

remove_file: "hadoop fs -rm /{{ file_name }}"

client_file_dest: "/root/"

Client Node Playbook

- hosts: client

vars_files:

- client_var.yml

tasks:

- name: "Copying the hadoop File"

copy:

src: "{{ hadoop_software }}"

dest: "/root/"

- name: "Copying the JDK File"

copy:

src: "{{ jdk_path }}"

dest: "/root/"

- name: "Installing Jdk"

shell: "rpm -ivh {{ jdk_software }}"

ignore_errors: yes

- name: "Java Installation"

debug:

var: Java.stdout

- name: "Installing Hadoop"

shell: "rpm -ivh {{ hadoop_software }} --force"

ignore_errors: yes

- name: "Hadoop Installation"

debug:

var: Hadoop.stdout

- name: "Copying the core-site.xml file"

copy:

src: "{{ core_site }}"

dest: "/etc/hadoop/"

- name: "Files Available"

shell: "{{ client_report }}"

- name: "Showing Files"

debug:

var: files

- name: "Deleting Previous Files"

shell: "{{ remove_file }}"

ignore_errors: yes

- name: "Copying the files to client node"

copy:

src: "{{ client_file_src }}"

dest: "{{ client_file_dest }}"

- name: "Uploading the Files by client"

shell: "{{ put_file }}"

- name: "Files Available"

shell: "{{ client_report }}"

- name: "Showing Files"

debug:

var: files

- name: "Running Hadoop Report"

shell: "{{ hadoop_report }}"

- name: "Showing Hadoop Report"

debug:

var: hadoop_report.stdout

Command to run client node Playbook :

ansible-playbook clientnode.yml

Running Client Node Playbook :

As the Client Node ran successfully, now we need to see the dashboard that file that client wanted to send is sent successfully or not.

That's all for now. guys keep learning and keep sharing.

Search This Blog

knowledge gainer