BigData & Hadoop Recipes

Step1: Download the following Jar file:

https://github.com/Ravi-Shekhar/hadoop-eclipse-plugin/blob/master/release/hadoop-eclipse-plugin-2.6.0.jar

Step 2: Install jar file. Simply copy the jar file in eclipse plugins folder Ex: /eclipse/plugins. Then Restart the eclipse.

Step 3: Switch to the MapReduce perspective. In the upper-right corner of the workbench, click the “Open Perspective” button, as shown in Figure 3.4:

Select “Other,” followed by “Map/Reduce” in the window that opens up. At first, nothing may appear to change. In the menu, choose Window * Show View * Other. Under “MapReduce Tools,” select “Map/Reduce Locations.” This should make a new panel visible at the bottom of the screen, next to Problems and Tasks.

Step 4: Add the Server. In the Map/Reduce Locations panel, click on the elephant logo in the upper-right corner to add a new server to Eclipse.

You will now be asked to fill in a number of parameters identifying the server. To connect to the VMware image, the values are:

When you are done, click “Finish.” Your server will now appear in the Map/Reduce Locations panel. If you look in the Project Explorer (upper-left corner of Eclipse), you will see that the MapReduce plugin has added the ability to browse HDFS. Click the [+] buttons to expand the directory tree to see any files already there. If you inserted files into HDFS yourself, they will be visible in this tree.

Figure 4: Files Visible in the HDFS Viewer

Now that your system is configured, the following sections will introduce you to the basic features and verify that they work correctly.

Latest Hadoop setting on Mac OS X can be achieved by various ways like Homebrew. If you don’t prefer Homebrew, this guide aims to setting up the pseudo-distributed mode in single node cluster.

1. Required software

1) Java

Run the following command in a terminal:

$ java -version

If Java is already installed, you can see a similar result like:

$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

If not, the terminal will prompt you for installation or you can download Java JDK here.

2) SSH

First enable Remote Login in System Preference -> Sharing.

Now check that you can ssh to the localhost without a passphrase:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

2. Get a Hadoop distribution

You can download it from Apache Download Mirror.

3. Prepare to start the Hadoop cluster

1) Unpack the downloaded Hadoop distribution.

2) Run the following command to figure out where is your Java home directory:

$ /usr/libexec/java_home

You can see a result like:

$ /usr/libexec/java_home
/Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home

3) In the distribution, edit the file etc/hadoop/hadoop-env.sh to define some parameters as follows:

# set to the root of your Java installation
export JAVA_HOME={your java home directory}
# set to the root of your Hadoop installation
export HADOOP_PREFIX={your hadoop distribution directory}

4) Try the following command:

$ cd {your hadoop distribution directory}
$ bin/hadoop

This will display the usage documentation for the hadoop script.

Now you are ready to start your Hadoop cluster in one of the three supported modes:

Standalone mode
Pseudo-distributed mode
fully-distributed mode

We will go through pseudo-distributed mode and run a MapReduce job on YARN here. In this mode, Hadoop runs on a single node and each Hadoop daemon runs in a separate Java process.

4. Configuration

Edit following config files in your Hadoop directory
1) etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

2) etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

3) etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

4) etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

5. Execution

1) Format and start HDFS and YARN

$ cd {your hadoop distribution directory}

Format the filesystem:

$ bin/hdfs namenode -format

Start NameNode daemon and DataNode daemon:

$ sbin/start-dfs.sh

Now you can browse the web interface for the NameNode at - http://localhost:50070/

Make the HDFS directories required to execute MapReduce jobs:

$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/{username} #make sure you add correct username here

Start ResourceManager daemon and NodeManager daemon:

$ sbin/start-yarn.sh

Browse the web interface for the ResourceManager at - http://localhost:8088/

2) Test examples code that came with the hadoop version

Copy the input files into the distributed filesystem:

$ bin/hdfs dfs -put etc/hadoop input

Run some of the examples provided:

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'

This example counts the words starting with "dfs" in the input.

Examine the output files:

Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ bin/hdfs dfs -get output output
$ cat output/*

or View the output files on the distributed filesystem:

$ bin/hdfs dfs -cat output/*

You can see the result like:

4 dfs.class
4 dfs.audit.logger
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file

3) Stop YARN and HDFS

When you're done, stop the daemons with:

$ sbin/stop-yarn.sh
$ sbin/stop-dfs.sh

You are all done.

BigData & Hadoop Recipes

Monday, September 14, 2015

Setting Hadoop in Eclipse / Eclipse plugin for Hadoop

Setting Hadoop Classpath in Mac OS X

Hadoop Classpath Setting

Friday, September 11, 2015

Setting up Hadoop 2.7.1 on Mac OS X Yosemite & Hadoop Eclipse Setup

1. Required software

1) Java

2) SSH

2. Get a Hadoop distribution

3. Prepare to start the Hadoop cluster

4. Configuration

5. Execution

Tuesday, July 15, 2014

[WhitePaper] Guide to BigData and Hadoop

BigData & Hadoop Recipes