How to install Hadoop|Spark on macOS High Sierra

2018-01-19

Note: I recommend using Hortonworks Sandbox instead…..

Install Hadoop with brew

Firstly, you need to check whether home brew has been installed or not, you can try

1
brew -v

I will skip how to install home brew here.

Then, install hadoop with brew command:

1
2
brew search hadoop
brew install hadoop

The installation location of hadoop is ‘/usr/local/Cellar/hadoop’

Configuration

Go to usr/local/Cellar/hadoop/2.8.2/libexec/etc/hadoop. Under this folder, you will need to modify four files.

1. hadoop-env.sh

Change

1
2
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
export JAVA_HOME="$(/usr/libexec/java_home)"

to

1
2
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home"

2. core-site.xml

Then configure HDFS address and port number, open core-site.xml, input following content in <configuration></configuration> tag.

1
2
3
4
5
6
7
8
9
10
11
12
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>

3. mapred-site.xml

Configure jobtracker address and port number in map-reduce, first sudo cp mapred-site.xml.template mapred-site.xml to make a copy of mapred-site.xml, and open mapred-site.xml, add

1
2
3
4
5
6
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

4. hdfs-site.xml

Set HDFS default backup, the default value is 3, we should change to 1, open hdfs-site.xml, add

1
2
3
4
5
6
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Enable SSH

Check for files ~/.ssh/id_rsa and ~/.ssh/id_rsa.pubto verify the SSH localhost is enabled or not.

If these files does not exists then run the following command to generate them:

1
ssh-keygen -t rsa

Enable Remote Login in (System Preference->Sharing), Just click “remote login”.

Then Authorize the generate SSH keys:

1
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test ssh at localhost:

1
ssh localhost

If success, you will see:

1
2
HanslendeMacBook-Pro:~ Hanslen$ ssh localhost
Last login: Fri Jan 19 17:59:36 2018

If fail, you will see:

1
2
HanslendeMacBook-Pro:~ Hanslen$ ssh localhost
ssh: connect to host localhost port 22: Connection refused

Format the hdfs

Format the distributed file system with the below command before starting the hadoop daemons. So that we can put our data sources into the hdfs file system while performing the map-reduce job.

1
hdfs namenode -format

Alias to start and stop Hadoop Daemons

Now, we need to go to /usr/local/Cellar/hadoop/2.8.2/sbin/ to start and stop hadoop services. And it is quite inconvienient, to make it easier we can create alias:

Edit ~/.bash_profile and add

1
2
alias hstart="/usr/local/Cellar/hadoop/2.8.2/sbin/start-all.sh"
alias hstop="/usr/local/Cellar/hadoop/2.8.2/sbin/stop-all.sh"

Then run

1
source ~/.bash_profile

Run Hadoop

Start hadoop with

1
hstart

In the browser, when your url is http://localhost:50070 you can see the following page:

localhost:50070

Also, you can see the JobTracker as well.

Go to the page http://localhost:8088, Specific Node Information http://localhost:8042, you will see

localhost:8088

localhost:8042

To stop it, just run

1
hstop

Thanks for these two articles which help me to figure out this problem.

https://isaacchanghau.github.io/2017/06/27/Hadoop-Installation-on-Mac-OS-X/

https://www.slideshare.net/SunilkumarMohanty3/install-apache-hadoop-on-mac-os-sierra-76275019

Install Spark

It is much more convient to install the Spark on Mac.

Firstly, download the latest Spark from http://spark.apache.org/downloads.html

Then you are done! :-) Just kidding hhhh. Move that zip to the directory that you like, as for me, I move it to the home directory.

Configure Jupyter Notebook

Open you .bash_profile, and add the following to this file.

1
2
3
4
5
#setting path for spark
export SPARK_PATH=~/spark-2.3.0-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'

Then source .bash_profile. You are done! I am not kidding. :P

You can now start to program with Spark in Jupyter Notebook. :-)

Useful Link here: http://spark.apache.org/docs/latest/index.html


Comments: