Hadoop Sandbox on Google Cloud

Ibrezm
4 min readJun 21, 2020

Well, Hadoop already has Cloud Dataproc which provides native Support for Pyspark, Hive, Pig already that can be accessed from the master node. This is more than needed for our needs most of the time. Then you would be asking why on earth do I want to do this.

Well you are right, but at time you want to study all the features in the Cloudera Sandbox ( Ambari/Zookeeper to name a few). I can do that on my local with virtual box. I do not have a 16 GB machine spare. Also its mostly be one time effort, so why go through so much hassle. Shouldn’t their be a easier way out ?

I could thing of GCP coming quickly to the rescue. I wanted something fast to prototype and test and that seemed to fulfill all the needs.

Lets give that a whirl.

Here is what i wanted to quickly install and check. I wanted to check out the 2.6.5 docker version and check out hive Ambari and Zookeeper and other features.

So, I knew I would need atleast 16GB ram and 40+ GB harddisk. In the console then we go.

The estimate was OKish but I wanted to reduce it further, So I marked it a Preemptive VM to further decrease the cost ( Remember if you are planning to use it longer, take a judicious decision ). also take care that you will be downloading a lot of data ( ~15 GB over the internet ) that will add to the cost.

Well, if we mark this preemptible then the cost drops to whopping 0.043$/hr

gcloud compute ssh --project=<project> --zone=<zone>  <instancename>

This brings up the VM. You can quickly install docker via below commands

sudo apt-get remove docker docker-engine docker.io containerd runc
sudo apt-get -y install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get -y install docker-ce docker-ce-cli containerd.io
sudo docker run hello-world
sudo apt -y install unzip
sudo apt -y install wget

Better you put these things in the start up script of the VM to avoid manually entering these, I ended up adding below items to startup

#! /bin/bash
apt remove docker docker-engine docker.io containerd runc
apt install -y apt-transport-https ca-certificates curl gnupg-agent software-properties-common unzip wget git
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
apt-key fingerprint 0EBFCD88
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
apt update
apt install -y docker-ce docker-ce-cli containerd.io

Once done some more code to get the docker scripts

wget https://archive.cloudera.com/hwx-sandbox/hdp/hdp-2.6.5/HDP_2.6.5_deploy-scripts_180624d542a25.zip
sudo unzip HDP_2.6.5_deploy-scripts_180624d542a25.zip
sudo bash docker-deploy-hdp265.sh

Remember to use sudo bash and not sh ( mentioned in cloudera tutorials ) as it causes operator error.

Sit back relax as it will take some time (20–30 mins ) once done validate that images are running with

sudo docker ps

Also we will have to open firewall so a good way is to create a firewall rule to open ports as below

gcloud compute --project=<project> firewall-rules create test123 --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=all --source-ranges=<ur public ip>

All that is needed next to login to the UI

http://<GCP VM IP>:1080/
Username : maria_dev
Password : maria_dev

Thats all folks , you will need some patience I tell you that. Until next time.

--

--