STARLAB
AHA

Repository



AHA
The codebase of project on adaptive Hadoop for ad-hoc environments.

Project Setup
A customized Apache Hadoop 3.3.0 cluster can be setup and deployed automatically using the
bash scripts found in the setup directory by following the steps below
(Tested on GCP ad AWS).

Create project on either GCP or AWS and create some VM instances (resources per node, firewalls, etc.)

e.g. 1 namenode and 3 datanodes each with 2vCPU and 4GB Memory
Make sure to setup passwordless SSH for all instances!
Create file ./setup/ssh_credentials with ssh credentials adhering to the format
<path to ssh key> <ssh username>


Create ./setup/config/namenodes with corresponding IP address of the namenode
adhering to the format below (space delimited) (This file needs to be updated whenever VMs are rebooted!)

For GCP:
<External IP> <Internal IP> <vCPU for instance> <Memory for instance(in GB)>

For AWS:
<Public IPv4 address> <Public IPv4 DNS> <vCPU for instance> <Memory for instance(in GB)>


Create ./setup/config/datanodes with corresponding IP address for each of
the datanodes adhering to the format below (space delimited) (This file needs to be updated whenever VMs are rebooted!)

For GCP:
<External IP> <Internal IP> <vCPU for instance> <Memory for instance(in GB)>

For AWS:
<Public IPv4 address> <Public IPv4 DNS> <vCPU for instance> <Memory for instance(in GB)>


cd ./setup
Set up the cluster using


./install_jfuzzylogic.sh (First time only!)
./build_hadoop.sh
./setup_cluster.sh


Start the cluster using ./start_cluster.sh.

Stop the cluster using ./stop_cluster.sh


(Optional) Test the cluster out with a simple word count job

./run_wc.sh


(Optional) Enable GCP stackdriver-agent to setup monitoring via GCP

./enable_gcp_monitoring_agent.sh


Setup references:

post1
post2
post3
post4


Tunable Configurations
The following is a list of tunable configurations in this project.


yarn.resourcemanager.supernode-threshold

Configured in: yarn-site.xml

Description:  Specifies supernode threshold for entire cluster (default: 60.0). This value is passed to each AM inside of allocate rpc calls.


yarn.resourcemanager.supernode-threshold-min

Configured in: yarn-site.xml

Description:  Specifies minimum supernode threshold for entire cluster (default: 0.0)


yarn.resourcemanager.supernode-threshold-max

Configured in: yarn-site.xml

Description:  Specifies maximum supernode threshold for entire cluster (default: 100.0)


yarn.app.mapreduce.am.job.supernode-threshold

Configured in: JobConfiguration for MR jobs
Description: Specifies supernode threshold used for this job (overrides value set in yarn.resourcemanager.supernode-threshold). Set as -1 to used cluster-wide threshold.


yarn.resourcemanager.tuning.enabled

Configured in: yarn-site.xml

Description: Specifies whether RM tuning service is enabled.


yarn.resourcemanager.tuning.fcl

Configured in: yarn-site.xml

Description: Specifies fuzzy control logic file used by fuzzy inference system during RM tuning.


yarn.resourcemanager.tuning.knob-increment

Configured in: yarn-site.xml

Description: Specifies the amount to increase/decrease the knob when tuning system indicates a chang is needed.


yarn.resourcemanager.tuning.resource-input

Configured in: yarn-site.xml

Description: Specifies which resource is used as input during tuning decision making. (default: "cpu", also supports "memory")


yarn.resourcemanager.tuning.time-step

Configured in: yarn-site.xml

Description: Specifies frequency that tuning logic is evaluated in seconds (default: 30)


yarn.resourcemanager.tuning.fcl

Configured in: yarn-site.xml

Description: Specifies file name of fuzzy control logic rule definition (default: hadoop.fcl, default directory: $HOME/config)


yarn.resourcemanager.tuning.under-alloc-limit

Configured in: yarn-site.xml

Description: Specifies the percentage boundary for when resource usage is considered under-allocated (default: 0.75)


yarn.resourcemanager.tuning.over-alloc-limit

Configured in: yarn-site.xml

Description: Specifies the percentage boundary for when resource usage is considered over-allocated (default: 1.25)


yarn.nodemanager.node-score-cpu-weight

Configured in: yarn-site.xml

Description: Specifies weight for CPU when NM calculates node score. (default: 5)


yarn.nodemanager.node-score-memory-weight

Configured in: yarn-site.xml

Description: Specifies weight for Memory when NM calculates node score. (default: 4)


yarn.nodemanager.node-score-disk-weight

Configured in: yarn-site.xml

Description: Specifies weight for Disk I/O when NM calculates node score. (default: 1)


yarn.nodemanager.node-score-history-weight

Configured in: yarn-site.xml

Description: Specifies weight for prior(historical) node scores when NM calculates node score. (default: 1)


yarn.nodemanager.history-time-unit

Configured in: yarn-site.xml

Description: Specifies the granularity of each history unit when stored to local data store in minutes. (default: 10)


Evaluation Workloads
The project currently supports the following evaluation workloads. All evaluation scripts can be found in the eval directory.


TeraGen/Sort/Validate: from the eval directory use

./eval_tera.sh <number of GB> [<speculator classname> <supernode threshold [0-100]> (optional together)]

<speculator classname> defaults to org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator (see Speculator Options for more)

<supernode threshold> defaults to 70


RandomTextWriter + WordCount: from the eval directory use

./eval_wordcount.sh <maps per host> <bytes per map> [<speculator classname> <supernode threshold [0-100]> (optional together)]

<speculator classname> defaults to org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator (see Speculator Options for more)

<supernode threshold> defaults to 70


Ad-hoc Environment Simulation
All ad-hoc simulation scripts can be found in the eval directory.

To start ad-hoc simulation: from the eval directory use

./start_adhoc.sh


To stop ad-hoc simulation: use Ctrl+C


Speculator Options

DefaultSpeculator (default/hadoop built-in)
HistoryAwareSpeculator (most different from DefaultSpeculator)
RACSpeculator (closest to original Adoop)