AHA
The codebase of project on adaptive Hadoop for ad-hoc environments.
Project Setup
A customized Apache Hadoop 3.3.0 cluster can be setup and deployed automatically using the
bash scripts found in the setup
directory by following the steps below
(Tested on GCP ad AWS).
- Create project on either GCP or AWS and create some VM instances (resources per node, firewalls, etc.)
- e.g. 1 namenode and 3 datanodes each with 2vCPU and 4GB Memory
- Make sure to setup passwordless SSH for all instances!
- Create file
./setup/ssh_credentials
with ssh credentials adhering to the format<path to ssh key> <ssh username>
- Create
./setup/config/namenodes
with corresponding IP address of the namenode adhering to the format below (space delimited) (This file needs to be updated whenever VMs are rebooted!)- For GCP:
<External IP> <Internal IP> <vCPU for instance> <Memory for instance(in GB)>
- For AWS:
<Public IPv4 address> <Public IPv4 DNS> <vCPU for instance> <Memory for instance(in GB)>
- For GCP:
- Create
./setup/config/datanodes
with corresponding IP address for each of the datanodes adhering to the format below (space delimited) (This file needs to be updated whenever VMs are rebooted!)- For GCP:
<External IP> <Internal IP> <vCPU for instance> <Memory for instance(in GB)>
- For AWS:
<Public IPv4 address> <Public IPv4 DNS> <vCPU for instance> <Memory for instance(in GB)>
- For GCP:
cd ./setup
- Set up the cluster using
-
./install_jfuzzylogic.sh
(First time only!) ./build_hadoop.sh
./setup_cluster.sh
-
- Start the cluster using
./start_cluster.sh
.- Stop the cluster using
./stop_cluster.sh
- Stop the cluster using
- (Optional) Test the cluster out with a simple word count job
./run_wc.sh
- (Optional) Enable GCP stackdriver-agent to setup monitoring via GCP
./enable_gcp_monitoring_agent.sh
Setup references:
Tunable Configurations
The following is a list of tunable configurations in this project.
-
yarn.resourcemanager.supernode-threshold
- Configured in: yarn-site.xml
- Description: Specifies supernode threshold for entire cluster (default: 60.0). This value is passed to each AM inside of allocate rpc calls.
-
yarn.resourcemanager.supernode-threshold-min
- Configured in: yarn-site.xml
- Description: Specifies minimum supernode threshold for entire cluster (default: 0.0)
-
yarn.resourcemanager.supernode-threshold-max
- Configured in: yarn-site.xml
- Description: Specifies maximum supernode threshold for entire cluster (default: 100.0)
-
yarn.app.mapreduce.am.job.supernode-threshold
- Configured in: JobConfiguration for MR jobs
- Description: Specifies supernode threshold used for this job (overrides value set in
yarn.resourcemanager.supernode-threshold
). Set as -1 to used cluster-wide threshold.
-
yarn.resourcemanager.tuning.enabled
- Configured in: yarn-site.xml
- Description: Specifies whether RM tuning service is enabled.
-
yarn.resourcemanager.tuning.fcl
- Configured in: yarn-site.xml
- Description: Specifies fuzzy control logic file used by fuzzy inference system during RM tuning.
-
yarn.resourcemanager.tuning.knob-increment
- Configured in: yarn-site.xml
- Description: Specifies the amount to increase/decrease the knob when tuning system indicates a chang is needed.
-
yarn.resourcemanager.tuning.resource-input
- Configured in: yarn-site.xml
- Description: Specifies which resource is used as input during tuning decision making. (default: "cpu", also supports "memory")
-
yarn.resourcemanager.tuning.time-step
- Configured in: yarn-site.xml
- Description: Specifies frequency that tuning logic is evaluated in seconds (default: 30)
-
yarn.resourcemanager.tuning.fcl
- Configured in: yarn-site.xml
- Description: Specifies file name of fuzzy control logic rule definition (default: hadoop.fcl, default directory: $HOME/config)
-
yarn.resourcemanager.tuning.under-alloc-limit
- Configured in: yarn-site.xml
- Description: Specifies the percentage boundary for when resource usage is considered under-allocated (default: 0.75)
-
yarn.resourcemanager.tuning.over-alloc-limit
- Configured in: yarn-site.xml
- Description: Specifies the percentage boundary for when resource usage is considered over-allocated (default: 1.25)
-
yarn.nodemanager.node-score-cpu-weight
- Configured in: yarn-site.xml
- Description: Specifies weight for CPU when NM calculates node score. (default: 5)
-
yarn.nodemanager.node-score-memory-weight
- Configured in: yarn-site.xml
- Description: Specifies weight for Memory when NM calculates node score. (default: 4)
-
yarn.nodemanager.node-score-disk-weight
- Configured in: yarn-site.xml
- Description: Specifies weight for Disk I/O when NM calculates node score. (default: 1)
-
yarn.nodemanager.node-score-history-weight
- Configured in: yarn-site.xml
- Description: Specifies weight for prior(historical) node scores when NM calculates node score. (default: 1)
-
yarn.nodemanager.history-time-unit
- Configured in: yarn-site.xml
- Description: Specifies the granularity of each history unit when stored to local data store in minutes. (default: 10)
Evaluation Workloads
The project currently supports the following evaluation workloads. All evaluation scripts can be found in the eval
directory.
-
TeraGen/Sort/Validate: from the
eval
directory use./eval_tera.sh <number of GB> [<speculator classname> <supernode threshold [0-100]> (optional together)]
-
<speculator classname>
defaults toorg.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator
(see Speculator Options for more) -
<supernode threshold>
defaults to70
-
RandomTextWriter + WordCount: from the
eval
directory use./eval_wordcount.sh <maps per host> <bytes per map> [<speculator classname> <supernode threshold [0-100]> (optional together)]
-
<speculator classname>
defaults toorg.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator
(see Speculator Options for more) -
<supernode threshold>
defaults to70
Ad-hoc Environment Simulation
All ad-hoc simulation scripts can be found in the eval
directory.
- To start ad-hoc simulation: from the
eval
directory use./start_adhoc.sh
- To stop ad-hoc simulation: use Ctrl+C
Speculator Options
- DefaultSpeculator (default/hadoop built-in)
- HistoryAwareSpeculator (most different from DefaultSpeculator)
- RACSpeculator (closest to original Adoop)