Skip to content
Snippets Groups Projects

AHA

The codebase of project on adaptive Hadoop for ad-hoc environments.

Project Setup

A customized Apache Hadoop 3.3.0 cluster can be setup and deployed automatically using the bash scripts found in the setup directory by following the steps below (Tested on GCP ad AWS).

  1. Create project on either GCP or AWS and create some VM instances (resources per node, firewalls, etc.)
    • e.g. 1 namenode and 3 datanodes each with 2vCPU and 4GB Memory
    • Make sure to setup passwordless SSH for all instances!
    • Create file ./setup/ssh_credentials with ssh credentials adhering to the format <path to ssh key> <ssh username>
  2. Create ./setup/config/namenodes with corresponding IP address of the namenode adhering to the format below (space delimited) (This file needs to be updated whenever VMs are rebooted!)
    • For GCP: <External IP> <Internal IP> <vCPU for instance> <Memory for instance(in GB)>
    • For AWS: <Public IPv4 address> <Public IPv4 DNS> <vCPU for instance> <Memory for instance(in GB)>
  3. Create ./setup/config/datanodes with corresponding IP address for each of the datanodes adhering to the format below (space delimited) (This file needs to be updated whenever VMs are rebooted!)
    • For GCP: <External IP> <Internal IP> <vCPU for instance> <Memory for instance(in GB)>
    • For AWS: <Public IPv4 address> <Public IPv4 DNS> <vCPU for instance> <Memory for instance(in GB)>
  4. cd ./setup
  5. Set up the cluster using
    • ./install_jfuzzylogic.sh (First time only!)
    • ./build_hadoop.sh
    • ./setup_cluster.sh
  6. Start the cluster using ./start_cluster.sh.
    • Stop the cluster using ./stop_cluster.sh
  7. (Optional) Test the cluster out with a simple word count job
    • ./run_wc.sh
  8. (Optional) Enable GCP stackdriver-agent to setup monitoring via GCP
    • ./enable_gcp_monitoring_agent.sh

Setup references:

Tunable Configurations

The following is a list of tunable configurations in this project.

  • yarn.resourcemanager.supernode-threshold
    • Configured in: yarn-site.xml
    • Description: Specifies supernode threshold for entire cluster (default: 60.0). This value is passed to each AM inside of allocate rpc calls.
  • yarn.resourcemanager.supernode-threshold-min
    • Configured in: yarn-site.xml
    • Description: Specifies minimum supernode threshold for entire cluster (default: 0.0)
  • yarn.resourcemanager.supernode-threshold-max
    • Configured in: yarn-site.xml
    • Description: Specifies maximum supernode threshold for entire cluster (default: 100.0)
  • yarn.app.mapreduce.am.job.supernode-threshold
    • Configured in: JobConfiguration for MR jobs
    • Description: Specifies supernode threshold used for this job (overrides value set in yarn.resourcemanager.supernode-threshold). Set as -1 to used cluster-wide threshold.
  • yarn.resourcemanager.tuning.enabled
    • Configured in: yarn-site.xml
    • Description: Specifies whether RM tuning service is enabled.
  • yarn.resourcemanager.tuning.fcl
    • Configured in: yarn-site.xml
    • Description: Specifies fuzzy control logic file used by fuzzy inference system during RM tuning.
  • yarn.resourcemanager.tuning.knob-increment
    • Configured in: yarn-site.xml
    • Description: Specifies the amount to increase/decrease the knob when tuning system indicates a chang is needed.
  • yarn.resourcemanager.tuning.resource-input
    • Configured in: yarn-site.xml
    • Description: Specifies which resource is used as input during tuning decision making. (default: "cpu", also supports "memory")
  • yarn.resourcemanager.tuning.time-step
    • Configured in: yarn-site.xml
    • Description: Specifies frequency that tuning logic is evaluated in seconds (default: 30)
  • yarn.resourcemanager.tuning.fcl
    • Configured in: yarn-site.xml
    • Description: Specifies file name of fuzzy control logic rule definition (default: hadoop.fcl, default directory: $HOME/config)
  • yarn.resourcemanager.tuning.under-alloc-limit
    • Configured in: yarn-site.xml
    • Description: Specifies the percentage boundary for when resource usage is considered under-allocated (default: 0.75)
  • yarn.resourcemanager.tuning.over-alloc-limit
    • Configured in: yarn-site.xml
    • Description: Specifies the percentage boundary for when resource usage is considered over-allocated (default: 1.25)
  • yarn.nodemanager.node-score-cpu-weight
    • Configured in: yarn-site.xml
    • Description: Specifies weight for CPU when NM calculates node score. (default: 5)
  • yarn.nodemanager.node-score-memory-weight
    • Configured in: yarn-site.xml
    • Description: Specifies weight for Memory when NM calculates node score. (default: 4)
  • yarn.nodemanager.node-score-disk-weight
    • Configured in: yarn-site.xml
    • Description: Specifies weight for Disk I/O when NM calculates node score. (default: 1)
  • yarn.nodemanager.node-score-history-weight
    • Configured in: yarn-site.xml
    • Description: Specifies weight for prior(historical) node scores when NM calculates node score. (default: 1)
  • yarn.nodemanager.history-time-unit
    • Configured in: yarn-site.xml
    • Description: Specifies the granularity of each history unit when stored to local data store in minutes. (default: 10)

Evaluation Workloads

The project currently supports the following evaluation workloads. All evaluation scripts can be found in the eval directory.

  • TeraGen/Sort/Validate: from the eval directory use

    • ./eval_tera.sh <number of GB> [<speculator classname> <supernode threshold [0-100]> (optional together)]
    • <speculator classname> defaults to org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator (see Speculator Options for more)
    • <supernode threshold> defaults to 70
  • RandomTextWriter + WordCount: from the eval directory use

    • ./eval_wordcount.sh <maps per host> <bytes per map> [<speculator classname> <supernode threshold [0-100]> (optional together)]
    • <speculator classname> defaults to org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator (see Speculator Options for more)
    • <supernode threshold> defaults to 70

Ad-hoc Environment Simulation

All ad-hoc simulation scripts can be found in the eval directory.

  • To start ad-hoc simulation: from the eval directory use
    • ./start_adhoc.sh
  • To stop ad-hoc simulation: use Ctrl+C

Speculator Options

  • DefaultSpeculator (default/hadoop built-in)
  • HistoryAwareSpeculator (most different from DefaultSpeculator)
  • RACSpeculator (closest to original Adoop)