Commit 61b8db68 authored by Jae Young Lee's avatar Jae Young Lee

Merge branch 'wise-move-tomac-init' into 'master'

WiseMove Release (TOMAC-init).

See merge request !6
parents 6b50d1d1 878581c7
# SET BASE IMAGE:
FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04
# Setup basic commands
RUN apt-get update\
&& apt-get install -y --no-install-recommends \
x11-apps \
build-essential \
curl \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
rsync \
software-properties-common \
unzip \
libcupti-dev
# Setup environment variables
ENV LD_LIBRARY_PATH /usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
ENV CUDA_HOME /usr/local/cuda-9.0
ENV DEBIAN_FRONTEND=noninteractive
# Setup python 3.6
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update
RUN apt-get install -y python3.6 \
python3-tk
RUN apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Setup pip
RUN curl -O https://bootstrap.pypa.io/get-pip.py \
&& python3.6 get-pip.py \
&& rm get-pip.py
# set symlinks to python3.6
RUN rm /usr/bin/python3
RUN ln -s /usr/bin/python3.6 /usr/bin/python
RUN ln -s /usr/bin/python3.6 /usr/bin/python3
# setup pip packages
RUN python -m pip install --no-cache-dir -U ipython pip setuptools
RUN python -m pip install --no-cache-dir tensorflow-gpu==1.9.0
ENV PYTHON_PACKAGES="\
matplotlib \
keras==2.2.4 \
keras-rl \
h5py \
gym \
tqdm \
"
RUN pip install --upgrade pip
RUN pip install --no-cache-dir $PYTHON_PACKAGES
# setup user
ARG build_uid=1000
ARG build_gid=1000
ARG build_username=devuser
RUN groupadd -g ${build_gid} ${build_username} && \
useradd -m -u ${build_uid} -g ${build_gid} ${build_username}
RUN usermod -a -G sudo ${build_username}
RUN echo "${build_username} ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
# set user for container login
USER ${build_username}
\ No newline at end of file
# WiseMove
WiseMove is a hierarchical framework to investigate safe reinforcement learning, using incremental "learntime" verification of temporal logic constraints.
<br/>
<div align=center>
<div align="center">
<img src="documentation/figures/highlevel.gif"/>
</div>
<br/>
Requirements
------------
* Python 3.6
* Sphinx
* Please check `requirements.txt` for python package list.
Installation
------------
* Run the install dependencies script: `./scripts/install_dependencies.sh` to install pip3 and the required python packages.
Note: The script checks if the dependencies folder exists in the project root folder. If it does, it will install from the local packages in that folder, otherwise it will install the required packages from the internet.
If you do not have an internet connection and the dependencies folder does not exist, you will need to run `./scripts/download_dependencies.sh` using a machine with an internet connection first, then transfer that folder.
<img src="documentation/markdown/logo.png" height="70" />
<img src="documentation/markdown/motto.png" height="30" /><br>
Documentation
-------------
<img src="documentation/figures/highlevel.gif" width=400/><br><br>
* Open `./documentation/index.html` to view the documentation
* If the file does not exist, use command `./scripts/generate_doc.sh build` to generate the documentation first. Note that this requires Sphinx to be installed.
Replicate Results
-----------------
Given below are the minimum steps required to replicate the results for the simple_intersection environment. For a detailed user guide, it is recommended to view the documentation.
* Open terminal and navigate to the root of the project directory.
* Low-level policies:
* Use `python3 low_level_policy_main.py --help` to see all available commands.
* You can choose to test the provided pre-trained options:
* To visually inspect all pre-trained options: `python3 low_level_policy_main.py --test`
* To evaluate all pre-trained options: `python3 low_level_policy_main.py --evaluate`
* To visually inspect a specific pre-trained policy: `python3 low_level_policy_main.py --option=wait --test`.
* To evaluate a specific pre-trained policy: `python3 low_level_policy_main.py --option=wait --evaluate`.
* Available options are: wait, changelane, stop, keeplane, follow
* Or, you can train and test all the options, noting that this may take some time. Newly trained policies are saved to the root folder by default.
* To train all low-level policies from scratch (~40 minutes): `python3 low_level_policy_main.py --train`.
* To visually inspect all the new low-level policies: `python3 low_level_policy_main.py --test --saved_policy_in_root`.
* To evaluate all the new low-level policies: `python3 low_level_policy_main.py --evaluate --saved_policy_in_root`.
* Make sure the training is fully complete before running the above test/evaluation.
* It is faster to verify the training of a few options using the commands below (**Recommended**):
* To train a single low-level policy, e.g., *changelane* (~6 minutes): `python3 low_level_policy_main.py --option=changelane --train`. This is saved to the root folder.
* To evaluate the new *changelane*: `python3 low_level_policy_main.py --option=changelane --evaluate --saved_policy_in_root`
* Available options are: wait, changelane, stop, keeplane, follow
* **To replicate the experiments without additional properties:**
* Note that we have not provided a pre-trained policy that is trained without additional LTL.
* You will need to train it by adding the argument `--without_additional_ltl_properties` to the above *training* procedures. For example, `python3 low_level_policy_main.py --option=changelane --train --without_additional_ltl_properties`
* Now, use `--evaluate` to evaluate this new policy: `python3 low_level_policy_main.py --option=changelane --evaluate --saved_policy_in_root`
* **The results of `--evaluate` here is one trial. ** In the experiments reported in the paper, we conduct multiple such trials.
* High-level policy:
* Use `python3 high_level_policy_main.py --help` to see all available commands.
* You can use the provided pre-trained high-level policy:
* To visually inspect this policy: `python3 high_level_policy_main.py --test`
* To **replicate the experiment** used for reported results (~5 minutes): `python3 high_level_policy_main.py --evaluate`
* Or, you can train the high-level policy from scratch. Note that this takes some time:
* To train using pre-trained low-level policies for 0.2 million steps (~50 minutes): `python3 high_level_policy_main.py --train`
* To visually inspect this new policy: `python3 high_level_policy_main.py --test --saved_policy_in_root`
* To **replicate the experiment** used for reported results (~5 minutes): `python3 high_level_policy_main.py --evaluate --saved_policy_in_root`.
* Since above training takes a long time, you can instead verify using a lower number of steps:
* To train for 0.1 million steps (~25 minutes): `python3 high_level_policy_main.py --train --nb_steps=100000`
* Note that this has a much lower success rate of ~75%. Using this for MCTS will not reproduce the reported results.
* The average success and standard deviation in the evaluation corresponds to the results of high-level policy experiments.
* MCTS:
* Use `python3 mcts.py --help` to see all available commands.
* You can run MCTS on the provided pre-trained high-level policy:
* To visually inspect MCTS on the pre-trained policy: `python3 mcts.py --test --nb_episodes=10`
* To **replicate the experiment** used for reported results: `python3 mcts.py --evaluate`. Note that this takes a very long time (~16 hours).
* For a shorter version of the experiment: `python3 mcts.py --evaluate --nb_trials=2 --nb_episodes=10` (~20 minutes)
* Or, if you have trained a high-level policy from scratch, you can run MCTS on it:
* To visually inspect MCTS on the new policy: `python3 mcts.py --test --highlevel_policy_in_root --nb_episodes=10`
* To **replicate the experiment** used for reported results: `python3 mcts.py --evaluate --highlevel_policy_in_root`. Note that this takes a very long time (~16 hours).
* For a shorter version of the experiment: `python3 mcts.py --evaluate --highlevel_policy_in_root --nb_trials=2 --nb_episodes=10` (~20 minutes)
* You can use the arguments `--depth` and `--nb_traversals` to vary the depth of the MCTS tree (default is 5) and number of traversals done (default is 50).
* The average success and standard deviation in the evaluation corresponds to the results from MCTS experiments.
The time taken to execute the above scripts may vary depending on your configuration. The reported results were obtained using a system of the following specs:
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
16GB memory
Nvidia GeForce GTX 1080 Ti
Ubuntu 16.04
Coding Standards
----------------
We follow PEP8 style guidelines for coding and PEP257 for documentation.
It is not necessary to keep these in mind while coding, but before
submitting a pull request, do these two steps for each python file you
have modified.
1. `yapf -i YOUR_MODIFIED_FILE.py`
2. `docformatter --in-place YOUR_MODIFIED_FILE.py`
<a href=documentation/markdown/features.md><img src="documentation/markdown/features_button.png" height="30" /></a>
<a href=documentation/markdown/installation.md><img src="documentation/markdown/installation_button.png" height="30" /></a>
<a href=documentation/markdown/repeatability.md><img src="documentation/markdown/repeatability_button.png" height="30" /></a>
<a href=documentation/markdown/contributing.md><img src="documentation/markdown/contributing_button.png" height="30" /></a>
<a href=documentation/markdown/license.md><img src="documentation/markdown/license_button.png" height="30" /></a>
<a href=documentation/markdown/about.md><img src="documentation/markdown/about_button.png" height="30" /></a>
</div>
`yapf` formats the code and `docformatter` formats the docstrings.
from .manual_policy import ManualPolicy
from .mcts_learner import MCTSLearner
from .rl_controller import RLController
from .kerasrl_learner import DDPGLearner, DQNLearner
from .mcts_controller import MCTSController
\ No newline at end of file
from .kerasrl.learners import DDPGLearner, DQNLearner
from .mcts_controller import MCTSController
from .learner_base import LearnerBase
# TODO: make sure that the package for PPO2 is installed.
from stable_baselines import PPO2
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common.policies import MlpPolicy
import numpy as np
class PPO2Agent(LearnerBase):
def __init__(self,
input_shape,
nb_actions,
env,
policy=None,
tensorboard=False,
log_path="./logs",
**kwargs):
"""The constructor which sets the properties of the class.
Args:
input_shape: Shape of observation space, e.g (10,);
nb_actions: number of values in action space;
env: env on which the agent learns
policy: stable_baselines Policy object. default is MlpPolicy
tensorboard: whether to integrate tensorboard or not
log_path="./logs",
**kwargs: other optional key-value arguments with defaults defined in property_defaults
"""
super(PPO2Agent, self).__init__(input_shape, nb_actions, **kwargs)
if policy is None:
policy = self.get_default_policy()
self.log_path = log_path
self.env = DummyVecEnv([
lambda: env
]) #PPO2 requried a vectorized environment for parallel training
self.agent_model = self.create_agent(policy, tensorboard)
def get_default_policy(self):
"""Creates the default policy.
Returns: stable_baselines Policy object. default is MlpPolicy
"""
return MlpPolicy
def create_agent(self, policy, tensorboard):
"""Creates a PPO agent.
Returns: stable_baselines PPO2 object
"""
if tensorboard:
return PPO2(
policy, self.env, verbose=1, tensorboard_log=self.log_path)
else:
return PPO2(policy, self.env, verbose=1)
def fit(self,
env=None,
nb_steps=1000000,
visualize=False,
nb_max_episode_steps=200):
# PPO2 callback is only called each episode (not step) so cannot render whole episode
# To render each step, add self.env.render() in at Runner class method run() in stable_baselines ppo2.py
callback = self.__render_env_while_learning if visualize else None
self.agent_model.learn(total_timesteps=nb_steps, callback=callback)
@staticmethod
def __render_env_while_learning(_locals, _globals):
_locals['self'].env.render()
def save_weights(self, file_name="test_weights.h5f", overwrite=True):
self.agent_model.save(file_name)
def test_model(self,
env=None,
nb_episodes=50,
visualize=True,
nb_max_episode_steps=200):
episode_rewards = [0.0]
obs = self.env.reset()
current_episode = 1
current_step = 0
while current_episode <= nb_episodes:
# _states are only useful when using LSTM policies
action, _states = self.agent_model.predict(obs)
# here, action, rewards and dones are arrays
# because we are using vectorized env
obs, rewards, dones, info = self.env.step(action)
current_step += 1
if visualize:
self.env.render()
# Stats
episode_rewards[-1] += rewards[0]
if dones[0] or current_step > nb_max_episode_steps:
obs = self.env.reset()
print("Episode ", current_episode, "reward: ",
episode_rewards[-1])
episode_rewards.append(0.0)
current_episode += 1
current_step = 0
# Compute mean reward for the last 100 episodes
mean_100ep_reward = round(np.mean(episode_rewards[-100:]), 1)
print("Mean reward over last 100 episodes:", mean_100ep_reward)
def load_weights(self, file_name="test_weights.h5f"):
self.agent_model = PPO2.load(file_name)
def forward(self, observation):
return self.agent_model.predict(observation)
def set_environment(self, env):
self.env = DummyVecEnv([lambda: env])
self.agent_model.set_env(self.env)
from .policy_base import PolicyBase
class ControllerBase(PolicyBase):
class ControllerBase(object):
"""Abstract class for controllers."""
def __init__(self, env, low_level_policies, start_node_alias):
......@@ -9,7 +7,7 @@ class ControllerBase(PolicyBase):
self.low_level_policies = low_level_policies
# TODO: Move an intermediate class so that base class can be clean
self.current_node = self.low_level_policies[start_node_alias]
self.current_node = None if start_node_alias is None else self.low_level_policies[start_node_alias]
self.node_terminal_state_reached = False
self.controller_args_defaults = {}
......@@ -23,8 +21,8 @@ class ControllerBase(PolicyBase):
To be implemented in subclass.
"""
raise NotImplemented(self.__class__.__name__ + \
"can_transition is not implemented.")
raise NotImplementedError(self.__class__.__name__ + \
"can_transition is not implemented.")
def do_transition(self, observation):
"""Do a transition, assuming we can transition. To be implemented in
......@@ -34,8 +32,8 @@ class ControllerBase(PolicyBase):
observation: final observation from episodic step
"""
raise NotImplemented(self.__class__.__name__ + \
"do_transition is not implemented.")
raise NotImplementedError(self.__class__.__name__ + \
"do_transition is not implemented.")
def set_current_node(self, node_alias):
"""Sets the current node which is being executed.
......@@ -43,8 +41,8 @@ class ControllerBase(PolicyBase):
Args:
node: node alias of the node to be set
"""
raise NotImplemented(self.__class__.__name__ + \
"set_current_node is not implemented.")
raise NotImplementedError(self.__class__.__name__ + \
"set_current_node is not implemented.")
# TODO: Looks generic. Move to an intermediate class/highlevel manager so that base class can be clean
''' Executes the current node until node termination condition is reached
......@@ -56,16 +54,24 @@ class ControllerBase(PolicyBase):
# methods with and without MCTS.
def step_current_node(self, visualize_low_level_steps=False):
total_reward = 0
discount_rate = 1
self.node_terminal_state_reached = False
self.current_node.reset()
while not self.node_terminal_state_reached:
observation, reward, terminal, info = self.low_level_step_current_node()
if visualize_low_level_steps:
self.env.render()
# TODO: make the total_reward discounted....
total_reward += reward
# TODO: make the discount factor as a parameter.
total_reward += discount_rate * reward
discount_rate *= 0.9985
total_reward += self.current_node.high_level_extra_reward
# import time
# if (self.env.step_count > self.env.startup_delay):
# time.sleep(1)
# TODO for info
return observation, total_reward, terminal, info
......@@ -77,8 +83,17 @@ class ControllerBase(PolicyBase):
'''
def low_level_step_current_node(self):
u_ego = self.current_node.low_level_policy(self.current_node.get_reduced_features_tuple())
u_ego = self.current_node.policy(self.current_node.get_reduced_features_tuple())
feature, R, terminal, info = self.current_node.step(u_ego)
self.node_terminal_state_reached = terminal
return self.env.get_features_tuple(), R, self.env.termination_condition, info
return self.env.get_features_tuple(), R, self.env.termination_condition(), info
def execute_and_get_terminal_reward(self, visualize_low_level_steps=False):
self.node_terminal_state_reached = False
terminal = False
while not terminal:
observation, reward, terminal, info = self.step_current_node(
visualize_low_level_steps=visualize_low_level_steps)
# TODO for info
return observation, reward, terminal, info
import numpy as np
from rl.agents import DDPGAgent, DQNAgent
class DQNAgentOverOptions(DQNAgent):
def __init__(self,
model,
low_level_policies,
policy=None,
test_policy=None,
enable_double_dqn=True,
enable_dueling_network=False,
dueling_type='avg',
*args,
**kwargs):
super(DQNAgentOverOptions, self).__init__(
model, policy, test_policy, enable_double_dqn,
enable_dueling_network, dueling_type, *args, **kwargs)
# TODO: Rename `low_level_policies` just to `policies`.
self.low_level_policies = low_level_policies
if low_level_policies is not None:
self.low_level_policy_aliases = list(
self.low_level_policies.keys())
def __get_invalid_node_indices(self):
"""Returns a list of option indices that are invalid according to
initiation conditions."""
invalid_node_indices = list()
for index, option_alias in enumerate(self.low_level_policy_aliases):
# TODO: Locate reset to another place as this is a "get" function.
self.low_level_policies[option_alias].reset()
if not self.low_level_policies[option_alias].initiation_condition:
invalid_node_indices.append(index)
return invalid_node_indices
def forward(self, observation):
q_values = self.get_modified_q_values(observation)
if self.training:
action = self.policy.select_action(q_values=q_values)
else:
action = self.test_policy.select_action(q_values=q_values)
# Book-keeping.
self.recent_observation = observation
self.recent_action = action
# print('forward gives %s from %s' % (action, dict(zip(self.low_level_policy_aliases, q_values))))
return action
def get_modified_q_values(self, observation):
state = self.memory.get_recent_state(observation)
q_values = self.compute_q_values(state)
if self.low_level_policies is not None:
invalid_node_indices = self.__get_invalid_node_indices()
for node_index in invalid_node_indices:
q_values[node_index] = -np.inf
return q_values
import numpy as np
from keras.callbacks import TensorBoard
from rl.callbacks import Callback, ModelIntervalCheckpoint
from math import log, floor
class OptionsDiscounter(Callback):
def __init__(self, agent, env, gamma):
self.agent = agent
self.env = env
self.gamma = gamma
def on_action_end(self, option, logs={}):
self.agent.gamma = self.gamma ** self.env.current_node.time.num_steps
class ModelIntervalSavepoint(ModelIntervalCheckpoint):
def __init__(self, filepath, interval, verbose=0):
super(ModelIntervalSavepoint, self).__init__(filepath, interval, verbose)
def on_train_begin(self, logs={}):
self.save_model(logs)
def on_step_end(self, step, logs={}):
""" Save weights at interval steps during training """
self.total_steps += 1
if self.total_steps % self.interval == 0:
self.save_model(logs)
def save_model(self, logs={}):
filepath = self.filepath.format(step=str(int(self.total_steps / 1000)) + 'k', **logs)
if self.verbose > 0:
print('Step {}: saving model to {}'.format(self.total_steps, filepath))
self.model.save_weights(filepath, overwrite=True)
class EpisodicDataCollector(Callback):
def __init__(self, env, nb_episodes):
super().__init__()
#: the statistics of the test result, of the form (episode, avg, std, min, max)
self.nb_episodes = nb_episodes
self.episode_rewards = np.zeros(nb_episodes)
self.step_rewards = np.zeros(nb_episodes)
self.termination_rewards = np.zeros(nb_episodes)
self.nb_steps = np.zeros(nb_episodes)
self.env = env
def on_episode_end(self, episode, logs={}):
self.episode_rewards[episode] = logs['episode_reward']
self.termination_rewards[episode] = self.env.terminal_reward()
self.step_rewards[episode] = logs['episode_reward'] - self.termination_rewards[episode]
self.nb_steps[episode] = logs['nb_steps']
def on_train_end(self, logs=None):
print("\nThe statistics:")
print(f"\n\tEpisodic reward (mean, std, min, max): {self.reward_statistics}")
print(f"\n\tThe number of Steps (mean, std, min, max): {self.nb_steps_statistics}")
@staticmethod
def statistics(data):
avg = np.average(data)
std = np.std(data)
return [avg, std, np.min(data), np.max(data)]
@property
def reward_statistics(self): return self.statistics(self.episode_rewards)
@property
def nb_steps_statistics(self): return self.statistics(self.nb_steps)
from rl.memory import SequentialMemory
import numpy as np
from rl.policy import GreedyQPolicy, EpsGreedyQPolicy
class RestrictedEpsGreedyQPolicy(EpsGreedyQPolicy):
"""Implement the epsilon greedy policy
Restricted Eps Greedy policy.
This policy ensures that it never chooses the action whose value is -inf
"""
def __init__(self, eps=.1):
super(RestrictedEpsGreedyQPolicy, self).__init__(eps)
def select_action(self, q_values):
"""Return the selected action
# Arguments
q_values (np.ndarray): List of the estimations of Q for each action
# Returns
Selection action
"""
assert q_values.ndim == 1
nb_actions = q_values.shape[0]
index = list()
for i in range(0, nb_actions):
if q_values[i] != -np.inf:
index.append(i)
# every q_value is -np.inf (this sometimes inevitably happens within the fit and test functions
# of kerasrl at the terminal stage as they force to call forward in Kerasrl-learner which calls this function.
# TODO: exception process or some more process to choose action in this exceptional case.
if len(index) < 1:
# every q_value is -np.inf, we choose action = 0
action = 0
print("Warning: no action satisfies initiation condition, action = 0 is chosen by default.")
elif np.random.uniform() <= self.eps:
action = index[np.random.random_integers(0, len(index) - 1)]
else:
action = np.argmax(q_values)
return action
class RestrictedGreedyQPolicy(GreedyQPolicy):
"""Implement the greedy policy
Restricted Greedy policy.
This policy ensures that it never chooses the action whose value is -inf
"""
def select_action(self, q_values):
"""Return the selected action
# Arguments
q_values (np.ndarray): List of the estimations of Q for each action
# Returns
Selection action
"""
assert q_values.ndim == 1
# TODO: exception process or some more process to choose action in this exceptional case.
if np.max(q_values) == - np.inf:
# every q_value is -np.inf, we choose action = 0
action = 0
print("Warning: no action satisfies initiation condition, action = 0 is chosen by default.")
else:
action = np.argmax(q_values)
return action
from rl.random import AnnealedGaussianProcess, RandomProcess