diff --git a/README.md b/README.md index 68891b1136fdb0d967c0d9b1966bc00f4e806a36..9725f2749d654a0c0b4354bb25258bd3fb4d4231 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ Assignment code for course ECE 493 T25 at the University of Waterloo in Spring 2020. (*Code designed and created by Sriram Ganapathi Subramanian and Mark Crowley, 2020*) -**Due Date:** TBD: submitted as PDF and code to LEARN dropbox. +**Due Date:** July 5, 2020 by 11:50pm submitted as PDF report and code to the LEARN dropbox. **Collaboration:** You can discuss solutions and help to work out the code. But each person *must do their own work*. All code and writing will be cross-checked against each other and against internet databases for cheating. @@ -12,7 +12,7 @@ Updates to code which will be useful for all or bugs in the provided code will b The domain consists of a 10x10 grid of cells. The agent being controlled is represented as a red square. The goal is a yellow oval and you receive a reward of 1 for reaching it, this ends and resets the episode. Blue squares are **pits** which yield a penalty of -10 and end the episode. Black squares are **walls** which cannot be passed through. If the agent tries to walk into a wall they will remain in their current position and receive a penalty of -.3. -Their are **three tasks** defined in `run_main.py` which can be commented out to try each. They include a combination of pillars, rooms, pits and obstacles. The aim is to learn a policy that maximizes expected reward and reaches the goal as quickly as possible. +There are **three tasks** defined in `run_main.py` which can be commented out to try each. They include a combination of pillars, rooms, pits and obstacles. The aim is to learn a policy that maximizes expected reward and reaches the goal as quickly as possible. # <img src="task1.png" width="300"/><img src="task2.png" width="300"/><img src="task3.png" width="300"/> @@ -21,20 +21,14 @@ Their are **three tasks** defined in `run_main.py` which can be commented out to This assignment will have a written component and a programming component. Clone the mazeworld environment locally and run the code looking at the implemtation of the sample algorithm. Your task is to implement three other algortihms on this domain. -- **(20%)** Implement SARSA -- **(20%)** Implement QLearning -- **(20%)** At least one other algorithm of your choice or own design. -Suggestions to try: - - Policy Iteration (easy) - - Expected SARSA (less easy) - - Double Q-Learning (less easy) - - n-step TD or TD(Lambda) with eligibility traces (harder) - - Policy Gradients (harderer) -- **(10%) bonus** Implement four algorithms in total (you can do more but we'll only look at four, you need to tell us which). -- **(40%)** Report : Write a short report on the problem and the results of your three algorithms. The report should be submited on LEARN as a pdf. - - Describing each algorithm you used, define the states, actions, dynamics. Define the mathematical formulation of your algorithm, show the Bellman updates for you use. - - Some quantitative analysis of the results, a default plot for comparing all algorithms is given. You can do more than that. - - Some qualitative analysis of why one algorithm works well in each case, what you noticed along the way. +- **(15%)** Implement Value Iteration +- **(15%)** Implement Policy Iteration +- **(15%)** Implement SARSA +- **(15%)** Implement QLearning +- **(40%)** Report : Write a short report on the problem and the results of your three algorithms. The report should be submited on LEARN as a pdf: + - Describing each algorithm you used, define the states, actions, dynamics. Define the mathematical formulation of your algorithm, show the Bellman updates you use. + - Some quantitative analysis of the results, a default plot for comparing all algorithms is given. You can do more plots than this. + - Some qualitative analysis of you observations where one algorithm works well in each case, what you noticed along the way, explain the differences in performance related to the algorithms. ### Evaluation