OpenAI Gym: MountainCar-v0¶
This notebook shows how grammar-guided genetic programming (G3P) can be used to solve the MountainCar-v0 problem from OpenAI Gym. This is achieved by searching for a small program that defines an agent, who uses an algebraic expression of the observed variables to decide which action to take in each moment.
Caution: This notebook was run with gym v0.20.0 (pip install gym==0.20.0) and pyglet v1.5.27 (pip install pyglet==1.5.27). Gym deprecated “Pendulum-v0” from v0.20.0 to v.0.21.0. Gym changed its API from v0.25.2 to v0.26.0. Pyglet changed its API from 1.5.27 to 2.0.0.
References¶
OpenAI Gym website
Classic problems from control theory: an overview of environments
MountainCar-v0: the environment solved here
GitHub
MountainCar-v0: details on the environment solved here
Leaderboard: community wiki to track user-provided solutions
Example solution: a fixed policy written by Zhiqing Xiao
[1]:
import time
import warnings
import alogos as al
import gym
import unified_map as um
[2]:
warnings.filterwarnings('ignore')
Preparation¶
1) Environment¶
MountainCar-v0: The aim is to drive a car up the right hill, but its engine is not strong enough, so it needs to build up momentum first. The agent observes the current position and velocity of the car. It can act by pushing the car to the left (value 0), applying no push (value 1), or pushing it to the right (value 2).
[3]:
env = gym.make('MountainCar-v0')
2) Functions to run single or multiple simulations¶
It allows an agent to act in an environment and collect rewards until the environment signals it is done.
[4]:
def simulate_single_run(env, agent, render=False):
observation = env.reset()
episode_reward = 0.0
while True:
action = agent.decide(observation)
observation, reward, done, info = env.step(action)
episode_reward += reward
if render:
time.sleep(0.025)
env.render()
if done:
break
env.close()
return episode_reward
[5]:
def simulate_multiple_runs(env, agent, n):
total_reward = sum(simulate_single_run(env, agent) for _ in range(n))
mean_reward = total_reward / n
return mean_reward
Example solutions¶
[6]:
num_sim = 200
1) By Zhiqing Xiao¶
[7]:
class Agent:
def decide(self, observation):
position, velocity = observation
lb = min(-0.09 * (position + 0.25) ** 2 + 0.03,
0.3 * (position + 0.9) ** 4 - 0.008)
ub = -0.07 * (position + 0.38) ** 2 + 0.07
if lb < velocity < ub:
action = 2 # push right
else:
action = 0 # push left
return action
agent = Agent()
simulate_multiple_runs(env, agent, num_sim)
[7]:
-107.605
2) By previous runs of evolutionary optimization¶
[8]:
class Agent:
def decide(self, observation):
position, velocity = observation
output = (7.83**velocity)
action = 0 if output < 1.0 else 2
return action
agent = Agent()
simulate_multiple_runs(env, agent, num_sim)
[8]:
-119.285
[9]:
class Agent:
def decide(self, observation):
position, velocity = observation
output = (4.59/(3.35*velocity))
action = 0 if output < 1.0 else 2
return action
agent = Agent()
simulate_multiple_runs(env, agent, num_sim)
[9]:
-119.75
[10]:
class Agent:
def decide(self, observation):
position, velocity = observation
output = (((((2.18/velocity)-velocity)-((((velocity/7.27)/position)+(velocity*(position-position)))*(((2.64+(8.48*velocity))+(5.86*position))*9.40)))+((5.59*position)+(((0.19*(((velocity-(velocity*4.62))+1.42)+(((0.09-position)**6.40)*5.21)))**((4.09/(7.32/6.71))/5.33))**((position*(position*(1.69+(3.20-3.13))))**8.44))))/(velocity/velocity))
action = 0 if output < 1.0 else 2
return action
agent = Agent()
simulate_multiple_runs(env, agent, num_sim)
[10]:
-115.635
Definition of search space and goal¶
1) Grammar¶
This grammar defines the search space: a Python program that creates an Agent who uses an algebraic expression of the observed variables to decide how to act in each situation.
[11]:
ebnf_text = """
PROGRAM = L0 NL L1 NL L2 NL L3 NL L4 NL L5
L0 = "class Agent:"
L1 = " def decide(self, observation):"
L2 = " position, velocity = observation"
L3 = " output = " EXPR
L4 = " action = 0 if output < 1.0 else 2"
L5 = " return action"
NL = "\n"
EXPR = VAR | CONST | "(" EXPR OP EXPR ")"
VAR = "position" | "velocity"
CONST = DIGIT "." DIGIT DIGIT
OP = "+" | "-" | "*" | "/" | "**"
DIGIT = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
"""
grammar = al.Grammar(ebnf_text=ebnf_text)
2) Objective function¶
The objective function gets a candidate solution (=a string of the grammar’s language) and returns a fitness value for it. This is done by 1) executing the string as a Python program, so that it creates an agent object, and then 2) using the agent in multiple simulations to see how good it can handle different situations: the higher the total reward, the better is the candidate.
[12]:
def string_to_agent(string):
local_vars = dict()
exec(string, None, local_vars)
Agent = local_vars['Agent']
return Agent()
def objective_function(string):
agent = string_to_agent(string)
avg_reward = simulate_multiple_runs(env, agent, 10)
return avg_reward
Generation of a random solution¶
Check if grammar and objective function work as intended.
[13]:
random_string = grammar.generate_string()
print(random_string)
class Agent:
def decide(self, observation):
position, velocity = observation
output = (9.65*(6.93/velocity))
action = 0 if output < 1.0 else 2
return action
[14]:
objective_function(random_string)
[14]:
-116.8
Search for an optimal solution¶
Evolutionary optimization with random variation and non-random selection is used to find increasingly better candidate solutions.
1) Parameterization¶
[15]:
ea = al.EvolutionaryAlgorithm(
grammar, objective_function, 'max',
max_or_min_fitness=-100, population_size=200, offspring_size=200,
evaluator=um.univariate.parallel.futures, verbose=True)
2) Run¶
[16]:
best_ind = ea.run()
Progress Generations Evaluations Runtime (sec) Best fitness
..... ..... 10 1941 70.5 -114.7
..... ..... 20 3831 125.6 -114.6
..... ..... 30 5707 173.9 -114.6
..... ..... 40 7524 216.8 -114.5
..... ..... 50 9333 257.8 -114.5
..... ..... 60 11179 303.2 -112.2
..... ..... 70 13071 355.0 -111.6
..... ..... 80 14957 400.4 -111.3
..... ..... 90 16838 453.7 -107.0
..... ..... 100 18732 508.5 -107.0
..... ..
Finished 107 20044 545.0 -98.8
3) Result¶
[17]:
string = best_ind.phenotype
print(string)
class Agent:
def decide(self, observation):
position, velocity = observation
output = (((velocity/(((position-2.19)*velocity)**(((((8.60-(velocity-2.19))**(position/(velocity-2.19)))**(6.80+velocity))+velocity)-(4.97*6.80))))/(8.98+(3.95*3.36)))-(((2.97+velocity)/2.88)**(velocity/((2.97+velocity)/position))))
action = 0 if output < 1.0 else 2
return action
[18]:
agent = string_to_agent(string)
simulate_multiple_runs(env, agent, 100)
[18]:
-108.02
[19]:
simulate_single_run(env, agent, render=True)
[19]:
-114.0