OpenAI Gym: Acrobot-v1¶

This notebooks shows how grammar-guided genetic programming (G3P) can be used to solve the Acrobot-v1 problem from OpenAI Gym. This is achieved by searching for a small program that defines an agent, who uses an algebraic expression of the observed variables to decide which action to take in each moment.

References¶

Wikipedia
- Closed-form expression
OpenAI Gym website
- Classic problems from control theory: an overview of environments
- Acrobot-v1: the environment solved here
Book by original author Richard Sutton
- The Acrobot
GitHub
- Leaderboard: community wiki to track user-provided solutions
- Example solution: a fixed policy written by Zhiqing Xiao

import time
import warnings

import alogos as al
import gym
import unified_map as um

warnings.filterwarnings('ignore')

Preparation¶

1) Environment¶

Acrobot-v1: The aim is to swing the lower part of a two-link robot up to a given height, much like a gymnast. The agent observes current positions and velocities of the joints. It can act by applying positive torque (value 0), no torque (value 1), or negative torque (value 2) only to the joint between the two links.

env = gym.make('Acrobot-v1')

2) Functions to run single or multiple simulations¶

It allows an agent to act in an environment and collect rewards until the environment signals it is done.

def simulate_single_run(env, agent, render=False):
    observation = env.reset()
    episode_reward = 0.0
    while True:
        action = agent.decide(observation)
        observation, reward, done, info = env.step(action)
        episode_reward += reward
        if render:
            time.sleep(0.03)
            env.render()
        if done:
            break
    env.close()
    return episode_reward

def simulate_multiple_runs(env, agent, n):
    total_reward = sum(simulate_single_run(env, agent) for _ in range(n))
    mean_reward = total_reward / n
    return mean_reward

Example solutions¶

num_sim = 200

1) By Zhiqing Xiao¶

class Agent:
    def decide(self, observation):
        x0, y0, x1, y1, v0, v1 = observation
        if v1 < -0.3:
            action = 0
        elif v1 > 0.3:
            action = 2
        else:
            y = y1 + x0 * y1 + x1 * y0
            if y > 0.:
                action = 0
            else:
                action = 2
        return action

agent = Agent()
simulate_multiple_runs(env, agent, num_sim)

-89.24

2) By previous runs of evolutionary optimization¶

class Agent:
    def decide(self, observation):
        x0, y0, x1, y1, v0, v1 = observation
        output = (6.72*((v1*((5.01-0.80)**(5.94*(0.91*(v1+1.32)))))*((y0/v1)+5.87)))
        action = 0 if output < 1.0 else 2
        return action

agent = Agent()
simulate_multiple_runs(env, agent, num_sim)

-85.195

class Agent:
    def decide(self, observation):
        x0, y0, x1, y1, v0, v1 = observation
        output = ((((7.79*((v0**3.36)**1.25))*x0)/((x1+2.54)-8.97))+((((((((v0+v0)/7.58)*(((8.04**v0)+8.23)+y0))-(x1-((v0+v0)/v1)))+x1)-y1)/(7.57/(5.20+y0)))-(((1.07+(4.97**6.10))-((x1+9.28)-8.64))**(2.01*((x0-((4.57/3.52)-4.46))-(0.54**y0))))))
        action = 0 if output < 1.0 else 2
        return action

agent = Agent()
simulate_multiple_runs(env, agent, num_sim)

-81.575

class Agent:
    def decide(self, observation):
        x0, y0, x1, y1, v0, v1 = observation
        output = ((((((y0-y0)*4.98)*(v0-3.67))/y0)+(((v1-v1)*(((6.21/3.46)+7.92)*2.15))+(((v1/(3.44*x1))*((v0/v1)+1.08))*((v1+(3.69/(8.43+y1)))/((y1-((9.55*1.51)-((1.95**((7.15/x0)+v1))+((x0**8.25)-(((y1/8.41)-4.74)*9.79)))))*(((0.35*x1)/x0)/(((y1*(x1*x0))+(v0**1.24))**(((v0/v1)+1.08)*(((7.76**v1)/8.67)+(2.92/(9.24-3.81)))))))))))*(((5.77-y1)-(y1-(((7.05/((1.49/0.38)/(4.79**v0)))*y0)/v1)))/(v0*((v0**(y0*(7.96+x1)))+((v1+x0)/1.05)))))
        action = 0 if output < 1.0 else 2
        return action

agent = Agent()
simulate_multiple_runs(env, agent, num_sim)

-88.825

Definition of search space and goal¶

1) Grammar¶

This grammar defines the search space: a Python program that creates an Agent who uses an algebraic expression of the observed variables to decide how to act in each situation.

ebnf_text = """
PROGRAM = L0 NL L1 NL L2 NL L3 NL L4 NL L5

L0 = "class Agent:"
L1 = "    def decide(self, observation):"
L2 = "        x0, y0, x1, y1, v0, v1 = observation"
L3 = "        output = " EXPR
L4 = "        action = 0 if output < 1.0 else 2"
L5 = "        return action"

NL = "\n"

EXPR = VAR | CONST | "(" EXPR OP EXPR ")"
VAR = "x0" | "y0" | "x1" | "y1" | "v0" | "v1"
CONST = DIGIT "." DIGIT DIGIT
OP = "+" | "-" | "*" | "/" | "**"
DIGIT = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
"""

grammar = al.Grammar(ebnf_text=ebnf_text)

2) Objective function¶

The objective function gets a candidate solution (=a string of the grammar's language) and returns a fitness value for it. This is done by 1) executing the string as a Python program, so that it creates an agent object, and then 2) using the agent in multiple simulations to see how good it can handle different situations: the higher the total reward, the better is the candidate.

def string_to_agent(string):
    local_vars = dict()
    exec(string, None, local_vars)
    Agent = local_vars['Agent']
    return Agent()


def objective_function(string):
    agent = string_to_agent(string)
    avg_reward = simulate_multiple_runs(env, agent, 10)
    return avg_reward

Generation of a random solution¶

Check if grammar and objective function work as intended.

random_string = grammar.generate_string()
print(random_string)

class Agent:
    def decide(self, observation):
        x0, y0, x1, y1, v0, v1 = observation
        output = y0
        action = 0 if output < 1.0 else 2
        return action

objective_function(random_string)

-500.0

Search for an optimal solution¶

Evolutionary optimization with random variation and non-random selection is used to find increasingly better candidate solutions.

1) Parameterization¶

ea = al.EvolutionaryAlgorithm(
    grammar, objective_function, 'max', max_or_min_fitness=-100,
    population_size=20, offspring_size=20, evaluator=um.univariate.parallel.futures, verbose=True)

2) Run¶

best_ind = ea.run()

Progress         Generations      Evaluations      Runtime (sec)    Best fitness    
..... .....      10               196              31.1             -339.6
..... ..

Finished         17               302              56.2             -83.2

3) Result¶

string = best_ind.phenotype
print(string)

class Agent:
    def decide(self, observation):
        x0, y0, x1, y1, v0, v1 = observation
        output = (2.67-((7.93-5.95)-v1))
        action = 0 if output < 1.0 else 2
        return action

agent = string_to_agent(string)
simulate_multiple_runs(env, agent, 100)

-94.22

simulate_single_run(env, agent, render=True)

-74.0