Symbolic regression with Python code

This notebook is an example of symbolic regression, i.e. the search for an algebraic expression that fits well to given data points. It evolves a Python expression in order to implement a function, which takes two values as input (x, y) and returns one value (z) as output. The goal is to minimize the deviation from given data points (x, y, z).

References

In [1]:
import time
import warnings

import alogos as al
import numpy as np
import unified_map as um
import unified_plotting as up
from numpy import sin, cos, exp, log, sqrt
In [2]:
warnings.filterwarnings('ignore')  # ignore numerical warnings like division by zero

Preparation

1) Generate data points

A function f with two input variables x and y and one output variable z is used to generate a list of data points (x, y, z). This is achieved by evaluating the function z = f(x, y) on a list of (x, y) pairs, which are chosen to lie on a regular grid.

As function was chosen the Pagie-1 polynomial, because it was suggested for use in benchmarking in the article Better GP benchmarks: community survey results and proposals.

In [3]:
def pagie1_polynomial(x, y):
    return 1.0/(1.0+x**-4) + 1.0/(1.0+y**-4)

n = 30
a, b = -5, +5

xy_grid = np.array([(xi, yj) for xi in np.linspace(a, b, n) for yj in np.linspace(a, b, n)])
xa = np.array([el[0] for el in xy_grid])
ya = np.array([el[1] for el in xy_grid])
za = np.array([pagie1_polynomial(xi, yi) for xi, yi in xy_grid])
In [4]:
print('Generated {} data points in the form of (x, y, z) triplets.'.format(len(za)))

up.plotly.scatter_3d(xa, ya, za, marker_color='black', marker_size=3)
Generated 900 data points in the form of (x, y, z) triplets.
Out[4]:

Definition of search space and goal

1) Grammar

This grammar defines the search space: an Atomese program that creates an atom that represents a function in form of a LambdaLink and that can be evaluated with a given (x, y) input to give a z output.

In [5]:
ebnf_text_1 = """
EXPR = "(" EXPR OP EXPR ")" | PREOP "(" EXPR ")" | VAR | CONST
OP = "+" | "-" | "*" | "/"
PREOP = "sin" | "cos" | "exp" | "log" | "sqrt"
VAR = "x" | "y"
CONST = DIGIT "." DIGIT
DIGIT = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
"""

ebnf_text_2 = """
EXPR = "(" EXPR OP EXPR ")" | VAR | CONST
OP = "+" | "-" | "*" | "/"
VAR = "x" | "y"
CONST = "1.0"
"""

grammar = al.Grammar(ebnf_text=ebnf_text_2)

2) Objective function

The objective function gets a candidate solution (=a string of the grammar's language) and returns a fitness value for it. This is done by 1) evaluating the string as Python expression within a function definition and then 2) evaluating that function for each (x, y) pair in the given grid and compare the resulting z values with the target z values: the smaller the deviation, the better is the candidate.

In [6]:
def string_to_function(string):
    def func(x, y):
        return eval(string)
    return func


def objective_function(string):
    if len(string) > 500:
        return float('inf')

    func = string_to_function(string)
    zf = func(xa, ya)  # problem: sqrt(-1) leads to complex number and potentially a negative fitness
    sse = np.sum((zf - za)**2)
    return sse

Generation of a random solution

Check if grammar and objective function work as intended.

In [7]:
random_string = grammar.generate_string()
print(random_string)
1.0
In [8]:
objective_function(random_string)
Out[8]:
501.15063896163883

Search for an optimal solution

Evolutionary optimization with random variation and non-random selection is used to find increasingly better candidate solutions.

1) Parameterization

In [9]:
report_time = True
display_results = False

num_gen = 300
num_ind = 600

ea = al.EvolutionaryAlgorithm(
    grammar, objective_function, 'min',
    max_generations=num_gen,
    max_or_min_fitness=0.001,
    max_runtime_in_seconds=300,
    population_size=num_ind, offspring_size=num_ind,
    evaluator=um.univariate.parallel.futures,
    verbose=True,
    max_nodes=500,
)

2) Run

The search is performed one generation after another and some intermediate results are reported to see how the solutions improve gradually.

In [10]:
best_ind = ea.run()
Progress         Generations      Evaluations      Runtime (sec)    Best fitness    
..... .....      10               5048             9.9              372.5355030568117
..... .....      20               9040             11.8             207.98155741874257
..... .....      30               13526            13.8             128.8446004950123
..... .....      40               18042            15.9             115.75819905573331
..... .....      50               22905            18.2             111.76596596229132
..... .....      60               27823            20.5             110.5245483085387
..... .....      70               32855            23.0             88.84833624869765
..... .....      80               38014            25.7             28.819597040823833
..... .....      90               43304            28.5             24.96120749648166
..... .....      100              48042            31.4             18.408307674685446
..... .....      110              52770            34.1             16.20136255479693
..... .....      120              57562            36.9             12.18172167716684
..... .....      130              62431            39.7             10.189942694179642
..... .....      140              67318            42.7             8.256614540753905
..... .....      150              72030            45.4             6.138793671428651
..... .....      160              76688            48.2             6.078267415761433
..... .....      170              81106            50.9             5.461639749430844
..... .....      180              85782            53.7             2.4632192147035483
..... .....      190              90469            56.5             1.9427671298390243
..... .....      200              94709            59.3             1.9427671298390243
..... .....      210              98873            61.9             1.9427671298390243
..... .....      220              103299           64.6             1.9427671298390243
..... .....      230              107986           67.4             1.9427671298390243
..... .....      240              112751           70.2             1.9427671298390243
..... .....      250              117628           73.1             0.8489957977210127
..... .....      260              122638           76.0             0.6823928146154141
..... .....      270              127433           78.8             0.6823928146154141
..... .....      280              131887           81.7             0.6351751297242041
..... .....      290              136437           84.5             0.627981680776138
..... .....      300              140912           87.3             0.6258111596495319


Finished         300              140912           87.5             0.6258111596495319

3) Result

Show the phenotype of the best individual found so far. If more computing time is provided, increasingly better solutions may be discovered.

In [11]:
string = best_ind.phenotype
func = string_to_function(string)
print(string)
(1.0-(((1.0-((1.0/(1.0+(((1.0+(x/x))/((x*x)+(y*(y/1.0))))/((y*y)+(x*x)))))-1.0))/(1.0+(((x*x)+((1.0/(1.0+((x/x)/1.0)))/(1.0+((x*x)/(1.0+(((y*y)/(1.0+1.0))-(1.0/1.0)))))))/((1.0/(1.0/((1.0*(1.0/(1.0/(1.0+(1.0/1.0)))))/((((x*x)/1.0)/1.0)*1.0))))-((1.0/(1.0*(1.0-((1.0-(x/(1.0/x)))-1.0))))-(((x*x)/(y*y))/(y*y)))))))-1.0))
In [12]:
zf = func(xa, ya)

up.plotly.scatter_3d(xa, ya, za, marker_color='black', marker_size=3) + \
up.plotly.surface(xa, ya, zf, show_colormap=False)
Out[12]: