alogos._grammar.data_structures

Data structures for a grammar, derivation tree and their parts.

Classes

Grammar

Context-free grammar (CFG) for defining a formal language.

DerivationTree

Derivation tree for representing the structure of a string.

Node

Node inside a derivation tree.

Symbol

Symbol inside a grammar or derivation tree.

NonterminalSymbol

Nonterminal symbol inside a grammar or derivation tree.

TerminalSymbol

Terminal symbol inside a grammar or derivation tree.


Detailed object descriptions

class alogos._grammar.data_structures.Grammar(bnf_text=None, bnf_file=None, ebnf_text=None, ebnf_file=None, **kwargs)[source]

Context-free grammar (CFG) for defining a formal language.

The role of this class is explained by first introducing basic concepts from formal language theory and then describing the main tasks a grammar can be used for.

  1. Technical terms from formal language theory

    • A grammar is a mathematical device to define a formal language.

    • A language is a finite or infinite set of strings.

    • A string is a finite sequence of symbols that all belong to a single alphabet.

    • An alphabet is a finite set of symbols.

    • In mathematical terms, a grammar is usually defined as a tuple (N, T, P, S) where

      • N is a set of nonterminal symbols, also known as variables.

      • T is a set of terminal symbols, which has no overlap with N.

      • S is the start symbol, which is a nontermminal symbol and therefore an element of N.

      • P is a set of production rules, also known as rewrite rules. Beginning from the start symbol, these rules can be applied until only terminal symbols are left, which then form a string of the grammar’s language. Different kinds of grammars can be distinguished by putting different restrictions on the form of these rules. This influences how expressive the grammars are and hence what kind of languages can be defined by them.

    • A context-free grammar (CFG) is a type of formal grammar that is frequently used in computer science and programming language design. The production rules need to fulfill following criteria:

      • The left-hand side of each rule is a single nonterminal symbol. This means a nonterminal symbol can be rewritten into the right-hand side of the rule, no matter which context the nonterminal is surrounded with.

      • The right-hand side is a sequence of symbols that can consist of a combination of terminal symbols, nonterminal symbols and the empty string "" (often denoted by the greek letter ɛ or less often λ).

  2. Tasks a grammar can be used for

    • Define a grammar with a text in a suitable format such as Backus-Naur form (BNF) or Extended Backus-Naur form (EBNF). Both of these formats can be recognized by this class’s initialization method __init__.

    • Visualize a grammar in form of a syntax diagram with the method plot.

    • Recognize whether a given string belongs to the grammar’s language or not with the method recognize_string.

    • Parse a given string to see how it can be derived from the start symbol by a specific sequence of rule applications with the method parse_string. The result is a parse tree or derivation tree, an instance of the class DerivationTree. A derivation tree represents a single string by reading its leaf nodes from left to right, but many possible derivations that lead to it, since the sequence of rule applications is not fixed in the tree.

    • Generate a random derivation, derivation tree or string with the methods generate_derivation, generate_derivation_tree, generate_string.

    • Generate the grammar’s language or a finite subset of it with the method generate_language.

    • Convert a grammar into a specific normal form or check if it is in one with the methods to_cnf, is_cnf, to_gnf, is_gnf, to_bnf, is_bnf. Normal forms put further restrictions on the form of the production rules, but importantly, converting a grammar into it does not change its language.

Notes

The concepts and notation mentioned here are based on classical textbooks in formal language theory such as [1]. Similar discussions can be found on the web [2].

References

__init__(self, bnf_text=None, bnf_file=None, ebnf_text=None, ebnf_file=None, **kwargs)[source]

Create a grammar from a string or file in BNF or EBNF notation.

Backus-Naur form (BNF) and Extended Backus-Naur form (EBNF) [3] are two well-known and often used formats for defining context-free grammars.

Parameters:
  • bnf_text (str, optional) – String that contains a grammar in BNF.

  • bnf_file (str, optional) – Filepath of a text file that contains a grammar in BNF.

  • ebnf_text (str, optional) – String that contains a grammar in EBNF.

  • ebnf_file (str, optional) – Filepath of a text file that contains a grammar in EBNF.

  • **kwargs (dict, optional) – Further keyword arguments are forwarded to the function that reads and creates a grammar from text in BNF or EBNF notation. The available arguments and further details can be found in following methods:

Raises:
  • TypeError – If an argument gets a value of unexpected type.

  • FileNotFoundError – If a filepath does not point to an existing file.

  • GrammarError – If a newly generated grammar does not pass all validity checks. For example, each nonterminal that appears on the right-hand side must also appear on the left-hand side.

Warns:

GrammarWarning – If a newly generated grammar contains a production rule more than once.

References

Examples

Use text in Backus-Naur form (BNF) to define a grammar:

>>> import alogos as al
>>> bnf_text = """
<S> ::= <Greeting> _ <Object> !
<Greeting> ::= Hello | Hi | Hola
<Object> ::= World | Universe | Cosmos
"""
>>> grammar = al.Grammar(bnf_text=bnf_text)
>>> grammar.generate_string()
Hello_World!

Use text in or Extended Backus-Naur form (EBNF) to define a grammar:

>>> import alogos as al
>>> ebnf_text = """
S = Greeting " " Object "!"
Greeting = "Hello" | "Hi" | "Hola"
Object = "World" | "Universe" | "Cosmos"
"""
>>> grammar = al.Grammar(ebnf_text=ebnf_text)
>>> grammar.generate_string()
Hola Universe!

Use text in an unusual form to define a grammar by passing custom symbol definitions as keyword arguments:

>>> import alogos as al
>>> bnf_text = """
[English Sentence] = [Simple Sentence]
[Simple Sentence] = [Declarative Sentence]
[Declarative Sentence] = [subject] [predicate]
[subject] = [simple subject]
[simple subject] = [nominative personal pronoun]
[nominative personal pronoun] = "I" | "you" | "he" | "she" | "it" | "we" | "they"
[predicate] = [verb]
[verb] = [linking verb]
[linking verb] = "am" |"are" |"is" | "was"| "were"
"""
>>> grammar = al.Grammar(
    bnf_text=bnf_text,
    defining_symbol="=",
    start_nonterminal_symbol="[",
    end_nonterminal_symbol="]",
    start_terminal_symbol='"',
    end_terminal_symbol='"',
)
>>> grammar.generate_string(separator=' ')
I am
copy(self)[source]

Create a deep copy of the grammar.

The new object is entirely independent of the original object. No parts, such as symbols or rules, are shared between the objects.

from_bnf_text(self, bnf_text, defining_symbol='::=', rule_separator_symbol='|', start_nonterminal_symbol='<', end_nonterminal_symbol='>', start_terminal_symbol='', end_terminal_symbol='', start_terminal_symbol2='', end_terminal_symbol2='', verbose=False)[source]

Read a grammar specification in BNF notation from a string.

This method resets the grammar object and then uses the provided information.

Parameters:

bnf_text (str) – String with grammar specification in BNF notation.

Other Parameters:
  • defining_symbol (str, optional) – Symbol between left-hand side and right-hand side of a rule.

  • rule_separator_symbol (str, optional) – Symbol between alternative productions of a rule.

  • start_nonterminal_symbol (str, optional) – Symbol indicating the start of a nonterminal.

  • end_nonterminal_symbol (str, optional) – Symbol indicating the end of a nonterminal.

  • start_terminal_symbol (str, optional) – Symbol indicating the start of a terminal.

  • end_terminal_symbol (str, optional) – Symbol indicating the end of a terminal.

  • start_terminal_symbol2 (str, optional) – Alternative symbol indicating the start of a terminal.

  • end_terminal_symbol2 (str, optional) – Alternative symbol indicating the end of a terminal.

  • verbose (bool, optional) – If True, messages will be printed during processing the input text that show which rules and symbols are found one after another. This can be useful to see what went wrong when the generated grammar does not look or behave as expected.

from_bnf_file(self, filepath, defining_symbol='::=', rule_separator_symbol='|', start_nonterminal_symbol='<', end_nonterminal_symbol='>', start_terminal_symbol='', end_terminal_symbol='', start_terminal_symbol2='', end_terminal_symbol2='', verbose=False)[source]

Read a grammar specification in BNF notation from a file.

This method resets the grammar object and then uses the provided information.

Parameters:

filepath (str) – Text file with grammar specification in BNF notation.

Other Parameters:
  • defining_symbol (str, optional) – Symbol between left-hand side and right-hand side of a rule.

  • rule_separator_symbol (str, optional) – Symbol between alternative productions of a rule.

  • start_nonterminal_symbol (str, optional) – Symbol indicating the start of a nonterminal.

  • end_nonterminal_symbol (str, optional) – Symbol indicating the end of a nonterminal.

  • start_terminal_symbol (str, optional) – Symbol indicating the start of a terminal.

  • end_terminal_symbol (str, optional) – Symbol indicating the end of a terminal.

  • start_terminal_symbol2 (str, optional) – Alternative symbol indicating the start of a terminal.

  • end_terminal_symbol2 (str, optional) – Alternative symbol indicating the end of a terminal.

  • verbose (bool, optional) – If True, messages will be printed during processing the input text that show which rules and symbols are found one after another. This can be useful to see what went wrong when the generated grammar does not look or behave as expected.

from_ebnf_text(self, ebnf_text, defining_symbol='=', rule_separator_symbol='|', start_nonterminal_symbol='', end_nonterminal_symbol='', start_terminal_symbol='"', end_terminal_symbol='"', start_terminal_symbol2="'", end_terminal_symbol2="'", verbose=False)[source]

Read a grammar specification in EBNF notation from a string.

This method resets the grammar object and then uses the provided information.

Parameters:

ebnf_text (str) – String with grammar specification in EBNF notation.

Other Parameters:
  • defining_symbol (str, optional) – Symbol between left-hand side and right-hand side of a rule.

  • rule_separator_symbol (str, optional) – Symbol between alternative productions of a rule.

  • start_nonterminal_symbol (str, optional) – Symbol indicating the start of a nonterminal.

  • end_nonterminal_symbol (str, optional) – Symbol indicating the end of a nonterminal.

  • start_terminal_symbol (str, optional) – Symbol indicating the start of a terminal.

  • end_terminal_symbol (str, optional) – Symbol indicating the end of a terminal.

  • start_terminal_symbol2 (str, optional) – Alternative symbol indicating the start of a terminal.

  • end_terminal_symbol2 (str, optional) – Alternative symbol indicating the end of a terminal.

  • verbose (bool, optional) – If True, messages will be printed during processing the input text that show which rules and symbols are found one after another. This can be useful to see what went wrong when the generated grammar does not look or behave as expected.

from_ebnf_file(self, filepath, defining_symbol='=', rule_separator_symbol='|', start_nonterminal_symbol='', end_nonterminal_symbol='', start_terminal_symbol='"', end_terminal_symbol='"', start_terminal_symbol2="'", end_terminal_symbol2="'", verbose=False)[source]

Read a grammar specification in EBNF notation from a file.

This method resets the grammar object and then uses the provided information.

Parameters:

filepath (str) – Text file with grammar specification in EBNF notation.

Other Parameters:
  • defining_symbol (str, optional) – Symbol between left-hand side and right-hand side of a rule.

  • rule_separator_symbol (str, optional) – Symbol between alternative productions of a rule.

  • start_nonterminal_symbol (str, optional) – Symbol indicating the start of a nonterminal.

  • end_nonterminal_symbol (str, optional) – Symbol indicating the end of a nonterminal.

  • start_terminal_symbol (str, optional) – Symbol indicating the start of a terminal.

  • end_terminal_symbol (str, optional) – Symbol indicating the end of a terminal.

  • start_terminal_symbol2 (str, optional) – Alternative symbol indicating the start of a terminal.

  • end_terminal_symbol2 (str, optional) – Alternative symbol indicating the end of a terminal.

  • verbose (bool, optional) – If True, messages will be printed during processing the input text that show which rules and symbols are found one after another. This can be useful to see what went wrong when the generated grammar does not look or behave as expected.

to_bnf_text(self, rules_on_separate_lines=True, defining_symbol='::=', rule_separator_symbol='|', start_nonterminal_symbol='<', end_nonterminal_symbol='>', start_terminal_symbol='', end_terminal_symbol='', start_terminal_symbol2='', end_terminal_symbol2='')[source]

Write the grammar in BNF notation to a string.

Parameters:
  • rule_on_separate_lines (bool, optional) – If True, each rule for a nonterminal is put onto a separate line. If False, all rules for a nonterminal are grouped onto one line.

  • defining_symbol (str, optional) – Symbol between left-hand side and right-hand side of a rule.

  • rule_separator_symbol (str, optional) – Symbol between alternative productions of a rule.

  • start_nonterminal_symbol (str, optional) – Symbol indicating the start of a nonterminal.

  • end_nonterminal_symbol (str, optional) – Symbol indicating the end of a nonterminal.

  • start_terminal_symbol (str, optional) – Symbol indicating the start of a terminal.

  • end_terminal_symbol (str, optional) – Symbol indicating the end of a terminal.

  • start_terminal_symbol2 (str, optional) – Alternative symbol indicating the start of a terminal.

  • end_terminal_symbol2 (str, optional) – Alternative symbol indicating the end of a terminal.

to_bnf_file(self, filepath, rules_on_separate_lines=True, defining_symbol='::=', rule_separator_symbol='|', start_nonterminal_symbol='<', end_nonterminal_symbol='>', start_terminal_symbol='', end_terminal_symbol='', start_terminal_symbol2='', end_terminal_symbol2='')[source]

Write the grammar in BNF notation to a text file.

Parameters:

filepath (str) – Filepath of the text file that shall be generated.

Other Parameters:
  • rule_on_separate_lines (bool, optional) – If True, each rule for a nonterminal is put onto a separate line. If False, all rules for a nonterminal are grouped onto one line.

  • defining_symbol (str, optional) – Symbol between left-hand side and right-hand side of a rule.

  • rule_separator_symbol (str) – Symbol between alternative productions of a rule.

  • start_nonterminal_symbol (str) – Symbol indicating the start of a nonterminal.

  • end_nonterminal_symbol (str) – Symbol indicating the end of a nonterminal.

  • start_terminal_symbol (str) – Symbol indicating the start of a terminal.

  • end_terminal_symbol (str) – Symbol indicating the end of a terminal.

  • start_terminal_symbol2 (str) – Alternative symbol indicating the start of a terminal.

  • end_terminal_symbol2 (str) – Alternative symbol indicating the end of a terminal.

to_ebnf_text(self, rules_on_separate_lines=True, defining_symbol='=', rule_separator_symbol='|', start_nonterminal_symbol='', end_nonterminal_symbol='', start_terminal_symbol='"', end_terminal_symbol='"', start_terminal_symbol2='"', end_terminal_symbol2='"')[source]

Write the grammar in EBNF notation to a string.

Parameters:
  • rule_on_separate_lines (bool) – If True, each rule for a nonterminal is put onto a separate line. If False, all rules for a nonterminal are grouped onto one line.

  • defining_symbol (str) – Symbol between left-hand side and right-hand side of a rule.

  • rule_separator_symbol (str) – Symbol between alternative productions of a rule.

  • start_nonterminal_symbol (str) – Symbol indicating the start of a nonterminal.

  • end_nonterminal_symbol (str) – Symbol indicating the end of a nonterminal.

  • start_terminal_symbol (str) – Symbol indicating the start of a terminal.

  • end_terminal_symbol (str) – Symbol indicating the end of a terminal.

  • start_terminal_symbol2 (str) – Alternative symbol indicating the start of a terminal.

  • end_terminal_symbol2 (str) – Alternative symbol indicating the end of a terminal.

to_ebnf_file(self, filepath, rules_on_separate_lines=True, defining_symbol='=', rule_separator_symbol='|', start_nonterminal_symbol='', end_nonterminal_symbol='', start_terminal_symbol='"', end_terminal_symbol='"', start_terminal_symbol2='"', end_terminal_symbol2='"')[source]

Write the grammar in EBNF notation to a text file.

Parameters:

filepath (str) – Filepath of the text file that shall be generated.

Other Parameters:
  • rule_on_separate_lines (bool, optional) – If True, each rule for a nonterminal is put onto a separate line. If False, all rules for a nonterminal are grouped onto one line.

  • defining_symbol (str, optional) – Symbol between left-hand side and right-hand side of a rule.

  • rule_separator_symbol (str, optional) – Symbol between alternative productions of a rule.

  • start_nonterminal_symbol (str, optional) – Symbol indicating the start of a nonterminal.

  • end_nonterminal_symbol (str, optional) – Symbol indicating the end of a nonterminal.

  • start_terminal_symbol (str, optional) – Symbol indicating the start of a terminal.

  • end_terminal_symbol (str, optional) – Symbol indicating the end of a terminal.

  • start_terminal_symbol2 (str, optional) – Alternative symbol indicating the start of a terminal.

  • end_terminal_symbol2 (str, optional) – Alternative symbol indicating the end of a terminal.

generate_derivation_tree(self, method='weighted', *args, **kwargs)[source]

Generate a derivation tree with a chosen method.

One derivation tree represents one string but many derivations [4].

  • The leaf nodes of the tree read from left to right form a string of the language defined by the grammar.

  • A depth-first walk over the tree that always chooses the leftmost item gives the leftmost derivation, while one that always chooses the rightmost item gives the rightmost derivation. In most cases, there can be many more ways to expand one nonterminal after another, leading to a plethora of possible derivations, all represented by the same derivation tree.

Parameters:
  • method (str, optional) – Name of the method that is used for creating a derivation tree.

    Possible values:

  • **kwargs (dict, optional) – Further keyword arguments are forwarded to the chosen method. This is especially relevant for controlling the behavior of the probabilistic methods.

Returns:

derivation_tree (DerivationTree)

Raises:

MappingError – If the method fails to generate a derivation tree, e.g. because a limit like maximum number of expansions was reached before all leaves contained terminal symbols

References

generate_derivation(self, method='weighted', *args, derivation_order='leftmost', newline=True, **kwargs)[source]

Generate a derivation with a chosen method.

Parameters:
  • method (str, optional) – Name of the method that is used for creating a derivation.

    Possible values: See generate_derivation_tree.

  • derivation_order (str, optional) – Order in which nonterminals are expanded during the step-by-step derivation.

    Possible values:

    • "leftmost": Expand the leftmost nonterminal in each step.

    • "rightmost": Expand the rightmost nonterminal in each step.

    • "random": Expand a random nonterminal in each step.

  • newline (bool, optional) – If True, the derivation steps are placed on separate lines by adding newline characters between them.

  • **kwargs (dict, optional) – Further keyword arguments are forwarded to the chosen method.

Returns:

derivation (str) – The derivation generated with the chosen method. It is a text that shows each step of the derivation as it is commonly represented in textbooks.

Notes

This method uses generate_derivation_tree to create a derivation tree and then reads the leftmost derivation contained in it with derivation.

generate_string(self, method='weighted', *args, separator='', **kwargs)[source]

Generate a string with a chosen method.

Each string [5] is part of the language L(G) defined by grammar G.

Parameters:
  • method (str, optional) – Name of the method that is used for creating a string.

    Possible values: See generate_derivation_tree.

  • separator (str, optional) – A short string that is inserted between each of the terminal symbols that make up the string of the grammar’s language.

  • **kwargs (dict, optional) – Further keyword arguments are forwarded to the chosen method.

Returns:

string (str) – The string generated with the chosen method. It is a text that consists only of terminal symbols and optionally a separator symbol between them.

Notes

This method uses generate_derivation_tree to create a derivation tree and then reads the string contained in it from the leaf nodes with string.

References

generate_language(self, max_steps=None, sort_order='discovered', verbose=None, return_details=None)[source]

Generate the formal language defined by the grammar.

This algorithm recursively constructs the formal language [6] defined by the grammar, i.e. the set of strings it can generate or recognize.

The algorithm can be stopped prematurely to get only a subset of the entire language. By default, only the language of the start symbol is returned, which is the language of the grammar, but it is also possible to return the languages of each other nonterminal symbol, which can be thought of as sublanguages. For example, many programming languages are defined by a CFG and have a nonterminal symbol that represents all valid integer literals in that language.

Parameters:
  • max_steps (int, optional) – The maximum number of recursive steps during language generation. It can be used to stop the algorithm before it can construct all strings of the language. Instead a list of valid strings found so far will be returned, which is a subset of the entire language. This is necessary to get a result if the grammar defines a very large or infinite language.

    Note that each recursive step uses the strings known so far and inserts them into the right-hand sides of production rules to see if any new strings can be discovered. Therefore simpler strings are found before more complex ones that require more expansions. If the number of steps is too little to form a single string belonging to the language of the start symbol, the result will be an empty list.

  • sort_order (str, optional) – The language is returned as a list of strings, which can be sorted in different ways.

    Possible values:

    • "discovered": Strings are returned in the order they were discovered.

    • "lex": Lexicographic, i.e. the order used in lexicons, which means the alphabetic order extended to non-alphabet characters like numbers. Python’s built-in sort() function delivers it by default.

    • "shortlex": Strings are sorted primarily by their length. Those with the same length are further sorted in lexicographic order.

  • verbose (bool, optional) – If True, detailed messages are printed during language generation. If False, no output is generated.

  • return_details (bool, optional) – If True, the return value is a dict with nonterminals as keys and their languages as values. The language of the start symbol is the language of the grammar, but each nonterminal has its own sub-language that can be of interest too.

Returns:

language (list of str, or dict with list of str as values) – The formal language L(G) defined by the grammar G. If the argument return_details is set to True, the return value is a dict where each key is a nonterminal of the grammar and each value the language (set of strings) of the nonterminal.

Warns:

GrammarWarning – If no value is provided for the argument max_steps, internally an unrealistically large value of is assigned to it. In the unlikely case this is ever reached, a warning will be raised if the language generation did not generate all strings of the language.

References

recognize_string(self, string, parser='earley')[source]

Test if a string belongs to the language of the grammar.

Parameters:
  • string (str) – Candidate string which can be recognized to be a member of the grammar’s language.

  • parser (str, optional) – Parsing algorithm used to analyze the string.

    Possible values: See parse_string

Returns:

recognized (bool) – True if the string belongs to the grammar’s language, False if it does not.

parse_string(self, string, parser='earley', get_multiple_trees=False, max_num_trees=None)[source]

Try to parse a given string with a chosen parsing algorithm.

Parsing means the analysis of a string into its parts or constituents [7].

Parameters:
  • string (str) – Candidate string which can only be parsed successfully if it is a member of the grammar’s language.

  • parser (str, optional) – Parsing algorithm used to analyze the string. This package uses parsers from the package Lark [8].

    Possible values:

    • "earley": Can parse any context-free grammar.

      Performance: The algorithm has a time complexity of O(n^3), but if the grammar is unambiguous it is reduced to O(n^2) and for most LR grammars it is O(n).

    • "lalr": Can parse only a subset of context-free grammars, which have a form that allows very efficient parsing with a LALR(1) parser.

      Performance: The algorithm has a time complexity of O(n).

  • get_multiple_trees (bool, optional) – If True, a list of parse trees, also known as parse forest, is returned instead of a single parse tree object.

  • max_num_trees (int, optional) – An upper limit on how many parse trees will be returned at maximum.

Returns:

parse_tree (DerivationTree, or list of DerivationTree) – If the argument get_multiple_trees is set to True, a list of derivation trees is returned instead of a single derivation tree object. The list can contain one or more trees, dependening on how many ways there are to parse the given string. If a grammar is unambiguous, there is only one way. If it is ambuguous, there can be multiple ways to derive the same string in a leftmost derivation, which is captured by different derivation trees.

Raises:

ParserError – If the string does not belong to the language and therefore no parse tree exists for it.

Notes

If the grammar is ambiguous, there can be more than one way to parse a given string, which means that there are multiple parse trees for it. By default, only one of these trees is returned, but the argument get_multiple_trees allows to get all of them. Caution: This feature is currently only supported with Lark’s Earley parser.

References

plot(self)[source]

Generate a figure containing a syntax diagram of the grammar.

Returns:

fig (Figure) – Figure object containing the plot, allowing to display or export it.

Notes

Syntax diagrams (a.k.a. railroad diagrams) [9] are especially useful for representing EBNF specifications of a grammar, because they capture nicely the extended notations that are introduced by EBNF, e.g. optional or repeated items.

This package supports reading a grammar specification from EBNF text. Internally, however, EBNF is automatically converted to a simpler form during the reading process, which is done by removing any occurrence of extended notation and expressing it with newly introduced symbols and rules instead. Only the final version of the grammar can be visualized, which is essentially BNF with new helper rules and nonterminals. Therefore the expressive power of syntax diagrams is unfortunately not fully used here.

There are many websites with explanations [10] or examples [11] of syntax diagrams. Likewise, there are many libraries [12] for generating syntax diagrams. This package uses the library Railroad-Diagram Generator [13].

References

class alogos._grammar.data_structures.DerivationTree(grammar, root_symbol=None)[source]

Derivation tree for representing the structure of a string.

A derivation tree [22] is the result from generating or parsing a string with a grammar. For the latter reason, it is also known as parse tree [23].

One derivation tree represents a single string of the grammar’s language, but multiple derivations by which this string can be generated. These derivations differ only in the order in which the nonterminals are expanded, which is not captured by a derivation tree.

Notes

Structure:

  • The tree consists of linked Node objects, each of which contains either a TerminalSymbol or NonterminalSymbol belonging to the grammar.

  • A rule of a context-free grammar can transform a nonterminal symbol into a sequence of symbols. In a derivation tree this is represented by a node, which contains the nonterminal symbol. This node is connected to one or more child nodes, which are ordered and contain the symbols resulting from the rule application. The parent node is said to be expanded.

  • The root node of the derivation tree contains the starting symbol of the grammar, which is a NonterminalSymbol.

  • Each internal node of the derivation tree contains a NonterminalSymbol.

  • In a fully expanded derivation tree, where no nonterminal symbol is left that could be further expanded, each leaf node contains a TerminalSymbol. Putting the terminal symbols into a sequence results in the string that the derivation tree represents.

Behaviour:

  • A node with a nonterminal can be expanded using a suitable production rule of the grammar.

  • Depth first traversal allows to get all terminals in correct order. This allows to retrieve the string represented by the derivation tree.

References

__init__(self, grammar, root_symbol=None)[source]

Create an empty derivation tree with reference to a grammar.

Parameters:
copy(self)[source]

Create a deep copy of the derivation tree.

The new object is entirely independent of the original object. No parts, such as nodes or symbols, are shared between the objects.

to_parenthesis_notation(self)[source]

Represent the tree as string in single-line parenthesis notation.

This notation can be found in the Natural Language Toolkit (NLTK) book under the name “bracketed structures” [24]. There are various similar representations of trees or nested structures [25].

Returns:

text (str) – Tree in parenthesis notation.

References

to_tree_notation(self)[source]

Represent the tree as string in multi-line tree notation.

This notation is a way to represent both code and data in a minimalistic format [26] based on newlines and different indentation levels. As such it is also suitable to represent derivation trees.

Returns:

text (str) – Tree in “Tree Notation”.

References

to_tuple(self)[source]

Serialize the tree to a tuple.

Notes

The data structure is a tuple that contains two other tuples of integers. A depth-first traversal of the tree visits all nodes. For each node, its symbol and number of children is remembered in two separate tuples. Instead of storing the symbols directly, a number is assigned to each symbol of the grammar and that concise number is stored instead of a potentially long symbol text.

Returns:

serialized_tree (tuple of two tuple objects) – The first tuple describes symbols that are contained in the nodes. The second tuple describes the number of children each node has.

from_tuple(self, serialized_tree)[source]

Deserialize the tree from a tuple.

Parameters:

serialized_tree (tuple containing two tuple objects) – The first tuple describes symbols that are contained in the nodes. The second tuple describes the number of children each node has.

to_json(self)[source]

Serialize the tree to a JSON string.

Returns:

json_string (str) – A string in JSON format that represents the tree.

from_json(self, serialized_tree)[source]

Deserialize the tree from a JSON string.

Parameters:

serialized_tree (str) – A string in JSON format that represents a tree.

plot(self, show_node_indices=None, layout_engine=None, fontname=None, fontsize=None, shape_nt=None, shape_unexpanded_nt=None, shape_t=None, fontcolor_nt=None, fontcolor_unexpanded_nt=None, fontcolor_t=None, fillcolor_nt=None, fillcolor_unexpanded_nt=None, fillcolor_t=None)[source]

Plot the derivation tree as labeled, directed graph.

This method uses Graphviz [27] and therefore requires it to be installed on the system.

Parameters:
  • show_node_indices (bool, optional) – If True, nodes will contain numbers that indicate the order in which they were created during tree construction.

  • layout_engine (str, optional) – Layout engine that calculates node positions.

    Possible values:

    • "circo"

    • "dot"

    • "fdp"

    • "neato"

    • "osage"

    • "patchwork"

    • "sfdp"

    • "twopi"

  • fontname (str, optional) – Fontname of text inside nodes.

  • fontsize (int or str, optional) – Fontsize of text inside nodes.

  • shape_nt (str, optional) – Shape of nodes that represent expanded nonterminals.

    Possible values: See Graphviz documentation: Node shapes

  • shape_unexpanded_nt (str, optional) – Shape of nodes that represent unexpanded nonterminals.

    Possible values: See Graphviz documentation: Node shapes

  • shape_t (str, optional) – Shape of nodes that represent terminals.

    Possible values: See Graphviz documentation: Node shapes

  • fontcolor_nt (str, optional) – Fontcolor of nodes that represent expanded nonterminals.

  • fontcolor_unexpanded_nt (str, optional) – Fontcolor of nodes that represent unexpanded nonterminals.

  • fontcolor_t (str, optional) – Fontcolor of nodes that represent terminals.

  • fillcolor_nt (str, optional) – Fillcolor of nodes that represent expanded nonterminals.

  • fillcolor_unexpanded_nt (str, optional) – Fillcolor of nodes that represent unexpanded nonterminals.

  • fillcolor_t (str, optional) – Fillcolor of nodes that represent terminals.

Returns:

fig (Figure) – Figure object containing the plot, allowing to display or export it.

References

nodes(self, order='dfs')[source]

Get all nodes as a list in order of a chosen tree traversal.

The nodes in a tree can be visited in different orders with different tree traversal methods [28].

Parameters:

order (str, optional) – Possible values:

  • "dfs": Traversal in order of a depth-first search

  • "bfs": Traversal in order of a breadth-first search

Returns:

nodes (list of Node objects)

References

leaf_nodes(self)[source]

Get all leaf nodes by a tree traversal in depth-first order.

Returns:

nodes (list of Node objects)

internal_nodes(self)[source]

Get all internal nodes by a tree traversal in depth-first order.

Returns:

nodes (list of Node objects)

tokens(self)[source]

Get a sequence of tokens in leaf nodes from left to right.

Returns:

tokens (list of Symbol objects)

string(self, separator='')[source]

Get the string contained in the leaf nodes of the tree.

  • If the tree is fully expanded, no nonterminal symbol is left in the leaf nodes, so the obtained string is composed only of terminals and belongs to the language of the grammar.

  • If the tree is not fully expanded, the result is not a string but a so called sentential form that still includes nonterminal symbols.

Parameters:

separator (str, optional) – The separator text used between terminal symbols.

Returns:

string (str)

derivation(self, derivation_order='leftmost', newline=True)[source]

Get a derivation that fits to the structure of the tree.

Parameters:
  • derivation_order (str, optional) – Order in which nonterminals are expanded during the step-by-step derivation.

    Possible values:

    • "leftmost": Expand the leftmost nonterminal in each step.

    • "rightmost": Expand the rightmost nonterminal in each step.

    • "random": Expand a random nonterminal in each step.

  • newline (bool, optional) – If True, the derivation steps are placed on separate lines by adding newline characters between them.

Returns:

derivation (str)

num_expansions(self)[source]

Calculate the number of expansions contained in the tree.

Returns:

num_expansions (int)

num_nodes(self)[source]

Calculate the number of nodes contained in the tree.

Returns:

num_nodes (int)

depth(self)[source]

Calculate the depth of the derivation tree.

The depth [29] [30] [31] of a tree is the number of edges in the longest path (in the graph-theoretic sense) from the root node to a leaf node.

Returns:

depth (int)

References

is_completely_expanded(self)[source]

Check if the tree contains only expanded nonterminal symbols.

Returns:

is_completely_expanded (bool) – If True, the tree is fully expanded which means that it contains only terminals in its leave nodes and that it represents complete derivations that lead to a string of the grammar’s language.

class alogos._grammar.data_structures.Node(sy, ch=None)[source]

Node inside a derivation tree.

A node contains a symbol and refers to child nodes.

__init__(self, sy, ch=None)[source]

Create a node.

Parameters:
  • sy (Symbol)

  • ch (list of child Node objects, optional)

copy(self)[source]

Create a deep copy of the node and its children.

contains_terminal(self)[source]

Check if the node contains a terminal symbol.

Returns:

contains_t (bool)

contains_nonterminal(self)[source]

Check if the node contains a nonterminal symbol.

Returns:

contains_nt (bool)

contains_unexpanded_nonterminal(self)[source]

Check if the node contains a nonterminal symbol and has no child nodes.

Returns:

contains_unexpanded_nt (bool)

class alogos._grammar.data_structures.Symbol(text)[source]

Symbol inside a grammar or derivation tree.

A symbol can be either a nonterminal or terminal of the grammar and it has a text attribute.

__init__(self, text)[source]

Create a symbol in the sense of formal language theory.

A symbol contains a text that can be empty, a single letter or multiple letters.

Parameters:

text (str) – Text contained in the symbol object.

copy(self)[source]

Create a deep copy of the symbol.

class alogos._grammar.data_structures.NonterminalSymbol(text)[source]

Bases: Symbol

Nonterminal symbol inside a grammar or derivation tree.

class alogos._grammar.data_structures.TerminalSymbol(text)[source]

Bases: Symbol

Terminal symbol inside a grammar or derivation tree.