Bigrams

This Jupyter notebook provides an example of using the Python package gravis. The .ipynb file can be found here.

It uses the Natural Language Toolkit (NLTK) to extract word bigrams from a text and filter them by simple criteria to get a list of relevant ones. Each bigram is a pair of words, therefore a list of bigrams can be interpreted as directed graph: words as nodes, word pairs as edges, frequency of a word pair as edge width.

References

[1]:
import gravis as gv
import networkx as nx
import nltk
[2]:
# Download text corpora, if not already done before
nltk.download('gutenberg')
nltk.download('stopwords')
[nltk_data] Downloading package gutenberg to /home/r/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package stopwords to /home/r/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[2]:
True
[3]:
def text_to_bigrams_and_counts(text, min_count=3):
    # Text
    known_texts = nltk.corpus.gutenberg.fileids()
    if text not in known_texts:
        message = 'Unknown text "{}".\nPossible values: {}'.format(text, known_texts)
        raise ValueError(message)

    # Words
    words = [word.lower() for word in nltk.corpus.gutenberg.words(text)]
    print('Number of words:', len(words))

    # Bigrams
    bigrams = list(nltk.bigrams(words))
    print('Number of bigrams:', len(bigrams))

    # Bigram counts
    bigrams_counted = {}
    for bg in bigrams:
        try:
            bigrams_counted[bg] += 1
        except KeyError:
            bigrams_counted[bg] = 1
    print('Number of unique bigrams:', len(bigrams_counted))

    # Relevant bigrams
    stop_words = nltk.corpus.stopwords.words('english')
    def include_bigram(bigram):
        count = bigrams_counted[bigram]
        if count < min_count:
            return False
        for word in bigram:
            if len(word) <= 1:
                return False
            if word in stop_words:
                return False
            if not word.isalnum():
                return False
        return True

    filtered_bigrams = [bg for bg in bigrams if include_bigram(bg)]
    filtered_bigrams = list(set(filtered_bigrams))
    print('Number of filtered bigrams:', len(filtered_bigrams))

    # Relevant bigrams with counts
    filtered_bigrams_and_counts = {bg: bigrams_counted[bg] for bg in filtered_bigrams}
    return filtered_bigrams_and_counts


def bigram_counts_to_graph(bg_cnt):
    graph = nx.DiGraph()
    for bigram, count in bg_cnt.items():
        word1, word2 = bigram
        graph.add_edge(word1, word2, count=count)
    for node_id in graph.nodes:
        node = graph.nodes[node_id]
        node['size'] = (graph.in_degree[node_id] + 1) * 3
    print()
    print('Graph with {} nodes and {} edges.'.format(len(graph.nodes), len(graph.edges)))
    return graph
[4]:
for text in ['austen-emma.txt', 'carroll-alice.txt', 'melville-moby_dick.txt', 'shakespeare-caesar.txt']:
    print(text)
    print('-' * len(text))
    bigrams_and_counts = text_to_bigrams_and_counts(text, min_count=5)
    graph = bigram_counts_to_graph(bigrams_and_counts)
    fig = gv.d3(
        graph,
        edge_size_data_source='count',
        use_edge_size_normalization=True,
        zoom_factor=0.5,
    )
    fig.display(inline=True)
    print()
austen-emma.txt
---------------
Number of words: 192427
Number of bigrams: 192426
Number of unique bigrams: 64601
Number of filtered bigrams: 275

Graph with 230 nodes and 275 edges.
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

carroll-alice.txt
-----------------
Number of words: 34110
Number of bigrams: 34109
Number of unique bigrams: 14864
Number of filtered bigrams: 42

Graph with 61 nodes and 42 edges.
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

melville-moby_dick.txt
----------------------
Number of words: 260819
Number of bigrams: 260818
Number of unique bigrams: 114181
Number of filtered bigrams: 291

Graph with 283 nodes and 291 edges.
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

shakespeare-caesar.txt
----------------------
Number of words: 25833
Number of bigrams: 25832
Number of unique bigrams: 14335
Number of filtered bigrams: 34

Graph with 45 nodes and 34 edges.
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force