Bigrams¶

This Jupyter notebook provides an example of using the Python package gravis. The .ipynb file can be found here.

It uses the Natural Language Toolkit (NLTK) to extract word bigrams from a text and filter them by simple criteria to get a list of relevant ones. Each bigram is a pair of words, therefore a list of bigrams can be interpreted as directed graph: words as nodes, word pairs as edges, frequency of a word pair as edge width.

References¶

Wikipedia
- n-Gram
NLTK
- Accessing Text Corpora and Lexical Resources
- Collocations

[1]:

import gravis as gv
import networkx as nx
import nltk

[2]:

# Download text corpora, if not already done before
nltk.download('gutenberg')
nltk.download('stopwords')

[nltk_data] Downloading package gutenberg to /home/r/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package stopwords to /home/r/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

[2]:

True

[3]:

def text_to_bigrams_and_counts(text, min_count=3):
    # Text
    known_texts = nltk.corpus.gutenberg.fileids()
    if text not in known_texts:
        message = 'Unknown text "{}".\nPossible values: {}'.format(text, known_texts)
        raise ValueError(message)

    # Words
    words = [word.lower() for word in nltk.corpus.gutenberg.words(text)]
    print('Number of words:', len(words))

    # Bigrams
    bigrams = list(nltk.bigrams(words))
    print('Number of bigrams:', len(bigrams))

    # Bigram counts
    bigrams_counted = {}
    for bg in bigrams:
        try:
            bigrams_counted[bg] += 1
        except KeyError:
            bigrams_counted[bg] = 1
    print('Number of unique bigrams:', len(bigrams_counted))

    # Relevant bigrams
    stop_words = nltk.corpus.stopwords.words('english')
    def include_bigram(bigram):
        count = bigrams_counted[bigram]
        if count < min_count:
            return False
        for word in bigram:
            if len(word) <= 1:
                return False
            if word in stop_words:
                return False
            if not word.isalnum():
                return False
        return True

    filtered_bigrams = [bg for bg in bigrams if include_bigram(bg)]
    filtered_bigrams = list(set(filtered_bigrams))
    print('Number of filtered bigrams:', len(filtered_bigrams))

    # Relevant bigrams with counts
    filtered_bigrams_and_counts = {bg: bigrams_counted[bg] for bg in filtered_bigrams}
    return filtered_bigrams_and_counts


def bigram_counts_to_graph(bg_cnt):
    graph = nx.DiGraph()
    for bigram, count in bg_cnt.items():
        word1, word2 = bigram
        graph.add_edge(word1, word2, count=count)
    for node_id in graph.nodes:
        node = graph.nodes[node_id]
        node['size'] = (graph.in_degree[node_id] + 1) * 3
    print()
    print('Graph with {} nodes and {} edges.'.format(len(graph.nodes), len(graph.edges)))
    return graph

[4]:

for text in ['austen-emma.txt', 'carroll-alice.txt', 'melville-moby_dick.txt', 'shakespeare-caesar.txt']:
    print(text)
    print('-' * len(text))
    bigrams_and_counts = text_to_bigrams_and_counts(text, min_count=5)
    graph = bigram_counts_to_graph(bigrams_and_counts)
    fig = gv.d3(
        graph,
        edge_size_data_source='count',
        use_edge_size_normalization=True,
        zoom_factor=0.5,
    )
    fig.display(inline=True)
    print()

austen-emma.txt
---------------
Number of words: 192427
Number of bigrams: 192426
Number of unique bigrams: 64601
Number of filtered bigrams: 275

Graph with 230 nodes and 275 edges.


carroll-alice.txt
-----------------
Number of words: 34110
Number of bigrams: 34109
Number of unique bigrams: 14864
Number of filtered bigrams: 42

Graph with 61 nodes and 42 edges.


melville-moby_dick.txt
----------------------
Number of words: 260819
Number of bigrams: 260818
Number of unique bigrams: 114181
Number of filtered bigrams: 291

Graph with 283 nodes and 291 edges.


shakespeare-caesar.txt
----------------------
Number of words: 25833
Number of bigrams: 25832
Number of unique bigrams: 14335
Number of filtered bigrams: 34

Graph with 45 nodes and 34 edges.