Bigrams¶
This Jupyter notebook provides an example of using the Python package gravis. The .ipynb file can be found here.
It uses the Natural Language Toolkit (NLTK) to extract word bigrams from a text and filter them by simple criteria to get a list of relevant ones. Each bigram is a pair of words, therefore a list of bigrams can be interpreted as directed graph: words as nodes, word pairs as edges, frequency of a word pair as edge width.
References¶
Wikipedia
NLTK
[1]:
import gravis as gv
import networkx as nx
import nltk
[2]:
# Download text corpora, if not already done before
nltk.download('gutenberg')
nltk.download('stopwords')
[nltk_data] Downloading package gutenberg to /home/r/nltk_data...
[nltk_data] Package gutenberg is already up-to-date!
[nltk_data] Downloading package stopwords to /home/r/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[2]:
True
[3]:
def text_to_bigrams_and_counts(text, min_count=3):
# Text
known_texts = nltk.corpus.gutenberg.fileids()
if text not in known_texts:
message = 'Unknown text "{}".\nPossible values: {}'.format(text, known_texts)
raise ValueError(message)
# Words
words = [word.lower() for word in nltk.corpus.gutenberg.words(text)]
print('Number of words:', len(words))
# Bigrams
bigrams = list(nltk.bigrams(words))
print('Number of bigrams:', len(bigrams))
# Bigram counts
bigrams_counted = {}
for bg in bigrams:
try:
bigrams_counted[bg] += 1
except KeyError:
bigrams_counted[bg] = 1
print('Number of unique bigrams:', len(bigrams_counted))
# Relevant bigrams
stop_words = nltk.corpus.stopwords.words('english')
def include_bigram(bigram):
count = bigrams_counted[bigram]
if count < min_count:
return False
for word in bigram:
if len(word) <= 1:
return False
if word in stop_words:
return False
if not word.isalnum():
return False
return True
filtered_bigrams = [bg for bg in bigrams if include_bigram(bg)]
filtered_bigrams = list(set(filtered_bigrams))
print('Number of filtered bigrams:', len(filtered_bigrams))
# Relevant bigrams with counts
filtered_bigrams_and_counts = {bg: bigrams_counted[bg] for bg in filtered_bigrams}
return filtered_bigrams_and_counts
def bigram_counts_to_graph(bg_cnt):
graph = nx.DiGraph()
for bigram, count in bg_cnt.items():
word1, word2 = bigram
graph.add_edge(word1, word2, count=count)
for node_id in graph.nodes:
node = graph.nodes[node_id]
node['size'] = (graph.in_degree[node_id] + 1) * 3
print()
print('Graph with {} nodes and {} edges.'.format(len(graph.nodes), len(graph.edges)))
return graph
[4]:
for text in ['austen-emma.txt', 'carroll-alice.txt', 'melville-moby_dick.txt', 'shakespeare-caesar.txt']:
print(text)
print('-' * len(text))
bigrams_and_counts = text_to_bigrams_and_counts(text, min_count=5)
graph = bigram_counts_to_graph(bigrams_and_counts)
fig = gv.d3(
graph,
edge_size_data_source='count',
use_edge_size_normalization=True,
zoom_factor=0.5,
)
fig.display(inline=True)
print()
austen-emma.txt
---------------
Number of words: 192427
Number of bigrams: 192426
Number of unique bigrams: 64601
Number of filtered bigrams: 275
Graph with 230 nodes and 275 edges.
Details for selected element
carroll-alice.txt
-----------------
Number of words: 34110
Number of bigrams: 34109
Number of unique bigrams: 14864
Number of filtered bigrams: 42
Graph with 61 nodes and 42 edges.
Details for selected element
melville-moby_dick.txt
----------------------
Number of words: 260819
Number of bigrams: 260818
Number of unique bigrams: 114181
Number of filtered bigrams: 291
Graph with 283 nodes and 291 edges.
Details for selected element
shakespeare-caesar.txt
----------------------
Number of words: 25833
Number of bigrams: 25832
Number of unique bigrams: 14335
Number of filtered bigrams: 34
Graph with 45 nodes and 34 edges.
Details for selected element