OREGANO¶

This notebook explores the biomedical knowledge graph provided by the project OREGANO: Publication (2023), Code, Data on figshare or Zenodo, Query interface

The source file of this notebook is oregano.ipynb and can be found in the repository awesome-biomedical-knowledge-graphs that also contains information about similar projects.

Table of contents¶

  1. Setup
  2. Data download
  3. Data import
  4. Data inspection
  5. Schema discovery
  6. Knowledge graph reconstruction
  7. Subgraph exploration

1. Setup¶

This section prepares the environment for the following exploratory data analysis.

a) Import packages¶

From the Python standard library.

In [1]:
import math
import os

From the Python Package Index (PyPI).

In [2]:
import gravis as gv  # for visualization of the KG schema and subgraphs, developed by the author of this notebook
import igraph as ig
import pandas as pd

From a local Python module named shared_bmkg.py. The functions in it are used in several similar notebooks to reduce code repetition and to improve readability.

In [3]:
import shared_bmkg

b) Create data directories¶

The raw data provided by the project and the transformed data generated throughout this notebook are stored in separate directories. If the notebook is run more than once, the downloaded data is reused instead of fetching it again, but all data transformations are rerun.

In [4]:
project_name = "oregano"
download_dir = os.path.join(project_name, "downloads")
results_dir = os.path.join(project_name, "results")

shared_bmkg.create_dir(download_dir)
shared_bmkg.create_dir(results_dir)

2. Data download¶

This section fetches the data published by the project on figshare. The latest available version at the time of creating this notebook was used: Version 3 (2023-10-16).

All files provided by the project¶

  • OREGANO_V2.1.tsv: Nodes and edges in form of semantic triples (subject, predicate, object).
    • Publication: "File containing all knowledge graph triples. It is composed of 3 columns: Subject; Predicate;Object"
  • Ten further TSV files named ACTIVITY.tsv to TARGET.tsv: Node annotations, each file for one particular node type.
    • Publication: "Cross-reference tables [...] column headers are the names of the sources to which the cross-references belong, and the row headers contain the name of the entity in the OREGANO graph. The first column header of each file is a key consisting of “ID_OREGANO:” followed by the number of entries in the file."
  • oreganov2.1_metadata_complet.ttl: Seemingly the same information as above, but in a format compatible with the RDF standard, a central technology of the semantic web.
    • Publication: "The OREGANO knowledge graph in turtle format with the names and cross-references of the various integrated entities."

Files needed to create the knowledge graph¶

  • OREGANO_V2.1.tsv together with ACTIVITY.tsv to TARGET.tsv contain all information required for reconstructing the knowledge graph.
  • Alternatively, oreganov2.1_metadata_complet.ttl could presumably be used as well, but parsing it is much slower and interpreting it correctly is more tedious compared to using the TSV files.
In [5]:
download_specification = [
    ("OREGANO_V2.1.tsv", "https://figshare.com/ndownloader/files/42612700", "3e534cadf7717c705cbd5e6dd8c392a6"),

    ("ACTIVITY.tsv", "https://figshare.com/ndownloader/files/42612697", "d1faec938bd32ec9bb45ae5bba14d7a2"),
    ("COMPOUND.tsv", "https://figshare.com/ndownloader/files/42612676", "d86277d0e849c3774acb3c6be1878b44"),
    ("DISEASES.tsv", "https://figshare.com/ndownloader/files/42612685", "747a48c9786fd50912bb8d7f33d630df"),
    ("EFFECT.tsv", "https://figshare.com/ndownloader/files/42612694", "9b1656a45df7d0369dc2a8e5f52b5c2b"),
    ("GENES.tsv", "https://figshare.com/ndownloader/files/42612673", "8fe9a917b23ab715a268feda50b4a8d7"),
    ("INDICATION.tsv", "https://figshare.com/ndownloader/files/42612691", "eae2cbaebc64b1387cce1755a386d1cb"),
    ("PATHWAYS.tsv", "https://figshare.com/ndownloader/files/42612688", "d3f0abefe355c4b9cd1eea4cc8d58021"),
    ("PHENOTYPES.tsv", "https://figshare.com/ndownloader/files/42612679", "cada3e9316ba4ec654dd2b9ec2afa23c"),
    ("SIDE_EFFECT.tsv", "https://figshare.com/ndownloader/files/42612670", "8388b5a37db250128e50571d42d96cbc"),
    ("TARGET.tsv", "https://figshare.com/ndownloader/files/42612682", "46e189c556b2b010952d4212b8779cef"),

    ("oreganov2.1_metadata_complete.ttl", "https://figshare.com/ndownloader/files/42700234", "b087300f5e597d263a03c02eaae53bc7"),
]

for filename, url, md5 in download_specification:
    filepath = os.path.join(download_dir, filename)
    shared_bmkg.fetch_file(url, filepath)
    shared_bmkg.validate_file(filepath, md5)
    print()
Found a full local copy of "oregano/downloads/OREGANO_V2.1.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/ACTIVITY.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/COMPOUND.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/DISEASES.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/EFFECT.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/GENES.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/INDICATION.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/PATHWAYS.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/PHENOTYPES.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/SIDE_EFFECT.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/TARGET.tsv".
MD5 checksum is correct.

Found a full local copy of "oregano/downloads/oreganov2.1_metadata_complete.ttl".
MD5 checksum is correct.

3. Data import¶

This section loads the raw files into Python data structures for the following inspection and conversion.

In [6]:
%%time

df_kg = shared_bmkg.read_tsv_file(os.path.join(download_dir, "OREGANO_V2.1.tsv"), header=None)
df_kg.columns = ["subject", "predicate", "object"]
CPU times: user 2.02 s, sys: 291 ms, total: 2.31 s
Wall time: 2.5 s
In [7]:
%%time

df_activity = shared_bmkg.read_tsv_file(os.path.join(download_dir, "ACTIVITY.tsv"))
df_activity.columns.values[0] = "id"

df_compound = shared_bmkg.read_tsv_file(os.path.join(download_dir, "COMPOUND.tsv"))
df_compound.columns.values[0] = "id"

df_diseases = shared_bmkg.read_tsv_file(os.path.join(download_dir, "DISEASES.tsv"))
df_diseases.columns.values[0] = "id"

df_effect = shared_bmkg.read_tsv_file(os.path.join(download_dir, "EFFECT.tsv"))
df_effect.columns.values[0] = "id"

df_genes = shared_bmkg.read_tsv_file(os.path.join(download_dir, "GENES.tsv"))
df_genes.columns.values[0] = "id"

df_indication = shared_bmkg.read_tsv_file(os.path.join(download_dir, "INDICATION.tsv"))
df_indication.columns.values[0] = "id"

df_pathways = shared_bmkg.read_tsv_file(os.path.join(download_dir, "PATHWAYS.tsv"))
df_pathways.columns.values[0] = "id"

df_phenotypes = shared_bmkg.read_tsv_file(os.path.join(download_dir, "PHENOTYPES.tsv"))
df_phenotypes.columns.values[0] = "id"

df_sideeffect = shared_bmkg.read_tsv_file(os.path.join(download_dir, "SIDE_EFFECT.tsv"))
df_sideeffect.columns.values[0] = "id"

df_target = shared_bmkg.read_tsv_file(os.path.join(download_dir, "TARGET.tsv"))
df_target.columns.values[0] = "id"
CPU times: user 6.52 s, sys: 972 ms, total: 7.49 s
Wall time: 8.12 s
In [8]:
%%time

rdf_kg = shared_bmkg.read_ttl_file(os.path.join(download_dir, "oreganov2.1_metadata_complete.ttl"))
CPU times: user 9min 7s, sys: 11.2 s, total: 9min 19s
Wall time: 9min 26s

4. Data inspection¶

This section attempts to reproduce some published numbers by inspecting the raw data and then prints a few exemplary records.

The publication and data repository on figshare mention following statistics about the knowledge graph contents:

  • 88,937 nodes having 11 different node types
  • 824,231 edges having 19 different edge types

a) Number of nodes and edges¶

In [9]:
unique_nodes = list(set(df_kg['subject'].unique().tolist() + df_kg['object'].unique().tolist()))
num_nodes = len(unique_nodes)
num_edges = len(df_kg)

print(f"{num_nodes:,} nodes")
print(f"{num_edges:,} edges")
88,937 nodes
823,005 edges

Interpretation:

  • Inspecting the raw data resulted in 88,937 nodes, which matches the number mentioned in the publication.
  • Inspecting the raw data resulted in 823,005 edges, while the publication mentions 824,231 edges, which is 1226 more.
  • A difference was also present in earlier versions of the public dataset (Version 1 and 2), though in form of yet another number. The reason for these deviations did not become clear.
In [10]:
raw_triples = [(row.subject, row.predicate, row.object) for row in df_kg.itertuples()]
unique_triples = list(set(raw_triples))

num_raw_triples = len(raw_triples)
num_unique_triples = len(unique_triples)
print(f"{num_raw_triples:,} triples")
print(f"{num_unique_triples:,} unique triples")
823,005 triples
819,884 unique triples

Interpretation:

  • Inspecting the raw data further resulted only in 819,884 unique edges, while the original list contains 823,005 edges, which is 3121 more. This means approximately 0.38% of the edges are redundant, because they connect the same pair of nodes via the same type of edge. The reason for this deviation seems to be a mistake in the process that generated the data.

b) Types of nodes and edges¶

In [11]:
def node_name_to_type(node_name):
    try:
        # Most node names contain both the type and id of a node separated by a ":"
        node_type, _ = node_name.split(':', 1)
        node_type = node_type.lower()
    except Exception:
        # Node names that occur in "has_code" relations as target do not have a ":" and explicit type
        # Here the type "code" is assigned to such nodes to be consistent
        node_type = "code"
    return node_type
In [12]:
nt_counts = dict()
for node_name in unique_nodes:
    nt = node_name_to_type(node_name)
    if nt not in nt_counts:
        nt_counts[nt] = 0    
    nt_counts[nt] += 1

print(len(nt_counts), "node types, sorted by their frequency of occurrence:")
for nt, cnt in sorted(nt_counts.items(), key=lambda item: -item[1]):
    print(f"- {nt}: {cnt}")
12 node types, sorted by their frequency of occurrence:
- compound: 32083
- protein: 14505
- gene: 13363
- disease: 8934
- phenotype: 6854
- side_effect: 5364
- code: 3280
- pathway: 2128
- indication: 2080
- effect: 171
- molecule: 97
- activity: 78
In [13]:
et_column = "predicate"
et_counts = dict(df_kg.groupby(et_column).size().sort_values(ascending=False))

print(len(et_counts), "edge types, sorted by their frequency:")
for key, val in et_counts.items():
    print(f"- {key}: {val}")
18 edge types, sorted by their frequency:
- decrease_efficacy: 215222
- has_target: 202539
- has_side_effect: 112532
- has_phenotype: 88791
- acts_within: 44865
- increase_efficacy: 44114
- has_effect: 19160
- increase_effect: 18999
- causes_condition: 17287
- has_activity: 12584
- gene_product_of: 10881
- increase_activity: 10518
- is_affecting: 8905
- has_indication: 8225
- has_code: 3403
- decrease_activity: 3331
- is_substance_that_treats: 1394
- decrease_effect: 255
In [14]:
# Correctness checks

# 1) Do the counts of different node types add up to the total number of nodes?
sum_node_types = sum(nt_counts.values())
assert sum_node_types == num_nodes, f"Node counts differ: {sum_node_types} != {num_nodes}"
print(f"{sum_node_types:,} = {num_nodes:,} nodes")

# 2) Do the counts of different edge types add up to the total number of edges?
sum_edge_types = sum(et_counts.values())
assert sum_edge_types == num_edges, f"Edge counts differ: {sum_edge_types} != {num_edges}"
print(f"{sum_edge_types:,} = {num_edges:,} edges")
88,937 = 88,937 nodes
823,005 = 823,005 edges
In [15]:
85657 - 88937
Out[15]:
-3280

Interpretation:

  • Inspecting the raw data resulted in 12 node types, while the publication mentioned 11 node types, which is 1 fewer.
  • Inspecting the raw data resulted in 18 edge types, while the publication mentioned 19 edge types, which is 1 more.
  • The differences can be explained by two separate reasons:
    • The number of node types differ because of a peculiarity in how node types are encoded in the raw data, which can also be seen in the next section. A node name usually has the form "type:id", which assigns an explicit type to the node. In some cases, however, there are objects whose node name has the form "id" without any type assignment, but because they are always part of a triple with the predicate "has_code", their implicit type is taken to be "code" in this notebook in order construct a proper knowledge graph where every node has a type.
    • The number of edge types do not differ when using an earlier version of the public dataset (Version 1). This suggests that the authors made slight modifications to the knowledge graph creation process after the publication, presumably to correct an error in the computation or to accommodate slightly different input data.

c) Example entries¶

This section prints some example entries of the raw data. It gives an impression of the format chosen by the authors, which differs greatly between projects due to a lack of a broadly accepted standard for biomedical knowledge graphs.

In [16]:
def report_first_n_items(data, n):
    return data.head(n)
In [17]:
def report_last_n_items(data, n):
    return data.tail(n)

Edges and the nodes they connect¶

In [18]:
report_first_n_items(df_kg, 2)
Out[18]:
subject predicate object
0 COMPOUND:1 has_code B01AE02
1 COMPOUND:2 has_code L01FE01
In [19]:
report_last_n_items(df_kg, 2)
Out[19]:
subject predicate object
823003 PROTEIN:22095 gene_product_of GENE:31330
823004 PROTEIN:22096 gene_product_of GENE:25313
In [20]:
# Caution: Some edges contain a target node without type, i.e. there is no ":" in the node name
df_notype = df_kg[~df_kg['subject'].str.contains(':') | ~df_kg['object'].str.contains(':')]
report_first_n_items(df_notype, 4)
Out[20]:
subject predicate object
0 COMPOUND:1 has_code B01AE02
1 COMPOUND:2 has_code L01FE01
2 COMPOUND:3 has_code R05CB13
3 COMPOUND:4 has_code L01XX29

Node annotations¶

In [21]:
report_first_n_items(df_activity, 2)
Out[21]:
id NAME_DRUGBANK
0 ACTIVITY:1 the anticoagulant activities
1 ACTIVITY:10 the hepatotoxic activities
In [22]:
report_first_n_items(df_compound, 2)
Out[22]:
id DRUGBANK ATC DRUGS PRODUCT DATABASE PUBCHEM SUBSTANCE KEGG DRUG PHARMGKB UNIPROTKB THERAPEUTIC TARGETS DATABASE WIKIPEDIA ... NDC THERAPEUTIC TARGETS DATABASE CAS GENBANK UNIPROT CLINICALTRIALS.GOV DPD URL FDA DRUG LABEL AT DAILYMED IUPHAR LIGAND
0 COMPOUND:1 DB00001 B01AE02 11916 46507011 D06880 PA450195 P01050 DAP000541 Lepirudin ... 50419-150-57 DAP000541 NaN NaN P01050 NaN NaN NaN NaN NaN
1 COMPOUND:2 DB00002 L01FE01 13175 46507042 D03455 PA10040 NaN DNC000788 Cetuximab ... 66733-948-23 DNC000788 NaN J00228 NaN NaN NaN NaN NaN NaN

2 rows × 52 columns

In [23]:
report_first_n_items(df_diseases, 2)
Out[23]:
id PHARMGKB MESH SNOMEDCT UMLS NDFRT MEDDRA SNOMEDCT NDFRT MESH MEDDRA UMLS ORPHANET ICD-11 ICD-10 OMIM GARD
0 DISEASE:1 PA165108557 D054868 4325000 C0795841 N0000181176 NaN NaN NaN NaN NaN C0795841 2308 LD44.B0 Q93.5 NaN 307
1 DISEASE:2 PA446766 D018784 75100008 C0243001 N0000003856 10060921 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [24]:
report_first_n_items(df_effect, 2)
Out[24]:
id NAME_DRUGBANK
0 EFFECT:1 bleeding and hemorrhage
1 EFFECT:10 infection and neutropenia
In [25]:
report_first_n_items(df_genes, 2)
Out[25]:
id PHARMGKB COMPARATIVE TOXICOGENOMICS DATABASE ENSEMBL GENATLAS GENECARD HGNC HUMANCYC GENE MODBASE NCBI GENE ... ENSEMBL GENATLAS GENECARD HGNC IUPHAR RECEPTOR URL PHARMVAR GENE GENBANK COMPARATIVE TOXICOGENOMICS DATABASE NCBI GENE
0 GENE:1 PA24356 1.0 ENSG00000121410 A1BG A1BG HGNC5 HS04495 P04217 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 GENE:2 PA165392995 NaN NaN NaN A1BG-AS1 HGNC37133 NaN NaN 503538 ... ENSG00000268895 NaN NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 26 columns

In [26]:
report_first_n_items(df_indication, 2)
Out[26]:
id SIDER NAME
0 INDICATION:1 C0015544 Failure to Thrive
1 INDICATION:2 C0020615 Hypoglycemia
In [27]:
report_first_n_items(df_pathways, 2)
Out[27]:
id REACTOME
0 PATHWAY:10 R-HSA-418990
1 PATHWAY:100 R-HSA-9614657
In [28]:
report_first_n_items(df_phenotypes, 2)
Out[28]:
id HPO UMLS MSH SNOMEDCT_US MEDDRA FYLER NCIT COHD EFO ... EPCC DOID MONDO ICD-O MP MPATH PMID ORPHA SNOMED_CT ICD-9
0 PHENOTYPE:1 HP:0000001 C0444868 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 PHENOTYPE:2 HP:0000002 C4025901 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 23 columns

In [29]:
report_first_n_items(df_sideeffect, 2)
Out[29]:
id SIDER NAME
0 SIDE_EFFECT:1 C0000729 Abdominal cramps
1 SIDE_EFFECT:2 C0000737 Abdominal pain
In [30]:
report_first_n_items(df_target, 2)
Out[30]:
id HUGO GENE NOMENCLATURE COMMITTEE GENATLAS GENBANK GENE DATABASE GENBANK PROTEIN DATABASE GUIDE TO PHARMACOLOGY UNIPROTKB UNIPROT ACCESSION DRUGBANK IUPHAR ... PCDDB SFLD MOONPROT DEPOD UNILECTIN PEROXIBASE ALLERGOME IMGT_GENE-DB REBASE PATRIC
0 PROTEIN:1 HGNC:3535 F2 M17262 339641.0 2362.0 P00734;B2R7F7;B4E1A7;Q4QZ40;Q53H04;Q53H06;Q69E... THRB_HUMAN BE0000048 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 PROTEIN:2 HGNC:3236 EGFR X00588 757924.0 1797.0 P00533;O00688;O00732;P06268;Q14225;Q68GS5;Q927... EGFR_HUMAN BE0000767 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 142 columns

Nodes, edges and annotations in a different format for semantic web technologies¶

In [31]:
for i, (s, p, o) in enumerate(rdf_kg):
    print("Subject:  ", s)
    print("Predicate:", p)
    print("Object:   ", o)
    print()
    if i >= 4:
        break
Subject:   http://erias.fr/oregano/compound/compound_966
Predicate: http://erias.fr/oregano/#decrease_efficacy
Object:    http://erias.fr/oregano/compound/compound_3132

Subject:   http://erias.fr/oregano/compound/compound_24288
Predicate: http://erias.fr/oregano/#has_target
Object:    http://erias.fr/oregano/protein/protein_2870

Subject:   http://erias.fr/oregano/protein/protein_20017
Predicate: http://erias.fr/oregano/#hpa
Object:    ensg00000127125

Subject:   http://erias.fr/oregano/compound/compound_11152
Predicate: http://erias.fr/oregano/#increase_efficacy
Object:    http://erias.fr/oregano/compound/compound_1530

Subject:   http://erias.fr/oregano/protein/protein_14672
Predicate: http://erias.fr/oregano/#bgee
Object:    ensg00000108813

5. Schema discovery¶

This section analyzes the structure of the knowledge graph by determining which types of nodes are connected by which types of edges. To construct this overview, it is necessary to iterate over the entire data once. The result is a condensed representation of all entities and relations, which is known as data model or schema in the context of graph databases.

In [32]:
node_type_to_color = {
    "compound": "green",
    "molecule": "green",

    "gene": "blue",
    "protein": "blue",

    "disease": "red",
    "pathway": "red",
}
In [33]:
%%time

unique_triples = set()
for row in df_kg.itertuples():
    s = node_name_to_type(row.subject)
    p = row.predicate
    o = node_name_to_type(row.object)
    triple = (s, p, o)
    unique_triples.add(triple)
CPU times: user 4.77 s, sys: 30 μs, total: 4.77 s
Wall time: 4.77 s
In [34]:
gs = ig.Graph(directed=True)
unique_nodes = set()
for s, p, o in unique_triples:
    for node in (s, o):
        if node not in unique_nodes:
            unique_nodes.add(node)

            node_size = int(nt_counts[node])
            node_color = node_type_to_color.get(node, '')
            node_hover = f"{node}\n\n{nt_counts[node]} nodes of this type are contained in the knowledge graph."
            gs.add_vertex(node, size=node_size, color=node_color, label_color=node_color, hover=node_hover)

    edge_size = int(et_counts[p])
    edge_color = node_type_to_color.get(s, '')
    edge_hover = f"{p}\n\n{et_counts[p]} edges of this type are contained in the knowledge graph."
    gs.add_edge(s, o, size=edge_size, color=edge_color, hover=edge_hover, label=p, label_color="gray", label_size=5)

gs.vcount(), gs.ecount()
Out[34]:
(12, 19)
In [35]:
fig = gv.d3(
    gs,
    show_node_label=True,
    node_label_data_source="name",

    show_edge_label=True,
    edge_label_data_source="label",
    edge_curvature=0.2,

    use_node_size_normalization=True,
    node_size_normalization_min=10,
    node_size_normalization_max=50,
    node_drag_fix=True,
    node_hover_neighborhood=True,
    
    use_edge_size_normalization=True,
    edge_size_normalization_max=3,

    many_body_force_strength=-3000,
    zoom_factor=0.8,
)
fig
Out[35]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force
In [36]:
# Export the schema visualization
schema_filepath = os.path.join(results_dir, f"{project_name}_schema.html")
fig.export_html(schema_filepath, overwrite=True)

Interpretation:

  • Each node in the schema corresponds to one of the 12 node types in the data.
    • Node size represents the number of instances, i.e. how often that node type is present in the knowledge graph. The exact number can also be seen when hovering over a node.
    • Node color represents particular node types. The coloring scheme is based on a deliberately simple RGB palette with the same meaning across multiple notebooks to enable some visual comparison. The idea behind it is to highlight an interplay of certain entities, namely that drugs (or small molecules in general) can bind to proteins (or gene products in general) and thereby alter diseases (or involved pathways).
      • green = drugs & other small molecules (e.g. toxins)
      • blue = genes & gene products (e.g. proteins or RNAs)
      • red = diseases & related concepts (e.g. pathways)
      • black = all other types of entities
  • Each edge in the schema stands for one of the 18 edge types in the data. It is possible that the same edge type appears between different nodes.
    • Edge size represents the number of instances, i.e. how often that edge type is present in the knowledge graph.
    • Edge color is identical to the color of the source node, again to highlight the interplay between drugs, targets and diseases.

6. Knowledge graph reconstruction¶

This section first converts the raw data to an intermediate format used in several notebooks, and then reconstructs the knowledge graph from the standardized data with shared code.

  • The intermediate form of the data is created as two simple Python lists, one for nodes and the other for edges, which can be exported to two CSV files.
  • The knowledge graph is built as a graph object from the Python package igraph, which can be exported to a GraphML file.

a) Convert the data into a standardized format¶

Transform the raw data to an standardized format that is compatible with most biomedical knowledge graphs in order to enable shared downstream processing:

  • Each node is represented by three items: id (str), type (str), properties (dict)
  • Each edge is represented by four items: source_id (str), target_id (str), type(str), properties (dict)

This format was initially inspired by a straightforward way in which the content of a Neo4j graph database can be exported to two CSV files, one for all nodes and the other for all edges. This is an effect of the property graph model used in Neo4j and many other graph databases, which also appears to be general enough to fully capture the majority of biomedical knowledge graphs described in scientific literature, despite the large variety of formats they are shared in.

A second motivation was that each line represents a single node or edge, and that no entry is connected to any sections at other locations, such as property descriptions at the beginning of a GraphML file. This structural simplicity makes it very easy to load just a subset of nodes and edges by picking a subset of lines, or to skip the loading of properties if they not required for a task simply by ignoring a single column.

Finally, this format also allows to load the data directly into popular SQL databases like SQLite, MySQL or PostgreSQL with built-in CSV functions (CSV in SQLite, CSV in MySQL, CSV in PostgreSQL). Further, the JSON string in the property column can be accessed directly by built-in JSON functions (JSON in SQLite, JSON in MySQL, JSON in PostgreSQL), which enables sophisticated queries that access or modify specific properties within the JSON data.

Prepare node annotations¶

In [37]:
%%time

def strip_if_str(val):
    if isinstance(val, str):
        val = val.strip()
    return val

def is_nan(val):
    if isinstance(val, float):
        return math.isnan(val)
    return False

dataframes = [
    df_activity,
    df_compound,
    df_diseases,
    df_effect,
    df_genes,
    df_indication,
    df_pathways,
    df_phenotypes,
    df_sideeffect,
    df_target,
]
node_name_to_annotation_map = {}
for df in dataframes:
    data_columns = list(df.columns)[1:]
    for row in df.itertuples():
        # indexing: 0=id (ignored), 1=name, 2-n=annotations
        node_name = row[1]
        annotation = {col: row[i] for i, col in enumerate(data_columns, 2)}
        annotation = {strip_if_str(k):strip_if_str(v) for k, v in annotation.items()
                      if v is not None
                      and strip_if_str(v) != ""
                      and not is_nan(v)}
        node_name_to_annotation_map[node_name] = annotation
CPU times: user 18.1 s, sys: 136 ms, total: 18.3 s
Wall time: 18.3 s
In [38]:
for i, (k, v) in enumerate(node_name_to_annotation_map.items()):
    if i >= 20000:
        print(k, v)
    if i >= 20004:
        break
COMPOUND:28063 {'CHEMBL': 'CHEMBL465430', 'PUBCHEM COMPOUND': '44567115', 'NPASS': 'NPC171225'}
COMPOUND:28074 {'CHEMBL': 'CHEMBL510705', 'PUBCHEM COMPOUND': '11216146', 'NPASS': 'NPC171317'}
COMPOUND:28079 {'CHEMBL': 'CHEMBL2426510', 'PUBCHEM COMPOUND': '72708559', 'NPASS': 'NPC171360'}
COMPOUND:28080 {'CHEMBL': 'CHEMBL479677', 'PUBCHEM COMPOUND': '11687274', 'NPASS': 'NPC171372'}
COMPOUND:28081 {'CHEMBL': 'CHEMBL447356', 'PUBCHEM COMPOUND': '25009833', 'NPASS': 'NPC17138'}

Nodes¶

In [39]:
%%time

nodes = []
seen_nodes = set()
for row in df_kg.itertuples():
    for node_name in (row.subject, row.object):
        if node_name not in seen_nodes:
            seen_nodes.add(node_name)

            node_id = node_name
            node_type = node_name_to_type(node_name)
            node_properties = node_name_to_annotation_map.get(node_id, {})
            node = (node_id, node_type, node_properties)  # default format
            nodes.append(node)
CPU times: user 3.37 s, sys: 8.11 ms, total: 3.38 s
Wall time: 3.38 s

Edges¶

Note: As mentioned before, there are redundant edges in the raw data. These are skipped here deliberately, because igraph would support to place an edge of the same type between the same pair of nodes multiple times, which doesn't add any useful information.

In [40]:
%%time

edges = []
seen_triples = set()
for row in df_kg.itertuples():
    source_id = row.subject
    target_id = row.object
    edge_type = row.predicate
    edge_properties = {}
    
    triple = (source_id, edge_type, target_id)
    if triple not in seen_triples:
        seen_triples.add(triple)

        edge = (source_id, target_id, edge_type, edge_properties)  # default format
        edges.append(edge)
CPU times: user 4.59 s, sys: 368 ms, total: 4.96 s
Wall time: 4.96 s

b) Export the standardized data to two CSV files¶

Both the id and type items are simple strings, while the properties item is collection of key-value pairs represented by a Python dictionary that can be converted to a single JSON string, which the export function does internally. This means each node is fully represented by three strings, and each edge by four strings due to having a source id and target id.

In [41]:
nodes_csv_filepath = shared_bmkg.export_nodes_as_csv(nodes, results_dir, project_name)
In [42]:
edges_csv_filepath = shared_bmkg.export_edges_as_csv(edges, results_dir, project_name)

c) Use the standardized data to build a graph¶

Reconstruct the knowledge graph in form of a Graph object from the package igraph. This kind of graph object allows to have directed multi-edges, i.e. an edge has a source and a target node, and two nodes can be connected by more than one edge. It also allows to have node and edge properties. These features are necessary and sufficient to represent almost any biomedical knowledge graph found in academic literature.

In [43]:
%%time

g = shared_bmkg.create_graph(nodes, edges)
CPU times: user 14.7 s, sys: 764 ms, total: 15.5 s
Wall time: 15.5 s
In [44]:
shared_bmkg.report_graph_stats(g)
Directed multigraph with 88937 nodes, 819884 edges and a density of 0.0001037.
In [45]:
# Correctness checks

# 1) Does the reconstructed graph contain the same number of nodes as the raw data?
num_nodes_in_graph = g.vcount()
assert num_nodes_in_graph == num_nodes, f"Node counts differ: {num_nodes_in_graph} != {num_nodes}"
print(f"{num_nodes_in_graph:,} = {num_nodes:,}")

# 2) Does the reconstructed graph contain the same number of (unique) edges as the raw data?
num_edges_in_graph = g.ecount()
assert num_edges_in_graph == num_unique_triples, f"Edge counts differ: {num_edges_in_graph} != {num_unique_triples}"
print(f"{num_edges_in_graph:,} = {num_unique_triples:,}")
88,937 = 88,937
819,884 = 819,884

d) Export the graph to a GraphML file¶

Export the graph with all nodes, edges and properties as a single GraphML file.

In [46]:
%%time

g_graphml_filepath = shared_bmkg.export_graph_as_graphml(g, results_dir, project_name)
CPU times: user 56.9 s, sys: 3.79 s, total: 1min
Wall time: 59.8 s

7. Subgraph exploration¶

This section explores small subgraphs of the knowledge graph in two ways: first by inspecting the direct neighborhood of a selected node, and second by finding shortest paths between two chosen nodes.

As a simple case study, the goal is to identify some nodes in the knowledge graph that are associated with the success story of the drug Imatinib, which was one of the first targeted therapies against cancer. Detailed background information can for example be found in an article by the National Cancer Institute and in a talk by Brian Druker who played a major role in the development of this paradigm-changing drug. To give a simplified summary, following biological entities and relationships are involved:

  • Mutation: In a bone marrow stem cell, a translocation event between chromosome 9 and 22 leads to what has been called the Philadelphia chromosome, which can be seen under a microscope and got named after the city it originally got discovered in.
  • Gene: It turned out that this particular rearrangement of DNA fuses the BCR) gene on chromosome 22 to the ABL1) gene on chromosome 9, resulting in a new fusion gene known as BCR-ABL1.
  • Disease: BCR-ABL1 acts as an oncogene, because it expresses a protein that is a defective tyrosine kinase in a permanent "on" state, which leads to uncontrolled growth of certain white blood cells and their precursors, thereby driving the disease Chronic Myelogenous Leukemia (CML).
  • Drug: Imatinib (Gleevec) was the first demonstration that a potent and selective Bcr-Abl tyrosine-kinase inhibitor (TKI) is possible and that such a targeted inhibition of an oncoprotein halts the uncontrolled growth of leukemia cells with BCR-ABL1, while having significantly less effect on other cells in the body compared to conventional chemotherapies used in cancer. This revolutionized the treatment of CML and drastically improved the five-year survival rate of patients from less than 20% to over 90%, as well as their quality of life.

In reality the story is a bit more complex, for example because there are other genes involved in disease progression, there are many closely related forms of leukemia, BCR-ABL1 also plays a role in other forms of cancer, there are several drugs available as treatment options today, all of them bind to more than one target and with different affinities, and their individual binding profiles are relevant to their particular therapeutic effects. Inspecting the knowledge graph will focus on highlighting some entities of the simplified story, but the surrounding elements will also indicate some of the complexity encountered in reality. Some simple forms of reasoning on the knowledge graph will demonstrate its potential for discovering new patterns and hypotheses.

a) Search for interesting nodes¶

The OREGANO knowledge graph mainly consists of abstract identifiers for entities rather than names or descriptions. Therefore a search in external databases on the web is necessary to identify interesting entities.

In [47]:
def print_vertex(v):
    print(v.index)
    for k, v in v.attributes().items():
        if v is not None:
            print(f"- {k}: {v}")

Drugs¶

  • Search for drugs online to get their DrugBank Accession Numbers: https://go.drugbank.com/drugs
    • Example: Imatinib shows DrugBank Accession Number DB00619
In [48]:
def find_node_by_drugbank_id(graph, drugbank_id):
    for v in graph.vs:
        if v["DRUGBANK"] == drugbank_id:
            print_vertex(v)
            print()
In [49]:
# Drug: Imatinib - with DrugBank accession number "DB00619"
find_node_by_drugbank_id(g, "DB00619")
1059
- name: COMPOUND:605
- type: compound
- DRUGS PRODUCT DATABASE: 12429
- DPD: 2253283.0
- CHEMBL: CHEMBL941
- WIKIPEDIA: Imatinib
- PHARMGKB: PA10804
- THERAPEUTIC TARGETS DATABASE: DNC001383
- FDA DRUG LABEL AT DAILYMED: 211ef2da-2868-4a77-8055-1cb2cd78e24b
- URL: http://en.wikipedia.org/wiki/Imatinib
- KEGG DRUG: D01441
- ATC: L01EA01;L01XE01
- PUBCHEM COMPOUND: 5291.0
- PUBCHEM SUBSTANCE: 46505055;841977
- DRUGBANK: DB00619
- RXCUI: 282388.0
- NDC: 0078-0401-34
- CLINICALTRIALS.GOV: NCT00180830;NCT01437202;NCT01860456
- CHEMSPIDER: 5101
- CHEBI: CHEBI:45783
- BINDINGDB: 13530
- ZINC: ZINC000019632618
- PDB LIGAND: STI
- PDB: STI

Genes¶

  • Search for genes online to get their NCBI Gene ID: https://www.ncbi.nlm.nih.gov/gene
    • Example: ABL1 shows Gene ID: 25
In [50]:
def find_node_by_ncbi_gene_id(graph, ncbi_gene_id):
    for v in graph.vs:
        if v["NCBI GENE"] == ncbi_gene_id:
            print_vertex(v)
            print()
In [51]:
# Gene: ABL1 - with NCBI Gene ID "25"
find_node_by_ncbi_gene_id(g, "25")
18276
- name: GENE:27548
- type: gene
- NCBI GENE: 25

62046
- name: GENE:117
- type: gene
- UNIPROTKB: P00519;Q59FK4
- PHARMGKB: PA24413
- HGNC: HGNC76
- GENATLAS: ABL1
- ENSEMBL: ENSG00000097007
- NCBI GENE: 25
- OMIM: 189980
- REFSEQ PROTEIN: NP_005148;NP_009297
- MODBASE: P00519
- REFSEQ RNA: NM_005157;NM_007313
- HUMANCYC GENE: HS01874
- GENECARD: ABL1
- REFSEQ DNA: NG_012034;NT_035014
- UCSC GENOME BROWSER: NM_005157
- COMPARATIVE TOXICOGENOMICS DATABASE: 25.0

Diseases¶

  • Online search for diseases to get their MeSH (Medical Subject Headings) Unique ID: https://www.ncbi.nlm.nih.gov/mesh
    • Example: Leukemia, Myelogenous, Chronic, BCR-ABL Positive shows MeSH Unique ID: D015464
    • Example: Leukemia, Myeloid, Acute shows MeSH Unique ID: D015470
In [52]:
def find_node_by_mesh_id(graph, mesh_id):
    for v in graph.vs:
        if v["MESH"] == mesh_id:
            print(v["name"])
In [53]:
# Disease: Chronic Myelogenous Leukemia (CML) - with MeSH ID "D015464"
find_node_by_mesh_id(g, "D015464")
DISEASE:1822
In [54]:
# Disease: Acute Myelogenous Leukemia (AML) - with MeSH ID "D015470"
find_node_by_mesh_id(g, "D015470")
DISEASE:1824

b) Explore the neighborhood of a chosen node¶

In [55]:
# Neighborhood of drug Imatinib
source = "COMPOUND:605"
subgraph = shared_bmkg.get_egocentric_subgraph(g, source)

# Export
filename = f"{project_name}_neighbors_imatinib"
shared_bmkg.export_graph_as_graphml(subgraph, results_dir, filename)
shared_bmkg.export_nodes_as_csv(nodes, results_dir, filename, subgraph)
shared_bmkg.export_edges_as_csv(edges, results_dir, filename, subgraph)

# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source)
Directed multigraph with 144 nodes, 728 edges and a density of 0.03511.
Out[55]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

Interpretation:

  • Compound-Compound relations: The compound Imatinib (green node in the center) stands in two relations (mainly "decrease_efficiency", a few cases of "increase_efficiency") to a lot of other compounds (green nodes).
    • Example: COMPOUND:536 represents Propylthiouracil and interacts with a lot of compounds that also interact with Imatinib, seen as a dense cluster of green nodes in the visualization. A second cluster is formed by COMPOUND:1053 which represents Fludarabine and interacts with many of the same compounds. Why this is the case remains unclear, since an online search does not reveal any obvious connection between Imatinib, Propylthiouracil and Fludarabine.
  • As before, there are many Compound-Gene and Gene-Disease relations present in this subgraph.
    • Example: DISEASE:1228 has MeSH id D046152 and represents "Gastrointestinal Stromal Tumors (GIST)". Imatinib stands in a "is_substance_that_treats" relation to it. This is facilitated by Imatinib standing in a "is_affecting" relation with a lot of genes that in turn stand in a "causes_condition" relation to this disease. For example, GENE:9737 represents the tyrosine kinase KIT), which is a proto-oncogene that is indeed targeted by Imatinib and associated with "gastrointestinal stromal tumors", "acute myeloid leukemia" and a few other diseases according to Wikipedia, which is also reflected in the subgraph.
  • In contrast to before, this subgraph also contains nodes of type Effect or Activity (black nodes).
In [56]:
# Neighborhood of gene ABL1
source = "GENE:117"
subgraph = shared_bmkg.get_egocentric_subgraph(g, source)

# Export
filename = f"{project_name}_neighbors_abl1"
shared_bmkg.export_graph_as_graphml(subgraph, results_dir, filename)
shared_bmkg.export_nodes_as_csv(nodes, results_dir, filename, subgraph)
shared_bmkg.export_edges_as_csv(edges, results_dir, filename, subgraph)

# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source)
Directed multigraph with 19 nodes, 20 edges and a density of 0.0554.
Out[56]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

Interpretation:

  • Gene-Disease relations: The gene ABL1 (blue node in the center) stands in a "causes_condition" relation to 6 diseases (red nodes).
    • Example: Among the disease nodes is DISEASE:1822, which represents CML as seen before.
  • Compound-Gene relations: There are 12 compounds (green nodes) that stand in a "is_affecting" relation to the gene ABL1.
    • Example: Among the compound nodes is COMPOUND:605, which represents Imatinib as seen before. This compound in turn stands in a "is_substance_that_treats" relation to two diseases, where DISEASE:1822 represents CML and DISEASE:2199 represents Neoplasms in general. This can be determined by hovering over it to see its MeSH id and then perform an online search with it that results in D009369.
In [57]:
# Neighborhood of disease CML
source = "DISEASE:1822"
subgraph = shared_bmkg.get_egocentric_subgraph(g, source)

# Export
filename = f"{project_name}_neighbors_cml"
shared_bmkg.export_graph_as_graphml(subgraph, results_dir, filename)
shared_bmkg.export_nodes_as_csv(nodes, results_dir, filename, subgraph)
shared_bmkg.export_edges_as_csv(edges, results_dir, filename, subgraph)

# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source)
Directed multigraph with 25 nodes, 56 edges and a density of 0.0896.
Out[57]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

Interpretation:

  • Gene-Disease relations: There are 22 genes (blue nodes) that stand in a "causes_condition" relation with the disease CML (red node in the center). The edge types can be seen when hovering over an arrow. The schema detection in a previous section also clarified that there is only this particular edge type between gene and disease nodes.
    • Examples: GENE:117 is ABL1 and GENE:1648 is BCR, which was also detected in the node search and can be seen when hovering over them and looking at the GENECARD property. From the Imatinib story it was expected to see these genes here, while others might be more informative.
  • Compound-Disease relations: There are 2 compounds (green nodes) that stand in a "is_substance_that_treats" relation with the disease CML.
    • Examples: COMPOUND:605 is Imatinib and COMPOUND:90342 is Methotrexate, which can be seen when hovering over them and looking at the URL property that refers to Wikipedia.

c) Find shortest paths between two chosen nodes¶

In [58]:
# Paths from drug Imatinib to disease AML
source = "COMPOUND:605"
target = "DISEASE:1824"
subgraph = shared_bmkg.get_paths_subgraph(g, source, target)

# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source, target)
Directed multigraph with 7 nodes, 10 edges and a density of 0.2041.
Out[58]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

Interpretation:

  • As mentioned before, Imatinib (green node on the left) is not only used in the treatment of the disease CML, but also affecting other diseases, including AML (red node on the right). There is no direct link between Imatinib and AML present in the OREGANO knowledge graph, but this subgraph shows some genes (blue nodes in the middle) by which the effect of Imatinib on AML might be facilitated.
  • A simple pattern like this suggests that a link between Imatinib and AML could be discovered by a suitable link prediction method. Similar but perhaps a bit more complex patterns between other nodes may uncover indirect relationships between compounds and diseases that are not yet known. This is one way of forming drug repurposing hypotheses, which means that a drug that has already been shown in clinical trials to be safe to humans could potentially be effective in the treatment of another disease too and thereby find a new purpose.