Monarch¶

This notebook explores the biomedical knowledge graph provided by the project Monarch: Publication (2024), Website, Code, Data

The source file of this notebook is monarch.ipynb and can be found in the repository awesome-biomedical-knowledge-graphs that also contains information about similar projects.

Table of contents¶

  1. Setup
  2. Data download
  3. Data import
  4. Data inspection
  5. Schema discovery
  6. Knowledge graph reconstruction
  7. Subgraph exploration

1. Setup¶

This section prepares the environment for the following exploratory data analysis.

a) Import packages¶

From the Python standard library.

In [1]:
import os

From the Python Package Index (PyPI).

In [2]:
import dask.dataframe as dd
import gravis as gv
import igraph as ig

From a local Python module named shared_bmkg.py. The functions in it are used in several similar notebooks to reduce code repetition and to improve readability.

In [3]:
import shared_bmkg

b) Create data directories¶

The raw data provided by the project and the transformed data generated throughout this notebook are stored in separate directories. If the notebook is run more than once, the downloaded data is reused instead of fetching it again, but all data transformations are rerun.

In [4]:
project_name = "monarch"
download_dir = os.path.join(project_name, "downloads")
results_dir = os.path.join(project_name, "results")

shared_bmkg.create_dir(download_dir)
shared_bmkg.create_dir(results_dir)

2. Data download¶

This section fetches the data published by the project in a data repository on a custom web server. The latest available version at the time of creating this notebook was used: 2024-07-12.

All files provided by the project¶

Monarch provides regular updates of its data and provides it in many different formats.

  • TSV files
    • monarch-kg.tar.gz: Tarball) archive with two files
      • monarch-kg_nodes.tsv: Nodes and node annotations
      • monarch-kg_edges.tsv: Edges and edge annotations
    • monarch-kg-denormalized-edges.tsv.gz
    • monarch-kg-denormalized-nodes.tsv.gz
  • Database dumps
    • monarch-kg.db.gz: SQLite relational database
    • monarch-kg.duckdb.gz: DuckDB relational database
    • monarch-kg.neo4j.dump: Neo4j graph database
  • RDF files
    • monarch-kg.jsonl.tar.gz: JSON-LD file
    • monarch-kg.nt.gz: N-Triples file
  • Metadata
    • metadata.yaml
    • merged_graph_stats.yaml
    • qc_report.yaml
  • Other
    • phenio.db.gz: Phenomics Integrated Ontology database
    • solr.tar.gz: Solr index for the website's search function

Remarks in the publication:

  • "For users who want to integrate Monarch’s KG into their own datasets or analysis pipelines, we use Knowledge Graph Exchange (KGX) tools to serialize to various formats (e.g. SQLite, Neo4J, RDF, KGX)."
  • "the Monarch Knowledge Graph (KG), which comprises the combined knowledge of 33 biomedical resources and biomedical ontologies (see Data Sources below), and is updated with the latest data from each source once a month."
  • "ontologies are integrated into a ‘semantic layer,’ a Biolink-conformant representation of the Phenomics Integrated Ontology (PHENIO; github.com/monarch-initiative/phenio), which serves as a hierarchical schema and classification system for the integrated data."

Files needed to create the knowledge graph¶

  • monarch-kg.tar.gz contains all information required for reconstructing the knowledge graph.
  • Alternatively, many other formats are provided and could be used instead.
In [5]:
download_specification = [
    # TAR archive of two files that contain the knowledge graph and annotations for nodes and edges
    # - monarch-kg_nodes.tsv: TSV file that contains the node data
    # - monarch-kg_edges.tsv: TSV file that contains the edge data
    ("monarch-kg.tar.gz", "https://data.monarchinitiative.org/monarch-kg/2024-07-12/monarch-kg.tar.gz", "df24db3c3b743c5829af71b6cd41c9fb"),
]

for filename, url, md5 in download_specification:
    filepath = os.path.join(download_dir, filename)
    shared_bmkg.fetch_file(url, filepath)
    shared_bmkg.validate_file(filepath, md5)
    print()
Found a full local copy of "monarch/downloads/monarch-kg.tar.gz".
MD5 checksum is correct.

In [6]:
filepath_extracted_1 = os.path.join(download_dir, "monarch-kg_nodes.tsv")
filepath_extracted_2 = os.path.join(download_dir, "monarch-kg_edges.tsv")

if os.path.isfile(filepath_extracted_1) and os.path.isfile(filepath_extracted_2):
    print("Found existing files from a previous extraction of the archive.")
else:
    filepath = os.path.join(download_dir, "monarch-kg.tar.gz")
    shared_bmkg.extract_tar_gz(filepath)
Found existing files from a previous extraction of the archive.

3. Data import¶

This section loads the raw files into Python data structures for the following inspection and conversion.

In [7]:
def read_tsv_file(filepath):
    with open(filepath) as f:
        # A Dask dataframe, not Pandas
        df = dd.read_csv(filepath, sep='\t', dtype=str)
    return df
In [8]:
%%time

df_nodes = read_tsv_file(os.path.join(download_dir, "monarch-kg_nodes.tsv"))
df_edges = read_tsv_file(os.path.join(download_dir, "monarch-kg_edges.tsv"))
CPU times: user 119 ms, sys: 23.9 ms, total: 143 ms
Wall time: 151 ms

4. Data inspection¶

This section attempts to reproduce some published numbers by inspecting the raw data and then prints a few exemplary records.

The file merged_graph_stats.yaml of the 2024-06-10 release mentions following statistics about the knowledge graph contents, which can be found in the entries total_nodes, total_edges, predicates (number of entries), node_categories (number of entries):

  • 1,028,155 nodes having 80 different node types
  • 11,076,689 edges having 28 different edge types

a) Number of nodes and edges¶

In [9]:
%%time

num_nodes = len(df_nodes)
num_edges = len(df_edges)

print(f"{num_nodes:,} nodes")
print(f"{num_edges:,} edges")
print()
1,028,155 nodes
11,076,689 edges

CPU times: user 4min 3s, sys: 29.7 s, total: 4min 32s
Wall time: 2min 57s

Interpretation:

  • Inspecting the raw data resulted in 1,028,155 nodes, which matches the number mentioned in the stats file.
  • Inspecting the raw data resulted in 11,076,689 edges, which matches the number mentioned in the stats file.

b) Types of nodes and edges¶

In [10]:
%%time

nt_column = "category"
nt_counts = df_nodes.groupby(nt_column).size().compute()
nt_counts = nt_counts.sort_values(ascending=False)

print(len(nt_counts), "node types, sorted by their frequency of occurrence:")
for type, cnt in sorted(nt_counts.items(), key=lambda item: -item[1]):
    print(f"- {type}: {cnt}")
print()
80 node types, sorted by their frequency of occurrence:
- biolink:Gene: 571074
- biolink:Genotype: 133380
- biolink:PhenotypicFeature: 124247
- biolink:BiologicalProcessOrActivity: 38308
- biolink:Disease: 28109
- biolink:GrossAnatomicalStructure: 24210
- biolink:Cell: 22454
- biolink:Pathway: 22343
- biolink:NamedThing: 19576
- biolink:SequenceVariant: 13022
- biolink:AnatomicalEntity: 9978
- biolink:CellularComponent: 5308
- biolink:MolecularEntity: 4618
- biolink:BiologicalProcess: 3656
- biolink:MacromolecularComplex: 2120
- biolink:MolecularActivity: 1446
- biolink:Protein: 1112
- biolink:CellularOrganism: 958
- biolink:Vertebrate: 547
- biolink:Virus: 321
- biolink:BehavioralFeature: 297
- biolink:ChemicalEntity: 267
- biolink:LifeStage: 238
- biolink:PathologicalProcess: 231
- biolink:Drug: 100
- biolink:SmallMolecule: 70
- biolink:OrganismTaxon: 26
- biolink:InformationContentEntity: 23
- biolink:NucleicAcidEntity: 18
- biolink:EvidenceType: 16
- biolink:RNAProduct: 8
- biolink:Transcript: 6
- biolink:Plant: 4
- biolink:Fungus: 4
- biolink:ProcessedMaterial: 3
- biolink:PopulationOfIndividualOrganisms: 2
- biolink:Activity: 2
- biolink:ConfidenceLevel: 2
- biolink:Publication: 2
- biolink:Mammal: 2
- biolink:Agent: 2
- biolink:ProteinFamily: 2
- biolink:Dataset: 2
- biolink:GeneticInheritance: 2
- biolink:EnvironmentalFeature: 2
- biolink:Invertebrate: 2
- biolink:Haplotype: 2
- biolink:Bacterium: 1
- biolink:ChemicalMixture: 1
- biolink:ChemicalExposure: 1
- biolink:CellLine: 1
- biolink:OrganismalEntity: 1
- biolink:Event: 1
- biolink:EnvironmentalProcess: 1
- biolink:DrugExposure: 1
- biolink:Human: 1
- biolink:ProteinDomain: 1
- biolink:Patent: 1
- biolink:Study: 1
- biolink:AccessibleDnaRegion: 1
- biolink:BiologicalSex: 1
- biolink:StudyVariable: 1
- biolink:Zygosity: 1
- biolink:ReagentTargetedGene: 1
- biolink:Exon: 1
- biolink:DiagnosticAid: 1
- biolink:DatasetDistribution: 1
- biolink:Genome: 1
- biolink:MaterialSample: 1
- biolink:MicroRNA: 1
- biolink:IndividualOrganism: 1
- biolink:GenotypicSex: 1
- biolink:Polypeptide: 1
- biolink:PhenotypicSex: 1
- biolink:RegulatoryRegion: 1
- biolink:SiRNA: 1
- biolink:Snv: 1
- biolink:TranscriptionFactorBindingSite: 1
- biolink:Treatment: 1
- biolink:WebPage: 1

CPU times: user 4.69 s, sys: 841 ms, total: 5.53 s
Wall time: 2.56 s
In [11]:
%%time

et_column = "predicate"
et_counts = df_edges.groupby(et_column).size().compute()
et_counts = et_counts.sort_values(ascending=False)

print(len(et_counts), "edge types, sorted by their frequency:")
for key, val in et_counts.items():
    print(f"- {key}: {val}")
print()
28 edge types, sorted by their frequency:
- biolink:interacts_with: 2799181
- biolink:expressed_in: 2320065
- biolink:has_phenotype: 1703070
- biolink:enables: 839097
- biolink:actively_involved_in: 787306
- biolink:orthologous_to: 551418
- biolink:located_in: 500184
- biolink:subclass_of: 491204
- biolink:related_to: 282852
- biolink:participates_in: 272586
- biolink:acts_upstream_of_or_within: 181576
- biolink:active_in: 160549
- biolink:part_of: 96113
- biolink:causes: 16839
- biolink:is_sequence_variant_of: 15605
- biolink:model_of: 9902
- biolink:acts_upstream_of: 9366
- biolink:has_mode_of_inheritance: 8577
- biolink:gene_associated_with_condition: 8026
- biolink:contributes_to: 7746
- biolink:treats_or_applied_or_studied_to_treat: 5653
- biolink:associated_with_increased_likelihood_of: 3244
- biolink:colocalizes_with: 2937
- biolink:genetically_associated_with: 2156
- biolink:acts_upstream_of_positive_effect: 549
- biolink:acts_upstream_of_or_within_positive_effect: 512
- biolink:acts_upstream_of_negative_effect: 196
- biolink:acts_upstream_of_or_within_negative_effect: 180

CPU times: user 1min 39s, sys: 13.3 s, total: 1min 52s
Wall time: 28.4 s
In [12]:
# Correctness checks

# 1) Do the counts of different node types add up to the total number of nodes?
sum_node_types = nt_counts.sum()
assert sum_node_types == num_nodes, f"Node counts differ: {sum_node_types} != {num_nodes}"
print(f"{sum_node_types:,} = {num_nodes:,} nodes")

# 2) Do the counts of different edge types add up to the total number of edges?
sum_edge_types = et_counts.sum()
assert sum_edge_types == num_edges, f"Edge counts differ: {sum_edge_types} != {num_edges}"
print(f"{sum_edge_types:,} = {num_edges:,} edges")
1,028,155 = 1,028,155 nodes
11,076,689 = 11,076,689 edges

Interpretation:

  • Inspecting the raw data resulted in 80 node types, which matches the number mentioned in the stats file.
  • Inspecting the raw data resulted in 28 edge types, which matches the number mentioned in the stats file.
  • Looking at relative frequencies of node and edge types suggests that the dataset is rather unbalanced.
    • The most frequent node type is "biolink:Gene" with 571,074 instances, while the least frequent node types only have 1 instance each.
    • The most frequent edge type is "biolink:interacts_with" with 2,799,181 instances, while the least frequent edge type is "biolink:acts_upstream_of_or_within_negative_effect" with 180 instances, a difference of four orders of magnitude.

c) Example entries¶

This section prints some example entries of the raw data. It gives an impression of the format chosen by the authors, which differs greatly between projects due to a lack of a broadly accepted standard for biomedical knowledge graphs.

In [13]:
def report_first_n_items(data, n):
    return data.head(n)
In [14]:
def report_last_n_items(data, n):
    return data.tail(n)

Nodes together with node annotations¶

In [15]:
report_first_n_items(df_nodes, 2)
Out[15]:
id category name xref has_gene in_taxon in_taxon_label provided_by description synonym full_name symbol type deprecated iri same_as
0 CLINVAR:586 biolink:SequenceVariant NM_000277.2(PAH):c.1A>G (p.Met1Val) CA114360 HGNC:8582 NCBITaxon:9606 Homo sapiens clingen_variant_nodes NaN NaN NaN NaN NaN NaN NaN NaN
1 CLINVAR:102844 biolink:SequenceVariant NM_000277.2(PAH):c.806delT (p.Ile269Thrfs) CA229778 HGNC:8582 NCBITaxon:9606 Homo sapiens clingen_variant_nodes NaN NaN NaN NaN NaN NaN NaN NaN
In [16]:
report_last_n_items(df_nodes, 2)
Out[16]:
id category name xref has_gene in_taxon in_taxon_label provided_by description synonym full_name symbol type deprecated iri same_as
450246 MGI:7608104 biolink:Genotype Crlf3<sup>em1Gtm</sup>/Crlf3<sup>+</sup> [bac... NaN NaN NCBITaxon:10090 Mus musculus alliance_genotype_nodes NaN NaN NaN NaN genotype NaN NaN NaN
450247 MGI:7608107 biolink:Genotype Cdc23<sup>em1Lwa</sup>/Cdc23<sup>em1Lwa</sup> ... NaN NaN NCBITaxon:10090 Mus musculus alliance_genotype_nodes NaN NaN NaN NaN genotype NaN NaN NaN

Edges together with edge annotations¶

In [17]:
report_first_n_items(df_edges, 2)
Out[17]:
id original_subject predicate original_object category agent_type aggregator_knowledge_source knowledge_level primary_knowledge_source publications ... frequency_qualifier has_count has_percentage has_quotient has_total negated onset_qualifier sex_qualifier subject object
0 3dfcb65a-26a2-11ef-ace6-e9678ebf82fc NaN biolink:has_phenotype NaN biolink:GenotypeToPhenotypicFeatureAssociation manual_agent infores:monarchinitiative knowledge_assertion infores:zfin ZFIN:ZDB-PUB-060503-2 ... NaN NaN NaN NaN NaN NaN NaN NaN ZFIN:ZDB-FISH-150901-1 ZP:0000041
1 3dfcb65b-26a2-11ef-ace6-e9678ebf82fc NaN biolink:has_phenotype NaN biolink:GenotypeToPhenotypicFeatureAssociation manual_agent infores:monarchinitiative knowledge_assertion infores:zfin ZFIN:ZDB-PUB-060503-2 ... NaN NaN NaN NaN NaN NaN NaN NaN ZFIN:ZDB-FISH-150901-1 ZP:0000055

2 rows × 25 columns

In [18]:
report_last_n_items(df_edges, 2)
Out[18]:
id original_subject predicate original_object category agent_type aggregator_knowledge_source knowledge_level primary_knowledge_source publications ... frequency_qualifier has_count has_percentage has_quotient has_total negated onset_qualifier sex_qualifier subject object
229025 uuid:3324568f-4007-11ef-89e7-6fe0be41fbbf MESH:C426686 biolink:treats_or_applied_or_studied_to_treat MESH:D020521 biolink:ChemicalToDiseaseOrPhenotypicFeatureAs... manual_agent infores:monarchinitiative knowledge_assertion infores:ctd PMID:16305522|PMID:17885258 ... NaN NaN NaN NaN NaN NaN NaN NaN CHEBI:65172 MONDO:0005098
229026 uuid:33245690-4007-11ef-89e7-6fe0be41fbbf MESH:C426686 biolink:treats_or_applied_or_studied_to_treat MESH:D054556 biolink:ChemicalToDiseaseOrPhenotypicFeatureAs... manual_agent infores:monarchinitiative knowledge_assertion infores:ctd PMID:16123915|PMID:16305522 ... NaN NaN NaN NaN NaN NaN NaN NaN CHEBI:65172 MONDO:0005399

2 rows × 25 columns

5. Schema discovery¶

This section analyzes the structure of the knowledge graph by determining which types of nodes are connected by which types of edges. To construct this overview, it is necessary to iterate over the entire data once. The result is a condensed representation of all entities and relations, which is known as data model or schema in the context of graph databases.

In [19]:
node_type_to_color = {
    "biolink:Drug": "green",
    "biolink:ChemicalEntity": "green",
    "biolink:MolecularEntity": "green",
    "biolink:SmallMolecule": "green",

    "biolink:Gene": "blue",
    "biolink:Protein": "blue",

    "biolink:Disease": "red",
    "biolink:Pathway": "red",
    "biolink:BiologicalProcessOrActivity": "red",
}
In [20]:
%%time

node_id_to_type = {row.id: row.category for row in df_nodes.itertuples()}
CPU times: user 20.4 s, sys: 1.76 s, total: 22.1 s
Wall time: 22.2 s
In [21]:
%%time

unique_triples = set()
for row in df_edges.itertuples():
    s = node_id_to_type[row.subject]
    p = row.predicate
    o = node_id_to_type[row.object]
    triple = (s, p, o)
    unique_triples.add(triple)
CPU times: user 4min 42s, sys: 15.3 s, total: 4min 57s
Wall time: 4min 57s
In [22]:
gs = ig.Graph(directed=True)
unique_nodes = set()
for s, p, o in unique_triples:
    for node in (s, o):
        if node not in unique_nodes:
            unique_nodes.add(node)
            
            node_size = int(nt_counts[node])
            node_color = node_type_to_color.get(node, '')
            node_hover = f"{node}\n\n{nt_counts[node]} nodes of this type are contained in the knowledge graph."
            gs.add_vertex(node, size=node_size, color=node_color, label_color=node_color, hover=node_hover)

    edge_size = int(et_counts[p])
    edge_color = node_type_to_color.get(s, '')
    edge_hover = f"{p}\n\n{et_counts[p]} edges of this type are contained in the knowledge graph."
    gs.add_edge(s, o, size=edge_size, color=edge_color, hover=edge_hover, label=p, label_color="gray", label_size=5)

gs.vcount(), gs.ecount()
Out[22]:
(36, 288)
In [23]:
fig = gv.d3(
    gs,
    show_node_label=True,
    node_label_data_source="name",

    show_edge_label=True,
    edge_label_data_source="label",
    edge_curvature=0.2,

    use_node_size_normalization=True,
    node_size_normalization_min=10,
    node_size_normalization_max=50,
    node_drag_fix=True,
    node_hover_neighborhood=True,
    
    use_edge_size_normalization=True,
    edge_size_normalization_max=3,

    many_body_force_strength=-3000,
    zoom_factor=0.3,
)
fig
Out[23]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force
In [24]:
# Export the schema visualization
schema_filepath = os.path.join(results_dir, f"{project_name}_schema.html")
fig.export_html(schema_filepath, overwrite=True)

Interpretation:

  • Each node in the schema corresponds to one of the 80 node types in the data.
    • Node size represents the number of instances, i.e. how often that node type is present in the knowledge graph. The exact number can also be seen when hovering over a node.
    • Node color represents particular node types. The coloring scheme is based on a deliberately simple RGB palette with the same meaning across multiple notebooks to enable some visual comparison. The idea behind it is to highlight an interplay of certain entities, namely that drugs (or small molecules in general) can bind to proteins (or gene products in general) and thereby alter diseases (or involved pathways).
      • green = drugs & other small molecules (e.g. toxins)
      • blue = genes & gene products (e.g. proteins or RNAs)
      • red = diseases & related concepts (e.g. pathways)
      • black = all other types of entities
  • Each edge in the schema stands for one of the 28 edge types in the data. It is possible that the same edge type appears between different nodes.
    • Edge size represents the number of instances, i.e. how often that edge type is present in the knowledge graph.
    • Edge color is identical to the color of the source node, again to highlight the interplay between drugs, targets and diseases.

6. Knowledge graph reconstruction¶

This section first converts the raw data to an intermediate format used in several notebooks, and then reconstructs the knowledge graph from the standardized data with shared code.

  • The intermediate form of the data is created as two simple Python lists, one for nodes and the other for edges, which can be exported to two CSV files.
  • The knowledge graph is built as a graph object from the Python package igraph, which can be exported to a GraphML file.

a) Convert the data into a standardized format¶

Transform the raw data to an standardized format that is compatible with most biomedical knowledge graphs in order to enable shared downstream processing:

  • Each node is represented by three items: id (str), type (str), properties (dict)
  • Each edge is represented by four items: source_id (str), target_id (str), type(str), properties (dict)

This format was initially inspired by a straightforward way in which the content of a Neo4j graph database can be exported to two CSV files, one for all nodes and the other for all edges. This is an effect of the property graph model used in Neo4j and many other graph databases, which also appears to be general enough to fully capture the majority of biomedical knowledge graphs described in scientific literature, despite the large variety of formats they are shared in.

A second motivation was that each line represents a single node or edge, and that no entry is connected to any sections at other locations, such as property descriptions at the beginning of a GraphML file. This structural simplicity makes it very easy to load just a subset of nodes and edges by picking a subset of lines, or to skip the loading of properties if they not required for a task simply by ignoring a single column.

Finally, this format also allows to load the data directly into popular SQL databases like SQLite, MySQL or PostgreSQL with built-in CSV functions (CSV in SQLite, CSV in MySQL, CSV in PostgreSQL). Further, the JSON string in the property column can be accessed directly by built-in JSON functions (JSON in SQLite, JSON in MySQL, JSON in PostgreSQL), which enables sophisticated queries that access or modify specific properties within the JSON data.

Nodes¶

In [25]:
%%time

nodes = []
for row in df_nodes.itertuples():
    node_id = row.id
    node_type = row.category
    node_properties = {
        "label": row.name,  # Caution: The attribute "name" is reserved in igraph as unique identifier of a node, therefore using "label"
        "description": row.description,
        "xref": row.xref,
        "provided_by": row.provided_by,
        "synonym": row.synonym,
        "full_name": row.full_name,
        "in_taxon": row.in_taxon,
        "in_taxon_label": row.in_taxon_label,
        "symbol": row.symbol,
        "deprecated": row.deprecated,
        "iri": row.iri,
        "same_as": row.same_as,
    }
    node = (node_id, node_type, node_properties)
    nodes.append(node)
CPU times: user 26.7 s, sys: 2.48 s, total: 29.1 s
Wall time: 29.1 s

Edges¶

In [26]:
%%time

edges = []
for row in df_edges.itertuples():
    source_id = row.subject
    target_id = row.object
    edge_type = row.predicate
    edge_properties = {
        "id": row.id,
        "original_subject": row.original_subject,
        "original_object": row.original_object,
        "category": row.category,
        "agent_type": row.agent_type,
        "aggregator_knowledge_source": row.aggregator_knowledge_source,
        "knowledge_level": row.knowledge_level,
        "primary_knowledge_source": row.primary_knowledge_source,
        "qualifiers": row.qualifiers,
        "provided_by": row.provided_by,
        "has_evidence": row.has_evidence,
        "publications": row.publications,
        "stage_qualifier": row.stage_qualifier,
        "frequency_qualifier": row.frequency_qualifier,
        "has_count": row.has_count,
        "has_percentage": row.has_percentage,
        "has_quotient": row.has_quotient,
        "has_total": row.has_total,
        "negated": row.negated,
        "onset_qualifier": row.onset_qualifier,
        "sex_qualifier": row.sex_qualifier,
    }
    edge = (source_id, target_id, edge_type, edge_properties)
    edges.append(edge)
CPU times: user 5min 36s, sys: 25.4 s, total: 6min 1s
Wall time: 6min 1s

b) Export the standardized data to two CSV files¶

Both the id and type items are simple strings, while the properties item is collection of key-value pairs represented by a Python dictionary that can be converted to a single JSON string, which the export function does internally. This means each node is fully represented by three strings, and each edge by four strings due to having a source id and target id.

In [27]:
%%time

nodes_csv_filepath = shared_bmkg.export_nodes_as_csv(nodes, results_dir, project_name)
CPU times: user 43.3 s, sys: 1.66 s, total: 44.9 s
Wall time: 45 s
In [28]:
%%time

edges_csv_filepath = shared_bmkg.export_edges_as_csv(edges, results_dir, project_name)
CPU times: user 12min 13s, sys: 33.6 s, total: 12min 46s
Wall time: 12min 48s

c) Use the standardized data to build a graph¶

Reconstruct the knowledge graph in form of a Graph object from the package igraph. This kind of graph object allows to have directed multi-edges, i.e. an edge has a source and a target node, and two nodes can be connected by more than one edge. It also allows to have node and edge properties. These features are necessary and sufficient to represent almost any biomedical knowledge graph found in academic literature.

In [29]:
%%time

g = shared_bmkg.create_graph(nodes, edges)
CPU times: user 2min 58s, sys: 20 s, total: 3min 18s
Wall time: 3min 17s
In [30]:
shared_bmkg.report_graph_stats(g)
Directed multigraph with 1028155 nodes, 11076689 edges and a density of 1.048e-05.
In [31]:
# Correctness checks

# 1) Does the reconstructed graph contain the same number of nodes as the raw data?
num_nodes_in_graph = g.vcount()
assert num_nodes_in_graph == num_nodes, f"Node counts differ: {num_nodes_in_graph} != {num_nodes}"
print(f"{num_nodes_in_graph:,} = {num_nodes:,}")

# 2) Does the reconstructed graph contain the same number of (unique) edges as the raw data?
num_edges_in_graph = g.ecount()
assert num_edges_in_graph == num_edges, f"Edge counts differ: {num_edges_in_graph} != {num_unique_triples}"
print(f"{num_edges_in_graph:,} = {num_edges:,}")
1,028,155 = 1,028,155
11,076,689 = 11,076,689

d) Export the graph to a GraphML file¶

Export the graph with all nodes, edges and properties as a single GraphML file.

In [32]:
%%time

g_graphml_filepath = shared_bmkg.export_graph_as_graphml(g, results_dir, project_name)
CPU times: user 4min 27s, sys: 25.4 s, total: 4min 52s
Wall time: 4min 50s

7. Subgraph exploration¶

This section explores small subgraphs of the knowledge graph in two ways: first by inspecting the direct neighborhood of a selected node, and second by finding shortest paths between two chosen nodes.

As a simple case study, the goal is to identify some nodes in the knowledge graph that are associated with the success story of the drug Imatinib, which was one of the first targeted therapies against cancer. Detailed background information can for example be found in an article by the National Cancer Institute and in a talk by Brian Druker who played a major role in the development of this paradigm-changing drug. To give a simplified summary, following biological entities and relationships are involved:

  • Mutation: In a bone marrow stem cell, a translocation event between chromosome 9 and 22 leads to what has been called the Philadelphia chromosome, which can be seen under a microscope and got named after the city it originally got discovered in.
  • Gene: It turned out that this particular rearrangement of DNA fuses the BCR) gene on chromosome 22 to the ABL1) gene on chromosome 9, resulting in a new fusion gene known as BCR-ABL1.
  • Disease: BCR-ABL1 acts as an oncogene, because it expresses a protein that is a defective tyrosine kinase in a permanent "on" state, which leads to uncontrolled growth of certain white blood cells and their precursors, thereby driving the disease Chronic Myelogenous Leukemia (CML).
  • Drug: Imatinib (Gleevec) was the first demonstration that a potent and selective Bcr-Abl tyrosine-kinase inhibitor (TKI) is possible and that such a targeted inhibition of an oncoprotein halts the uncontrolled growth of leukemia cells with BCR-ABL1, while having significantly less effect on other cells in the body compared to conventional chemotherapies used in cancer. This revolutionized the treatment of CML and drastically improved the five-year survival rate of patients from less than 20% to over 90%, as well as their quality of life.

In reality the story is a bit more complex, for example because there are other genes involved in disease progression, there are many closely related forms of leukemia, BCR-ABL1 also plays a role in other forms of cancer, there are several drugs available as treatment options today, all of them bind to more than one target and with different affinities, and their individual binding profiles are relevant to their particular therapeutic effects. Inspecting the knowledge graph will focus on highlighting some entities of the simplified story, but the surrounding elements will also indicate some of the complexity encountered in reality. Some simple forms of reasoning on the knowledge graph will demonstrate its potential for discovering new patterns and hypotheses.

a) Search for interesting nodes¶

In [33]:
# Drug: Imatinib - seems not to be contained in Monarch
shared_bmkg.list_nodes_matching_substring(g, "imatinib", "label")
id                        type               label                               
=================================================================================
Reactome:R-HSA-9669917    biolink:Pathway    Imatinib-resistant KIT mutants      
Reactome:R-HSA-9674396    biolink:Pathway    Imatinib-resistant PDGFR mutants    
In [34]:
# Gene: ABL1
shared_bmkg.list_nodes_matching_substring(g, "abl1", "label")
id                          type                label                                                                                                                                                                                                                                        
=============================================================================================================================================================================================================================================================================================
HGNC:76                     biolink:Gene        ABL1                                                                                                                                                                                                                                         
MGI:2653892                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup> Abl2<sup>tm1Ajk</sup>/Abl2<sup>+</sup>  [background:] involves: 129S/SvEv * 129S4/SvJae * C57BL/6J                                                                                               
MGI:2653894                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>+</sup> Abl2<sup>tm1Ajk</sup>/Abl2<sup>tm1Ajk</sup>  [background:] involves: 129S/SvEv * 129S4/SvJae * C57BL/6J                                                                                               
MGI:2653897                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup> Abl2<sup>tm1Ajk</sup>/Abl2<sup>tm1Ajk</sup>  [background:] involves: 129S/SvEv * 129S4/SvJae * C57BL/6J                                                                                          
MGI:2665033                 biolink:Genotype    Tg(Igh-Abl1)40Sco/0  [background:] involves: C57BL/6JWehi * SJL/JWehi                                                                                                                                                                        
MGI:2665034                 biolink:Genotype    Tg(Igh-Abl1)40Sco/0 Tg(IghMyc)22Bri/0  [background:] involves: C57BL/6 * C57BL/6JWehi * SJL/J * SJL/JWehi                                                                                                                                    
MGI:3574537                 biolink:Genotype    Dok1<sup>tm1Yyam</sup>/Dok1<sup>tm1Yyam</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: C57BL/6                                                                                                                                        
MGI:3574538                 biolink:Genotype    Dok1<sup>tm1Yyam</sup>/Dok1<sup>tm1Yyam</sup> Dok2<sup>tm1Yyam</sup>/Dok2<sup>tm1Yyam</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: C57BL/6                                                                                          
MGI:3574539                 biolink:Genotype    Dok2<sup>tm1Yyam</sup>/Dok2<sup>tm1Yyam</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: C57BL/6                                                                                                                                        
MGI:3574541                 biolink:Genotype    Dok1<sup>tm1Ppp</sup>/Dok1<sup>tm1Ppp</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: 129S1/Sv                                                                                                                                         
MGI:3574544                 biolink:Genotype    Dok2<sup>tm1Ppp</sup>/Dok2<sup>tm1Ppp</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: 129S1/Sv                                                                                                                                         
MGI:3574545                 biolink:Genotype    Dok1<sup>tm1Ppp</sup>/Dok1<sup>+</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: 129S1/Sv                                                                                                                                              
MGI:3574546                 biolink:Genotype    Dok2<sup>tm1Ppp</sup>/Dok2<sup>+</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: 129S1/Sv                                                                                                                                              
MGI:3583883                 biolink:Genotype    Abl1<sup>tm1Ajk</sup>/Abl1<sup>tm1Ajk</sup> Abl2<sup>tm1Ajk</sup>/Abl2<sup>tm1Ajk</sup> Tg(Nes-cre)1Kag/?  [background:] involves: 129S4/SvJae                                                                                               
MGI:3693360                 biolink:Genotype    Tg(tetO-BCR/ABL1)27Dgt/0 Tg(MMTVtTA)1Mam/0  [background:] involves: C57BL/6 * FVB/N * SJL                                                                                                                                                    
MGI:3693361                 biolink:Genotype    Tg(tetO-BCR/ABL1)2Dgt/0 Tg(MMTVtTA)1Mam/0  [background:] involves: C57BL/6 * FVB/N * SJL                                                                                                                                                     
MGI:3693373                 biolink:Genotype    Tg(tetO-BCR/ABL1)2Dgt/0 Tg(Tal1-tTA)19Dgt/0  [background:] involves: C57BL/6 * DBA/2 * FVB/N                                                                                                                                                 
MGI:3828503                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup>  [background:] involves: 129S/SvEv * C57BL/6J                                                                                                                                                    
MGI:3834480                 biolink:Genotype    Tg(Ly6a-BCR/ABL1)IS1AIsg/0 Tg(Ly6a-BCR/ABL1)IS1BIsg/0  [background:] involves: C57BL/6J * CBA                                                                                                                                                
MGI:3834481                 biolink:Genotype    Tg(Ly6a-BCR/ABL1)IS1AIsg/0  [background:] involves: C57BL/6J * CBA                                                                                                                                                                           
MGI:3834482                 biolink:Genotype    Tg(Ly6a-BCR/ABL1)IS1BIsg/0  [background:] involves: C57BL/6J * CBA                                                                                                                                                                           
MGI:3834483                 biolink:Genotype    Tg(Ly6a-TK,-BCR/ABL1)IS9AIsg/0  [background:] involves: C57BL/6J * CBA                                                                                                                                                                       
MGI:4354766                 biolink:Genotype    Cbl<sup>tm1Soga</sup>/Cbl<sup>tm1Soga</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: C57BL/6 * DBA/2                                                                                                                                  
MGI:4354767                 biolink:Genotype    Cbl<sup>tm1Soga</sup>/Cbl<sup>+</sup> Tg(Tec-BCR/ABL1)5Hhi/0  [background:] involves: C57BL/6 * DBA/2                                                                                                                                        
MGI:4357933                 biolink:Genotype    Abl1<sup>m1</sup>/Abl1<sup>+</sup>  [background:] involves: 129S/SvEv                                                                                                                                                                        
MGI:4357934                 biolink:Genotype    Abl1<sup>m1</sup>/Abl1<sup>m1</sup>  [background:] either: 129S/SvEv-Abl1<sup>m1</sup> or (involves: 129S/SvEv * CD-1) or (involves: 129S/SvEv * C57BL/6 * DBA/2)                                                                            
MGI:4357935                 biolink:Genotype    Abl1<sup>m1</sup>/Abl1<sup>m1</sup>  [background:] involves: 129S/SvEv                                                                                                                                                                       
MGI:4357938                 biolink:Genotype    Abl1<sup>m1</sup>/Abl1<sup>m1</sup>  [background:] involves: 129S/SvEv * C57BL/6                                                                                                                                                             
MGI:4361586                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup>  [background:] involves: 129S/SvEv * C57BL/6J * CBA                                                                                                                                              
MGI:4361587                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup> Tg(ACTB-Abl1*I)1Spg/0  [background:] involves: 129S/SvEv * C57BL/6J * CBA                                                                                                                        
MGI:4361588                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup> Tg(ACTB-Abl1*K290R)1Spg/0  [background:] involves: 129S/SvEv * C57BL/6J * CBA                                                                                                                    
MGI:4361589                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup> Tg(ACTB-Abl1*IV)1Spg/0  [background:] involves: 129S/SvEv * C57BL/6J * CBA                                                                                                                       
MGI:4361590                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup> Tg(ACTB-Abl1*I)1Spg/0 Tg(ACTB-Abl1*IV)1Spg/0  [background:] involves: 129S/SvEv * C57BL/6J * CBA                                                                                                 
MGI:4421538                 biolink:Genotype    Abl1<sup>tm1Goff</sup>/Abl1<sup>tm1Goff</sup>  [background:] involves: C57BL/6                                                                                                                                                               
MGI:4421539                 biolink:Genotype    Abl1<sup>tm1Goff</sup>/Abl1<sup>tm1Goff</sup> Tg(Myh6-cre)2182Mds/0  [background:] involves: C57BL/6                                                                                                                                         
MGI:4421541                 biolink:Genotype    Abl1<sup>tm1.1Goff</sup>/Abl1<sup>tm1.1Goff</sup>  [background:] involves: C57BL/6                                                                                                                                                           
MGI:4421542                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup>  [background:] 129S/SvEv-Abl1<sup>tm1Mlg</sup>                                                                                                                                                   
MGI:4421543                 biolink:Genotype    Abl1<sup>tm1Mlg</sup>/Abl1<sup>tm1Mlg</sup>  [background:] B6.129-Abl1<sup>tm1Mlg</sup>                                                                                                                                                      
MGI:4421544                 biolink:Genotype    Abl1<sup>m1</sup>/Abl1<sup>m1</sup>  [background:] B6.129-Abl1<sup>m1</sup>                                                                                                                                                                  
MGI:4850043                 biolink:Genotype    Abl1<sup>tm2.2Goff</sup>/Abl1<sup>tm2.2Goff</sup>  [background:] involves: 129P2/OlaHsd * C57BL/6J * FVB/N                                                                                                                                   
MGI:4850044                 biolink:Genotype    Abl1<sup>tm2.1Goff</sup>/Abl1<sup>tm2.1Goff</sup> Abl2<sup>tm1Ajk</sup>/Abl2<sup>tm1Ajk</sup> Tg(Nes-cre)1Kln/0  [background:] involves: 129 * C57BL/6 * C57BL/6J * SJL                                                                      
MGI:4850045                 biolink:Genotype    Abl1<sup>tm2.1Goff</sup>/Abl1<sup>tm2.1Goff</sup> Abl2<sup>tm1Ajk</sup>/Abl2<sup>tm1Ajk</sup> Tg(Atoh1-cre)1Bfri/0  [background:] involves: 129P2/OlaHsd * 129S4/SvJae * C57BL/6J * CBA                                                      
MGI:4939891                 biolink:Genotype    Abl1<sup>tm1Gcos</sup>/Abl1<sup>+</sup> Myf5<sup>tm2Tajb</sup>/Myf5<sup>tm2Tajb</sup> Myf6<sup>tm1Tajb</sup>/Myf6<sup>tm1Tajb</sup> Myod1<sup>tm2.1(icre)Glh</sup>/Myod1<sup>+</sup>  [background:] involves: 129S * 129X1/SvJ * C57BL/6J    
MGI:4939892                 biolink:Genotype    Abl1<sup>tm1.1Gcos</sup>/Abl1<sup>tm1.1Gcos</sup>  [background:] involves: 129S1/Sv * 129X1/SvJ                                                                                                                                              
MGI:4939893                 biolink:Genotype    Abl1<sup>tm1.1Gcos</sup>/Abl1<sup>+</sup>  [background:] involves: 129S1/Sv * 129X1/SvJ                                                                                                                                                      
MGI:5806781                 biolink:Genotype    Gt(ROSA)26Sor<sup>tm4(CAG-hsb5)Nki</sup>/Gt(ROSA)26Sor<sup>+</sup> Tg(Mx1-cre)1Cgn/0 Tg(Tal1-tTA)19Dgt/0 Tg(tetO-BCR/ABL1)2Dgt/0 TgTn(pb-sb-GrOnc)#aGsva/0  [background:] involves: 129P2/OlaHsd * C57BL/6 * CBA/J * DBA/2 * FVB/N           
MGI:5806784                 biolink:Genotype    Tg(Mx1-cre)1Cgn/0 Tg(Tal1-tTA)19Dgt/0 Tg(tetO-BCR/ABL1)2Dgt/0 TgTn(pb-sb-GrOnc)#aGsva/0  [background:] involves: C57BL/6 * CBA/J * DBA/2 * FVB/N                                                                                             
MGI:5806786                 biolink:Genotype    Gt(ROSA)26Sor<sup>tm4(CAG-hsb5)Nki</sup>/Gt(ROSA)26Sor<sup>+</sup> Tg(Mx1-cre)1Cgn/0 Tg(tetO-BCR/ABL1)2Dgt/0 TgTn(pb-sb-GrOnc)#aGsva/0  [background:] involves: 129P2/OlaHsd * C57BL/6 * CBA/J * FVB/N                                       
MGI:87859                   biolink:Gene        Abl1                                                                                                                                                                                                                                         
MONDO:0004653               biolink:Disease     atypical chronic myeloid leukemia, BCR-ABL1 negative                                                                                                                                                                                         
MONDO:0006115               biolink:Disease     blast phase chronic myelogenous leukemia, BCR-ABL1 positive                                                                                                                                                                                  
MONDO:0011996               biolink:Disease     chronic myelogenous leukemia, BCR-ABL1 positive                                                                                                                                                                                              
MONDO:0035112               biolink:Disease     acute myeloid leukemia with BCR-ABL1                                                                                                                                                                                                         
MONDO:0850161               biolink:Disease     B-lymphoblastic leukemia/lymphoma, BCR-ABL1–like                                                                                                                                                                                             
MONDO:0850449               biolink:Disease     mixed phenotype acute leukemia with BCR-ABL1                                                                                                                                                                                                 
NCBIGene:100524544          biolink:Gene        ABL1                                                                                                                                                                                                                                         
NCBIGene:417181             biolink:Gene        ABL1                                                                                                                                                                                                                                         
NCBIGene:491292             biolink:Gene        ABL1                                                                                                                                                                                                                                         
NCBIGene:540876             biolink:Gene        ABL1                                                                                                                                                                                                                                         
RGD:1584969                 biolink:Gene        Abl1                                                                                                                                                                                                                                         
Xenbase:XB-GENE-17344225    biolink:Gene        abl1.L                                                                                                                                                                                                                                       
Xenbase:XB-GENE-6054675     biolink:Gene        abl1                                                                                                                                                                                                                                         
Xenbase:XB-GENE-6487639     biolink:Gene        abl1.S                                                                                                                                                                                                                                       
ZFIN:ZDB-GENE-100812-9      biolink:Gene        abl1                                                                                                                                                                                                                                         
In [35]:
# Disease: Chronic Myeloid Leukemia (CML)
shared_bmkg.list_nodes_matching_substring(g, "chronic myelogenous leukemia", "label")
id               type                         label                                                          
=============================================================================================================
HP:0005506       biolink:PhenotypicFeature    Chronic myelogenous leukemia                                   
MONDO:0006115    biolink:Disease              blast phase chronic myelogenous leukemia, BCR-ABL1 positive    
MONDO:0011996    biolink:Disease              chronic myelogenous leukemia, BCR-ABL1 positive                

b) Explore the neighborhood of a chosen node¶

In [36]:
# Neighborhood of gene ABL1 - in Homo sapiens
source = "HGNC:76"
subgraph = shared_bmkg.get_egocentric_subgraph(g, source)

# Export
filename = f"{project_name}_neighbors_abl1"
shared_bmkg.export_graph_as_graphml(subgraph, results_dir, filename)
shared_bmkg.export_nodes_as_csv(nodes, results_dir, filename, subgraph)
shared_bmkg.export_edges_as_csv(edges, results_dir, filename, subgraph)

# Report
shared_bmkg.report_graph_stats(subgraph)
Directed multigraph with 528 nodes, 40417 edges and a density of 0.145.

Interpretation:

  • Monarch contains too many nodes connected to the gene ABL1 for plotting it and performing a visual analysis.
In [37]:
# Neighborhood of disease CML
source = "MONDO:0011996"
subgraph = shared_bmkg.get_egocentric_subgraph(g, source)

# Export
filename = f"{project_name}_neighbors_cml"
shared_bmkg.export_graph_as_graphml(subgraph, results_dir, filename)
shared_bmkg.export_nodes_as_csv(nodes, results_dir, filename, subgraph)
shared_bmkg.export_edges_as_csv(edges, results_dir, filename, subgraph)

# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source)
Directed multigraph with 44 nodes, 106 edges and a density of 0.05475.
Out[37]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

Interpretation:

  • Disease-Disease relations: The disease CML with id MONDO:0011996 (red node in the center) stands in a "biolink:subclass_of" relation to many other diseases (red nodes). The type of relation can be seen when hovering over an arrow. The schema also summarized the edge types found between pairs of node types. In this case, there can be "biolink:subclass_of" and "biolink:related_to" edges between pairs of diseases.
    • Example: The disease MONDO:0010809 (="familial chronic myelocytic leukemia-like syndrome") is a subclass of the disease CML with id MONDO:0011996 (="chronic myelogenous leukemia, BCR-ABL1 positive"), which in turn is a subclass of the disease MONDO:0004643 (=myeloid leukemia).
  • Disease-Gene relations: The disease CML stands in "biolink:gene_associated_with_condition" and "biolink:causes" relations with genes (blue nodes).
    • Example: The two genes HGNC:1014 (="BCR") and HGNC:76 (="ABL1") are connected with both these relations to the disease CML, while the gene HGNC:10471 (="RUNX1") only stands in the weaker "biolink:gene_associated_with_condition" relation to CML.
  • Other relations: The disease CML stands in various relations to node types like "biolink:Genotype" or "biolink:PhenotypicFeature" (black nodes).
    • Example: The disease CML stands in a "biolink:has_phenotype" relation to the node HP:0001894 (="Thrombocytosis"), which is described as "Increased numbers of platelets in the peripheral blood.". This information can be read when hovering over the edges or nodes in this visualization.

c) Find shortest paths between two chosen nodes¶

In [38]:
# Paths from gene ABL1 to disease "myeloid leukemia"
source = "HGNC:76"
target = "MONDO:0004643"
subgraph = shared_bmkg.get_paths_subgraph(g, source, target)

# Export
filename = f"{project_name}_paths_abl1_to_ML"
shared_bmkg.export_graph_as_graphml(subgraph, results_dir, filename)
shared_bmkg.export_nodes_as_csv(nodes, results_dir, filename, subgraph)
shared_bmkg.export_edges_as_csv(edges, results_dir, filename, subgraph)

# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source, target)
Directed multigraph with 3 nodes, 2 edges and a density of 0.2222.
Out[38]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Edge label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Rotation
Angle
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Edge labels
Visibility
Size
Scaling factor
Rotation
Angle
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
Collision force
Radius
Strength
x-positioning force
Strength
y-positioning force
Strength
Centering force

Interpretation:

  • The gene HGNC:76 (="ABL1") is connected to the disease MONDO:0004643 (="myeloid leukemia") only by one shortest path via the disease MONDO:0011996 (="chronic myelogenous leukemia, BCR-ABL1 positive"). This path was already part of a previous subgraph showing the neighborhood of CML, but there could have been alternative paths as well.