HALD¶
This notebook explores the biomedical knowledge graph provided by the project Human Aging and Longevity Dataset (HALD): Publication (2023), Website, Code, Data
The source file of this notebook is hald.ipynb and can be found in the repository awesome-biomedical-knowledge-graphs that also contains information about similar projects.
Table of contents¶
1. Setup¶
This section prepares the environment for the following exploratory data analysis.
a) Import packages¶
From the Python standard library.
import os
From the Python Package Index (PyPI).
import gravis as gv # for visualization of the KG schema and subgraphs, developed by the author of this notebook
import igraph as ig
import pandas as pd
From a local Python module named shared_bmkg.py. The functions in it are used in several similar notebooks to reduce code repetition and to improve readability.
import shared_bmkg
b) Create data directories¶
The raw data provided by the project and the transformed data generated throughout this notebook are stored in separate directories. If the notebook is run more than once, the downloaded data is reused instead of fetching it again, but all data transformations are rerun.
project_name = "hald"
download_dir = os.path.join(project_name, "downloads")
results_dir = os.path.join(project_name, "results")
shared_bmkg.create_dir(download_dir)
shared_bmkg.create_dir(results_dir)
2. Data download¶
This section fetches the data published by the project on figshare. The latest available version at the time of creating this notebook was used: Version 6 (2023-12-12)
.
All files provided by the project¶
Entity_Info.json
: Nodes and some node annotations.Relation_Info.json
: Edges and some edge annotations.Entities.csv
,Roles.csv
: Presumably the same information as above, but in a format compatible with the graph database Neo4j.Literature_Info.json
: Background information about the literature used as input for the NLP pipeline that identified entities and relations in the texts.Aging_Biomarkers.json
,Longevity_Biomarkers.json
: A subset of nodes identified in a downstream analysis to either enhance or reduce aging.
Files needed to create the knowledge graph¶
Entity_Info.json
andRelation_Info.json
contain all information required for reconstructing the knowledge graph.- Alternatively,
Entities.csv
andRoles.csv
could be used as well but they are structured specifically for Neo4j.
download_specification = [
("Entity_Info.json", "https://figshare.com/ndownloader/files/43612509", "1746cde24a1bac0460f1ccf646608cc9"),
("Relation_Info.json", "https://figshare.com/ndownloader/files/43612506", "0c1fa199269adc58f64ad4d5b9fd87b9"),
("Entities.csv", "https://figshare.com/ndownloader/files/43612494", "b29f16555759edbbd05e59fa34cccdc5"),
("Roles.csv", "https://figshare.com/ndownloader/files/43612500", "65ad0206fb61bbc483065e47aa113172"),
("Literature_Info.json", "https://figshare.com/ndownloader/files/43612512", "10b78e8ec30f5b85f2a58d8fe24f056b"),
("Aging_Biomarkers.json", "https://figshare.com/ndownloader/files/43612503", "abd0eb6cb7295ae500c5d676b7797324"),
("Longevity_Biomarkers.json", "https://figshare.com/ndownloader/files/43612497", "0dbd9c3f8474dc3cd744ed38af460d75"),
]
for filename, url, md5 in download_specification:
filepath = os.path.join(download_dir, filename)
shared_bmkg.fetch_file(url, filepath)
shared_bmkg.validate_file(filepath, md5)
print()
Found a full local copy of "hald/downloads/Entity_Info.json". MD5 checksum is correct. Found a full local copy of "hald/downloads/Relation_Info.json". MD5 checksum is correct. Found a full local copy of "hald/downloads/Entities.csv". MD5 checksum is correct. Found a full local copy of "hald/downloads/Roles.csv". MD5 checksum is correct. Found a full local copy of "hald/downloads/Literature_Info.json". MD5 checksum is correct. Found a full local copy of "hald/downloads/Aging_Biomarkers.json". MD5 checksum is correct. Found a full local copy of "hald/downloads/Longevity_Biomarkers.json". MD5 checksum is correct.
3. Data import¶
This section loads the raw files into Python data structures for the following inspection and conversion.
%%time
data_nodes = shared_bmkg.read_json_file(os.path.join(download_dir, "Entity_Info.json"))
data_edges = shared_bmkg.read_json_file(os.path.join(download_dir, "Relation_Info.json"))
CPU times: user 11.6 s, sys: 3.35 s, total: 14.9 s Wall time: 15.9 s
%%time
df_neo4j_entities = shared_bmkg.read_csv_file(os.path.join(download_dir, "Entities.csv"))
df_neo4j_roles = shared_bmkg.read_csv_file(os.path.join(download_dir, "Roles.csv"))
CPU times: user 311 ms, sys: 104 ms, total: 415 ms Wall time: 322 ms
%%time
data_literature_info = shared_bmkg.read_json_file(os.path.join(download_dir, "Literature_Info.json"))
CPU times: user 21.3 s, sys: 7.13 s, total: 28.5 s Wall time: 30.6 s
%%time
data_aging_biomarkers = shared_bmkg.read_json_file(os.path.join(download_dir, "Aging_Biomarkers.json"))
data_longevity_biomarkers = shared_bmkg.read_json_file(os.path.join(download_dir, "Longevity_Biomarkers.json"))
CPU times: user 58.3 ms, sys: 20.7 ms, total: 79 ms Wall time: 81.6 ms
4. Data inspection¶
This section attempts to reproduce some published numbers by inspecting the raw data and then prints a few exemplary records.
The publication mentions following statistics about the knowledge graph contents:
- 12,227 nodes having 10 different node types
- 115,522 edges having a lot of different edge types since they reflect verbs in the input texts
a) Number of nodes and edges¶
num_nodes = len(data_nodes)
num_edges = len(data_edges)
print(f"{num_nodes:,} nodes")
print(f"{num_edges:,} edges")
12,257 nodes 116,495 edges
Interpretation:
- Inspecting the raw data resulted in 12,257 nodes, while the publication mentioned 12,227 nodes, which is 30 fewer.
- Inspecting the raw data resulted in 116,495 edges, while the publication mentioned 115,522 edges, which is 973 fewer.
- Both differences were not yet present in an earlier version of the public dataset (Version 3). This suggests that the authors made slight modifications to the knowledge graph creation process after the publication, presumably to correct an error in the computation or to accommodate slightly different input data.
b) Types of nodes and edges¶
nt_key = "type"
nt_counts = {}
for key, val in data_nodes.items():
nt = val[0][nt_key]
if nt not in nt_counts:
nt_counts[nt] = 0
nt_counts[nt] += 1
num_node_types = len(nt_counts)
print(f"{num_node_types} node types, sorted by their frequency of occurrence:")
for nt, cnt in sorted(nt_counts.items(), key=lambda item: -item[1]):
print(f"- {nt}: {cnt}")
10 node types, sorted by their frequency of occurrence: - Gene: 5624 - Disease: 3501 - Mutation: 2217 - RNA: 388 - Carbohydrate: 211 - Lipid: 177 - Peptide: 82 - Protein: 29 - Pharmaceutical Preparations: 15 - Toxin: 13
et_key = "relationship"
et_counts = {}
for key, val in data_edges.items():
et = val[et_key]
if et not in et_counts:
et_counts[et] = 0
et_counts[et] += 1
num_edge_types = len(et_counts)
print(f"{num_edge_types} edge types, sorted by their frequency of occurrence:")
n = 10
print_this = True
for i, (et, cnt) in enumerate(sorted(et_counts.items(), key=lambda item: -item[1])):
if print_this:
print(f"- {et}: {cnt}")
if i == n:
print('...')
print_this = False
if i == len(et_counts) - n:
print_this = True
3058 edge types, sorted by their frequency of occurrence: - associated: 19110 - include: 5542 - increase: 2088 - result: 2015 - cause: 2006 - lead: 1990 - occur: 1600 - characterized: 1468 - develop: 1467 - related: 1425 - show: 1250 ... - exhibit performance in: 1 - modify risk of: 1 - be uncommon but devastate cause of: 1 - visualize changes severity in: 1 - contributor to: 1 - Laboratory: 1 - remodeled: 1 - elucidated: 1 - saturated: 1
# Correctness checks
# 1) Do the counts of different node types add up to the total number of nodes?
sum_node_types = sum(nt_counts.values())
assert sum_node_types == num_nodes, f"Node counts differ: {sum_node_types} != {num_nodes}"
print(f"{sum_node_types:,} = {num_nodes:,} nodes")
# 2) Do the counts of different edge types add up to the total number of edges?
sum_edge_types = sum(et_counts.values())
assert sum_edge_types == num_edges, f"Edge counts differ: {sum_edge_types} != {num_edges}"
print(f"{sum_edge_types:,} = {num_edges:,} edges")
12,257 = 12,257 nodes 116,495 = 116,495 edges
Interpretation:
- Inspecting the raw data resulted in 10 node types, which matches the number mentioned in the publication.
- The numbers of instances per node type partially match those presented in a visualization on the website. They differ for example in the case of Lipids (177 here vs. 199 on the website). This observation fits to the difference in total node count.
- Inspecting the raw data resulted in 3044 edge types, while the publication did not specify any number for edge types.
- The reason for the large number of edge types and the omission of a quantification in the publication may be due to the method used for detecting edges. It is based on a natural language processing pipeline designed to "extract open-domain relation triples with no schema input for relations in advance". This means the types of edges are not predetermined by the authors, as is often the case in other projects, but comes directly from the texts used as input for the NLP pipeline.
- Looking at relative frequencies of node and edge types suggests that the dataset is rather unbalanced.
- The most frequent node type is "Gene" with 5624 instances, while the least frequent node type is "Toxin" with only 13 instances, a difference of two orders of magnitude.
- The most frequent edge type is "associated" with 19110 instances, while the least frequent edge types only have 1 instance each, partly because they come from highly specific phrases only present in one source text.
- This result may be intentional to accurately represent how the source texts describe relationships, but some downstream analyses might be improved if highly similar edge types were combined into a single new type and perhaps if too infrequent ones were dropped. Indeed, one analysis presented in the publication goes into that direction: "In the Biomarkers Identification phase, we classified the relationships into positive, association, and negative ones based on their types. Further identification as biomarkers for human aging and longevity was performed." and "Finally, the entities were further identified as human aging and longevity biomarkers according to their relationships with aging-related diseases."
c) Example entries¶
This section prints some example entries of the raw data. It gives an impression of the format chosen by the authors, which differs greatly between projects due to a lack of a broadly accepted standard for biomedical knowledge graphs.
def report_first_n_items(data, n):
for i, item in enumerate(data.items(), 1):
print(str(item)[:1000], '...')
print()
if i == n:
break
def report_last_n_items(data, n):
for i, item in enumerate(reversed(data.items()), 1):
print(str(item)[:1000], '...')
print()
if i == n:
break
Nodes together with node annotations¶
report_first_n_items(data_nodes, 2)
('MLH1', [{'entity': 'MLH1', 'type': 'Gene', 'PMID': ['12612901', '30275527', '25311944', '22936446', '19949675', '22406557', '23240038', '11325821', '21042749', '25556597', '17556535', '29425284', '22740444', '10954253', '37380216'], 'official full name': 'mutL homolog 1', 'sentence': [['Most such cancers have the CpG island methylator phenotype (CIMP+) with methylation and transcriptional silencing of the mismatch repair gene MLH1.'], ['Our group recently demonstrated that aging human HSCs accumulate microsatellite instability coincident with loss of MLH1, a DNA Mismatch Repair (MMR) protein, which could reasonably predispose to radiation-induced HSC malignancies.', 'In addition, whole-exome sequencing analysis revealed high SNVs and INDELs in lymphomas being driven by loss of Mlh1 and frequently mutated genes had a strong correlation with human leukemias.'], ['ARID1A loss was observed in 9% (22/257) of the cohort: 24% of MMR-deficient tumors (14/59, 13 of the 14 being MLH1/PMS2 defi ... ('CD4', [{'entity': 'CD4', 'type': 'Gene', 'PMID': ['9434661', '9433953', '15210831', '32041953', '8324202', '27243552', '21057376', '34314231', '33587445', '33225623', '31762303', '26284531', '31088755', '27756678', '16113482', '28708810', '25019430', '29529309', '26635008', '35114631', '31530175', '34233446', '23036045', '27940936', '15090829', '18225989', '30979972', '25356944', '24259252', '29165313', '19217939', '33888343', '29762168', '25833895', '33424857', '18925321', '25075743', '16156949', '30748025', '34633448', '1972177', '23981600', '28700495', '30814781', '23068054', '26423550', '35511728', '27097224', '12679605', '32959881', '21298072', '32251142', '31187337', '2631975', '8219229', '30225704', '28737297', '28127989', '28212619', '35249262', '23255844', '19890183', '17318234', '28002550', '34728337', '34143869', '25360575', '23291591', '8819096', '29808701', '33776993', '34106019', '35003076', '30788516', '29535090', '28462821', '23984974', '15050283', '34791781', '344110 ...
report_last_n_items(data_nodes, 2)
('Ototoxicity', [{'entity': 'Ototoxicity', 'type': 'Disease', 'PMID': ['37319406'], 'official full name': None, 'sentence': [['Additionally, the prevalence of aminoglycoside-induced vestibulotoxicity appears to be greater than cochleotoxicity.']], 'numbers of articles': 1, 'JT': ['American journal of audiology'], 'TA': ['Am J Audiol'], 'IF': [1.8], 'IF5': [2.0], 'year': [2023], 'date': [20231101], 'alias names': '', 'description': 'Damage to the EAR or its function secondary to exposure to toxic substances such as drugs used in CHEMOTHERAPY; IMMUNOTHERAPY; or RADIATION.', 'url': 'https://www.ncbi.nlm.nih.gov/mesh/2031054', 'mutation position': '', 'mutation alleles': '', 'MeSH ID': 'D000081015', 'relation': True, 'external links': [], 'aging biomarker': False, 'longevity biomarker': False}]) ... ('Prurigo', [{'entity': 'Prurigo', 'type': 'Disease', 'PMID': ['37903377'], 'official full name': None, 'sentence': [['Late-onset AD with generalized/prurigo lesions was the most predominant phenotype.']], 'numbers of articles': 1, 'JT': ['Folia medica Cracoviensia'], 'TA': ['Folia Med Cracov'], 'IF': [0.0], 'IF5': [0.0], 'year': [2023], 'date': [20230730], 'alias names': '', 'description': 'A name applied to several itchy skin eruptions of unknown cause.', 'url': 'https://www.ncbi.nlm.nih.gov/mesh/68011536', 'mutation position': '', 'mutation alleles': '', 'MeSH ID': 'D011536', 'relation': False, 'external links': [], 'aging biomarker': False, 'longevity biomarker': False}]) ...
Edges together with edge annotations¶
report_first_n_items(data_edges, 2)
('Pulmonary Disease, Chronic Obstructive-defined-Inflammation', {'source entity': 'Pulmonary Disease, Chronic Obstructive', 'relationship': 'defined', 'target entity': 'Inflammation', 'sentence': ['(1) Background: Chronic obstructive pulmonary disease (COPD) is defined as an inflammatory disorder that presents an increasingly prevalent health problem.'], 'source': ['COPD'], 'target': ['inflammatory disorder'], 'source type': ['Disease'], 'target type': ['Disease'], 'PMID': ['30781849'], 'DP': ['2019 Feb 13'], 'date': [20190213], 'TI': ['Chronic Obstructive Pulmonary Disease as a Main Factor of Premature Aging.'], 'TA': ['Int J Environ Res Public Health'], 'IF': [0.0], 'IF5': [0.0], 'method': ['deep learning', 'shortest path']}) ... ('Anorexia-associate-Sarcopenia', {'source entity': 'Anorexia', 'relationship': 'associate', 'target entity': 'Sarcopenia', 'sentence': ["(1) Background: Appetite loss in older people, the 'Anorexia of Aging' (AA), is common, associated with under-nutrition, sarcopenia, and frailty and yet receives little attention."], 'source': ['Anorexia'], 'target': ['sarcopenia'], 'source type': ['Disease'], 'target type': ['Disease'], 'PMID': ['30641897'], 'DP': ['2019 Jan 11'], 'date': [20190111], 'TI': ['Assessment and Treatment of the Anorexia of Aging: A Systematic Review.'], 'TA': ['Nutrients'], 'IF': [5.9], 'IF5': [6.6], 'method': ['deep learning']}) ...
report_last_n_items(data_edges, 2)
('Saxitoxin-cause-Drug-Related Side Effects and Adverse Reactions', {'source entity': 'Saxitoxin', 'relationship': 'cause', 'target entity': 'Drug-Related Side Effects and Adverse Reactions', 'sentence': ['Saxitoxin (STX) causes high toxicity by blocking voltage-gated sodium channels, and it poses a major threat to marine ecosystems and human health worldwide.'], 'source': ['Saxitoxin'], 'target': ['toxicity'], 'source type': ['Toxin'], 'target type': ['Disease'], 'PMID': ['37888479'], 'DP': ['2023 Oct 19'], 'date': [20231019], 'TI': ['Physiological Effects of Oxidative Stress Caused by Saxitoxin in the Nematode Caenorhabditis elegans.'], 'TA': ['Mar Drugs'], 'IF': [0.0], 'IF5': [0.0], 'method': ['shortest path']}) ... ('Triglycerides-protect-Dementia', {'source entity': 'Triglycerides', 'relationship': 'protect', 'target entity': 'Dementia', 'sentence': ['Higher triglyceride levels may be reflective of better overall health and/or lifestyle behaviors that would protect against dementia development.'], 'source': ['triglyceride'], 'target': ['dementia'], 'source type': ['Lipid'], 'target type': ['Disease'], 'PMID': ['37879942'], 'DP': ['2023 Nov 27'], 'date': [20231127], 'TI': ['Association Between Triglycerides and Risk of Dementia in Community-Dwelling Older Adults: A Prospective Cohort Study.'], 'TA': ['Neurology'], 'IF': [9.9], 'IF5': [10.3], 'method': ['shortest path']}) ...
Nodes, edges and annotations in a different format for Neo4j¶
df_neo4j_entities
entity:ID | name | type | frequency | :LABEL | |
---|---|---|---|---|---|
0 | 1 | Pulmonary Disease, Chronic Obstructive | Disease | 1034 | Disease |
1 | 2 | Inflammation | Disease | 4175 | Disease |
2 | 3 | Anorexia | Disease | 147 | Disease |
3 | 4 | Sarcopenia | Disease | 2072 | Disease |
4 | 5 | GPT | Gene | 48 | Gene |
... | ... | ... | ... | ... | ... |
6917 | 6918 | SUPT5H | Gene | 1 | Gene |
6918 | 6919 | HOXA3 | Gene | 2 | Gene |
6919 | 6920 | G6PC1 | Gene | 2 | Gene |
6920 | 6921 | OSR1 | Gene | 2 | Gene |
6921 | 6922 | Saxitoxin | Toxin | 1 | Toxin |
6922 rows × 5 columns
df_neo4j_roles
:START_ID | :END_ID | relation | weight | method | :TYPE | |
---|---|---|---|---|---|---|
0 | 1 | 2 | defined | 1 | deep learning; shortest path | defined |
1 | 3 | 4 | associate | 1 | deep learning | associate |
2 | 4 | 3 | associate | 1 | deep learning | associate |
3 | 5 | 6 | recognized | 1 | deep learning | recognized |
4 | 5 | 6 | increase | 1 | shortest path | increase |
... | ... | ... | ... | ... | ... | ... |
116482 | 578 | 2 | alleviate | 1 | shortest path | alleviate |
116483 | 104 | 1359 | associated | 1 | deep learning; shortest path | associated |
116484 | 1359 | 104 | associated | 1 | deep learning; shortest path | associated |
116485 | 104 | 8 | protect | 1 | shortest path | protect |
116486 | 6922 | 65 | cause | 1 | shortest path | cause |
116487 rows × 6 columns
Background information about the used literature¶
num_articles = len(data_literature_info)
print(f"There is information about {num_articles:,} articles used as input for "
f"the NLP pipeline that identified entities and relations in them.")
There is information about 342,651 articles used as input for the NLP pipeline that identified entities and relations in them.
report_first_n_items(data_literature_info, 1)
('35796512', {'PMID': '35796512', 'TI': 'Inflammatory biomarkers, multi-morbidity, and biologic aging.', 'AB': 'OBJECTIVES: To study the association between multi-morbidity percentiles, which is a measure of clinical aging, and interleukin (IL)-6, IL-10, and tumor necrosis factor (TNF)-alpha. METHODS: Participants 50 to 95 years of age from the Mayo Clinic Study of Aging were assigned age- and sex-specific multi-morbidity percentiles using look-up tables that were reported previously (n = 1646). Percentiles were divided into quintiles for analysis. Plasma IL-6, IL-10, and TNF-alpha levels were measured in 1595 participants. Median inflammatory marker levels were compared across multi-morbidity quintiles using nonparametric tests. RESULTS: People with higher multi-morbidity percentiles had significantly higher IL-6 and TNF-alpha levels compared with those with lower multi-morbidity percentiles. Tests for trend across five multi-morbidity quintiles were significant among women for IL-6 a ...
Information about a downstream analysis of nodes related to aging or longevity¶
num_aging_biomarkers = len(data_aging_biomarkers)
num_longevity_biomarkers = len(data_longevity_biomarkers)
print(f"There is information about nodes that were identified as biomarkers by downstream analyses:")
print(f"- {num_aging_biomarkers:,} entries were found to be aging-related biomarkers")
print(f"- {num_longevity_biomarkers:,} entries were found to be longevity-related biomarkers")
There is information about nodes that were identified as biomarkers by downstream analyses: - 1,871 entries were found to be aging-related biomarkers - 531 entries were found to be longevity-related biomarkers
report_first_n_items(data_aging_biomarkers, 1)
('GPT', [{'source entity': 'GPT', 'relationship': 'correlated', 'target entity': 'Death', 'sentence': 'RESULTS: Profiling of blood parameters demonstrated that elevated levels of alanine aminotransferase (ALT), total bilirubin (T-bil), blood urea nitrogen (BUN), creatinine (Cr) and a decreased platelet count were significantly correlated with death within 1 week in a training cohort.', 'source': 'alanine aminotransferase', 'target': 'death', 'source type': 'Gene', 'target type': 'Disease', 'PMID': '28011502', 'DP': '2017 Jan', 'date': 20170101, 'TI': 'Objective Predictive Score as a Feasible Biomarker for Short-term Survival in TerminalIy Ill Patients with Cancer.', 'TA': 'Anticancer Res', 'IF': 2.0, 'IF5': 2.2}]) ...
report_first_n_items(data_longevity_biomarkers, 1)
('Glucose', [{'source entity': 'Glucose', 'relationship': 'attenuate', 'target entity': 'Cerebrovascular Disorders', 'sentence': 'Raising NAD+ levels in model organisms by administration of NAD+ precursors improves glucose and lipid metabolism; attenuates diet-induced weight-gain, diabetes, diabetic kidney disease, and hepatic steatosis; reduces endothelial dysfunction; protects heart from ischemic injury; improves left ventricular function in models of heart failure; attenuates cerebrovascular and neurodegenerative disorders; and increases health-span.', 'source': 'glucose', 'target': 'cerebrovascular and neurodegenerative disorders', 'source type': 'Carbohydrate', 'target type': 'Disease', 'PMID': '37364580', 'DP': '2023 Nov 9', 'date': 20231109, 'TI': 'Nicotinamide Adenine Dinucleotide in Aging Biology: Potential Applications and Many Unknowns.', 'TA': 'Endocr Rev', 'IF': 20.3, 'IF5': 25.8}]) ...
5. Schema discovery¶
This section analyzes the structure of the knowledge graph by determining which types of nodes are connected by which types of edges. To construct this overview, it is necessary to iterate over the entire data once. The result is a condensed representation of all entities and relations, which is known as data model or schema in the context of graph databases.
Note: Since HALD has an unusually large number of edge types, the visualization would have a lot of arrows between any pair of node types. To make it tidier, all parallel arrows are condensed into a single arrow, and a list of all edge types that the arrow represents can be seen when hovering over it. This representation deviates slightly from the usual way to display a graph schema but conveys the same information.
node_type_to_color = {
"Pharmaceutical Preparations": "green",
"Toxin": "green",
"Gene": "blue",
"Peptide": "blue",
"Protein": "blue",
"RNA": "blue",
"Disease": "red",
}
unique_duples_to_edge_types = dict()
for entry in data_edges.values():
s = entry["source type"][0]
p = entry["relationship"]
o = entry["target type"][0]
duple = (s, o)
if duple not in unique_duples_to_edge_types:
unique_duples_to_edge_types[duple] = set()
unique_duples_to_edge_types[duple].add(p)
gs = ig.Graph(directed=True)
unique_nodes = set()
for (s, o), ps in unique_duples_to_edge_types.items():
for node in (s, o):
if node not in unique_nodes:
unique_nodes.add(node)
node_size = int(nt_counts[node])
node_color = node_type_to_color.get(node, '')
node_hover = f"{node}\n\n{nt_counts[node]} nodes of this type are contained in the knowledge graph."
gs.add_vertex(node, size=node_size, color=node_color, label_color=node_color, hover=node_hover)
edge_size = len(ps) # number of edge types represented by a single arrow
edge_color = node_type_to_color.get(s, '')
edge_type_list = ', '.join(f'"{entry}"' for entry in ps)
edge_hover = (
f"{s} -> {o}\n\nThere are {len(ps)} different edge types between these two node types, "
f"represented here with just a single arrow to keep the depiction tidy.\n\nList of edge types:\n{edge_type_list}")
gs.add_edge(s, o, size=edge_size, color=edge_color, label_color=edge_color, hover=edge_hover)
gs.vcount(), gs.ecount()
(10, 63)
fig = gv.d3(
gs,
show_node_label=True,
node_label_data_source="name",
show_edge_label=False,
edge_curvature=0.1,
use_node_size_normalization=True,
node_size_normalization_min=10,
node_size_normalization_max=50,
node_drag_fix=True,
node_hover_neighborhood=True,
use_edge_size_normalization=True,
edge_size_normalization_max=3,
many_body_force_strength=-3000,
zoom_factor=1.0,
)
fig
# Export the schema visualization to a standalone HTML file
schema_filepath = os.path.join(results_dir, f"{project_name}_schema.html")
fig.export_html(schema_filepath, overwrite=True)
Interpretation:
- Each node in the schema corresponds to one of the 10 node types in the data.
- Node size represents the number of instances, i.e. how often that node type is present in the knowledge graph. The exact number can also be seen when hovering over a node. The large differences indicate again that the dataset is rather unbalanced.
- Node color represents particular node types. The coloring scheme is based on a deliberately simple RGB palette with the same meaning across multiple notebooks to enable some visual comparison. The idea behind it is to highlight an interplay of certain entities, namely that drugs (or small molecules in general) can bind to proteins (or gene products in general) and thereby alter diseases (or involved pathways).
- green = drugs & other small molecules (e.g. toxins)
- blue = genes & gene products (e.g. proteins or RNAs)
- red = diseases & related concepts (e.g. pathways)
- black = all other types of entities
- Each edge in the schema stands for an edge type in the data, but it is possible that the same edge type appears between different nodes. In this schema, however, a single edge represents multiple edge types because there are so many of them in HALD. Hovering over an arrow provides the number of edge types it represents and lists all of them.
- Edge size represents the number of instances, i.e. how often that edge type is present in the knowledge graph.
- Edge color is identical to the color of the source node, again to highlight the interplay between drugs, targets and diseases.
6. Knowledge graph reconstruction¶
This section first converts the raw data to an intermediate format used in several notebooks, and then reconstructs the knowledge graph from the standardized data with shared code.
- The intermediate form of the data is created as two simple Python lists, one for nodes and the other for edges, which can be exported to two CSV files.
- The knowledge graph is built as a graph object from the Python package igraph, which can be exported to a GraphML file.
a) Convert the data into an standardized format¶
Transform the raw data to an standardized format that is compatible with most biomedical knowledge graphs in order to enable shared downstream processing:
- Each node is represented by three items:
id (str), type (str), properties (dict)
- Each edge is represented by four items:
source_id (str), target_id (str), type(str), properties (dict)
This format was initially inspired by a straightforward way in which the content of a Neo4j graph database can be exported to two CSV files, one for all nodes and the other for all edges. This is an effect of the property graph model used in Neo4j and many other graph databases, which also appears to be general enough to fully capture the majority of biomedical knowledge graphs described in scientific literature, despite the large variety of formats they are shared in.
A second motivation was that each line represents a single node or edge, and that no entry is connected to any sections at other locations, such as property descriptions at the beginning of a GraphML file. This structural simplicity makes it very easy to load just a subset of nodes and edges by picking a subset of lines, or to skip the loading of properties if they not required for a task simply by ignoring a single column.
Finally, this format also allows to load the data directly into popular SQL databases like SQLite, MySQL or PostgreSQL with built-in CSV functions (CSV in SQLite, CSV in MySQL, CSV in PostgreSQL). Further, the JSON string in the property column can be accessed directly by built-in JSON functions (JSON in SQLite, JSON in MySQL, JSON in PostgreSQL), which enables sophisticated queries that access or modify specific properties within the JSON data.
Nodes¶
%%time
nodes = []
for entry in data_nodes.values():
entry = entry[0]
node_id = entry["entity"]
node_type = entry["type"]
node_properties = {k: v for k, v in entry.items()
if k not in ("entity", "type")}
node = (node_id, node_type, node_properties) # default format
nodes.append(node)
CPU times: user 216 ms, sys: 24.3 ms, total: 240 ms Wall time: 281 ms
Edges¶
%%time
edges = []
for entry in data_edges.values():
source_id = entry["source entity"]
target_id = entry["target entity"]
edge_type = entry["relationship"]
edge_properties = {k: v for k, v in entry.items()
if k not in ("source entity", "target entity", "relationship")}
edge = (source_id, target_id, edge_type, edge_properties) # default format
edges.append(edge)
CPU times: user 4.32 s, sys: 179 ms, total: 4.5 s Wall time: 4.55 s
b) Export the standardized data to two CSV files¶
Both the id
and type
items are simple strings, while the properties
item is collection of key-value pairs represented by a Python dictionary that can be converted to a single JSON string, which the export function does internally. This means each node is fully represented by three strings, and each edge by four strings due to having a source id and target id.
nodes_csv_filepath = shared_bmkg.export_nodes_as_csv(nodes, results_dir, project_name)
edges_csv_filepath = shared_bmkg.export_edges_as_csv(edges, results_dir, project_name)
c) Use the standardized data to build a graph¶
Reconstruct the knowledge graph in form of a Graph object from the package igraph. This kind of graph object allows to have directed multi-edges, i.e. an edge has a source and a target node, and two nodes can be connected by more than one edge. It also allows to have node and edge properties. These features are necessary and sufficient to represent almost any biomedical knowledge graph found in academic literature.
%%time
g = shared_bmkg.create_graph(nodes, edges)
CPU times: user 2.14 s, sys: 28.7 ms, total: 2.16 s Wall time: 2.17 s
shared_bmkg.report_graph_stats(g)
Directed multigraph with 12257 nodes, 116495 edges and a density of 0.0007754.
# Correctness checks
# 1) Does the reconstructed graph contain the same number of nodes as the raw data?
num_nodes_in_graph = g.vcount()
assert num_nodes_in_graph == num_nodes, f"Node counts differ: {num_nodes_in_graph} != {num_nodes}"
print(f"{num_nodes_in_graph:,} = {num_nodes:,}")
# 2) Does the reconstructed graph contain the same number of edges as the raw data?
num_edges_in_graph = g.ecount()
assert num_edges_in_graph == num_edges, f"Edge counts differ: {num_edges_in_graph} != {num_edges}"
print(f"{num_edges_in_graph:,} = {num_edges:,}")
12,257 = 12,257 116,495 = 116,495
%%time
g_graphml_filepath = shared_bmkg.export_graph_as_graphml(g, results_dir, project_name)
CPU times: user 1.2 s, sys: 116 ms, total: 1.32 s Wall time: 1.3 s
7. Subgraph exploration¶
This section explores small subgraphs of the knowledge graph in two ways: first by inspecting the direct neighborhood of a selected node, and second by finding shortest paths between two chosen nodes.
As a simple case study, the goal is to identify some nodes in the knowledge graph that are associated with the success story of the drug Imatinib, which was one of the first targeted therapies against cancer. Detailed background information can for example be found in an article by the National Cancer Institute and in a talk by Brian Druker who played a major role in the development of this paradigm-changing drug. To give a simplified summary, following biological entities and relationships are involved:
- Mutation: In a bone marrow stem cell, a translocation event between chromosome 9 and 22 leads to what has been called the Philadelphia chromosome, which can be seen under a microscope and got named after the city it originally got discovered in.
- Gene: It turned out that this particular rearrangement of DNA fuses the BCR) gene on chromosome 22 to the ABL1) gene on chromosome 9, resulting in a new fusion gene known as BCR-ABL1.
- Disease: BCR-ABL1 acts as an oncogene, because it expresses a protein that is a defective tyrosine kinase in a permanent "on" state, which leads to uncontrolled growth of certain white blood cells and their precursors, thereby driving the disease Chronic Myelogenous Leukemia (CML).
- Drug: Imatinib (Gleevec) was the first demonstration that a potent and selective Bcr-Abl tyrosine-kinase inhibitor (TKI) is possible and that such a targeted inhibition of an oncoprotein halts the uncontrolled growth of leukemia cells with BCR-ABL1, while having significantly less effect on other cells in the body compared to conventional chemotherapies used in cancer. This revolutionized the treatment of CML and drastically improved the five-year survival rate of patients from less than 20% to over 90%, as well as their quality of life.
In reality the story is a bit more complex, for example because there are other genes involved in disease progression, there are many closely related forms of leukemia, BCR-ABL1 also plays a role in other forms of cancer, there are several drugs available as treatment options today, all of them bind to more than one target and with different affinities, and their individual binding profiles are relevant to their particular therapeutic effects. Inspecting the knowledge graph will focus on highlighting some entities of the simplified story, but the surrounding elements will also indicate some of the complexity encountered in reality. Some simple forms of reasoning on the knowledge graph will demonstrate its potential for discovering new patterns and hypotheses.
a) Search for interesting nodes¶
# Drug: Imatinib - seems not to be contained in HALD
shared_bmkg.list_nodes_matching_substring(g, "imatinib")
id type ==========
# Gene: ABL1
shared_bmkg.list_nodes_matching_substring(g, "abl1")
id type ================ ABL1 Gene
# Disease: Leukemia - to find Chronic Myeloid Leukemia (CML)
shared_bmkg.list_nodes_matching_substring(g, "leukemia")
id type ===================================================================== Leukemia Disease Leukemia L1210 Disease Leukemia, B-Cell Disease Leukemia, Biphenotypic, Acute Disease Leukemia, Erythroblastic, Acute Disease Leukemia, Hairy Cell Disease Leukemia, Large Granular Lymphocytic Disease Leukemia, Lymphocytic, Chronic, B-Cell Disease Leukemia, Lymphoid Disease Leukemia, Mast-Cell Disease Leukemia, Megakaryoblastic, Acute Disease Leukemia, Monocytic, Acute Disease Leukemia, Myelogenous, Chronic, BCR-ABL Positive Disease Leukemia, Myeloid Disease Leukemia, Myeloid, Accelerated Phase Disease Leukemia, Myeloid, Acute Disease Leukemia, Myeloid, Chronic, Atypical, BCR-ABL Negative Disease Leukemia, Myelomonocytic, Chronic Disease Leukemia, Myelomonocytic, Juvenile Disease Leukemia, Prolymphocytic Disease Leukemia, Prolymphocytic, T-Cell Disease Leukemia, Promyelocytic, Acute Disease Leukemia, T-Cell Disease Leukemia-Lymphoma, Adult T-Cell Disease Precursor Cell Lymphoblastic Leukemia-Lymphoma Disease Precursor T-Cell Lymphoblastic Leukemia-Lymphoma Disease Preleukemia Disease
b) Explore the neighborhood of a chosen node¶
# Neighborhood of gene ABL1
source = "ABL1"
subgraph = shared_bmkg.get_egocentric_subgraph(g, source)
# Export
filename = f"{project_name}_neighbors_abl1"
shared_bmkg.export_graph_as_graphml(subgraph, results_dir, filename)
shared_bmkg.export_nodes_as_csv(nodes, results_dir, filename, subgraph)
shared_bmkg.export_edges_as_csv(edges, results_dir, filename, subgraph)
# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source)
Directed multigraph with 3 nodes, 2 edges and a density of 0.2222.
# Neighborhood of disease CML
source = "Leukemia, Myelogenous, Chronic, BCR-ABL Positive"
subgraph = shared_bmkg.get_egocentric_subgraph(g, source)
# Export
filename = f"{project_name}_neighbors_cml"
shared_bmkg.export_graph_as_graphml(subgraph, results_dir, filename)
shared_bmkg.export_nodes_as_csv(nodes, results_dir, filename, subgraph)
shared_bmkg.export_edges_as_csv(edges, results_dir, filename, subgraph)
# Report
shared_bmkg.report_graph_stats(subgraph)
subgraph = subgraph.simplify() # Reduced subgraph without multi-edges in order to enable better visualization
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source)
Directed multigraph with 29 nodes, 1535 edges and a density of 1.825.
Interpretation:
- Disease-Disease relations: The disease CML (red node in the center) is linked to many other hematological diseases such as Anemia and cancers such as AML or Multiple Myeloma.
- Disease-Gene relations: The disease CML is linked to the gene ABL1 (a blue node), which was expected from the Imatinib story. It is also connected to the gene SIRT7 (the other blue node), which is a bit more informative. A publication indicates that this gene has a connection to aging, hence it appears in HALD, but also that there's indeed a connection to CML and AML, described as "reduced SIRT7 expression is associated with hematopoietic disorders like acute myeloid leukemia (AML) and chronic myeloid leukemia (CML)".
# Neighborhood of disease CMML - to show a small example with multi-edges
source = "Leukemia, Myelomonocytic, Chronic"
subgraph = shared_bmkg.get_egocentric_subgraph(g, source)
# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source)
Directed multigraph with 9 nodes, 126 edges and a density of 1.556.
Interpretation:
- The disease Chronic Myelomonocytic Leukemia (red node in the center) is abbreviated CMML and not identical to CML, but yet another form of leukemia, which is less well represented in HALD.
- Since there's not that much information about this disease in HALD, it is possible to visualize its entire neighborhood, including all individual edges, thereby illustrating that the same pair of nodes can be connected by multiple edges of different types. This comes from the fact that HALD analyses scientific texts to recognize entities and to identify verbs between them. These connective words are then used as edge types, which explains why there's a large number of them in the knowledge graph and why two nodes often have a lot of edges between them. Hovering over any edge in this visualization shows its type, e.g. "include" or "associated" are present several times.
c) Find shortest paths between two chosen nodes¶
# Paths from gene TET2 to disease CML
source = "TET2"
target = "Leukemia, Myelogenous, Chronic, BCR-ABL Positive"
subgraph = shared_bmkg.get_paths_subgraph(g, source, target)
# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source, target)
Directed multigraph with 4 nodes, 4 edges and a density of 0.25.
Interpretation:
- The gene TET2 is involved with myeloproliferative diseases and neoplasms, which in turn are linked to CML.
- CML is a concrete example of both of these more general categories of diseases. Therefore this subgraph doesn't necessarily suggest that TET2 has something to do with CML, because TET2 might as well be of relevance only to certain other diseases that also fit to the general notion of a neoplasm. However, one can indeed find studies such as this publication, which suggests there is indeed a link between TET2 and CML that could be further investigated.
# Paths from Von Willebrand factor (VWF) to Alzheimer disease - to show an example with more paths
source = "VWF"
target = "Alzheimer Disease"
subgraph = shared_bmkg.get_paths_subgraph(g, source, target)
# Report
shared_bmkg.report_graph_stats(subgraph)
shared_bmkg.visualize_graph(subgraph, node_type_to_color, source, target)
Directed multigraph with 13 nodes, 22 edges and a density of 0.1302.
Interpretation:
- These two nodes were initially selected only because the subgraph is of a reasonably interesting size, but it turned out there's an unfolding story to it.
- The gene VWF (Von Willebrand factor, involved in hemostasis) is connected to vasculatory and other diseases, which in turn are connected to Alzheimer Disease.
- The indirect relation by various paths could be taken to form the hypothesis that VWF might play a role in Alzheimer Disease. There are indeed studies investigating the possibility of such a link, and for example this publication actually finds that "Higher levels of Von Willebrand factor are associated with an increased risk of dementia, including Alzheimer’s disease, possibly due to direct prothrombotic effects or secondary to endothelial injury."
- This toy example shows the potential of reasoning on knowledge graphs. Here the presence of 1) many indirect links via 2) thematically similar nodes suggested there could be a relevant connection between two nodes that is not yet part of the knowledge graph, and a look at literature showed this might actually be the case. In network theory, the task of link prediction captures this idea as the identification of "unobserved true links".
Appendix: Loading and querying the converted data with SQLite¶
This section demonstrates that the converted data in the intermediary CSV format allows to directly load and query it with a popular SQL database.
a) Create an SQLite database with a suitable schema¶
Create a file-based SQLite database
sqlite_db_filepath = os.path.join(results_dir, f"{project_name}_graph.sqlite")
shared_bmkg.delete_file(sqlite_db_filepath)
Create a table for node data
sql_cmd = """
CREATE TABLE nodes (
id TEXT PRIMARY KEY,
type TEXT,
properties TEXT
);
"""
shared_bmkg.run_shell_command(['sqlite3', sqlite_db_filepath, sql_cmd])
Create a table for edge data
sql_cmd = """
CREATE TABLE edges (
source_id TEXT,
target_id TEXT,
type TEXT,
properties TEXT,
FOREIGN KEY (source_id) REFERENCES nodes(id),
FOREIGN KEY (target_id) REFERENCES nodes(id)
);
"""
shared_bmkg.run_shell_command(['sqlite3', sqlite_db_filepath, sql_cmd])
Load node data
sqlite_cmd = f".import --csv --skip 1 {nodes_csv_filepath} nodes"
shared_bmkg.run_shell_command(['sqlite3', sqlite_db_filepath, '-cmd', sqlite_cmd])
Load edge data
sqlite_cmd = f".import --csv --skip 1 {edges_csv_filepath} edges"
shared_bmkg.run_shell_command(['sqlite3', sqlite_db_filepath, '-cmd', sqlite_cmd])
c) Query the data¶
import sqlite3
conn = sqlite3.connect(sqlite_db_filepath)
cursor = conn.cursor()
Standard SQL query¶
Find all nodes that contain a certain substring in their id
substring = "leukemia"
query = """
SELECT id, type
FROM nodes
WHERE LOWER(id) LIKE LOWER(?)
ORDER BY id;
"""
search_term = f"%{substring}%"
cursor.execute(query, (search_term,))
result = cursor.fetchall()
for row in result:
print(row)
('Leukemia', 'Disease') ('Leukemia L1210', 'Disease') ('Leukemia, B-Cell', 'Disease') ('Leukemia, Biphenotypic, Acute', 'Disease') ('Leukemia, Erythroblastic, Acute', 'Disease') ('Leukemia, Hairy Cell', 'Disease') ('Leukemia, Large Granular Lymphocytic', 'Disease') ('Leukemia, Lymphocytic, Chronic, B-Cell', 'Disease') ('Leukemia, Lymphoid', 'Disease') ('Leukemia, Mast-Cell', 'Disease') ('Leukemia, Megakaryoblastic, Acute', 'Disease') ('Leukemia, Monocytic, Acute', 'Disease') ('Leukemia, Myelogenous, Chronic, BCR-ABL Positive', 'Disease') ('Leukemia, Myeloid', 'Disease') ('Leukemia, Myeloid, Accelerated Phase', 'Disease') ('Leukemia, Myeloid, Acute', 'Disease') ('Leukemia, Myeloid, Chronic, Atypical, BCR-ABL Negative', 'Disease') ('Leukemia, Myelomonocytic, Chronic', 'Disease') ('Leukemia, Myelomonocytic, Juvenile', 'Disease') ('Leukemia, Prolymphocytic', 'Disease') ('Leukemia, Prolymphocytic, T-Cell', 'Disease') ('Leukemia, Promyelocytic, Acute', 'Disease') ('Leukemia, T-Cell', 'Disease') ('Leukemia-Lymphoma, Adult T-Cell', 'Disease') ('Precursor Cell Lymphoblastic Leukemia-Lymphoma', 'Disease') ('Precursor T-Cell Lymphoblastic Leukemia-Lymphoma', 'Disease') ('Preleukemia', 'Disease')
Non-standard SQL query using JSON support of SQLite¶
Find all nodes that contain a certain substring in the value of a particular key in their JSON object. This can be done with the function json_extract in SQLite, but also with the more broadly supported ->
operator available as -> in SQLite, -> in MySQL and -> in PostreSQL.
%%time
key = "official full name"
substring = "myeloid"
query = f"""
SELECT id, type, json_extract(properties, '$."{key}"') AS official_full_name
FROM nodes
WHERE LOWER(official_full_name) LIKE LOWER(?)
ORDER BY id;
"""
search_term = f'%{substring}%'
cursor.execute(query, (search_term,))
result = cursor.fetchall()
print(f'Nodes with the substring "{substring}" in the value of the key "official full name" in the JSON data')
for row in result:
print(row)
Nodes with the substring "myeloid" in the value of the key "official full name" in the JSON data ('MLF1', 'Gene', 'myeloid leukemia factor 1') ('MZF1', 'Gene', 'myeloid zinc finger 1') ('TREM1', 'Gene', 'triggering receptor expressed on myeloid cells 1') ('TREM2', 'Gene', 'triggering receptor expressed on myeloid cells 2') CPU times: user 1.86 s, sys: 383 ms, total: 2.24 s Wall time: 2.24 s
%%time
key = "official full name"
substring = "myeloid"
query = f"""
SELECT id, type, properties -> '$."{key}"' AS official_full_name
FROM nodes
WHERE LOWER(official_full_name) LIKE LOWER(?)
ORDER BY id;
"""
search_term = f'%{substring}%'
cursor.execute(query, (search_term,))
result = cursor.fetchall()
print(f'Nodes with the substring "{substring}" in the value of the key "official full name" in the JSON data')
for row in result:
print(row)
Nodes with the substring "myeloid" in the value of the key "official full name" in the JSON data ('MLF1', 'Gene', '"myeloid leukemia factor 1"') ('MZF1', 'Gene', '"myeloid zinc finger 1"') ('TREM1', 'Gene', '"triggering receptor expressed on myeloid cells 1"') ('TREM2', 'Gene', '"triggering receptor expressed on myeloid cells 2"') CPU times: user 1.94 s, sys: 303 ms, total: 2.24 s Wall time: 2.24 s