PrimeKG¶
- Website: Precision Medicine Oriented Knowledge Graph
- Preprint: biorxiv: Building a knowledge graph to enable precision medicine
- Publication: Nature: Building a knowledge graph to enable precision medicine
- "PrimeKG integrates 20 high-quality resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action, considerably expanding previous efforts in disease-rooted knowledge graphs."
- Code repository: GitHub: PrimeKG
- "July 2023 update: In July 2023, this repository was updated to rebuild PrimeKG and update the knowledge graph to include database releases up to July 2023. Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG."
- Data repository: Harvard Dataverse: PrimeKG
Load the knowledge graph¶
- Used source: Harvard Dataverse: PrimeKG > Access Dataset > Original Format ZIP > kg.csv
In [1]:
%%time
import zipfile
import pandas as pd
def read_file(zip_filename, csv_filename):
with zipfile.ZipFile(zip_filename, 'r') as zf:
with zf.open(csv_filename) as f:
df = pd.read_csv(f, engine="pyarrow")
return df
df = read_file("dataverse_files.zip", "kg.csv")
df_drug = read_file("dataverse_files.zip", "drug_features.csv")
df_disease = read_file("dataverse_files.zip", "disease_features.csv")
CPU times: user 39.1 s, sys: 11.1 s, total: 50.2 s Wall time: 20.8 s
In [2]:
df.shape, df_drug.shape, df_disease.shape
Out[2]:
((8100498, 12), (7957, 18), (44133, 18))
Inspect rows¶
In [3]:
df.head(2)
Out[3]:
relation | display_relation | x_index | x_id | x_type | x_name | x_source | y_index | y_id | y_type | y_name | y_source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | protein_protein | ppi | 0 | 9796 | gene/protein | PHYHIP | NCBI | 8889 | 56992 | gene/protein | KIF15 | NCBI |
1 | protein_protein | ppi | 1 | 7918 | gene/protein | GPANK1 | NCBI | 2798 | 9240 | gene/protein | PNMA1 | NCBI |
In [4]:
df_drug.head(2)
Out[4]:
node_index | description | half_life | indication | mechanism_of_action | protein_binding | pharmacodynamics | state | atc_1 | atc_2 | atc_3 | atc_4 | category | group | pathway | molecular_weight | tpsa | clogp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14012 | Copper is a transition metal and a trace eleme... | None | For use in the supplementation of total parent... | Copper is absorbed from the gut via high affin... | Copper is nearly entirely bound by ceruloplasm... | Copper is incorporated into many enzymes throu... | Copper is a solid. | None | None | None | None | Copper is part of Copper-containing Intrauteri... | Copper is approved and investigational. | None | None | None | None |
1 | 14013 | Oxygen is an element displayed by the symbol O... | The half-life is approximately 122.24 seconds | Oxygen therapy in clinical settings is used ac... | Oxygen therapy increases the arterial pressure... | Oxygen binds to oxygen-carrying protein in red... | Oxygen therapy improves effective cellular oxy... | Oxygen is a gas. | Oxygen is anatomically related to various. | Oxygen is in the therapeutic group of all othe... | Oxygen is pharmacologically related to all oth... | The chemical and functional group of is medic... | Oxygen is part of Chalcogens ; Elements ; Gase... | Oxygen is approved and vet_approved. | None | The molecular weight is 32.0. | Oxygen has a topological polar surface area of... | None |
In [5]:
df_disease.head(2)
Out[5]:
node_index | mondo_id | mondo_name | group_id_bert | group_name_bert | mondo_definition | umls_description | orphanet_definition | orphanet_prevalence | orphanet_epidemiology | orphanet_clinical_description | orphanet_management_and_treatment | mayo_symptoms | mayo_causes | mayo_risk_factors | mayo_complications | mayo_prevention | mayo_see_doc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27165 | 8019 | mullerian aplasia and hyperandrogenism | None | None | Deficiency of the glycoprotein WNT4, associate... | Deficiency of the glycoprotein wnt4, associate... | A rare syndrome with 46,XX disorder of sex dev... | None | None | None | None | None | None | None | None | None | None |
1 | 27165 | 8019 | mullerian aplasia and hyperandrogenism | None | None | Deficiency of the glycoprotein WNT4, associate... | Deficiency of the glycoprotein wnt4, associate... | A rare syndrome with 46,XX disorder of sex dev... | None | None | None | None | None | None | None | None | None | None |
Inspect entity and relation types¶
In [6]:
def report(df, col):
entries = sorted(df[col].unique())
n = len(entries)
print(f'Column "{col}" contains {n} unique entries:')
if n < 100:
for entry in entries:
print(f"- {entry}")
print()
In [7]:
report(df, "x_type")
Column "x_type" contains 10 unique entries: - anatomy - biological_process - cellular_component - disease - drug - effect/phenotype - exposure - gene/protein - molecular_function - pathway
In [8]:
report(df, "x_type")
Column "x_type" contains 10 unique entries: - anatomy - biological_process - cellular_component - disease - drug - effect/phenotype - exposure - gene/protein - molecular_function - pathway
In [9]:
report(df, "relation")
Column "relation" contains 30 unique entries: - anatomy_anatomy - anatomy_protein_absent - anatomy_protein_present - bioprocess_bioprocess - bioprocess_protein - cellcomp_cellcomp - cellcomp_protein - contraindication - disease_disease - disease_phenotype_negative - disease_phenotype_positive - disease_protein - drug_drug - drug_effect - drug_protein - exposure_bioprocess - exposure_cellcomp - exposure_disease - exposure_exposure - exposure_molfunc - exposure_protein - indication - molfunc_molfunc - molfunc_protein - off-label use - pathway_pathway - pathway_protein - phenotype_phenotype - phenotype_protein - protein_protein
In [10]:
report(df, "display_relation")
Column "display_relation" contains 18 unique entries: - associated with - carrier - contraindication - enzyme - expression absent - expression present - indication - interacts with - linked to - off-label use - parent-child - phenotype absent - phenotype present - ppi - side effect - synergistic interaction - target - transporter
Search for entries about some biological molecules¶
In [11]:
drugs = ["penicillin", "imatinib"]
proteins = ["polymerase", "cox2", "hsp90"]
diseases = ["liver cancer", "alzheimer", "amyotrophy", "hiv"]
x = df['x_name'].str.lower()
y = df['y_name'].str.lower()
for substring in drugs + proteins + diseases:
n = (x.str.contains(substring) | y.str.contains(substring)).sum()
print(f'"{substring}": {n} rows')
"penicillin": 4166 rows "imatinib": 3454 rows "polymerase": 21100 rows "cox2": 560 rows "hsp90": 4768 rows "liver cancer": 1328 rows "alzheimer": 784 rows "amyotrophy": 642 rows "hiv": 2570 rows
In [12]:
import networkx as nx
import gravis as gv
name = 'hsp90'
mask1 = df['x_name'].str.lower().str.contains(name)
mask2 = df['y_name'].str.lower().str.contains(name)
df_sub = df[mask1 | mask2]
df_sub.shape
Out[12]:
(4768, 12)
In [13]:
report(df_sub, "relation")
Column "relation" contains 14 unique entries: - anatomy_protein_absent - anatomy_protein_present - bioprocess_bioprocess - bioprocess_protein - cellcomp_cellcomp - cellcomp_protein - disease_protein - drug_protein - exposure_protein - molfunc_molfunc - molfunc_protein - pathway_pathway - pathway_protein - protein_protein
In [14]:
report(df_sub, "x_type")
Column "x_type" contains 9 unique entries: - anatomy - biological_process - cellular_component - disease - drug - exposure - gene/protein - molecular_function - pathway
In [15]:
report(df_sub, "y_type")
Column "y_type" contains 9 unique entries: - anatomy - biological_process - cellular_component - disease - drug - exposure - gene/protein - molecular_function - pathway
In [51]:
def node_to_color(id, type, source, name):
color = None
name = name.lower()
if type == "gene/protein":
color = "green"
elif type == "drug":
color = "orange"
elif type == "disease":
color = "red"
elif type == "pathway":
color = "blue"
elif type == "biological_process":
color = "lightblue"
return color
def node_to_shape(id, type, source, name):
shape = None
if type == "disease":
shape = "hexagon"
elif type not in ["gene/protein", "drug"]:
shape = "rectangle"
return shape
def node_to_size(id, type, source, name):
size = None
name = name.lower()
if type == "gene/protein" and "hsp90" in name:
size = 16
elif type in ["pathway", "biological_process"]:
size = 7
return size
def node_to_hover(id, type, source, name):
return f'<b>{name}</b>\n\nType: {type}\nSource: {source}'
g = nx.Graph()
for i, row in df_sub.iterrows():
x_id = row['x_id']
x_type = row['x_type']
x_source = row['x_source']
x_name = row['x_name']
y_id = row['y_id']
y_type = row['y_type']
y_source = row['y_source']
y_name = row['y_name']
relation = row['relation']
display_relation = row['display_relation']
if relation in ["disease_protein", "drug_protein", "disease_protein", "pathway_protein", "bioprocess_protein"]:
g.add_node(
x_id,
name=x_name,
hover=node_to_hover(x_id, x_type, x_source, x_name),
size=node_to_size(x_id, x_type, x_source, x_name),
color=node_to_color(x_id, x_type, x_source, x_name),
shape=node_to_shape(x_id, x_type, x_source, x_name),
)
g.add_node(
y_id,
name=y_name,
hover=node_to_hover(y_id, y_type, y_source, y_name),
size=node_to_size(y_id, y_type, y_source, y_name),
color=node_to_color(y_id, y_type, y_source, y_name),
shape=node_to_shape(y_id, y_type, y_source, y_name),
)
g.add_edge(
x_id,
y_id,
type=relation,
name=display_relation,
)
print(f"Result: A graph with {len(g.nodes)} nodes and {len(g.edges)} edges.")
Result: A graph with 294 nodes and 354 edges.
In [52]:
gv.d3(g, show_node_label=False, edge_curvature=0.2)
Out[52]:
Details for selected element
Inspect a subset of the knowledge graph with OpenCog Hyperon¶
Load the knowledge graph into MeTTa expressions in a format that is suitable for querying the data with the pattern matcher.
In [18]:
import hyperon
def run(program):
runner = hyperon.MeTTa()
result = runner.run(program)
return result
Convert rows of the KG into MeTTa expressions¶
In [19]:
%%time
data = []
for i, row in df_sub.iterrows():
entity1 = row["x_name"]
entity2 = row["y_name"]
relation = row["relation"]
expression = f"({relation} ({entity1} {entity2}))"
data.append(expression)
data[:5]
CPU times: user 1.04 s, sys: 12 ms, total: 1.06 s Wall time: 1.09 s
Out[19]:
['(protein_protein (HSP90AB1 IKBKE))', '(protein_protein (HSP90AB1 PCGF6))', '(protein_protein (HSP90AB1 MAP2K7))', '(protein_protein (HSP90AB1 BAG2))', '(protein_protein (HSP90AB1 MAPK15))']
In [20]:
len(data)
Out[20]:
4768
Query the KG with the Pattern Matcher of OpenCog Hyperon¶
In [21]:
get_ppi = """
(= (get_ppi $p1)
(match
&self
(protein_protein ($p1 $p2))
(The protein $p1 binds to protein $p2)
)
)
"""
get_dpi = """
(= (get_dpi $d)
(match
&self
(drug_protein ($d $p))
(The drug $d binds to protein $p)
)
)
"""
get_loc = """
(= (get_loc $p)
(match
&self
(anatomy_protein_present ($a $p))
(The protein $p is present in the $a)
)
)
"""
get_ddi = """
(= (get_ddi $d1)
(match
&self
(,
(drug_protein ($d1 $p))
(drug_protein ($d2 $p))
)
(The drugs $d1 and $d2 interact with the same protein $p)
)
)
"""
function_definitions = [
get_ppi,
get_dpi,
get_loc,
get_ddi,
]
queries = [
"!(get_dpi Geldanamycin)",
"!(get_ddi Radicicol)",
"!(get_loc HSP90B1)",
"!(get_ppi HSP90B1)",
]
program = '\n'.join(data + function_definitions + queries)
In [22]:
%%time
run(program)
CPU times: user 2.63 s, sys: 3.87 ms, total: 2.64 s Wall time: 2.63 s
Out[22]:
[[(The drug Geldanamycin binds to protein HSP90B1), (The drug Geldanamycin binds to protein HSP90AA1), (The drug Geldanamycin binds to protein HSP90AB1)], [(The drugs Radicicol and Copper interact with the same protein HSP90B1), (The drugs Radicicol and Diglyme interact with the same protein HSP90B1), (The drugs Radicicol and Geldanamycin interact with the same protein HSP90B1), (The drugs Radicicol and 2-Chlorodideoxyadenosine interact with the same protein HSP90B1), (The drugs Radicicol and Rifabutin interact with the same protein HSP90B1), (The drugs Radicicol and Radicicol interact with the same protein HSP90B1), (The drugs Radicicol and SNX-5422 interact with the same protein HSP90AB1), (The drugs Radicicol and Tanespimycin interact with the same protein HSP90AB1), (The drugs Radicicol and Geldanamycin interact with the same protein HSP90AB1), (The drugs Radicicol and CCT-018159 interact with the same protein HSP90AB1), (The drugs Radicicol and Polaprezinc interact with the same protein HSP90AB1), (The drugs Radicicol and Radicicol interact with the same protein HSP90AB1)], [(The protein HSP90B1 is present in the kidney), (The protein HSP90B1 is present in the uterus), (The protein HSP90B1 is present in the tendon), (The protein HSP90B1 is present in the tongue), (The protein HSP90B1 is present in the metanephros), (The protein HSP90B1 is present in the cerebellum), (The protein HSP90B1 is present in the eye), (The protein HSP90B1 is present in the liver), (The protein HSP90B1 is present in the esophagus), (The protein HSP90B1 is present in the blood), (The protein HSP90B1 is present in the decidua), (The protein HSP90B1 is present in the hypothalamus), (The protein HSP90B1 is present in the neocortex), (The protein HSP90B1 is present in the vagina), (The protein HSP90B1 is present in the testis), (The protein HSP90B1 is present in the brain), (The protein HSP90B1 is present in the telencephalon), (The protein HSP90B1 is present in the duodenum), (The protein HSP90B1 is present in the myometrium), (The protein HSP90B1 is present in the stomach), (The protein HSP90B1 is present in the deltoid), (The protein HSP90B1 is present in the thymus), (The protein HSP90B1 is present in the nerve), (The protein HSP90B1 is present in the nasopharynx), (The protein HSP90B1 is present in the embryo), (The protein HSP90B1 is present in the heart), (The protein HSP90B1 is present in the putamen), (The protein HSP90B1 is present in the forebrain), (The protein HSP90B1 is present in the amygdala), (The protein HSP90B1 is present in the endometrium), (The protein HSP90B1 is present in the peritoneum), (The protein HSP90B1 is present in the intestine), (The protein HSP90B1 is present in the bronchus), (The protein HSP90B1 is present in the colon), (The protein HSP90B1 is present in the pancreas), (The protein HSP90B1 is present in the spleen), (The protein HSP90B1 is present in the gingiva), (The protein HSP90B1 is present in the caecum), (The protein HSP90B1 is present in the tonsil), (The protein HSP90B1 is present in the trachea), (The protein HSP90B1 is present in the placenta), (The protein HSP90B1 is present in the aorta), (The protein HSP90B1 is present in the jejunum), (The protein HSP90B1 is present in the myocardium), (The protein HSP90B1 is present in the lung), (The protein HSP90B1 is present in the midbrain)], [(The protein HSP90B1 binds to protein DNM1L), (The protein HSP90B1 binds to protein GABRA1), (The protein HSP90B1 binds to protein SUMO2), (The protein HSP90B1 binds to protein CRELD2), (The protein HSP90B1 binds to protein MAPK6), (The protein HSP90B1 binds to protein MGP), (The protein HSP90B1 binds to protein APOB), (The protein HSP90B1 binds to protein IKBKG), (The protein HSP90B1 binds to protein RXRB), (The protein HSP90B1 binds to protein P4HB), (The protein HSP90B1 binds to protein ADAMTS3), (The protein HSP90B1 binds to protein ZNF780A), (The protein HSP90B1 binds to protein H1-2), (The protein HSP90B1 binds to protein ERBB2), (The protein HSP90B1 binds to protein EPN3), (The protein HSP90B1 binds to protein ANXA8L1), (The protein HSP90B1 binds to protein CYSLTR2), (The protein HSP90B1 binds to protein EVI5), (The protein HSP90B1 binds to protein VPS13C), (The protein HSP90B1 binds to protein EEF1D), (The protein HSP90B1 binds to protein CCT2), (The protein HSP90B1 binds to protein LIMA1), (The protein HSP90B1 binds to protein ITPR3), (The protein HSP90B1 binds to protein LPIN1), (The protein HSP90B1 binds to protein LIG3), (The protein HSP90B1 binds to protein NKX3-1), (The protein HSP90B1 binds to protein OS9), (The protein HSP90B1 binds to protein CDYL2), (The protein HSP90B1 binds to protein CSNK2A2), (The protein HSP90B1 binds to protein KIT), (The protein HSP90B1 binds to protein FANCA), (The protein HSP90B1 binds to protein CAMLG), (The protein HSP90B1 binds to protein MYO1B), (The protein HSP90B1 binds to protein CALR), (The protein HSP90B1 binds to protein EIF2AK3), (The protein HSP90B1 binds to protein DAAM1), (The protein HSP90B1 binds to protein H1-5), (The protein HSP90B1 binds to protein SH3RF3), (The protein HSP90B1 binds to protein KLF16), (The protein HSP90B1 binds to protein SOD1), (The protein HSP90B1 binds to protein SWAP70), (The protein HSP90B1 binds to protein TRIM68), (The protein HSP90B1 binds to protein GPRC5B), (The protein HSP90B1 binds to protein HSP90B2P), (The protein HSP90B1 binds to protein GANAB), (The protein HSP90B1 binds to protein EYA1), (The protein HSP90B1 binds to protein RASGRP4), (The protein HSP90B1 binds to protein UBC), (The protein HSP90B1 binds to protein APAF1), (The protein HSP90B1 binds to protein SGTB), (The protein HSP90B1 binds to protein KIF20B), (The protein HSP90B1 binds to protein MCF2L), (The protein HSP90B1 binds to protein ASPM), (The protein HSP90B1 binds to protein AIMP1), (The protein HSP90B1 binds to protein COX15), (The protein HSP90B1 binds to protein HCN2), (The protein HSP90B1 binds to protein VDAC1), (The protein HSP90B1 binds to protein RUFY1), (The protein HSP90B1 binds to protein ITGB1), (The protein HSP90B1 binds to protein PPP1R12A), (The protein HSP90B1 binds to protein PTGFR), (The protein HSP90B1 binds to protein EBP), (The protein HSP90B1 binds to protein POLR2E), (The protein HSP90B1 binds to protein VWF), (The protein HSP90B1 binds to protein BMPR1A), (The protein HSP90B1 binds to protein BATF), (The protein HSP90B1 binds to protein GPR37), (The protein HSP90B1 binds to protein RACK1), (The protein HSP90B1 binds to protein CSNK2A1), (The protein HSP90B1 binds to protein DERL1), (The protein HSP90B1 binds to protein NFYB), (The protein HSP90B1 binds to protein FANCC), (The protein HSP90B1 binds to protein PRKCSH), (The protein HSP90B1 binds to protein HMGXB4), (The protein HSP90B1 binds to protein DLST), (The protein HSP90B1 binds to protein DLD), (The protein HSP90B1 binds to protein TLR1), (The protein HSP90B1 binds to protein CCDC112), (The protein HSP90B1 binds to protein SMARCC1), (The protein HSP90B1 binds to protein EMC2), (The protein HSP90B1 binds to protein TLR2), (The protein HSP90B1 binds to protein RAB8A), (The protein HSP90B1 binds to protein CCDC171), (The protein HSP90B1 binds to protein GBA), (The protein HSP90B1 binds to protein SLC25A1), (The protein HSP90B1 binds to protein SOAT1), (The protein HSP90B1 binds to protein PCNA), (The protein HSP90B1 binds to protein CNPY2), (The protein HSP90B1 binds to protein EGFR), (The protein HSP90B1 binds to protein BIRC2), (The protein HSP90B1 binds to protein SEL1L), (The protein HSP90B1 binds to protein STXBP2), (The protein HSP90B1 binds to protein LYZL2), (The protein HSP90B1 binds to protein SDHA), (The protein HSP90B1 binds to protein C1GALT1), (The protein HSP90B1 binds to protein H2BC9), (The protein HSP90B1 binds to protein SMARCA4), (The protein HSP90B1 binds to protein PPIB), (The protein HSP90B1 binds to protein SOD2), (The protein HSP90B1 binds to protein TMC6), (The protein HSP90B1 binds to protein SMG1), (The protein HSP90B1 binds to protein ESRP2), (The protein HSP90B1 binds to protein GLT1D1), (The protein HSP90B1 binds to protein MYO10), (The protein HSP90B1 binds to protein TXNDC11), (The protein HSP90B1 binds to protein KIF15), (The protein HSP90B1 binds to protein POLK), (The protein HSP90B1 binds to protein RXFP3), (The protein HSP90B1 binds to protein KDM4B), (The protein HSP90B1 binds to protein C15orf39), (The protein HSP90B1 binds to protein FOS), (The protein HSP90B1 binds to protein KCNQ4), (The protein HSP90B1 binds to protein ODAD2), (The protein HSP90B1 binds to protein LDLR), (The protein HSP90B1 binds to protein NR4A1), (The protein HSP90B1 binds to protein NFKB1), (The protein HSP90B1 binds to protein RB1CC1), (The protein HSP90B1 binds to protein UGGT1), (The protein HSP90B1 binds to protein LRRC63), (The protein HSP90B1 binds to protein SUGT1), (The protein HSP90B1 binds to protein MLLT3), (The protein HSP90B1 binds to protein ESR1), (The protein HSP90B1 binds to protein CLU), (The protein HSP90B1 binds to protein CALML3), (The protein HSP90B1 binds to protein HSPA9), (The protein HSP90B1 binds to protein RFWD3), (The protein HSP90B1 binds to protein MDM2), (The protein HSP90B1 binds to protein UBB), (The protein HSP90B1 binds to protein XRRA1), (The protein HSP90B1 binds to protein HSPA5), (The protein HSP90B1 binds to protein SMARCC2), (The protein HSP90B1 binds to protein SP1), (The protein HSP90B1 binds to protein ADRB2), (The protein HSP90B1 binds to protein SLC33A1), (The protein HSP90B1 binds to protein MAP3K7), (The protein HSP90B1 binds to protein XRCC3), (The protein HSP90B1 binds to protein HSD17B10), (The protein HSP90B1 binds to protein TLR4)]]