Detailed example
The following provides a detailed example for getting familiar with most features of the package kgw.
Load the package
[1]:
import kgw
print(kgw.__version__)
0.1.0
List all supported projects
Currently this package supports five knowledge graph projects from the field of biomedicine. They are grouped in a module named after the domain, because future versions of the package may cover more projects from different domains.
[2]:
project_names = [x for x in dir(kgw.biomedicine) if not x.startswith("_")]
print(project_names)
['Ckg', 'Hald', 'MonarchKg', 'Oregano', 'PrimeKg']
Get details for a project
Each project comes with additional information such as references to its publication, website and data repository. The text can be read in a nicely formatted form in the API documentation. Alternatively, it can be accessed in its raw ReST form by calling Python’s built-in help system on the project’s class, which is done below for the project HALD:
[3]:
help(kgw.biomedicine.Hald)
Help on class Hald in module kgw.biomedicine._hald:
class Hald(kgw._shared.base.Project)
| Hald(version, workdir)
|
| Human Aging and Longevity Dataset (HALD).
|
| References
| ----------
| - Publication: https://doi.org/10.1038/s41597-023-02781-0
| - Website: https://bis.zju.edu.cn/hald
| - Code: https://github.com/zexuwu/hald
| - Data: https://doi.org/10.6084/m9.figshare.22828196
|
| Method resolution order:
| Hald
| kgw._shared.base.Project
| builtins.object
|
| Methods defined here:
|
| to_schema(self)
| Determine the schema of the knowledge graph.
|
| Output: `schema.html`
|
| Generate a standalone HTML file with an interactive graph visualization
| of all entity types in the KG and the relationship types by which they
| are connected.
|
| References
| ----------
| - `Neo4j: Graph modeling guidelines
| <https://neo4j.com/docs/getting-started/data-modeling/guide-data-modeling/>`__
|
| ----------------------------------------------------------------------
| Methods inherited from kgw._shared.base.Project:
|
| __init__(self, version, workdir)
| Initialize a project instance so that tasks can be defined on it.
|
| Parameters
| ----------
| version : `str`
| Version of the dataset that will be downloaded and processed.
| The method :meth:`get_versions` returns all currently available
| versions.
| workdir : `str`
| Path of the working directory in which a unique subdirectory will
| be created to hold all downloaded and generated files for
| this project and version.
|
| Raises
| ------
| ValueError
| Raised if `version` is invalid or unavailable.
| TypeError
| Raised if `workdir` is not a string.
|
| Notes
| -----
| This class does not automatically download or process any data.
| Such tasks first need to be specified by calling the relevant methods
| on the project object and then passing it to the function :func:`~kgw.run`
| that builds and executes a corresponding workflow.
|
| to_csv(self)
| Convert the knowledge graph to two `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`__ files.
|
| File names: `kg_nodes.csv` and `kg_edges.csv`
|
| to_graphml(self)
| Convert the knowledge graph to a `GraphML <https://en.wikipedia.org/wiki/GraphML>`__ file.
|
| File name: `kg.graphml`
|
| to_jsonl(self)
| Convert the knowledge graph to two `JSONL <https://jsonlines.org>`__ files.
|
| Output: `kg_nodes.jsonl` and `kg_edges.jsonl`
|
| to_metta(self, representation='spo')
| Convert the knowledge graph to a `MeTTa <https://metta-lang.dev>`__ file.
|
| File name: Depending on the chosen representation, either
| `kg_spo.metta`,
| `kg_properties_aggregated.metta`, or
| `kg_properties_expanded.metta`.
|
| Parameters
| ----------
| representation : str
| Available options:
|
| - `"spo"`: Semantic triples of the form `("subject", "predicate", "object")`.
| If properties are present in the original KG, they are ignored in this
| representation.
| - `"properties_aggregated"`: Properties (=key-value pairs) are represented
| by putting each key on a separate line, but each value is ensured to be a
| single number or string. This means values that hold a compound data type
| like a list or dict are aggregated into one string in JSON string format.
| Text identifiers of nodes are reused to create the association with their
| properties, while text identifiers of the form "e{cnt}" are introduced for
| edges to serve the same purpose.
| - `"properties_expanded"`: Properties (=key-value pairs) are represented
| by fully expanding their keys and values onto as many lines as required.
| Numerical identifiers for nodes and edges are introduced to create the
| association between these elements and their properties.
|
| to_sql(self)
| Convert the knowledge graph to a `SQL <https://docs.fileformat.com/database/sql>`__ file.
|
| Output: `kg.sql`
|
| References
| ----------
| - `<>`__
|
| to_sqlite(self)
| Convert the knowledge graph to a file-based SQLite database.
|
| Output: `kg.sqlite`
|
| References
| ----------
| - `SQLite <https://www.sqlite.org>`__
|
| to_statistics(self)
| Determine some statistical properties of the knowledge graph.
|
| Output: `statistics.json`
|
| This method generates a JSON file with simple statistics
| such as node, edge and type counts.
|
| ----------------------------------------------------------------------
| Class methods inherited from kgw._shared.base.Project:
|
| get_versions()
| Fetch all currently available versions from the data repository of the project.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from kgw._shared.base.Project:
|
| __dict__
| dictionary for instance variables
|
| __weakref__
| list of weak references to the object
Inspect available versions of a project
Projects share their knowledge graphs in various data formats and make them available on different data repositories on the web. Often they publish multiple versions of their knowledge graph and do so with different naming conventions for their version identifiers. One reason for providing an update is that an error was discovered in a previous version, but another more important reason is that newer source data became available and so a new knowledge graph is published to reflect the status quo of available information.
This package can fetch all currently available versions for each project from their respective data repositories.
[4]:
kgw.biomedicine.Hald.get_versions()
[4]:
['1', '2', '3', '4', '5', '6']
[5]:
kgw.biomedicine.MonarchKg.get_versions()
[5]:
['2023-09-28',
'2023-10-17',
'2023-11-16',
'2023-12-16',
'2024-01-13',
'2024-02-13',
'2024-03-13',
'2024-03-18',
'2024-04-18',
'2024-05-17',
'2024-05-22',
'2024-06-10',
'2024-07-04',
'2024-07-12',
'2024-08-12',
'2024-09-12']
Define a workflow covering multiple projects and versions
This example demonstrates a few points:
A project is represented by a class that has to be instantiated with a
version
andworkdir
argument.The version identifier needs to be one of those listed by
get_versions()
.The working directory needs to be a local directory, which after running the workflow will contain both the raw downloads and generated results.
A project’s knowledge graph can be converted into various output formats by making method calls on the object. The actual processing happens when running the workflow. Decoupling the definition of a workflow from running it allows to analyze dependencies, so that no task is run twice and several independent tasks can be run in parallel.
Multiple versions of the same project are represented by different objects and can be part of the same workflow. Behind the scenes, different files will be fetched from the data repositories and converted into different target locations.
Multiple output formats can be defined, and which ones can vary from project to project. The example below will show all available formats on the first project.
[6]:
workdir = "a_workdir_for_kgw"
# Project 1 in the first version it got published
hald1 = kgw.biomedicine.Hald("1", workdir)
hald1.to_sqlite()
hald1.to_schema()
hald1.to_statistics()
hald1.to_csv()
hald1.to_jsonl()
hald1.to_sql()
hald1.to_graphml()
hald1.to_metta(representation="spo")
hald1.to_metta(representation="properties_aggregated")
hald1.to_metta(representation="properties_expanded")
# Project 1 in the latest version it got published
hald2 = kgw.biomedicine.Hald("latest", workdir)
hald2.to_schema()
hald2.to_graphml()
# Project 2
oregano = kgw.biomedicine.Oregano("latest", workdir)
oregano.to_schema()
oregano.to_statistics()
oregano.to_metta()
# Project 3
monarchkg = kgw.biomedicine.MonarchKg("latest", workdir)
monarchkg.to_schema()
monarchkg.to_csv()
Run it
The workflow definition only included the desired version, workdir and outputs. Implicitly this involves additional tasks beyond the conversion to the desired outputs, such as creating all target directories and producing intermediate representations. The workflow engine behind run
will automatically build a dependency graph of all tasks and execute them in the required order and whenever possible in parallel.
The run
functions accepts a workflow definition in form of either 1) a single project object or 2) multiple project objects in a list, which is shown below. Optionally verbose=False
can be passed to turn off the printing of a log.
[7]:
status = kgw.run([hald1, hald2, oregano, monarchkg])
Log of performed tasks
======================
2024-10-11 01:18:37 Started CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/results)
2024-10-11 01:18:38 Finished CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/results)
2024-10-11 01:18:38 Started CreateDirectory(dirpath=a_workdir_for_kgw/hald_v6/results)
2024-10-11 01:18:38 Finished CreateDirectory(dirpath=a_workdir_for_kgw/hald_v6/results)
2024-10-11 01:18:38 Started CreateDirectory(dirpath=a_workdir_for_kgw/hald_v6/downloads)
2024-10-11 01:18:37 Started CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/downloads)
2024-10-11 01:18:38 Finished CreateDirectory(dirpath=a_workdir_for_kgw/hald_v6/downloads)
2024-10-11 01:18:38 Finished CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/downloads)
2024-10-11 01:18:38 Started CreateDirectory(dirpath=a_workdir_for_kgw/oregano_v3/results)
2024-10-11 01:18:38 Finished CreateDirectory(dirpath=a_workdir_for_kgw/oregano_v3/results)
2024-10-11 01:18:38 Started CreateDirectory(dirpath=a_workdir_for_kgw/oregano_v3/downloads)
2024-10-11 01:18:38 Finished CreateDirectory(dirpath=a_workdir_for_kgw/oregano_v3/downloads)
2024-10-11 01:18:38 Started CreateDirectory(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/results)
2024-10-11 01:18:38 Finished CreateDirectory(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/results)
2024-10-11 01:18:38 Started CreateDirectory(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/downloads)
2024-10-11 01:18:38 Finished CreateDirectory(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/downloads)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/hald_v6/downloads, filename=Relation_Info.json)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/hald_v6/downloads, filename=Entity_Info.json)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/hald_v1/downloads, filename=Relation_Info.json)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/hald_v1/downloads, filename=Entity_Info.json)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=GENES.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=TARGET.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=DISEASES.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=EFFECT.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=ACTIVITY.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=PATHWAYS.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=SIDE_EFFECT.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=PHENOTYPES.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=INDICATION.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=COMPOUND.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=OREGANO_V2.1.tsv)
2024-10-11 01:18:38 Started DownloadFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/downloads, filename=monarch-kg.tar.gz)
2024-10-11 01:18:48 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=ACTIVITY.tsv)
2024-10-11 01:18:48 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=EFFECT.tsv)
2024-10-11 01:18:50 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=PATHWAYS.tsv)
2024-10-11 01:18:52 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=INDICATION.tsv)
2024-10-11 01:18:54 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=SIDE_EFFECT.tsv)
2024-10-11 01:18:54 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=DISEASES.tsv)
2024-10-11 01:18:56 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=PHENOTYPES.tsv)
2024-10-11 01:19:02 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=GENES.tsv)
2024-10-11 01:19:10 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=COMPOUND.tsv)
2024-10-11 01:19:35 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=TARGET.tsv)
2024-10-11 01:19:36 Started FetchAnnotationFiles(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:19:36 Finished FetchAnnotationFiles(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:19:42 Finished DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=OREGANO_V2.1.tsv)
2024-10-11 01:19:42 Started FetchKnowledgeGraphFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:19:42 Finished FetchKnowledgeGraphFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:19:42 Started CreateSqliteFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:16 Finished DownloadFile(dirpath=a_workdir_for_kgw/hald_v1/downloads, filename=Relation_Info.json)
2024-10-11 01:20:17 Started FetchEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:20:17 Finished FetchEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:20:21 Finished CreateSqliteFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:21 Started CreateSchemaFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:21 Started CreateStatisticsFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:21 Started CreateMettaFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3, representation=spo)
2024-10-11 01:20:23 Finished CreateStatisticsFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:30 Finished CreateSchemaFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:31 Finished DownloadFile(dirpath=a_workdir_for_kgw/hald_v6/downloads, filename=Relation_Info.json)
2024-10-11 01:20:31 Started FetchEdgesFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:20:31 Finished FetchEdgesFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:20:32 Finished CreateMettaFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3, representation=spo)
2024-10-11 01:21:20 Finished DownloadFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/downloads, filename=monarch-kg.tar.gz)
2024-10-11 01:21:20 Started FetchKnowledgeGraphFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:21:20 Finished FetchKnowledgeGraphFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:21:20 Started DecompressKnowledgeGraphFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:21:40 Finished DownloadFile(dirpath=a_workdir_for_kgw/hald_v1/downloads, filename=Entity_Info.json)
2024-10-11 01:21:40 Started FetchNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:21:40 Finished FetchNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:21:40 Started CreateSqliteFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:21:51 Finished DownloadFile(dirpath=a_workdir_for_kgw/hald_v6/downloads, filename=Entity_Info.json)
2024-10-11 01:21:51 Started FetchNodesFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:21:51 Finished FetchNodesFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:21:51 Started CreateSqliteFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:02 Finished CreateSqliteFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02 Started CreateStatisticsFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02 Started CreateCompactSchemaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02 Started CreateCsvNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02 Started CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02 Started CreateJsonlNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:03 Started CreateSqlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:03 Started CreateJsonlEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:03 Started CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:03 Started CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=properties_aggregated)
2024-10-11 01:22:03 Started CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=spo)
2024-10-11 01:22:03 Started CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=properties_expanded)
2024-10-11 01:22:03 Finished CreateStatisticsFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:05 Finished DecompressKnowledgeGraphFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:22:05 Started CreateSqliteFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:22:06 Finished CreateCompactSchemaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:09 Finished CreateJsonlEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:10 Finished CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=spo)
2024-10-11 01:22:15 Finished CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:16 Finished CreateJsonlNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:17 Finished CreateSqlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:28 Finished CreateSqliteFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:28 Started CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:28 Started CreateCompactSchemaFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:32 Finished CreateCompactSchemaFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:41 Finished CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=properties_aggregated)
2024-10-11 01:22:44 Finished CreateCsvNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:47 Finished CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:23:06 Finished CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:24:23 Finished CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=properties_expanded)
2024-10-11 01:35:08 Finished CreateSqliteFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:35:08 Started CreateSchemaFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:35:08 Started CreateCsvNodesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:35:08 Started CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:35:35 Finished CreateCsvNodesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:37:46 Finished CreateSchemaFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:41:22 Finished CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
Summary of workflow results
===========================
Scheduled 55 tasks of which:
* 55 ran successfully:
- 2 CreateCompactSchemaFile(...)
- 2 CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1) and CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
- 2 CreateCsvNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1) and CreateCsvNodesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
- 8 CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/downloads,a_workdir_for_kgw/hald_v1/results,a_workdir_for_kgw/hald_v6/downloads,...)
- 2 CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1) and CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
...
This progress looks :) because there were no failed tasks or missing dependencies
[8]:
print("The workflow fully succeeded:", status)
The workflow fully succeeded: True
Inspect the results
Running the workflow generates several directories and files within the user-defined working directory. Each combination of project & version get a unique directory to contain all downloads and results.
[9]:
import os
def inspect_directory(path):
RED = "\033[91m"
GREEN = "\033[92m"
RESET = "\033[0m"
for root, _, files in sorted(os.walk(path)):
dir_name = os.path.basename(root)
level = root.replace(path, '').count(os.sep)
color = RED if dir_name == "downloads" else GREEN
dir_indent = ' ' * 2 * level
file_indent = ' ' * 2 * (level + 1)
print(f"{dir_indent}{dir_name}/")
for file_name in files:
print(f"{file_indent}{color}{file_name}{RESET}")
inspect_directory(workdir)
a_workdir_for_kgw/
hald_v1/
downloads/
Entity_Info.json
Relation_Info.json
results/
kg_edges.jsonl
kg.graphml
kg.sqlite
kg_properties_expanded.metta
statistics.json
kg_properties_aggregated.metta
kg_nodes.jsonl
kg_spo.metta
schema.html
kg.sql
kg_edges.csv
kg_nodes.csv
hald_v6/
downloads/
Entity_Info.json
Relation_Info.json
results/
kg.graphml
kg.sqlite
schema.html
monarchkg_v2024-09-12/
downloads/
monarch-kg.tar.gz
results/
kg.sqlite
schema.html
kg_edges.csv
kg_nodes.csv
oregano_v3/
downloads/
COMPOUND.tsv
PATHWAYS.tsv
TARGET.tsv
ACTIVITY.tsv
INDICATION.tsv
GENES.tsv
OREGANO_V2.1.tsv
DISEASES.tsv
PHENOTYPES.tsv
EFFECT.tsv
SIDE_EFFECT.tsv
results/
kg.sqlite
statistics.json
kg_spo.metta
schema.html
Interpret them
The workflow definition at the beginning means that the knowledge graphs of several projects should be converted to several output files of different formats. This requires that the original files are downloaded from the projects’ respective web repositories and then converted step by step into the desired output formats.
Running the workflow auto-generates a directory structure in the user-defined working directory
a_workdir_for_kgw
. First there is a subdirectory for each project in its chosen version, so that no collisions can happen. Each directory of such a kind then has two further subdirectories to separate fetched from generated files:The
downloads
directory contains all files fetched from the project’s web repository in unmodified form, shown in red here. The number and types of files varies between projects because there is no widely accepted standard for how to encode a knowledge graph.The
results
directory contains all files derived from the raw downloads, shown in green here.kg.sqlite
is a file-based SQLite database, which serves as intermediate format that is used as common basis for all conversions and analyses supported by this package. For this reason, it has to be generated in every case before producing any other outputs.
It is possible to define a small workflow for a single project and output, or a large workflow for multiple projects, versions and output formats. Internally, the Python package luigi is used to build a dependency graph, which contains all tasks that need to be run in order to produce the desired output files. The local inputs and outputs of each task along the way are well defined, so the scheduler can automatically run them as early as possible and often in parallel. For example, all downloads are independent, so they don’t need to wait for each other, but some downstream conversions require multiple input files, so they have to wait for a specific subset of downloads or other conversions to be finished. The overall process can be tracked through messages that are written whenever a task starts or is finished. If everything worked, the
run
function returnsTrue
. If some part failed, e.g. due to a failed web connection, the other parts are attempted to be finished as far as possible, but aFalse
is returned to make clear that something is missing. The workflow can then be restarted and will not begin again from zero, but rather will only run tasks that have not produced their local outputs yet. Some work may be lost anyways, e.g. when a specific conversion was interrupted in the middle the progress is usually lost, but downloads will attempt to continue from partial files.