Detailed example

The following provides a detailed example for getting familiar with most features of the package kgw.

Load the package

[1]:
import kgw

print(kgw.__version__)
0.1.0

List all supported projects

Currently this package supports five knowledge graph projects from the field of biomedicine. They are grouped in a module named after the domain, because future versions of the package may cover more projects from different domains.

[2]:
project_names = [x for x in dir(kgw.biomedicine) if not x.startswith("_")]

print(project_names)
['Ckg', 'Hald', 'MonarchKg', 'Oregano', 'PrimeKg']

Get details for a project

Each project comes with additional information such as references to its publication, website and data repository. The text can be read in a nicely formatted form in the API documentation. Alternatively, it can be accessed in its raw ReST form by calling Python’s built-in help system on the project’s class, which is done below for the project HALD:

[3]:
help(kgw.biomedicine.Hald)
Help on class Hald in module kgw.biomedicine._hald:

class Hald(kgw._shared.base.Project)
 |  Hald(version, workdir)
 |
 |  Human Aging and Longevity Dataset (HALD).
 |
 |  References
 |  ----------
 |  - Publication: https://doi.org/10.1038/s41597-023-02781-0
 |  - Website: https://bis.zju.edu.cn/hald
 |  - Code: https://github.com/zexuwu/hald
 |  - Data: https://doi.org/10.6084/m9.figshare.22828196
 |
 |  Method resolution order:
 |      Hald
 |      kgw._shared.base.Project
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  to_schema(self)
 |      Determine the schema of the knowledge graph.
 |
 |      Output: `schema.html`
 |
 |      Generate a standalone HTML file with an interactive graph visualization
 |      of all entity types in the KG and the relationship types by which they
 |      are connected.
 |
 |      References
 |      ----------
 |      - `Neo4j: Graph modeling guidelines
 |        <https://neo4j.com/docs/getting-started/data-modeling/guide-data-modeling/>`__
 |
 |  ----------------------------------------------------------------------
 |  Methods inherited from kgw._shared.base.Project:
 |
 |  __init__(self, version, workdir)
 |      Initialize a project instance so that tasks can be defined on it.
 |
 |      Parameters
 |      ----------
 |      version : `str`
 |          Version of the dataset that will be downloaded and processed.
 |          The method :meth:`get_versions` returns all currently available
 |          versions.
 |      workdir : `str`
 |          Path of the working directory in which a unique subdirectory will
 |          be created to hold all downloaded and generated files for
 |          this project and version.
 |
 |      Raises
 |      ------
 |      ValueError
 |          Raised if `version` is invalid or unavailable.
 |      TypeError
 |          Raised if `workdir` is not a string.
 |
 |      Notes
 |      -----
 |      This class does not automatically download or process any data.
 |      Such tasks first need to be specified by calling the relevant methods
 |      on the project object and then passing it to the function :func:`~kgw.run`
 |      that builds and executes a corresponding workflow.
 |
 |  to_csv(self)
 |      Convert the knowledge graph to two `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`__ files.
 |
 |      File names: `kg_nodes.csv` and `kg_edges.csv`
 |
 |  to_graphml(self)
 |      Convert the knowledge graph to a `GraphML <https://en.wikipedia.org/wiki/GraphML>`__ file.
 |
 |      File name: `kg.graphml`
 |
 |  to_jsonl(self)
 |      Convert the knowledge graph to two `JSONL <https://jsonlines.org>`__ files.
 |
 |      Output: `kg_nodes.jsonl` and `kg_edges.jsonl`
 |
 |  to_metta(self, representation='spo')
 |      Convert the knowledge graph to a `MeTTa <https://metta-lang.dev>`__ file.
 |
 |      File name: Depending on the chosen representation, either
 |      `kg_spo.metta`,
 |      `kg_properties_aggregated.metta`, or
 |      `kg_properties_expanded.metta`.
 |
 |      Parameters
 |      ----------
 |      representation : str
 |          Available options:
 |
 |          - `"spo"`: Semantic triples of the form `("subject", "predicate", "object")`.
 |            If properties are present in the original KG, they are ignored in this
 |            representation.
 |          - `"properties_aggregated"`: Properties (=key-value pairs) are represented
 |            by putting each key on a separate line, but each value is ensured to be a
 |            single number or string. This means values that hold a compound data type
 |            like a list or dict are aggregated into one string in JSON string format.
 |            Text identifiers of nodes are reused to create the association with their
 |            properties, while text identifiers of the form "e{cnt}" are introduced for
 |            edges to serve the same purpose.
 |          - `"properties_expanded"`: Properties (=key-value pairs) are represented
 |            by fully expanding their keys and values onto as many lines as required.
 |            Numerical identifiers for nodes and edges are introduced to create the
 |            association between these elements and their properties.
 |
 |  to_sql(self)
 |      Convert the knowledge graph to a `SQL <https://docs.fileformat.com/database/sql>`__ file.
 |
 |      Output: `kg.sql`
 |
 |      References
 |      ----------
 |      - `<>`__
 |
 |  to_sqlite(self)
 |      Convert the knowledge graph to a file-based SQLite database.
 |
 |      Output: `kg.sqlite`
 |
 |      References
 |      ----------
 |      - `SQLite <https://www.sqlite.org>`__
 |
 |  to_statistics(self)
 |      Determine some statistical properties of the knowledge graph.
 |
 |      Output: `statistics.json`
 |
 |      This method generates a JSON file with simple statistics
 |      such as node, edge and type counts.
 |
 |  ----------------------------------------------------------------------
 |  Class methods inherited from kgw._shared.base.Project:
 |
 |  get_versions()
 |      Fetch all currently available versions from the data repository of the project.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from kgw._shared.base.Project:
 |
 |  __dict__
 |      dictionary for instance variables
 |
 |  __weakref__
 |      list of weak references to the object

Inspect available versions of a project

Projects share their knowledge graphs in various data formats and make them available on different data repositories on the web. Often they publish multiple versions of their knowledge graph and do so with different naming conventions for their version identifiers. One reason for providing an update is that an error was discovered in a previous version, but another more important reason is that newer source data became available and so a new knowledge graph is published to reflect the status quo of available information.

This package can fetch all currently available versions for each project from their respective data repositories.

[4]:
kgw.biomedicine.Hald.get_versions()
[4]:
['1', '2', '3', '4', '5', '6']
[5]:
kgw.biomedicine.MonarchKg.get_versions()
[5]:
['2023-09-28',
 '2023-10-17',
 '2023-11-16',
 '2023-12-16',
 '2024-01-13',
 '2024-02-13',
 '2024-03-13',
 '2024-03-18',
 '2024-04-18',
 '2024-05-17',
 '2024-05-22',
 '2024-06-10',
 '2024-07-04',
 '2024-07-12',
 '2024-08-12',
 '2024-09-12']

Define a workflow covering multiple projects and versions

This example demonstrates a few points:

  1. A project is represented by a class that has to be instantiated with a version and workdir argument.

    • The version identifier needs to be one of those listed by get_versions().

    • The working directory needs to be a local directory, which after running the workflow will contain both the raw downloads and generated results.

  2. A project’s knowledge graph can be converted into various output formats by making method calls on the object. The actual processing happens when running the workflow. Decoupling the definition of a workflow from running it allows to analyze dependencies, so that no task is run twice and several independent tasks can be run in parallel.

  3. Multiple versions of the same project are represented by different objects and can be part of the same workflow. Behind the scenes, different files will be fetched from the data repositories and converted into different target locations.

  4. Multiple output formats can be defined, and which ones can vary from project to project. The example below will show all available formats on the first project.

[6]:
workdir = "a_workdir_for_kgw"

# Project 1 in the first version it got published
hald1 = kgw.biomedicine.Hald("1", workdir)
hald1.to_sqlite()
hald1.to_schema()
hald1.to_statistics()
hald1.to_csv()
hald1.to_jsonl()
hald1.to_sql()
hald1.to_graphml()
hald1.to_metta(representation="spo")
hald1.to_metta(representation="properties_aggregated")
hald1.to_metta(representation="properties_expanded")

# Project 1 in the latest version it got published
hald2 = kgw.biomedicine.Hald("latest", workdir)
hald2.to_schema()
hald2.to_graphml()

# Project 2
oregano = kgw.biomedicine.Oregano("latest", workdir)
oregano.to_schema()
oregano.to_statistics()
oregano.to_metta()

# Project 3
monarchkg = kgw.biomedicine.MonarchKg("latest", workdir)
monarchkg.to_schema()
monarchkg.to_csv()

Run it

The workflow definition only included the desired version, workdir and outputs. Implicitly this involves additional tasks beyond the conversion to the desired outputs, such as creating all target directories and producing intermediate representations. The workflow engine behind run will automatically build a dependency graph of all tasks and execute them in the required order and whenever possible in parallel.

The run functions accepts a workflow definition in form of either 1) a single project object or 2) multiple project objects in a list, which is shown below. Optionally verbose=False can be passed to turn off the printing of a log.

[7]:
status = kgw.run([hald1, hald2, oregano, monarchkg])
Log of performed tasks
======================

2024-10-11 01:18:37  Started   CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/results)
2024-10-11 01:18:38  Finished  CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/results)
2024-10-11 01:18:38  Started   CreateDirectory(dirpath=a_workdir_for_kgw/hald_v6/results)
2024-10-11 01:18:38  Finished  CreateDirectory(dirpath=a_workdir_for_kgw/hald_v6/results)
2024-10-11 01:18:38  Started   CreateDirectory(dirpath=a_workdir_for_kgw/hald_v6/downloads)
2024-10-11 01:18:37  Started   CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/downloads)
2024-10-11 01:18:38  Finished  CreateDirectory(dirpath=a_workdir_for_kgw/hald_v6/downloads)
2024-10-11 01:18:38  Finished  CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/downloads)
2024-10-11 01:18:38  Started   CreateDirectory(dirpath=a_workdir_for_kgw/oregano_v3/results)
2024-10-11 01:18:38  Finished  CreateDirectory(dirpath=a_workdir_for_kgw/oregano_v3/results)
2024-10-11 01:18:38  Started   CreateDirectory(dirpath=a_workdir_for_kgw/oregano_v3/downloads)
2024-10-11 01:18:38  Finished  CreateDirectory(dirpath=a_workdir_for_kgw/oregano_v3/downloads)
2024-10-11 01:18:38  Started   CreateDirectory(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/results)
2024-10-11 01:18:38  Finished  CreateDirectory(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/results)
2024-10-11 01:18:38  Started   CreateDirectory(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/downloads)
2024-10-11 01:18:38  Finished  CreateDirectory(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/downloads)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/hald_v6/downloads, filename=Relation_Info.json)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/hald_v6/downloads, filename=Entity_Info.json)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/hald_v1/downloads, filename=Relation_Info.json)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/hald_v1/downloads, filename=Entity_Info.json)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=GENES.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=TARGET.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=DISEASES.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=EFFECT.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=ACTIVITY.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=PATHWAYS.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=SIDE_EFFECT.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=PHENOTYPES.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=INDICATION.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=COMPOUND.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=OREGANO_V2.1.tsv)
2024-10-11 01:18:38  Started   DownloadFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/downloads, filename=monarch-kg.tar.gz)
2024-10-11 01:18:48  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=ACTIVITY.tsv)
2024-10-11 01:18:48  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=EFFECT.tsv)
2024-10-11 01:18:50  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=PATHWAYS.tsv)
2024-10-11 01:18:52  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=INDICATION.tsv)
2024-10-11 01:18:54  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=SIDE_EFFECT.tsv)
2024-10-11 01:18:54  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=DISEASES.tsv)
2024-10-11 01:18:56  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=PHENOTYPES.tsv)
2024-10-11 01:19:02  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=GENES.tsv)
2024-10-11 01:19:10  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=COMPOUND.tsv)
2024-10-11 01:19:35  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=TARGET.tsv)
2024-10-11 01:19:36  Started   FetchAnnotationFiles(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:19:36  Finished  FetchAnnotationFiles(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:19:42  Finished  DownloadFile(dirpath=a_workdir_for_kgw/oregano_v3/downloads, filename=OREGANO_V2.1.tsv)
2024-10-11 01:19:42  Started   FetchKnowledgeGraphFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:19:42  Finished  FetchKnowledgeGraphFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:19:42  Started   CreateSqliteFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:16  Finished  DownloadFile(dirpath=a_workdir_for_kgw/hald_v1/downloads, filename=Relation_Info.json)
2024-10-11 01:20:17  Started   FetchEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:20:17  Finished  FetchEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:20:21  Finished  CreateSqliteFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:21  Started   CreateSchemaFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:21  Started   CreateStatisticsFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:21  Started   CreateMettaFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3, representation=spo)
2024-10-11 01:20:23  Finished  CreateStatisticsFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:30  Finished  CreateSchemaFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3)
2024-10-11 01:20:31  Finished  DownloadFile(dirpath=a_workdir_for_kgw/hald_v6/downloads, filename=Relation_Info.json)
2024-10-11 01:20:31  Started   FetchEdgesFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:20:31  Finished  FetchEdgesFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:20:32  Finished  CreateMettaFile(dirpath=a_workdir_for_kgw/oregano_v3, version=3, representation=spo)
2024-10-11 01:21:20  Finished  DownloadFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12/downloads, filename=monarch-kg.tar.gz)
2024-10-11 01:21:20  Started   FetchKnowledgeGraphFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:21:20  Finished  FetchKnowledgeGraphFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:21:20  Started   DecompressKnowledgeGraphFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:21:40  Finished  DownloadFile(dirpath=a_workdir_for_kgw/hald_v1/downloads, filename=Entity_Info.json)
2024-10-11 01:21:40  Started   FetchNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:21:40  Finished  FetchNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:21:40  Started   CreateSqliteFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:21:51  Finished  DownloadFile(dirpath=a_workdir_for_kgw/hald_v6/downloads, filename=Entity_Info.json)
2024-10-11 01:21:51  Started   FetchNodesFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:21:51  Finished  FetchNodesFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:21:51  Started   CreateSqliteFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:02  Finished  CreateSqliteFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02  Started   CreateStatisticsFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02  Started   CreateCompactSchemaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02  Started   CreateCsvNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02  Started   CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:02  Started   CreateJsonlNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:03  Started   CreateSqlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:03  Started   CreateJsonlEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:03  Started   CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:03  Started   CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=properties_aggregated)
2024-10-11 01:22:03  Started   CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=spo)
2024-10-11 01:22:03  Started   CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=properties_expanded)
2024-10-11 01:22:03  Finished  CreateStatisticsFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:05  Finished  DecompressKnowledgeGraphFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:22:05  Started   CreateSqliteFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:22:06  Finished  CreateCompactSchemaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:09  Finished  CreateJsonlEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:10  Finished  CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=spo)
2024-10-11 01:22:15  Finished  CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:16  Finished  CreateJsonlNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:17  Finished  CreateSqlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:28  Finished  CreateSqliteFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:28  Started   CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:28  Started   CreateCompactSchemaFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:32  Finished  CreateCompactSchemaFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:22:41  Finished  CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=properties_aggregated)
2024-10-11 01:22:44  Finished  CreateCsvNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:22:47  Finished  CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1)
2024-10-11 01:23:06  Finished  CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
2024-10-11 01:24:23  Finished  CreateMettaFile(dirpath=a_workdir_for_kgw/hald_v1, version=1, representation=properties_expanded)
2024-10-11 01:35:08  Finished  CreateSqliteFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:35:08  Started   CreateSchemaFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:35:08  Started   CreateCsvNodesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:35:08  Started   CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:35:35  Finished  CreateCsvNodesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:37:46  Finished  CreateSchemaFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
2024-10-11 01:41:22  Finished  CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)


Summary of workflow results
===========================

Scheduled 55 tasks of which:
* 55 ran successfully:
    - 2 CreateCompactSchemaFile(...)
    - 2 CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1) and CreateCsvEdgesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
    - 2 CreateCsvNodesFile(dirpath=a_workdir_for_kgw/hald_v1, version=1) and CreateCsvNodesFile(dirpath=a_workdir_for_kgw/monarchkg_v2024-09-12, version=2024-09-12)
    - 8 CreateDirectory(dirpath=a_workdir_for_kgw/hald_v1/downloads,a_workdir_for_kgw/hald_v1/results,a_workdir_for_kgw/hald_v6/downloads,...)
    - 2 CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v1, version=1) and CreateGraphMlFile(dirpath=a_workdir_for_kgw/hald_v6, version=6)
    ...

This progress looks :) because there were no failed tasks or missing dependencies
[8]:
print("The workflow fully succeeded:", status)
The workflow fully succeeded: True

Inspect the results

Running the workflow generates several directories and files within the user-defined working directory. Each combination of project & version get a unique directory to contain all downloads and results.

[9]:
import os

def inspect_directory(path):
    RED = "\033[91m"
    GREEN = "\033[92m"
    RESET = "\033[0m"
    for root, _, files in sorted(os.walk(path)):
        dir_name = os.path.basename(root)
        level = root.replace(path, '').count(os.sep)
        color = RED if dir_name == "downloads" else GREEN
        dir_indent = ' ' * 2 * level
        file_indent = ' ' * 2 * (level + 1)
        print(f"{dir_indent}{dir_name}/")
        for file_name in files:
            print(f"{file_indent}{color}{file_name}{RESET}")

inspect_directory(workdir)
a_workdir_for_kgw/
  hald_v1/
    downloads/
      Entity_Info.json
      Relation_Info.json
    results/
      kg_edges.jsonl
      kg.graphml
      kg.sqlite
      kg_properties_expanded.metta
      statistics.json
      kg_properties_aggregated.metta
      kg_nodes.jsonl
      kg_spo.metta
      schema.html
      kg.sql
      kg_edges.csv
      kg_nodes.csv
  hald_v6/
    downloads/
      Entity_Info.json
      Relation_Info.json
    results/
      kg.graphml
      kg.sqlite
      schema.html
  monarchkg_v2024-09-12/
    downloads/
      monarch-kg.tar.gz
    results/
      kg.sqlite
      schema.html
      kg_edges.csv
      kg_nodes.csv
  oregano_v3/
    downloads/
      COMPOUND.tsv
      PATHWAYS.tsv
      TARGET.tsv
      ACTIVITY.tsv
      INDICATION.tsv
      GENES.tsv
      OREGANO_V2.1.tsv
      DISEASES.tsv
      PHENOTYPES.tsv
      EFFECT.tsv
      SIDE_EFFECT.tsv
    results/
      kg.sqlite
      statistics.json
      kg_spo.metta
      schema.html

Interpret them

  • The workflow definition at the beginning means that the knowledge graphs of several projects should be converted to several output files of different formats. This requires that the original files are downloaded from the projects’ respective web repositories and then converted step by step into the desired output formats.

  • Running the workflow auto-generates a directory structure in the user-defined working directory a_workdir_for_kgw. First there is a subdirectory for each project in its chosen version, so that no collisions can happen. Each directory of such a kind then has two further subdirectories to separate fetched from generated files:

    • The downloads directory contains all files fetched from the project’s web repository in unmodified form, shown in red here. The number and types of files varies between projects because there is no widely accepted standard for how to encode a knowledge graph.

    • The results directory contains all files derived from the raw downloads, shown in green here.

      • kg.sqlite is a file-based SQLite database, which serves as intermediate format that is used as common basis for all conversions and analyses supported by this package. For this reason, it has to be generated in every case before producing any other outputs.

  • It is possible to define a small workflow for a single project and output, or a large workflow for multiple projects, versions and output formats. Internally, the Python package luigi is used to build a dependency graph, which contains all tasks that need to be run in order to produce the desired output files. The local inputs and outputs of each task along the way are well defined, so the scheduler can automatically run them as early as possible and often in parallel. For example, all downloads are independent, so they don’t need to wait for each other, but some downstream conversions require multiple input files, so they have to wait for a specific subset of downloads or other conversions to be finished. The overall process can be tracked through messages that are written whenever a task starts or is finished. If everything worked, the run function returns True. If some part failed, e.g. due to a failed web connection, the other parts are attempted to be finished as far as possible, but a False is returned to make clear that something is missing. The workflow can then be restarted and will not begin again from zero, but rather will only run tasks that have not produced their local outputs yet. Some work may be lost anyways, e.g. when a specific conversion was interrupted in the middle the progress is usually lost, but downloads will attempt to continue from partial files.