Quickstart example

The following provides a minimal example for getting started with the package kgw.

Load the package

[1]:

import kgw

Define a minimal workflow

[2]:

hald = kgw.biomedicine.Hald(version="latest", workdir="a_user_chosen_directory")
hald.to_graphml()

Run it

[3]:

status = kgw.run(hald)

Log of performed tasks
======================

2024-10-11 01:18:32  Started   CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/downloads)
2024-10-11 01:18:32  Started   CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/results)
2024-10-11 01:18:32  Finished  CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/downloads)
2024-10-11 01:18:32  Finished  CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/results)
2024-10-11 01:18:32  Started   DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Relation_Info.json)
2024-10-11 01:18:32  Started   DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Entity_Info.json)
2024-10-11 01:19:46  Finished  DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Relation_Info.json)
2024-10-11 01:19:46  Started   FetchEdgesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:19:46  Finished  FetchEdgesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:21:44  Finished  DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Entity_Info.json)
2024-10-11 01:21:44  Started   FetchNodesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:21:44  Finished  FetchNodesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:21:44  Started   CreateSqliteFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:22:19  Finished  CreateSqliteFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:22:19  Started   CreateGraphMlFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:23:02  Finished  CreateGraphMlFile(dirpath=a_user_chosen_directory/hald_v6, version=6)


Summary of workflow results
===========================

Scheduled 8 tasks of which:
* 8 ran successfully:
    - 2 CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/downloads,a_user_chosen_directory/hald_v6/results)
    - 1 CreateGraphMlFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
    - 1 CreateSqliteFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
    - 2 DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Entity_Info.json) and DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Relation_Info.json)
    - 1 FetchEdgesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
    ...

This progress looks :) because there were no failed tasks or missing dependencies

[4]:

print("The workflow fully succeeded:", status)

The workflow fully succeeded: True

Inspect the results

Running the workflow generates several directories and files within the user-defined working directory.

[5]:

import os

def inspect_directory(path):
    RED = "\033[91m"
    GREEN = "\033[92m"
    RESET = "\033[0m"
    for root, _, files in sorted(os.walk(path)):
        dir_name = os.path.basename(root)
        level = root.replace(path, '').count(os.sep)
        color = RED if dir_name == "downloads" else GREEN
        dir_indent = ' ' * 2 * level
        file_indent = ' ' * 2 * (level + 1)
        print(f"{dir_indent}{dir_name}/")
        for file_name in files:
            print(f"{file_indent}{color}{file_name}{RESET}")

inspect_directory("a_user_chosen_directory")

a_user_chosen_directory/
  hald_v6/
    downloads/
      Entity_Info.json
      Relation_Info.json
    results/
      kg.graphml
      kg.sqlite

Interpret them

The workflow definition at the beginning means that the knowledge graph of the project HALD should be converted to a GraphML file. This requires that the original files are downloaded from the project’s web repository and then converted step by step into the desired output format.
Running the workflow auto-generates a directory structure in the user-defined working directory a_user_chosen_directory. First there is a subdirectory for the project (HALD) in its chosen version (“latest” = version 6 at time of writing) named hald_v6, so that no collisions between projects or versions can happen. Each directory of such a kind then has two further subdirectories to separate fetched from generated files:
- The downloads directory contains all files fetched from the project’s web repository in unmodified form. In this case these are two JSON files shown in red. The number and types of files varies between projects because there is no widely accepted standard for how to encode a knowledge graph.
- The results directory contains all files derived from the raw downloads. In this case these are two files shown in green, although only one output was specified in the workflow in the beginning.
  - kg.sqlite is a file-based SQLite database, which serves as intermediate format that is used as common basis for all conversions and analyses supported by this package. For this reason, it has to be generated before producing any other outputs.
  - kg.graphml is the HALD knowledge graph in the desired output format GraphML.
It is possible to define a larger workflow that can include multiple projects, versions and output formats. Internally, the Python package luigi is used to build a dependency graph, which contains all tasks that need to be run in order to produce the desired output files. The local inputs and outputs of each task along the way are well defined, so the scheduler can automatically run them as early as possible and often in parallel. For example, all downloads are independent, so they don’t need to wait for each other, but some downstream conversions require multiple input files, so they have to wait for a specific subset of downloads or other conversions to be finished. The overall process can be tracked through messages that are written whenever a task starts or is finished. If everything worked, the run function returns True. If some part failed, e.g. due to a failed web connection, the other parts are attempted to be finished as far as possible, but a False is returned to make clear that something is missing. The workflow can then be restarted and will not begin again from zero, but rather will only run tasks that have not produced their local outputs yet. Some work may be lost anyways, e.g. when a specific conversion was interrupted in the middle the progress is usually lost, but downloads will attempt to continue from partial files.