Quickstart example
The following provides a minimal example for getting started with the package kgw.
Load the package
[1]:
import kgw
Define a minimal workflow
[2]:
hald = kgw.biomedicine.Hald(version="latest", workdir="a_user_chosen_directory")
hald.to_graphml()
Run it
[3]:
status = kgw.run(hald)
Log of performed tasks
======================
2024-10-11 01:18:32 Started CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/downloads)
2024-10-11 01:18:32 Started CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/results)
2024-10-11 01:18:32 Finished CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/downloads)
2024-10-11 01:18:32 Finished CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/results)
2024-10-11 01:18:32 Started DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Relation_Info.json)
2024-10-11 01:18:32 Started DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Entity_Info.json)
2024-10-11 01:19:46 Finished DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Relation_Info.json)
2024-10-11 01:19:46 Started FetchEdgesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:19:46 Finished FetchEdgesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:21:44 Finished DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Entity_Info.json)
2024-10-11 01:21:44 Started FetchNodesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:21:44 Finished FetchNodesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:21:44 Started CreateSqliteFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:22:19 Finished CreateSqliteFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:22:19 Started CreateGraphMlFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
2024-10-11 01:23:02 Finished CreateGraphMlFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
Summary of workflow results
===========================
Scheduled 8 tasks of which:
* 8 ran successfully:
- 2 CreateDirectory(dirpath=a_user_chosen_directory/hald_v6/downloads,a_user_chosen_directory/hald_v6/results)
- 1 CreateGraphMlFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
- 1 CreateSqliteFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
- 2 DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Entity_Info.json) and DownloadFile(dirpath=a_user_chosen_directory/hald_v6/downloads, filename=Relation_Info.json)
- 1 FetchEdgesFile(dirpath=a_user_chosen_directory/hald_v6, version=6)
...
This progress looks :) because there were no failed tasks or missing dependencies
[4]:
print("The workflow fully succeeded:", status)
The workflow fully succeeded: True
Inspect the results
Running the workflow generates several directories and files within the user-defined working directory.
[5]:
import os
def inspect_directory(path):
RED = "\033[91m"
GREEN = "\033[92m"
RESET = "\033[0m"
for root, _, files in sorted(os.walk(path)):
dir_name = os.path.basename(root)
level = root.replace(path, '').count(os.sep)
color = RED if dir_name == "downloads" else GREEN
dir_indent = ' ' * 2 * level
file_indent = ' ' * 2 * (level + 1)
print(f"{dir_indent}{dir_name}/")
for file_name in files:
print(f"{file_indent}{color}{file_name}{RESET}")
inspect_directory("a_user_chosen_directory")
a_user_chosen_directory/
hald_v6/
downloads/
Entity_Info.json
Relation_Info.json
results/
kg.graphml
kg.sqlite
Interpret them
The workflow definition at the beginning means that the knowledge graph of the project HALD should be converted to a GraphML file. This requires that the original files are downloaded from the project’s web repository and then converted step by step into the desired output format.
Running the workflow auto-generates a directory structure in the user-defined working directory
a_user_chosen_directory
. First there is a subdirectory for the project (HALD) in its chosen version (“latest” = version 6 at time of writing) namedhald_v6
, so that no collisions between projects or versions can happen. Each directory of such a kind then has two further subdirectories to separate fetched from generated files:The
downloads
directory contains all files fetched from the project’s web repository in unmodified form. In this case these are two JSON files shown in red. The number and types of files varies between projects because there is no widely accepted standard for how to encode a knowledge graph.The
results
directory contains all files derived from the raw downloads. In this case these are two files shown in green, although only one output was specified in the workflow in the beginning.kg.sqlite
is a file-based SQLite database, which serves as intermediate format that is used as common basis for all conversions and analyses supported by this package. For this reason, it has to be generated before producing any other outputs.kg.graphml
is the HALD knowledge graph in the desired output format GraphML.
It is possible to define a larger workflow that can include multiple projects, versions and output formats. Internally, the Python package luigi is used to build a dependency graph, which contains all tasks that need to be run in order to produce the desired output files. The local inputs and outputs of each task along the way are well defined, so the scheduler can automatically run them as early as possible and often in parallel. For example, all downloads are independent, so they don’t need to wait for each other, but some downstream conversions require multiple input files, so they have to wait for a specific subset of downloads or other conversions to be finished. The overall process can be tracked through messages that are written whenever a task starts or is finished. If everything worked, the
run
function returnsTrue
. If some part failed, e.g. due to a failed web connection, the other parts are attempted to be finished as far as possible, but aFalse
is returned to make clear that something is missing. The workflow can then be restarted and will not begin again from zero, but rather will only run tasks that have not produced their local outputs yet. Some work may be lost anyways, e.g. when a specific conversion was interrupted in the middle the progress is usually lost, but downloads will attempt to continue from partial files.