kgw
Welcome! This is the documentation for kgw, an open-source Python 3 package for downloading, converting and analyzing knowledge graphs.
What is this project about?
The name “kgw” is an acronym for “knowledge graph workflows”. This phrase consists of two components with following meaning:
A knowledge graph, in the following abbreviated as KG, is information organized in a graph data model.
A workflow is an organized and repeatable pattern of activity.
In essence, this project allows to define and run reproducible workflows that retrieve particular knowledge graphs from the web, convert them into chosen target formats, and analyze their contents.
Which goals are pursued by this project?
There are three central motivations that drive this project:
Ease access to high-quality knowledge graphs that are generated by academic projects but shared in various formats on different data repositories in a non-standardized way.
Provide comparable insights into the contents of these knowledge graphs, such as their schemata and statistical properties.
Support the ongoing development of a novel AI/ML framework named OpenCog Hyperon, in particular with respect to scalable representation and querying of knowledge in it.
These objectives are pursued by following means:
The package allows to easily define and run ETL workflows:
Extract a KG from a data repository.
A user defines a KG, the desired version of it, and a local working directory. The package downloads all relevant files from the data repository of the project into a uniquely named subdirectory.
Transform the KG to a unified intermediate format.
Different projects use different ways to represent their KGs. To unify them into a single intermediate representation, the model of a property graph was chosen and encoded in form of a file-based SQLite database. This allows to capture almost any knowledge graph encountered in practice without any loss or modification of information.
Load the KG into various target formats.
A user defines into which target formats a KG shall be converted. Currently supported are CSV, JSONL, SQL, GraphML and several MeTTa representations. All target formats are derived from a single source format, which is the SQLite file that uses the same relational database schema for every KG. This implies that exactly one conversion method has to be implemented per target format. It is therefore easy to add additional methods and formats with minimal programming and testing effort.
The package supports some ways to analyze the content of the knowledge graphs:
Calculate summary statistics.
A JSON file captures quantitative aspects such as node, edge and type counts. This enables comparsions between reported numbers in the project publications and actual element counts in the raw data, but also comparisons between KGs to see differences in coverage and scale.
Detect and visualize the schema of a KG.
A standalone HTML file contains an interactive graph visualization of all node types in the KG as well as the edge types that connect them. Hovering over nodes and edges provides additional information such as the number of instances of a particular node or edge type. This is also indicated visually by different node and edge sizes, providing a graphical overview of what kinds of nodes are edges are making up the majority of information stored in a KG. Examples of such schemata can be found in the last section of a repository about biomedical knowledge graphs, which served as preparation for this package.
The package allows to bring a diversity of knowledge graphs to the OpenCog Hyperon framework:
Knowledge graphs of different shapes and sizes, built on various ontologies, and coming with all kinds of properties, are flexibly ported to OpenCog Hyperon’s language MeTTa. Conceptually, knowledge graphs can be represented in a large number of ways in MeTTa, with no clearly preferable approach yet, hence the package supports multiple MeTTa representations and it is easily extendable to support further ones. This shall enable experimentation on how to represent and query KGs efficiently at scale, which can also be used for performance benchmarking of different MeTTa interpreters that are under active construction at the time of creating this package and writing its documentation.
Why is it relevant?
Since there are no widely accepted standards for how to encode knowledge graphs, academic projects provide their results in various formats and share them via different data repositories on the web. This situation makes it hard for an interested audience to use and compare them. It can be largely resolved, however, by providing automatic download methods, a definition and implementation of a unified KG representation, and a set of conversion methods into widely-used target formats. The hope is to thereby reduce some barriers of science, and perhaps catalyze analyses and experiments that otherwise would not take place. A simple example is the interactive exploration of the neighborhood of a node of interest, such as a certain disease and all the genes and drugs it is connected with.
How can you get started?
If you want to use this package, the following steps should help you to get results in a short time:
The Installation Guide describes how you can install the package and its optional dependencies.
The Examples provide a minimal quickstart guide that can be run immediately after the installation to see whether it worked, and a more detailed guide that provides a thorough introduction into most features.
The API Documentation contains a comprehensive description of all user-facing functionality available in this package, so you can see the full range of available options.
Where can everything be found?
The Package References page contains a collection of links to all parts of this project, including the source code, distributed code and documentation.
Who is involved in this project?
The design, implementation and documentation of this package was done by Robert Haas.
Preliminary work was published in the repository awesome-biomedical-knowledge-graphs. It contains notebooks with exploratory data analysis of the KG projects that culimanted in the code of this package. Beyond that, it also provides a broad literature survey and a curated selection of BMKGs, which gives a more comprehensive account of available KGs in the field of biomedicine and can serve as basis to incorporate further projects into this package.
Financial support to realize this project was granted by the Deep Funding initiative of SingularityNET for a proposal titled Bringing Network Pharmacology to OpenCog Hyperon. Due to the interest in and focus on pharmacological use cases, such as drug repurposing or side-effect prediction, the currently covered KGs all contain node types that represent drugs, their targets and affected diseases, although to quite different extent and levels of differentiation. This provides a close look into how different academic groups approach the modeling challenges and how these choices affect the complexity of querying relevant information and deriving new insights. A hope for this work is that its insights are informative for new projects that model biomedical data in their own ways to power specific use cases. One such project with close relations to SingularityNET is Rejuve, which continuously integrates up-to-date biological data from various sources into their own BMKG for applications in longevitiy research.