An overview of knowledge graphs and the link prediction task, based on my thesis “Structural Alignment in Link Prediction”.
Introduction
This article is intended as a layperson’s introduction to Structural Alignment, the core contribution of my PhD thesis “Structural Alignment in Link Prediction”.
This is the first in the series, and will provide a quick summary of all the background knowledge that you need to understand the rest of the series. Particularly, I’ll be discussing knowledge graphs (a type of data storage system) and link prediction task (the task of predicting new data based on a knowledge graph). If you already have expertise in these areas, feel free to skip this article.
Finally, if you want a more technical explanation, you can read the thesis in English here, and its extended summary in Irish here.
The key take-aways of this article are:
- Knowledge Graphs (KGs) are databases that store information in as a collection of simple sentences called “triples”, such as (Aragorn, Ally-Of, Frodo) or (Frodo, Enemy-Of, Sauron).
- Link Prediction (LP) is the task of predicting new triples based on those that are already present in a knowledge graph. For example, the triple (Aragorn, Enemy-Of- Sauron) could be predicted based on a knowledge graph containing the triples (Aragorn, Ally-Of, Frodo) and (Frodo, Enemy-Of, Sauron).
- Link prediction is very relevant in many domains — including helping to find new treatments for diseases.
Knowledge Graphs
A knowledge graph models data as simple sentences. The easiest way to see this is in the table below, where every row is a single sentence in the graph. The two nouns in the sentence (called the subject and object) are called “nodes”. The so-called “predicate” describes how they relate. Since every one of these sentences has 3 parts (subject, predicate, object), they are called “triples”.
In practice, it tends to be easy to visualize knowledge graphs as “graphs” containing nodes (circles, representing concepts) and edges (links, connecting concepts to each other). For example, the above knowledge graph could be visualized as seen below:
This view makes the overall structure of the graph very easy to see. For example, the node “Gondor” is readily seen as the most well-connected node, and (in general) nodes at the edges of the graph have fewer links connecting them.
In many practical applications, knowledge graphs have a few other bells and whistles. That said, the core of knowledge graphs — and all that you need to understand Structural Alignment — is what is described above.
Link Prediction
Link prediction is the task of predicting new data that should be added to a knowledge graph. Link prediction can occur in one of two ways:
- Subject Prediction. The user asks what nodes N should be chosen such that the triple (N, predicate, object) is true.
- Object Prediction. The user asks what nodes N should be chosen such that the triple (subject, predicate, N) is true.
In the above graphs, a user might ask “Who are Gandalf’s Allies?”. This question is represented as the query (Gandalf, Ally-Of, ?). A good link prediction system would respond not only with Frodo (which is observed directly in the knowledge graph) but also with Aragorn (which is true, but not observed directly in the graph).
As such, link prediction can be seen as a sort of augmented search. Instead of just telling you what you already know (what is in the knowledge graph), it can make inferences about things that are not known, but that are likely to be true.
Applications of KGs and LP
While knowledge graphs are used in many domains, they are particularly common in biology. There are hundreds of biological knowledge graphs that exist — some (but not all!) of them are shown in the image below from the Linked Open Data Cloud.
Link prediction is also quite useful for biology — after all, lots of biological data can be naturally expressed as knowledge graphs. Perhaps the simplest example is drug-disease interactions, in which:
- Nodes represent various drugs or diseases
- Edges describe what effect, if any, drugs are known to have on various diseases.
In many cases, one drug can be used to treat multiple diseases — although this is not always known in advance. Recently, there has been growing interest in trying to use link prediction to predict such cases. After all, this question really boils down to single link prediction query: (Drug, treats, ?).
In fact, this is more or less the exact research direction that some pharmaceutical companies (for example, here) have explored. And this, at the end of the day, is (one reason) why we care about link prediction.
What about Structural Alignment?
Structural Alignment is a framework for 1) understanding, and 2) performing link prediction. It draws directly on patterns of graph structure to help predict new links — leading to improvements in performance over conventional techniques.
Before we get to the details of Structural Alignment, however, we first must discuss graph structure. This can be found in the next article in this series.