Turning a genome annotation into a RDF knowledge graph.

Introduction

In this article, I will describe the steps taken to generate a RDF (Resource Description Format) datastructure starting from a gff3 formatted genome annotation file. The annotation file in question is the new reference annotation for Pseudomonas fluorescens strain SBW25.

Required packages

I will make use of the following python packages:

  • gffutils to read the gff3 file into a sqlite database.
  • rdflib to construct the rdf graph.
  • requests to fetch data (e.g. ontology files)

All packages can be installed via conda from the conda-forge channel or pypi.

import gffutils
from tqdm import tqdm
from rdflib import URIRef, Literal, Namespace, Graph, BNode
from rdflib.namespace import CSVW, DC, DCAT, DCTERMS, DOAP, FOAF, ODRL2, ORG, OWL, \
                           PROF, PROV, RDF, RDFS, SDO, SH, SKOS, SOSA, SSN, TIME, \
                           VOID, XMLNS, XSD

import requests
import re

Load the gff3 file

To load the gff3 file for the first time, I run

db = gffutils.create_db(MPBAS00001.gff3, MPBAS00001.gff3.db)

This creates the sqlite database file MPBAS00001.gff3.db and loads the database into memory. Subsequently, I can load the database directly from the database file.

db = gffutils.FeatureDB("MPBAS00001.gff3.db")

Each genome feature can now be accessed via it’s feature ID, e.g.

feat = db['CDS:PFLU_0409-0']

would load the CDS in the locus tagged as “PFLU_0409”.

gffutils feature properties.

From the feature object, I can query all information given in the gff3 file:

feat.featuretype
'CDS'
feat.start
452623
feat.end
453588
feat.strand
'+'
feat.id
'CDS:PFLU_0409-0'
# The chromosome on which this feature is located.
feat.chrom
'MPBAS00001'
# CDS coding frame.
feat.frame
'0'
feat.source
'ena'

Further annotations that would be found in the 9th column of the gff3 file are accessible via the attributes member of the feat object:

print(feat.attributes)
Dbxref: ['EMBL:AM181176', 'InterPro:IPR007445', 'InterPro:IPR007446', 'InterPro:IPR014717', 'KEGG:pfs:PFLU_0409', 'OrthoDB:1803493at2', 'Pfam:PF04350', 'Pfam:PF04351', 'RefSeq:WP_012721815.1', 'STRING:216595.PFLU_0409', 'eggNOG:COG3167', 'eggNOG:COG3168', 'Pubmed:19432983']
ID: ['CDS:PFLU_0409-0']
Ontology_term: ['GO:0043107', 'GO:0043683']
Parent: ['transcript:PFLU_0409-0']
codon_start: ['1']
confidence_level: ['3']
frame: ['0']
gene: ['PFLU_0409']
inference: ['Predicted']
locus_tag: ['PFLU_0409']
note: ['N-terminus similar to Pseudomonas syringae PilO andC-terminus similar to Pseudomonas syringae PilP']
product: ['Putative fimbriae biogenesis-related fusion protein']
protein_id: ['CAY46686.1']
seqid: ['MPBAS00001']
similarity: ['fasta; with=UniProt:Q52542; Pseudomonas syringae.; pilP; Pilus expression protein.; length=175; id 41.781%; ungapped id 43.571%; E()=6.9e-14; 146 aa overlap; query 181-321; subject 31-175', 'fasta; with=UniProt:Q87V11; Pseudomonas syringae (pv. tomato).; pilO; Type IV pilus biogenesis protein PilO.; length=207; id 36.735%; ungapped id 37.696%; E()=1.6e-17; 196 aa overlap; query 3-193; subject 10-205']
uniprot_annotation_score: ['1 out of 5']
uniprot_review_status: ['unreviewed']

feat.attributes returns a dictionary, each item is a list (iterable).

For further information about working with gffutils data structures and functions, refer to the reference manual.

Fixing issues with gff3 file.

The gff3 file had a number of issues, e.g. some Dbxref entries were not formatted correctly. The next three cells clean up these issues. Note that each alteration is committed back to the database file by first deleting all features that had to be fixed and then updating the database with the altered features.

# Debug Pubmed.
updates = []
for feat in db.all_features():
    dbxrefs = feat.attributes.get("Dbxref", [])
    
    indices_to_remove = []
    for i,dbxref in enumerate(dbxrefs):
        if dbxref.startswith("Pubmed"):
            if ";" in dbxref:
                new_dbxrefs = []
                indices_to_remove.append(i)
                pubmed_ids = dbxref.split(":")[1].replace(" ","").split(";")
                pubmed_ids = [pmid for pmid in pubmed_ids if pmid != '']
            
                for pmid in pubmed_ids:
                    dbxrefs.append("Pubmed:{}".format(pmid))
                    
    if len(indices_to_remove) > 0:
        for i in indices_to_remove:
            dbxrefs.pop(i)

        updates.append(feat)
    else:
        continue
        
    
db = db.delete(updates)
db = db.update(updates)# split dbxrefs with ;.

# Fix dbxrefs
updates = []
for feat in db.all_features():
    dbxrefs = feat.attributes.get("Dbxref", [])
    
    indices_to_remove = []
    for i,dbxref in enumerate(dbxrefs):
        if ";" in dbxref:
            new_dbxrefs = []
            indices_to_remove.append(i)
            splt = dbxref.split(":")
            dbkey = splt[0]
            ids = ":".join(splt[1:]).replace(" ","").split(";")
            ids = [idx for idx in ids if idx != '']

            for idx in ids:
                dbxrefs.append("{}:{}".format(dbkey, idx))
                    
    if len(indices_to_remove) > 0:
        for i in indices_to_remove:
            dbxrefs.pop(i)

        updates.append(feat)
    else:
        continue
        
    
db = db.delete(updates)
db = db.update(updates)

# split GO/EC terms with ;.
updates = []
for feat in db.all_features():
    ontos = feat.attributes.get("Ontology_term", [])
    
    indices_to_remove = []
    for i,dbxref in enumerate(ontos):
        if ";" in dbxref:
            new_ontos = []
            indices_to_remove.append(i)
            splt = dbxref.split(":")
            dbkey = splt[0]
            ids = ":".join(splt[1:]).replace(" ","").split(";")
            ids = [idx for idx in ids if idx != '']

            for idx in ids:
                ontos.append("{}:{}".format(dbkey, idx))
                    
    if len(indices_to_remove) > 0:
        for i in indices_to_remove:
            ontos.pop(i)

        updates.append(feat)
    else:
        continue
        
    
db = db.delete(updates)
db = db.update(updates)

Dbxref keys

In the rdf graph, I want to express database cross-references as rdf objects through their unique resource identifier (URI). First, I need to find all dbxref keys (the part before the “:” in a dbxref entry) in my gff3 file.

# Get all dbxref keys
dbxref_keys = []
for feat in db.all_features():
    dbxrefs = feat.attributes.get('Dbxref', [])
    for dbxref in dbxrefs:
        key = dbxref.split(':')[0]
        if not key in dbxref_keys:
            dbxref_keys.append(key)
dbxref_keys
['EMBL',
 'GeneID',
 'HAMAP',
 'InterPro',
 'KEGG',
 'OrthoDB',
 'PANTHER',
 'PRIDE',
 'PRINTS',
 'Pfam',
 'RefSeq',
 'STRING',
 'SUPFAM',
 'TIGRFAMs',
 'eggNOG',
 'Pubmed',
 'UniPathway',
 'PSEUDO',
 'PseudoCAP',
 'RFAM',
 'EC']

Then, I will define the generic part of the database URI as a rdflib.Namespace


dbxref_prefixes = {
    'GeneID': Namespace("https://www.ncbi.nlm.nih.gov/gene/"),
    'InterPro': Namespace("http://purl.uniprot.org/interpro/"),
    'KEGG': Namespace("http://purl.uniprot.org/kegg/"),
    'OrthoDB': Namespace("http://purl.orthodb.org/odbgroup/"),
    'PANTHER': Namespace("http://purl.uniprot.org/panther/"),
    'PRIDE': Namespace("http://purl.uniprot.org/pride"),
    'PRINTS': Namespace("http://purl.uniprot.org/prints/"),
    'Pfam': Namespace("http://purl.uniprot.org/pfam/"),
    'RefSeq': Namespace("http://purl.uniprot.org/refseq/"),
    'STRING': Namespace("http://purl.uniprot.org/string/"),
    'SUPFAM': Namespace("http://purl.uniprot.org/supfam"),
    'TIGRFAMs': Namespace("http://purl.uniprot.org/tigrfams/"),
    'eggNOG': Namespace("http://purl.uniprot.org/eggnog/"),
    'Pubmed': Namespace("https://pubmed.ncbi.nlm.nih.gov/"),
    'PseudoCAP': Namespace("https://www.pseudomonas.com/feature/show?id="),
}

Load the sequence ontology

Featuretypes are to be expressed via their sequence ontology (SO) identifier. Here, I load the sequence ontology file as a json dictionary by issuing a GET request.

# Load SO
so_json = requests.get('https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so.json').json()

The loaded json structure exposes its data under the ‘graph’ item. The ‘graph’ item is a list of graphs, here it has only one element. From this graph element, I need the ’nodes’ item.

so_nodes = so_json['graphs'][0]['nodes']

Now I define a function that returns the SO feature ID for a given feature type.

def get_so_id(featuretype):
    hits = filter(lambda x:x.get('lbl', "")==featuretype, so_nodes)
    
    return hits

Ontologies and namespaces

I will build the rdf graph by utilizing established domain specific ontologies. Here, I define the ontology namespaces. The pflu namespace is defined such that pflu:<ID> with a valid feature ID will resolve to the corresponding feature website at the Pseudomonas fluorescens SBW25 genome database. The gffo ontology was created to provide a unique namespace for all feature properties. Each property has a rdfs:definedBy attribute that connects to one or multiple established ontology elements.

pflu = Namespace(r'http://pflu.evolbio.mpg.de/bio_data/')
so = Namespace(r'http://purl.obolibrary.org/obo/so/')
up = Namespace(r'http://purl.uniprot.org/core/')
gffo = Namespace(r'https://raw.githubusercontent.com/mpievolbio-scicomp/GenomeFeatureFormatOntology/main/gffo#')
obo = Namespace(r'http://purl.obolibrary.org/obo/')
go = Namespace(r'http://amigo.geneontology.org/amigo/term/')
ec = Namespace(r'https://www.brenda-enzymes.org/enzyme.php?ecno=')
faldo = Namespace(r'http://biohackathon.org/resource/faldo#')
gfvo = Namespace(r'http://github.com/BioInterchange/Ontologies/gfvo#')
ogo = Namespace(r'http://protozoadb.biowebdb.org/22/ogroup#')
edam = Namespace('http://edamontology.org/')

Connecting to the Pseudomonas fluorescens SBW25 genome database

Each feature is to be defined as a rdf subject through a URI. This URI should resolve to a valid web resource. I will define a feature’s subject URI through the previously defined pflu namespace and the feature’s database ID on the Pflu SBW25 genome database. To retrieve this ID, I access the genome database’s public API at https://pflu.evolbio.mpg.de/web-services. As an example, I fetch all features of type biological_region as a json object:

biological_region_json = requests.get('https://pflu.evolbio.mpg.de/web-services/content/v0.1/Biological_Region').json()

The returned json object is a deeply nested dictionary. All features are referenced with their label in the member item. E.g.

biological_region_json['member'][0]
{'@id': 'https://pflu.evolbio.mpg.de/web-services/content/v0.1/Biological_Region/29901',
 '@type': 'biological_region',
 'label': 'atcgggggcaagccccctcccaca',
 'ItemPage': 'https://pflu.evolbio.mpg.de/bio_data/29901'}

The Pflu SBW25 database ID is the last path element in the ItemPage item (29901 in the above example). To access a given feature’s DB ID, I filter the list of features (member) in the json object for items with key ’label’ matching the gff3 feature’s ID. E.g., to find the ItemPage for biological_region that has ID ‘atcgggggcaagccccctcccaca’ in the gff3 file, I filter

next(filter(lambda x:x.get('label', "")=='atcgggggcaagccccctcccaca', biological_region_json['member']))
{'@id': 'https://pflu.evolbio.mpg.de/web-services/content/v0.1/Biological_Region/29901',
 '@type': 'biological_region',
 'label': 'atcgggggcaagccccctcccaca',
 'ItemPage': 'https://pflu.evolbio.mpg.de/bio_data/29901'}

This filtering is now wrapped in a function that takes a gff3 feature object or id string as arguments and returns the corresponding json blob. Except for featuretype biological_region, I can all features through the featuretype’s API endpoint with a query for the gff3 feature’s ID. For biologial_region features, I use the prefetched json blob from above.

def get_pflu_json(feat=None, feat_id=None):
    if feat is None:
        feat = db[feat_id]
        
    feat_type = feat.featuretype
    
    # Query the featuretype's endpoint with the feature's ID.
    pflu_search = 'https://pflu.evolbio.mpg.de/web-services/content/v0.1/{}?identifier={}'.format(feat_type, feat.id)
    
    if feat_type in ['ncRNA_gene', 'chromosome', 'pseudogenic_CDS']:
        return {}
        
    elif re.match('biological_region', feat_type):
        return next(filter(lambda x:x.get('label', "")==feat.id, biological_region_json['member']))
        
    response = requests.get(pflu_search).json()
    
    return response

According to the above example, I can now fetch the json blob through a function call:

get_pflu_json(feat=db['atcgggggcaagccccctcccaca'])
{'@id': 'https://pflu.evolbio.mpg.de/web-services/content/v0.1/Biological_Region/29901',
 '@type': 'biological_region',
 'label': 'atcgggggcaagccccctcccaca',
 'ItemPage': 'https://pflu.evolbio.mpg.de/bio_data/29901'}

Convert a gff3 feature into a list of nodes.

Now I have all utilities and preparations to finally code up a function that returns for a given gff3 feature its representation as a list of rdf nodes. This function works as follows:

  • Retrieve the feature’s json object from the Pflu SBW25 database.
  • Extract the Pflu SBW25 DB ID from the json object.
  • Define the subject as the URI pflu:ID which resolves to the Pflu SBW25 DB site for the given feature.
  • Define the subject’s type through the SO term of the feature’s type.
  • Add the Pflu SBW25 DB URI as a rdfs:seeAlso property value.
  • Iterate over top level feature attributes (source, seqid, start, end, strand, frame, score) and define a node for each one. The property is anchored at the newly defined gffo ontology, with the appropriate namespace defined above
  • Add GO and EC annotations as nodes with property values anchored at the respective ontologies namespaces.
  • Add database crossreferences (Dbxref) as rdfs:seeAlso property values.
  • Add remaining attributes from column 9 of the gff3 files via the properties of th gffo namespace.
  • Uniprot review status is given as a up:reviewed property.
def feature_to_rdf(feat=None):
    """ Returns a serialized rdf string for the given feature"""
    nodes = []
    # Get feature id in database.

    response = get_pflu_json(feat)
    pflu_db_id = response['member'][-1]['ItemPage'].split("/")[-1]
    so_id = next(get_so_id(feat.featuretype))['id']
    
    # pflu db id
    subject = pflu.term(pflu_db_id)
    nodes.append((subject, RDF.type, gffo.Feature))
    nodes.append((subject, RDF.type, URIRef(so_id)) )
    nodes.append((subject, RDFS.seeAlso, pflu.term(pflu_db_id)))
    
    # source
    nodes.append((subject, gffo.source, Literal(feat.source)))
    
    # seqid
    nodes.append((subject, gffo.seqid, Literal(feat.seqid)))
    
    # position
    nodes.append((subject, gffo.start, Literal(feat.start)))
    nodes.append((subject, gffo.end, Literal(feat.end)))
    nodes.append((subject, gffo.strand, Literal(feat.strand)))
    
#     position = BNode()
#     nodes.append((position, RDF.type, {'+': faldo.ForwardStrandPosition,
#                                        '-': faldo.ReverseStrandPosition}[feat.strand]))
#     nodes.append((position, faldo.begin, Literal(feat.start)))
#     nodes.append((position, faldo.end, Literal(feat.end)))
#     nodes.append((subject, faldo.position, position))
    
    # score
    nodes.append((subject, gffo.score, Literal(feat.score)))
    
    # phase/frame
    nodes.append((subject, gffo.frame, Literal(feat.frame)))
   
    for key, vals in feat.attributes.items():
        if re.match('Ontology_term', key) is not None:
            for ot in vals:
                prefix = ot.split(":")[0]
                if re.match("EC", prefix) is not None: 
                    obj = ec.term(ot.split(":")[1])
                elif re.match("GO", prefix) is not None: 
                    obj = go.term(ot)
                go_node = (
                                subject,
                                gffo.Ontology_term,
                                obj
                )
                
                nodes.append(go_node)
                
        elif re.match('Dbxref', key) is not None:
            for val in vals:
                splt = val.split(":")
                dbkey = splt[0]
                dbval = ":".join(splt[1:])
                if dbkey not in dbxref_prefixes.keys():
                    continue
                    
                target = URIRef(dbxref_prefixes[dbkey].term(dbval))
                node = (
                                subject,
                                RDFS.seeAlso,
                                target
                )
                nodes.append(node)
        # Add protein id as cross ref to embl-cds
        elif re.match('protein_id', key) is not None:
            for val in vals:
                nodes.append((
                                subject,
                                gffo.term(key),
                                Literal(val)
                ))
                nodes.append((subject,
                              RDFS.seeAlso,
                              URIRef(val, base="http://purl.uniprot.org/embl-cds/")
                             ))
            
            # Add pfludb id as crossref.
                
        elif re.match('uniprot_review_status', key) is not None:
            nodes.append ((
                    subject,
                    up.reviewed,
                    {"reviewed": Literal(True), "unreviewed": Literal(False)}[vals[0]]
            ))
        else:
            for val in vals:
                nodes.append((
                                subject,
                                gffo.term(key),
                                Literal(val)
                ))
            
    return nodes

Define rdf nodes for each feature.

Finally, I iterate over all features in the gff3 file and convert to rdf nodes. Some featuretypes cause problems when pulling their json representation from the Pflu SBW25 database, hence I skip those. All nodes are added to the previously defined empty graph (rdflib.Graph).

for feat in tqdm(db.all_features()):
    if feat.featuretype in skipped_features:
        continue
    try:
        nodes = feature_to_rdf(feat)
    except:
        print(feat)
    for node in nodes:
        graph.add(node)

The initial run took approximately 14 hours on a single core. Subsequently, I serialize the graph to turtle formatted rdf file.

Serialize and load into triple store.

graph.serialize('MPBAS00001.ttl')

The generated rdf file can the be imported into a rdf triple store such as apache-jena-fuseki, virtuoso, or cortese. For it’s simplicity, I chose fuseki. In a subsequent post, I will demonstrate some example queries, including federated queries to other biological rdf resources.

Alternatively, I could have submitted each node to an online triplestore on the fly. In my case it turned out to be significantly slower (factor 3) than populating the graph’s nodes and subsequently serializing the graph (and uploading to the store).

    Last updated: 2022-03-31T12:19:11.497715+02:00
    
    Python implementation: CPython
    Python version       : 3.9.6
    IPython version      : 7.29.0
    
    Compiler    : GCC 9.3.0
    OS          : Linux
    Release     : 5.10.0-10-amd64
    Machine     : x86_64
    Processor   : 
    CPU cores   : 48
    Architecture: 64bit