Introduction
In this article, I will describe the steps taken to generate a RDF (Resource Description Format) datastructure starting from a gff3 formatted genome annotation file. The annotation file in question is the new reference annotation for Pseudomonas fluorescens strain SBW25.
Required packages
I will make use of the following python packages:
- gffutils to read the gff3 file into a sqlite database.
- rdflib to construct the rdf graph.
- requests to fetch data (e.g. ontology files)
All packages can be installed via conda from the conda-forge channel or pypi.
import gffutils
from tqdm import tqdm
from rdflib import URIRef, Literal, Namespace, Graph, BNode
from rdflib.namespace import CSVW, DC, DCAT, DCTERMS, DOAP, FOAF, ODRL2, ORG, OWL, \
PROF, PROV, RDF, RDFS, SDO, SH, SKOS, SOSA, SSN, TIME, \
VOID, XMLNS, XSD
import requests
import re
Load the gff3 file
To load the gff3 file for the first time, I run
db = gffutils.create_db(MPBAS00001.gff3, MPBAS00001.gff3.db)
This creates the sqlite database file MPBAS00001.gff3.db and loads the database into memory. Subsequently, I can load the database directly from the database file.
db = gffutils.FeatureDB("MPBAS00001.gff3.db")
Each genome feature can now be accessed via it’s feature ID, e.g.
feat = db['CDS:PFLU_0409-0']
would load the CDS in the locus tagged as “PFLU_0409”.
gffutils feature properties.
From the feature object, I can query all information given in the gff3 file:
feat.featuretype
'CDS'
feat.start
452623
feat.end
453588
feat.strand
'+'
feat.id
'CDS:PFLU_0409-0'
# The chromosome on which this feature is located.
feat.chrom
'MPBAS00001'
# CDS coding frame.
feat.frame
'0'
feat.source
'ena'
Further annotations that would be found in the 9th column of the gff3 file are accessible via the attributes
member of the feat
object:
print(feat.attributes)
Dbxref: ['EMBL:AM181176', 'InterPro:IPR007445', 'InterPro:IPR007446', 'InterPro:IPR014717', 'KEGG:pfs:PFLU_0409', 'OrthoDB:1803493at2', 'Pfam:PF04350', 'Pfam:PF04351', 'RefSeq:WP_012721815.1', 'STRING:216595.PFLU_0409', 'eggNOG:COG3167', 'eggNOG:COG3168', 'Pubmed:19432983']
ID: ['CDS:PFLU_0409-0']
Ontology_term: ['GO:0043107', 'GO:0043683']
Parent: ['transcript:PFLU_0409-0']
codon_start: ['1']
confidence_level: ['3']
frame: ['0']
gene: ['PFLU_0409']
inference: ['Predicted']
locus_tag: ['PFLU_0409']
note: ['N-terminus similar to Pseudomonas syringae PilO andC-terminus similar to Pseudomonas syringae PilP']
product: ['Putative fimbriae biogenesis-related fusion protein']
protein_id: ['CAY46686.1']
seqid: ['MPBAS00001']
similarity: ['fasta; with=UniProt:Q52542; Pseudomonas syringae.; pilP; Pilus expression protein.; length=175; id 41.781%; ungapped id 43.571%; E()=6.9e-14; 146 aa overlap; query 181-321; subject 31-175', 'fasta; with=UniProt:Q87V11; Pseudomonas syringae (pv. tomato).; pilO; Type IV pilus biogenesis protein PilO.; length=207; id 36.735%; ungapped id 37.696%; E()=1.6e-17; 196 aa overlap; query 3-193; subject 10-205']
uniprot_annotation_score: ['1 out of 5']
uniprot_review_status: ['unreviewed']
feat.attributes
returns a dictionary, each item is a list (iterable).
For further information about working with gffutils
data structures and functions, refer to the reference manual.
Fixing issues with gff3 file.
The gff3 file had a number of issues, e.g. some Dbxref entries were not formatted correctly. The next three cells clean up these issues. Note that each alteration is committed back to the database file by first deleting all features that had to be fixed and then updating the database with the altered features.
# Debug Pubmed.
updates = []
for feat in db.all_features():
dbxrefs = feat.attributes.get("Dbxref", [])
indices_to_remove = []
for i,dbxref in enumerate(dbxrefs):
if dbxref.startswith("Pubmed"):
if ";" in dbxref:
new_dbxrefs = []
indices_to_remove.append(i)
pubmed_ids = dbxref.split(":")[1].replace(" ","").split(";")
pubmed_ids = [pmid for pmid in pubmed_ids if pmid != '']
for pmid in pubmed_ids:
dbxrefs.append("Pubmed:{}".format(pmid))
if len(indices_to_remove) > 0:
for i in indices_to_remove:
dbxrefs.pop(i)
updates.append(feat)
else:
continue
db = db.delete(updates)
db = db.update(updates)# split dbxrefs with ;.
# Fix dbxrefs
updates = []
for feat in db.all_features():
dbxrefs = feat.attributes.get("Dbxref", [])
indices_to_remove = []
for i,dbxref in enumerate(dbxrefs):
if ";" in dbxref:
new_dbxrefs = []
indices_to_remove.append(i)
splt = dbxref.split(":")
dbkey = splt[0]
ids = ":".join(splt[1:]).replace(" ","").split(";")
ids = [idx for idx in ids if idx != '']
for idx in ids:
dbxrefs.append("{}:{}".format(dbkey, idx))
if len(indices_to_remove) > 0:
for i in indices_to_remove:
dbxrefs.pop(i)
updates.append(feat)
else:
continue
db = db.delete(updates)
db = db.update(updates)
# split GO/EC terms with ;.
updates = []
for feat in db.all_features():
ontos = feat.attributes.get("Ontology_term", [])
indices_to_remove = []
for i,dbxref in enumerate(ontos):
if ";" in dbxref:
new_ontos = []
indices_to_remove.append(i)
splt = dbxref.split(":")
dbkey = splt[0]
ids = ":".join(splt[1:]).replace(" ","").split(";")
ids = [idx for idx in ids if idx != '']
for idx in ids:
ontos.append("{}:{}".format(dbkey, idx))
if len(indices_to_remove) > 0:
for i in indices_to_remove:
ontos.pop(i)
updates.append(feat)
else:
continue
db = db.delete(updates)
db = db.update(updates)
Dbxref keys
In the rdf graph, I want to express database cross-references as rdf objects through their unique resource identifier (URI). First, I need to find all dbxref keys (the part before the “:” in a dbxref entry) in my gff3 file.
# Get all dbxref keys
dbxref_keys = []
for feat in db.all_features():
dbxrefs = feat.attributes.get('Dbxref', [])
for dbxref in dbxrefs:
key = dbxref.split(':')[0]
if not key in dbxref_keys:
dbxref_keys.append(key)
dbxref_keys
['EMBL',
'GeneID',
'HAMAP',
'InterPro',
'KEGG',
'OrthoDB',
'PANTHER',
'PRIDE',
'PRINTS',
'Pfam',
'RefSeq',
'STRING',
'SUPFAM',
'TIGRFAMs',
'eggNOG',
'Pubmed',
'UniPathway',
'PSEUDO',
'PseudoCAP',
'RFAM',
'EC']
Then, I will define the generic part of the database URI as a rdflib.Namespace
dbxref_prefixes = {
'GeneID': Namespace("https://www.ncbi.nlm.nih.gov/gene/"),
'InterPro': Namespace("http://purl.uniprot.org/interpro/"),
'KEGG': Namespace("http://purl.uniprot.org/kegg/"),
'OrthoDB': Namespace("http://purl.orthodb.org/odbgroup/"),
'PANTHER': Namespace("http://purl.uniprot.org/panther/"),
'PRIDE': Namespace("http://purl.uniprot.org/pride"),
'PRINTS': Namespace("http://purl.uniprot.org/prints/"),
'Pfam': Namespace("http://purl.uniprot.org/pfam/"),
'RefSeq': Namespace("http://purl.uniprot.org/refseq/"),
'STRING': Namespace("http://purl.uniprot.org/string/"),
'SUPFAM': Namespace("http://purl.uniprot.org/supfam"),
'TIGRFAMs': Namespace("http://purl.uniprot.org/tigrfams/"),
'eggNOG': Namespace("http://purl.uniprot.org/eggnog/"),
'Pubmed': Namespace("https://pubmed.ncbi.nlm.nih.gov/"),
'PseudoCAP': Namespace("https://www.pseudomonas.com/feature/show?id="),
}
Load the sequence ontology
Featuretypes are to be expressed via their sequence ontology (SO) identifier. Here, I load the sequence ontology file as a json dictionary by issuing a GET request.
# Load SO
so_json = requests.get('https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so.json').json()
The loaded json structure exposes its data under the ‘graph’ item. The ‘graph’ item is a list of graphs, here it has only one element. From this graph element, I need the ’nodes’ item.
so_nodes = so_json['graphs'][0]['nodes']
Now I define a function that returns the SO feature ID for a given feature type.
def get_so_id(featuretype):
hits = filter(lambda x:x.get('lbl', "")==featuretype, so_nodes)
return hits
Ontologies and namespaces
I will build the rdf graph by utilizing established domain specific ontologies. Here, I define the ontology namespaces. The pflu
namespace is defined such
that pflu:<ID>
with a valid feature ID will resolve to the corresponding feature website at the Pseudomonas fluorescens SBW25 genome database. The gffo
ontology was created to provide a unique namespace for all feature properties. Each property has a rdfs:definedBy
attribute
that connects to one or multiple established ontology elements.
pflu = Namespace(r'http://pflu.evolbio.mpg.de/bio_data/')
so = Namespace(r'http://purl.obolibrary.org/obo/so/')
up = Namespace(r'http://purl.uniprot.org/core/')
gffo = Namespace(r'https://raw.githubusercontent.com/mpievolbio-scicomp/GenomeFeatureFormatOntology/main/gffo#')
obo = Namespace(r'http://purl.obolibrary.org/obo/')
go = Namespace(r'http://amigo.geneontology.org/amigo/term/')
ec = Namespace(r'https://www.brenda-enzymes.org/enzyme.php?ecno=')
faldo = Namespace(r'http://biohackathon.org/resource/faldo#')
gfvo = Namespace(r'http://github.com/BioInterchange/Ontologies/gfvo#')
ogo = Namespace(r'http://protozoadb.biowebdb.org/22/ogroup#')
edam = Namespace('http://edamontology.org/')
Connecting to the Pseudomonas fluorescens SBW25 genome database
Each feature is to be defined as a rdf subject through a URI. This URI should resolve to a valid web resource. I will define a feature’s subject URI through
the previously defined pflu
namespace and the feature’s database ID on the Pflu SBW25 genome database. To retrieve this ID, I access the genome database’s public API at https://pflu.evolbio.mpg.de/web-services. As an example, I fetch all features of type biological_region
as a json object:
biological_region_json = requests.get('https://pflu.evolbio.mpg.de/web-services/content/v0.1/Biological_Region').json()
The returned json object is a deeply nested dictionary. All features are referenced with their label in the member
item. E.g.
biological_region_json['member'][0]
{'@id': 'https://pflu.evolbio.mpg.de/web-services/content/v0.1/Biological_Region/29901',
'@type': 'biological_region',
'label': 'atcgggggcaagccccctcccaca',
'ItemPage': 'https://pflu.evolbio.mpg.de/bio_data/29901'}
The Pflu SBW25 database ID is the last path element in the ItemPage
item (29901 in the above example). To access a given feature’s DB ID, I filter the list of features (member
) in the json object for items with key ’label’ matching the gff3 feature’s ID. E.g., to find the ItemPage
for biological_region
that has ID ‘atcgggggcaagccccctcccaca’ in the gff3 file, I filter
next(filter(lambda x:x.get('label', "")=='atcgggggcaagccccctcccaca', biological_region_json['member']))
{'@id': 'https://pflu.evolbio.mpg.de/web-services/content/v0.1/Biological_Region/29901',
'@type': 'biological_region',
'label': 'atcgggggcaagccccctcccaca',
'ItemPage': 'https://pflu.evolbio.mpg.de/bio_data/29901'}
This filtering is now wrapped in a function that takes a gff3 feature object or id string as arguments and returns the corresponding json blob. Except for featuretype biological_region
, I can all features through the featuretype’s API endpoint with a query for the gff3 feature’s ID. For biologial_region
features, I use the prefetched json blob from above.
def get_pflu_json(feat=None, feat_id=None):
if feat is None:
feat = db[feat_id]
feat_type = feat.featuretype
# Query the featuretype's endpoint with the feature's ID.
pflu_search = 'https://pflu.evolbio.mpg.de/web-services/content/v0.1/{}?identifier={}'.format(feat_type, feat.id)
if feat_type in ['ncRNA_gene', 'chromosome', 'pseudogenic_CDS']:
return {}
elif re.match('biological_region', feat_type):
return next(filter(lambda x:x.get('label', "")==feat.id, biological_region_json['member']))
response = requests.get(pflu_search).json()
return response
According to the above example, I can now fetch the json blob through a function call:
get_pflu_json(feat=db['atcgggggcaagccccctcccaca'])
{'@id': 'https://pflu.evolbio.mpg.de/web-services/content/v0.1/Biological_Region/29901',
'@type': 'biological_region',
'label': 'atcgggggcaagccccctcccaca',
'ItemPage': 'https://pflu.evolbio.mpg.de/bio_data/29901'}
Convert a gff3 feature into a list of nodes.
Now I have all utilities and preparations to finally code up a function that returns for a given gff3 feature its representation as a list of rdf nodes. This function works as follows:
- Retrieve the feature’s json object from the Pflu SBW25 database.
- Extract the Pflu SBW25 DB ID from the json object.
- Define the subject as the URI
pflu:ID
which resolves to the Pflu SBW25 DB site for the given feature. - Define the subject’s type through the SO term of the feature’s type.
- Add the Pflu SBW25 DB URI as a rdfs:seeAlso property value.
- Iterate over top level feature attributes (source, seqid, start, end, strand, frame, score) and define a node for each one. The property is anchored at the
newly defined
gffo
ontology, with the appropriate namespace defined above - Add GO and EC annotations as nodes with property values anchored at the respective ontologies namespaces.
- Add database crossreferences (Dbxref) as rdfs:seeAlso property values.
- Add remaining attributes from column 9 of the gff3 files via the properties of th
gffo
namespace. - Uniprot review status is given as a
up:reviewed
property.
def feature_to_rdf(feat=None):
""" Returns a serialized rdf string for the given feature"""
nodes = []
# Get feature id in database.
response = get_pflu_json(feat)
pflu_db_id = response['member'][-1]['ItemPage'].split("/")[-1]
so_id = next(get_so_id(feat.featuretype))['id']
# pflu db id
subject = pflu.term(pflu_db_id)
nodes.append((subject, RDF.type, gffo.Feature))
nodes.append((subject, RDF.type, URIRef(so_id)) )
nodes.append((subject, RDFS.seeAlso, pflu.term(pflu_db_id)))
# source
nodes.append((subject, gffo.source, Literal(feat.source)))
# seqid
nodes.append((subject, gffo.seqid, Literal(feat.seqid)))
# position
nodes.append((subject, gffo.start, Literal(feat.start)))
nodes.append((subject, gffo.end, Literal(feat.end)))
nodes.append((subject, gffo.strand, Literal(feat.strand)))
# position = BNode()
# nodes.append((position, RDF.type, {'+': faldo.ForwardStrandPosition,
# '-': faldo.ReverseStrandPosition}[feat.strand]))
# nodes.append((position, faldo.begin, Literal(feat.start)))
# nodes.append((position, faldo.end, Literal(feat.end)))
# nodes.append((subject, faldo.position, position))
# score
nodes.append((subject, gffo.score, Literal(feat.score)))
# phase/frame
nodes.append((subject, gffo.frame, Literal(feat.frame)))
for key, vals in feat.attributes.items():
if re.match('Ontology_term', key) is not None:
for ot in vals:
prefix = ot.split(":")[0]
if re.match("EC", prefix) is not None:
obj = ec.term(ot.split(":")[1])
elif re.match("GO", prefix) is not None:
obj = go.term(ot)
go_node = (
subject,
gffo.Ontology_term,
obj
)
nodes.append(go_node)
elif re.match('Dbxref', key) is not None:
for val in vals:
splt = val.split(":")
dbkey = splt[0]
dbval = ":".join(splt[1:])
if dbkey not in dbxref_prefixes.keys():
continue
target = URIRef(dbxref_prefixes[dbkey].term(dbval))
node = (
subject,
RDFS.seeAlso,
target
)
nodes.append(node)
# Add protein id as cross ref to embl-cds
elif re.match('protein_id', key) is not None:
for val in vals:
nodes.append((
subject,
gffo.term(key),
Literal(val)
))
nodes.append((subject,
RDFS.seeAlso,
URIRef(val, base="http://purl.uniprot.org/embl-cds/")
))
# Add pfludb id as crossref.
elif re.match('uniprot_review_status', key) is not None:
nodes.append ((
subject,
up.reviewed,
{"reviewed": Literal(True), "unreviewed": Literal(False)}[vals[0]]
))
else:
for val in vals:
nodes.append((
subject,
gffo.term(key),
Literal(val)
))
return nodes
Define rdf nodes for each feature.
Finally, I iterate over all features in the gff3 file and convert to rdf nodes. Some featuretypes cause problems when pulling their json representation from the Pflu SBW25 database, hence I skip those. All nodes are added to the previously defined empty graph (rdflib.Graph
).
for feat in tqdm(db.all_features()):
if feat.featuretype in skipped_features:
continue
try:
nodes = feature_to_rdf(feat)
except:
print(feat)
for node in nodes:
graph.add(node)
The initial run took approximately 14 hours on a single core. Subsequently, I serialize the graph to turtle formatted rdf file.
Serialize and load into triple store.
graph.serialize('MPBAS00001.ttl')
The generated rdf file can the be imported into a rdf triple store such as apache-jena-fuseki, virtuoso, or cortese. For it’s simplicity, I chose fuseki. In a subsequent post, I will demonstrate some example queries, including federated queries to other biological rdf resources.
Alternatively, I could have submitted each node to an online triplestore on the fly. In my case it turned out to be significantly slower (factor 3) than populating the graph’s nodes and subsequently serializing the graph (and uploading to the store).
Last updated: 2022-03-31T12:19:11.497715+02:00
Python implementation: CPython
Python version : 3.9.6
IPython version : 7.29.0
Compiler : GCC 9.3.0
OS : Linux
Release : 5.10.0-10-amd64
Machine : x86_64
Processor :
CPU cores : 48
Architecture: 64bit