How to apply a license to knowledge graphs

How do you apply a license to a semantic web knowledge graph? I somewhat stumbled across this question during the process of registering the NFDI4BIOIMAGE Knowledge Graph with the NFDI KGI Registry. Amongst other data, the registry form requested the license, under which this KG is published.

Creative Commons Licenses on Wikidata

Ok, but how? The KGI registry’s preferred format is to state the corresponding Wikidata ID. E.g. the Creative Commons License CC-BY 4.0 International is wd:Q6905323. The following wikidata query returns a list of all International 4.0 CC Licenses:

[Read More]

Linked Open Data for Bioimaging – The NFDI4BIOIMAGE Knowledge Graph

This post describes the NFDI4BIOIMAGE Knowledge Graph (N4BIKG). Many aspects of this project are still very much in the flow including a consensus on which ontologies and terms to employ, how to define various namespaces and much more.

The N4BIKG is accessible through a SPARQO endpoint at https://kg.nfdi4bioimage.de/N4BIKG/sparql. The dataset is split into four named graphs:

PREFIX : <https://nfdi.fiz-karlsruhe.de/ontology#>
PREFIX n4bikg: <https://kg.nfdi4bioimage.de/n4bikg/>
PREFIX nfdicore: <https://nfdi.fiz-karlsruhe.de/ontology/>
PREFIX ome_core: <https://ld.openmicroscopy.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX t4fs: <http://purl.obolibrary.org/obo/T4FS_>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

select distinct ?graph where {
  graph ?graph {?s ?p ?o}
}
graph
https://kg.nfdi4bioimage.de/n4bikg/core
https://kg.nfdi4bioimage.de/n4bikg/n4bi_zenodo_community
https://kg.nfdi4bioimage.de/n4bikg/services
https://kg.nfdi4bioimage.de/n4bikg/owl

As a convenience, the default graph is the union of all named graphs.

[Read More]

Open Science Ambassodors Meeting 2024

I was at the 2024 Open Science Ambassodors Meeting in Berlin and gave a presentation on reproducibility in scientific and high performance computing. The presentation slides are at https://mpievolbio-scicomp.pages.gwdg.de/presentations/osa2024.pdf, the permalink is https://dx.doi.org/10.5281/zenodo.14051129.

I also advertised our FAQ collection on good scientific practice in research software engineering, now online at https://mpg-rse.pages.gwdg.de/gwp-rse. At the workshop, we collected many more questions and also some answers. Thanks to the participants for your contributions!

Turning a pdf collection into jquery Datatable

Introduction

As researchers, we give presentations. Over the last years, I gave somewhere between 2 and 10 presentations a year. What I need is a html document that lists all my presentations, their title, date of presentation and a link to the pdf document in a table. This table can then be included on my institutional home page. As a plus, I want the table to be searchable and each column to be sortable. I figured this must be possible by extracting the metadata stored inside each presentations pdf file and writing this into a html table. In this way, I can avoid having to maintain a seperate database of title, venues, dates etc but instead the pdfs are my sole source of truth. In the following, I will describe a pipeline consisting of

[Read More]

Numba and numpy

Introduction

The python library numba provides (among others) just-in-time (jit) compilation for python. Python code can gain tremendous speed-ups without the need to rewrite a single line of code. Adding a decorator to a function such as illustrated in this example:

import numba
@njit
def accelerate_me():
   ...

can lead to run times comparable to C or Fortran code.

Let's make a concrete example, where we add random numbers in a for loop (not advised, but used here to demonstrate numba).

[Read More]

SLURM Array experiments

Background

The SLURM array directive provides a simple and intuitive syntax to submit a job with multiple similar tasks to the underlying HPC system. To fully exploit this capability, a few things should be kept in mind.

In the following, we will, starting from a simple, serial job, explore how the total run time of our job behaves when the `array` option is applied.

job_001: A job with one task running on a single node on a single CPU

This is likely the most simple job thinkable: We run the command `hostname` and let the process idle for 5 seconds. Before and after the `sleep` statement, we print out a timestamp (`date`). The time difference between the two timestamps should be 5 seconds.

[Read More]

Experimenting with the tripalv3 JSON+LD API

Experimenting with the tripalv3 JSON+LD API

Introduction

In this post I'm experimenting with querying the JSON-LD (LD stands for Linked Data) API of my Tripal3 genome database for Pseudomonas fluorescens SBW25. A little bit of background: Tripal is a content management system (CMS) for genomic data, based on the Drupal CMS framework. It facilitates setting up customized genomic database webservers to publish genome assemblies, annotation tracks, experimental and modeling data for one or several organisms. The current stable release is Tripal-v3. Tripal implements the Chado database schemefor biological (genomic) databases Popular examples for Tripal instances are the Rice Genome Hub, or the Kiwifruit Genome Database, more examples can be found on the Tripal website. What sets Tripal aside compared to other webserver frameworks with similar objectives (e.g. machado, the UCSC Genome Browser, or the Ensembl Genome Browser), is that all data in the Tripal database is accessible through a JSON API. Moreover, the JSON response adheres to the JSON-LD standard issued by the World Wide Web Consortium W3C. This feature makes Tripal (in principle) compatible with Semantic Web applications and eases the connectivity of Linked Data services such as public SPARQL endpoints and Tripal sites or between Tripal sites.

[Read More]

Conversion of multiple sequence alignment file formats

Working on a multiple sequence alignment project, I wanted to calculate a distance matrix from a MSA dataset generated by progressiveMauve. The MSA file is in xmfa format. I could not find a tool that calculates a distance matrix directly from the xmfa file, so I searched for conversion utilities to other MSA formats, such as phylip, maf, or msa.

This biostars forum entry always came out at top but the links provided there are dead. Geneious-prime loads xmfa but I could not figure out how to write it back in a different format. Final solution: biopython:

[Read More]

Querying metadata from omero in python

A colleague needed a table listing image ID, image filename, and number or ROIs for all images in a project. Here’s the python script I came up with:

Query Metadata from omero

Imports

import omero
from omero.gateway import BlitzGateway
import pandas as pd
import getpass

Connect to omero

conn = BlitzGateway('XXXXX', getpass.getpass(), host='omero.server.org')
conn.connect()
 ········





True

Retrieve datasets

This sets up an iterator over the needed datasets but does not load any data into memory (yet).

datasets = conn.getObjects(obj_type='Dataset', ids=[1,2,3.4])

Loop over datasets and store requested information in a list of tuples.

records = []
for dataset in datasets:
    id = dataset.id
    
    imgs = conn.getObjects("Image", opts={'dataset':id})
    
    for img in imgs:
        records.append((id, img.id, img.getName(), img.getROICount() ))

Write records into a DataFrame

df = pd.DataFrame.from_records(records, columns=['dataset id', 'image id', 'image filename', 'roi count'])

Write data as csv file.

df.to_csv('ds_id_fname_Nroi.csv', index=False)

Create file annotation object and attach to the omero project that encapsulates the datasets.

file_annotation = conn.createFileAnnfromLocalFile('ds_id_fname_Nroi.csv', mimetype="text/plain", desc="Table of images , their dataset id, image id, filename, and ROI count.")
project.linkAnnotation(file_annotation)

Running this produces a csv file on disk and attaches this file as a file annotation on the omero project.

[Read More]