Example queries against the Pseudomonas fluorescens SBW25 knowledge graph.

Introduction

In this post I will present example SPARQL queries against the Pseudomonas fluorescens SBW25 knowledge graph (SBW25KG). The knowlegde graph was derived from the manually created annotation in gff3 format, as explained in a previous post.

The queries are run against a local instance of the apache-jena-fuseki triplestore. First, I set the endpoint URL and the maximum number of returned records:

%endpoint http://micropop046:3030/plu/
%show 50
Endpoint set to: http://micropop046:3030/plu/
Result maximum size: 50

Retrieve 10 CDS’s from the SBW25KG.

In this first example, I query for CDS features, list their primary name, locus tag, and protein ID.

[Read More]

Turning a genome annotation into a RDF knowledge graph.

Introduction

In this article, I will describe the steps taken to generate a RDF (Resource Description Format) datastructure starting from a gff3 formatted genome annotation file. The annotation file in question is the new reference annotation for Pseudomonas fluorescens strain SBW25.

Required packages

I will make use of the following python packages:

  • gffutils to read the gff3 file into a sqlite database.
  • rdflib to construct the rdf graph.
  • requests to fetch data (e.g. ontology files)

All packages can be installed via conda from the conda-forge channel or pypi.

[Read More]

Importing data to OMERO

Getting your data into the database is one of the most frequent tasks when working with OMERO. There are two different ways to import data into OMERO: Via the desktop app OMERO.insight or via the commandline client that comes with OMERO.py. Both software packages can be found on the OMERO download site https://www.openmicroscopy.org/omero/downloads/. While the desktop app is easy and intuitive to use, a drawback of using it is that it must remain open while the data is uploaded. It is therefore mostly useful for relatively small datasets of a few GB at max. Larger datasets are more reliably imported with the commandline client. Secondly, if the data resides on a remote computer, the commandline client may be your only option.

[Read More]

Programming Courses at MPI Evolutionary Biology

As every year, some PostDocs and Staff Scientists offer courses on computing, programming, data analysis and visualization and related topics.

Requirements: Some courses require that participants already have certain levels of experience and knowledge. Before signing up, please assess for yourself if you feel comfortable with these requirements. If in doubt, please contact the course responsible. During the course, there will not be enough time to bring everybody up to the expected level before starting the course program.

[Read More]

Jupyter lab tutorial

On May 5 & 6 2021, I took part in the Workshop “Kompetenz Forschungsdatenmanagement” organized by the Max Planck Digital Library. Day 2 featured a full session on “Reproducible Science with Jupyter” with a presentation by Hans Fangohr (slides available here) followed by an interactive hands-on tutorial. In part 1 of the tutorial, I step through a data analysis workflow based on the Johns-Hopkins University CoViD19 dataset from github. Part 2 and 3 are about Bayesian Inference of SIR model parameters, kindly provided by Johannes Zierenberg from MPI Dynamics and Selforganization. All notebooks are available at https://gitlab.gwdg.de/mpievolbio-scicomp/fdm2021/-/blob/master/notebooks or interactively on mybinder.org at https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.gwdg.de%2Fmpievolbio-scicomp%2Ffdm2021.git/HEAD?urlpath=lab.

[Read More]

Dask and Jupyter

Parallel python with dask and jupyter

The dask framework provides an incredibly useful environment for parallel execution of python code in interactive settings (e.g. jupyter) or batch mode. Its key features are (from what I’ve seen so far):

  • Representation of threading, multiprocessing, and distributed computing with one unified API and CLI.
  • Abstraction of HPC schedulers (PBS, Moab, SLURM, …)
  • Data structures for distributed computing with pandas and numpy syntax

Dask-jobqueue

The package dask_jobqueue seems to me to be the most userfriendly if it comes to parallelization on HPC clusters with a scheduling system such as SLURM. For now, the most interesting for me is to determine how dask maps the parameters given to the cluster API and the cluster.scale method to the parameters usually given in a SLURM batch job script and the mpirun parameters.

[Read More]

Research Software Development Workshop 2020

On Dec. 10 & 11, Nikoletta Glynatsi and I ran our first Workshop on “Research Software Development”. Far from appraising myself, but judging from the feedback, it was a big success. For two days, we taught best practises in writing software (exemplified with python), using git for version control, collaborating on gitlab projects and employing gitlab’s built-in continuous integration tools to run automated tests and build a reference manual.

All material from the workshop, including all presentations and code examples are available under terms of the MIT License from this gitlab repository: https://gitlab.gwdg.de/glynatsi/rsd-workshop .

[Read More]

Running matlab code on HPC with SLURM

Running MATLAB scripts on HPC

Today, the question came up how to run MATLAB code on HPC featuring a SLURM scheduler. The syntax for running matlab on the command line is indeed a bit counterintuitive, at least if you are (like me) used to running python or R scripts.

Example SLURM script

The following snippet is an example for how to submit a matlab script for execution on an HPC Server with the SLURM scheduler:

[Read More]

Converting jupyter notebooks with embedded images to pdf.

Inserting images in a jupyter notebook is just drag and drop:

drag_and_drop

This will automagically produce the image link at the drop position.

img

And after executing the cell, the image is rendered

img

So far so good. But ever tried to convert a notebook with embedded images to pdf or html (slides)?

My first guess was: Menu -> File -> Export Notebook As -> PDF.

However, this immediately runs into Error 500 tracing back to latex not being able to locate the image attachment:5444....

[Read More]