Welcome to the blog on scientific computing at the Max-Planck Institute for Evolutionary Biology

In this blog, we summarize the activities in the Scientific Computing Unit. In particular, we cover topics such as

* Tips and Tricks, Dos and Don'ts in scientific computing
* Outlines and summaries of ongoing research and software projects
* Updates and new releases from our software repositories
* Recommended readings, videos, websites

Open Science Ambassodors Meeting 2024

I was at the 2024 Open Science Ambassodors Meeting in Berlin and gave a presentation on reproducibility in scientific and high performance computing. The presentation slides are at https://mpievolbio-scicomp.pages.gwdg.de/presentations/osa2024.pdf, the permalink is https://dx.doi.org/10.5281/zenodo.14051129. I also advertised our FAQ collection on good scientific practice in research software engineering, now online at https://mpg-rse.pages.gwdg.de/gwp-rse. At the workshop, we collected many more questions and also some answers. Thanks to the participants for your contributions! [Read More]

Turning a pdf collection into jquery Datatable

Introduction As researchers, we give presentations. Over the last years, I gave somewhere between 2 and 10 presentations a year. What I need is a html document that lists all my presentations, their title, date of presentation and a link to the pdf document in a table. This table can then be included on my institutional home page. As a plus, I want the table to be searchable and each column to be sortable. [Read More]

Numba and (vs.) SLURM

Introduction The python library numba provides (among others) just-in-time (jit) compilation for python. Python code can gain tremendous speed-ups without the need to rewrite a single line of code. Simply adding a decorator to a function such as illustrated in this example: import numba @njit def accelerate_me(): ... can lead to run times comparable to C or Fortran code. Let's make a concrete example, where we add random numbers in a for loop (not advised, but used here to demonstrate numba). [Read More]

SLURM Array experiments

Background The SLURM array directive provides a simple and intuitive syntax to submit a job with multiple similar tasks to the underlying HPC system. To fully exploit this capability, a few things should be kept in mind. In the following, we will, starting from a simple, serial job, explore how the total run time of our job behaves when the `array` option is applied. job_001: A job with one task running on a single node on a single CPU This is likely the most simple job thinkable: We run the command `hostname` and let the process idle for 5 seconds. [Read More]

Experimenting with the tripalv3 JSON+LD API

Experimenting with the tripalv3 JSON+LD API Introduction In this post I'm experimenting with querying the JSON-LD (LD stands for Linked Data) API of my Tripal3 genome database for Pseudomonas fluorescens SBW25. A little bit of background: Tripal is a content management system (CMS) for genomic data, based on the Drupal CMS framework. It facilitates setting up customized genomic database webservers to publish genome assemblies, annotation tracks, experimental and modeling data for one or several organisms. [Read More]

Conversion of multiple sequence alignment file formats

Working on a multiple sequence alignment project, I wanted to calculate a distance matrix from a MSA dataset generated by progressiveMauve. The MSA file is in xmfa format. I could not find a tool that calculates a distance matrix directly from the xmfa file, so I searched for conversion utilities to other MSA formats, such as phylip, maf, or msa. This biostars forum entry always came out at top but the links provided there are dead. [Read More]

Querying metadata from omero in python

A colleague needed a table listing image ID, image filename, and number or ROIs for all images in a project. Here’s the python script I came up with: Query Metadata from omero Imports import omero from omero.gateway import BlitzGateway import pandas as pd import getpass Connect to omero conn = BlitzGateway('XXXXX', getpass.getpass(), host='omero.server.org') conn.connect() ········ True Retrieve datasets This sets up an iterator over the needed datasets but does not load any data into memory (yet). [Read More]

Example queries against the Pseudomonas fluorescens SBW25 knowledge graph.

Introduction In this post I will present example SPARQL queries against the Pseudomonas fluorescens SBW25 knowledge graph (SBW25KG). The knowlegde graph was derived from the manually created annotation in gff3 format, as explained in a previous post. The queries are run against a local instance of the apache-jena-fuseki triplestore. First, I set the endpoint URL and the maximum number of returned records: %endpoint http://micropop046:3030/plu/ %show 50 Endpoint set to: http://micropop046:3030/plu/Result maximum size: 50 Retrieve 10 CDS’s from the SBW25KG. [Read More]

Turning a genome annotation into a RDF knowledge graph.

Introduction In this article, I will describe the steps taken to generate a RDF (Resource Description Format) datastructure starting from a gff3 formatted genome annotation file. The annotation file in question is the new reference annotation for Pseudomonas fluorescens strain SBW25. Required packages I will make use of the following python packages: gffutils to read the gff3 file into a sqlite database. rdflib to construct the rdf graph. requests to fetch data (e. [Read More]