Experimenting with the tripalv3 JSON+LD API Introduction In this post I'm experimenting with querying the JSON-LD (LD stands for Linked Data) API of my Tripal3 genome database for Pseudomonas fluorescens SBW25. A little bit of background: Tripal is a content management system (CMS) for genomic data, based on the Drupal CMS framework. It facilitates setting up customized genomic database webservers to publish genome assemblies, annotation tracks, experimental and modeling data for one or several organisms.
[Read More]
Conversion of multiple sequence alignment file formats
Working on a multiple sequence alignment project, I wanted to calculate a distance matrix from a MSA dataset generated by progressiveMauve. The MSA file is in xmfa format. I could not find a tool that calculates a distance matrix directly from the xmfa file, so I searched for conversion utilities to other MSA formats, such as phylip, maf, or msa.
This biostars forum entry always came out at top but the links provided there are dead.
[Read More]
Querying metadata from omero in python
A colleague needed a table listing image ID, image filename, and number or ROIs for all images in a project. Here’s the python script I came up with:
Query Metadata from omero Imports import omero from omero.gateway import BlitzGateway import pandas as pd import getpass Connect to omero conn = BlitzGateway('XXXXX', getpass.getpass(), host='omero.server.org') conn.connect() ········ True Retrieve datasets This sets up an iterator over the needed datasets but does not load any data into memory (yet).
[Read More]
Example queries against the Pseudomonas fluorescens SBW25 knowledge graph.
Introduction In this post I will present example SPARQL queries against the Pseudomonas fluorescens SBW25 knowledge graph (SBW25KG). The knowlegde graph was derived from the manually created annotation in gff3 format, as explained in a previous post.
The queries are run against a local instance of the apache-jena-fuseki triplestore. First, I set the endpoint URL and the maximum number of returned records:
%endpoint http://micropop046:3030/plu/ %show 50 Endpoint set to: http://micropop046:3030/plu/Result maximum size: 50 Retrieve 10 CDS’s from the SBW25KG.
[Read More]
Turning a genome annotation into a RDF knowledge graph.
Introduction In this article, I will describe the steps taken to generate a RDF (Resource Description Format) datastructure starting from a gff3 formatted genome annotation file. The annotation file in question is the new reference annotation for Pseudomonas fluorescens strain SBW25.
Required packages I will make use of the following python packages:
gffutils to read the gff3 file into a sqlite database. rdflib to construct the rdf graph. requests to fetch data (e.
[Read More]
Importing data to OMERO
Getting your data into the database is one of the most frequent tasks when working with OMERO. There are two different ways to import data into OMERO: Via the desktop app OMERO.insight or via the commandline client that comes with OMERO.py. Both software packages can be found on the OMERO download site https://www.openmicroscopy.org/omero/downloads/. While the desktop app is easy and intuitive to use, a drawback of using it is that it must remain open while the data is uploaded.
[Read More]
Programming Courses at MPI Evolutionary Biology
As every year, some PostDocs and Staff Scientists offer courses on computing, programming, data analysis and visualization and related topics.
Requirements: Some courses require that participants already have certain levels of experience and knowledge. Before signing up, please assess for yourself if you feel comfortable with these requirements. If in doubt, please contact the course responsible. During the course, there will not be enough time to bring everybody up to the expected level before starting the course program.
[Read More]
Jupyter lab tutorial
On May 5 & 6 2021, I took part in the Workshop “Kompetenz Forschungsdatenmanagement” organized by the Max Planck Digital Library. Day 2 featured a full session on “Reproducible Science with Jupyter” with a presentation by Hans Fangohr (slides available here) followed by an interactive hands-on tutorial. In part 1 of the tutorial, I step through a data analysis workflow based on the Johns-Hopkins University CoViD19 dataset from github. Part 2 and 3 are about Bayesian Inference of SIR model parameters, kindly provided by Johannes Zierenberg from MPI Dynamics and Selforganization.
[Read More]
Dask and Jupyter
Parallel python with dask and jupyter The dask framework provides an incredibly useful environment for parallel execution of python code in interactive settings (e.g. jupyter) or batch mode. Its key features are (from what I’ve seen so far):
Representation of threading, multiprocessing, and distributed computing with one unified API and CLI. Abstraction of HPC schedulers (PBS, Moab, SLURM, …) Data structures for distributed computing with pandas and numpy syntax Dask-jobqueue The package dask_jobqueue seems to me to be the most userfriendly if it comes to parallelization on HPC clusters with a scheduling system such as SLURM.
[Read More]
Research Software Development Workshop 2020
On Dec. 10 & 11, Nikoletta Glynatsi and I ran our first Workshop on “Research Software Development”. Far from appraising myself, but judging from the feedback, it was a big success. For two days, we taught best practises in writing software (exemplified with python), using git for version control, collaborating on gitlab projects and employing gitlab’s built-in continuous integration tools to run automated tests and build a reference manual.
All material from the workshop, including all presentations and code examples are available under terms of the MIT License from this gitlab repository: https://gitlab.
[Read More]