Introduction
As researchers, we give presentations. Over the last years, I gave somewhere between 2 and 10 presentations a year. What I need is a html document that lists all my presentations, their title, date of presentation and a link to the pdf document in a table. This table can then be included on my institutional home page. As a plus, I want the table to be searchable and each column to be sortable. I figured this must be possible by extracting the metadata stored inside each presentations pdf file and writing this into a html table. In this way, I can avoid having to maintain a seperate database of title, venues, dates etc but instead the pdfs are my sole source of truth. In the following, I will describe a pipeline consisting of
-
Writing metadata to pdf
- LaTeX: hyperxmp package for pdfs compiled from LaTeX source files.
- exiftool for already existing pdfs (e.g. generated from pptx or odp documents)
-
Extracting metadata
- exiftool with output written to RDF/XML.
- Running a SPARQL query against the RDF document with output written to json.
- Loading the json data via jquery ajax and publish as jquery datatable in a html file.
Writing metadata to pdf
… in LaTeX source files
Writing metadata to a pdf file generated from LaTeX source is best done using the hyperref and hyperxmp packages. While hyperref alone supports only a limited set of metadata fields (such a title, author, creationdate, modifieddate), hyperxmp supports many more fields (e.g. DOI, date of presentation, publication_type, …) and permits creation of custom fields.
In the .tex source file, inserting the metadata is done via the `\hypersetup{}` command:
... \hypersetup{ pdfauthor={Carsten Fortmann-Grote}, pdftitle={KI und Evolution}, pdfkeywords={artificial intelligence, genetic algorithm, evolutionary computing, novelty search, turing test, causal reasoning}, pdfsubject={Öffentlicher Abendvortrag am MPI f. Evolutionsbiologie, Plön}, pdfdate={D:20240130190000Z}, pdfcreator={Emacs 28.2 (Org mode 9.6.8)}, pdflang={German}, pdfdoi={10.5281/ZENODO.10599178}, pdfcontactemail={carsten.fortmann-grote@evolbio.mpg.de}, pdfpubtype={presentation}, pdfpublication={}, pdflicenseurl={https://creativecommons.org/licenses/by-sa/4.0/deed.en}, pdfcopyright={Max-Planck-Gesellschaft zur Förderung der Wissenschaften e.V.} } \begin{document} ... \end{document}
The \hypersetup{} command appears before the \begin{document} command. Note the rich set of metadata fields like pdfdoi, pdflicenseurl, pdfcopyright etc. Compiling the .tex file (pdflatex), then inserts these keys and their values into the xmp part of the pdf. XMP is the ISO standard format for embedded metadata for pdf documents. For more information, please visit this wiki or the official XMP homepage.
… into existing pdf documents
In case the metadata has to be written directly to the pdf document (e.g. because the LaTeX source file is no longer avaliable or because it was generated from a pptx or odp file), various solutions exist. The aforementioned XMP wiki page contains a list of XMP capable editors. Here, we use exiftool, a perl library and commandline tool to read and write metadata from/to a wide variety of file formats including pdf.
As an example, the following command writes the DOI field to the file "presentation.pdf"
exiftool -prism:doi:10.1001/1234 presentation.pdf
For batch processing, metadata can also be written to a csv file and consumed by exiftool through the -csv=<metadata_file.csv> argument.
So in detail, for a given collection of pdfs located in my current directory, I ran
exiftool -csv *.pdf > metadata.csv
metadata.csv might look like this:
SourceFile | Keywords | Title |
---|---|---|
ismb2022.pdf | Pseudomonas fluorescens SBW25, knowledge graph, data integration, RDF, SPARQL, ontologies | From Genome Annotation to Knowledge Graph: The case of Pseudomonas fluorenscens SBW25 |
containing only the Keywords and Title columns.
Then edited the resulting metadata.csv file in my favorite spreadsheet editor (fixing missing data, adding columns for missing keys, etc):
SourceFile | Date | Keywords | Title |
---|---|---|---|
ismb2022.pdf | 2022:07:13 | Pseudomonas fluorescens SBW25, knowledge graph, data integration, RDF, SPARQL, ontologies | From Genome Annotation to Knowledge Graph: The case of Pseudomonas fluorenscens SBW25 |
and then reloaded the csv file with
exitool -csv=metadata.csv *.pdf
to update the metadata.
Extract metadata to RDF
Again, we use exiftool to extract the metadata and write it into a RDF file:
exiftool -X *.pdf
This creates a XML file with RDF syntax to describe every pdf file in the current directory. This RDF can now be consumed by linked data parsers. Here, I use python's rdflib.
Parsing the metadata and convert to json
I wrote a little python script that runs a SPARQL query on my rdf metadata file, and returns writes the results as a json document. Maybe this step is actually redundant because jquery, the javascript library to generate formatted html table from a datafile can read xml directly. But since I'm not so literate in javascript, I used this solution.
The python script is listed below:
from rdflib import Graph, URIRef, Literal import os import pandas # Construct the SPARQL query query = """ prefix PDF: <http://ns.exiftool.org/PDF/PDF/1.0/> prefix prism: <http://ns.exiftool.org/XMP/XMP-prism/1.0/> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix xmpdc: <http://ns.exiftool.org/XMP/XMP-dc/1.0/> prefix xmppdf: <http://ns.exiftool.org/XMP/XMP-pdf/1.0/> SELECT ?fname ?date ?title ?subject ?keywords ?doi WHERE { ?fname PDF:Title ?title; PDF:Subject ?subject. optional { ?fname prism:DOI ?doi . } optional { ?fname xmppdf:Keywords ?keywords. } optional { ?fname xmpdc:Date ?date. } }""" # Parse the rdf document into a rdflib.Graph object. graph = Graph().parse('metadata.rdf.xml') # Run the query and save as pandas.DataFrame results = graph.query(query) df = pandas.DataFrame(data=results) # Set proper column names. df.columns = [str(v) for v in results.vars] # Strip hostname and user dependent parts of the filename path. df['fname'] = "/" + df['fname'].apply(os.path.basename) # Convert DOI to URL. df['doi'] = "<a href=https://dx.doi.org/" + df['doi'] + ">" + df['doi'] + "</a>" # Convert title to link to the pdf df['title'] = "<a href=https://mpievolbio-scicomp.pages.gwdg.de/blog/presentations/" + df['fname'] + ">" + df['title'] + "</a>" # Don't need filename column anymore, write to json. df.drop(axis=1, columns='fname').to_json('presentations.json', orient='split')
Load json in html document with jquery
The generated json file is the datasource for our html table. Loading the data and producing the table is taken care of by the jquery javascript library. This html loads the library and data and if opened in a browser shows the rendered table:
<!DOCTYPE html> <html lang="en"> <head> <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script> <link rel="stylesheet" href="https://cdn.datatables.net/2.0.7/css/dataTables.dataTables.css" /> <script src="https://cdn.datatables.net/2.0.7/js/dataTables.js"></script> <meta charset="UTF-8"> <title>My Presentations</title> <!-- This script loads the json data and produces the table body. Note that the anchor '#example' refers to the table id below. dataSrc: 'data' relates to the 'data' key in the json document.--> <script> $(document).ready(function() { $('#example').DataTable( { 'ajax':{url: 'presentations.json', dataSrc: 'data'} } ); } ); </script> </head> <body> <table id='example'> <thead> <tr> <th>Date</th> <th>Title</th> <th>Subject</th> <th>Keywords</th> <th>DOI</th> </tr> </thead> <tfoot> <tr> <th>Date</th> <th>Title</th> <th>Subject</th> <th>Keywords</th> <th>DOI</th> </tr> </tfoot> </table> </body> </html>
Putting everything together: gitlab-CI
Of course, I do not intend to run all these steps everytime I want to make an addition to my presentations table. Therefore, I wrapped all necessary commands and scripts into a gitlab CI job and add that to the repo where all my talks' LaTeX source file are deposited. The same job also publishes the generated html table via gitlab pages. You can visit the repo at https://gitlab.gwdg.de/mpievolbio-scicomp/presentations/ and the rendered table at https://mpievolbio-scicomp.pages.gwdg.de/presentations/.