Using globus-online for tranferring large (>100 GB) datasets over the network

Introduction

Globus Online is a web service to transfer data over the internet. It is particularly suited for large files in the GB to TB range. The largest dataset transferred with globus so far is a 2.9 PB file from Argonne National Lab link.

I just completed a workflow where I used globus to transfer a ~500 GB dataset from the DRACO HPC cluster at the MPCDF to a shared samba drive at MPI Evolutionary Biology. Here’s a short writeup of the steps involved:

Upload from MPCDF to DataHub

$> echo 'put -r SOURCEDIR/' | sftp -b - datahub.mpcdf.mpg.de:/data/grotec/TARGETDIR/.

This command recursively copies the source directory into the target directory on DataHub.

Globus Online has a web portal at https://www.globus.org/. MPG members can use their LDAP ActiveDirectory credentials to login. After the first login, the dashboard is empty since no endpoints are connected yet.

After connecting the DataHub endpoint (datahub.mpcdf.mpg.de), I can list the uploaded files in the file browser.

Transfer from DataHub to local storage

To get the data from DataHub into local storage, I set up a personal globus endpoint on wallace and initialize the transfer from DataHub to my personal endpoint through the globus online web interface.

Create and run a personal endpoint

Install the ‘globus-cli’ python library

I did this through conda:

$> conda create -n globus
$> conda activate globus
$> conda install -c conda-forge globus-cli

In between, I had to install tcllib from sources into the conda environment $CONDA_PREFIX. Depending on your system, this may not be neccessary.

$> wget https://core.tcl-lang.org/tcllib/uv/tcllib-1.19.tar.gz
$> tar xzvf tcllib-1.19.tar.gz
$> cd tcllib-1.19
$> ./configure --prefix=$CONDA_PREFIX
$> make
$> make install

Setup globus environment

First, login to the globus network

$> globus login

Then, create a new (personal) endpoint

$> globus endpoint create --personal <ENDPOINT NAME>.

Record the endpoint ID and key for later use.

Find more detailed information at https://docs.globus.org/cli/.

Start the endpoint

To actually run the endpoint, another tool, globusconnect is needed. Instructions to install and run here: https://docs.globus.org/how-to/globus-connect-personal-linux/

After downloading and unpacking globusconnect, cd into the globusconnect directory and run

$> globusconnect -setup <ENDPOINT KEY>

Follow any instructions (to authorize the connection).

To start the endpoint (on wallace, you may want to run this in a screen session):

$> globusconnect -start &

Transfer files from DataHub to personal endpoint.

In the web interface, under “Collection”, select the source endpoint in one panel and the target (destination) endpoint in the right panel. Navigate to the directories containing the source files (directories) and the targets. Select the files/directories to transfer and click the “Start button”.

Monitor progress of transfer

Follow the “Activities” icon on the left toolbar to see a log and some statistics on the submitted/active/completed/failed transfers.