On Friday 16-11-2018 Confluence will be upgraded to the latest version. Confluence will not be fully available from 09:00 to max. 17:00
Page tree

white spacing
white spacing
white spacing
white spacing
white spacing

Skip to end of metadata
Go to start of metadata
Contents

Introduction

GridFTP has two main advantages over normal FTP:

  • Parallel transfer for increased speed
  • Strong certificate based authentication

Setup

A couple of steps are necessary to obtain a certificate. Use the Firefox browser. Your certificate is not install properly in Chrome.

  • Setup your UNIMAAS account to be able to obtain a certificate
    • Request page
    • Select UI account & Grid certificate
    • Use as motivation: I would like to obtain a grid certificate for GridFTP usage. I do not plan to use the LSG UI
    • You should receive a confirmation within a day or two tat your account has been enabled

 

  • Obtain a Grid Premium certificate from digicert
    • Digicert self service portal
    • Type Universiteit Maastricht (not Maastricht University) as identity provider, and login using UNIMAAS credentials
    • Select Grid Premium

 

  • Download p12 file from browser (in Firefox settings)

  • Convert p12 file to key and certificate:
openssl pkcs12 -nocerts -in digicert.p12 -out userkey.pem
openssl pkcs12 -clcerts -nokeys -in digicert.p12 -out usercert.pem
  • Place them in ~/.globus/.
  • Authorize your certificate on the Maastricht University GridFTP server by sending your DN to datahub@mumc.nl.


    # Printing your DN
    grid-cert-info -subject -file ~/.globus/usercert.pem

Usage

Start with initializing your certificate with a proxy certificate:

grid-proxy-init

The basic syntax of the command is:

globus-url-copy <sourceURL> <destinationURL>

where <sourceURL> and <destinationURL> can be one of the following:

  • file://<absolute path to your file>, if you are referring to a file local on the machine where you call this command;
  • gsiftp://<remote machine><absolute path to your file>, if you are referring to a file on a remote workstation.

After the command globus-url-copy you can specify some options, for example -vb to enable the verbose mode and monitor the transfer. In order to achieve maximal throughput you should specify the -p 10 option; this uses 10 parallel streams for data transfer. However, it only works if the destinationURL is a gsiftp://-type URL.

Assuming that you want to copy a file named foo located in the home folder of the machine to serverscratch, what you have to do is:

johndoe@pico$ globus-url-copy -vb -p 10 ~/foo gsiftp://gridftp.maastrichtuniversity.nl/mnt/serverscratch/johndoe/foo

Copy a directory from rawdata to local directory particles

johndoe@judac$ globus-url-copy -cd -r -p 10 -vb gsiftp://gridftp.maastrichtuniversity.nl/mnt/rawdata/ARCTICA/20151216_rbg.ravelli/ particles/

Note the slash at the end of both directory, and the -cd flag to create directory

 johndoe@judac$ globus-url-copy -cd -r -p 10 -vb gsiftp://gridftp.maastrichtuniversity.nl/mnt/rawdata/ARCTICA/20151216_rbg.ravelli/ ./

Copy to current directory

 

Usage of Globus Online

Globus Online provides a nice interface for GridFTP. Create and account and add an endpoint.

To authenticate you need use a MyProxy server. The surfSARA MyProxy server is suitable (px.grid.sara.nl)

myproxy-init -s px.grid.sara.nl

This creates a proxy of your certificate on the MyProxy server. It will ask first for your certificate password, and then ask you come up with a new password.

When using the MyProxy server in an endpoint you supply your username and this self generated password. This will then allow Globus Online to use your certificate (proxy).

Connecting to surfSARA gridftp for Life science grid

To connect to the LSG storage using GridFTP you need to also specify your VOMS (Virtual Organisation Membership)

Become a VOMS member: https://voms.grid.sara.nl:8443/

grid-proxy-init --voms lsgrid # myproxy-init --voms lsgrid when you want to use MyProxy

~/.voms/vomses/lsgrid:

"lsgrid" "voms.grid.sara.nl" "30018" "/O=dutchgrid/O=hosts/OU=sara.nl/CN=voms.grid.sara.nl" "lsgrid"

The surfSARA GridFTP server is located at gsiftp://fly1.grid.sara.nl:2811

The lsgrid data is located at /pnfs/grid.sara.nl/data/lsgrid/

Advanced usage and performance enhancement

Resume transfers

In case of a massive transfer, it is good practice to add to globus-url-copy the options -rst (for restarting interrupted operations) together with -df <filename>. The file specified by means of the -df flag is the so called dump file, containing the URLs that still have to be transfered. In case globus-url-copy returns to the prompt before finishing, entering again the exactly same command, taking care to preserve the mentioned options, will resume the transfer starting from the first incomplete file. In fact, any source file or path will be ignored and globus-url-copy will read the content of the dump file.

However, the following restrictions apply:

  • the two options,-rst and -df <filename> must be present since the first GUC call, otherwise it is not possible to populate the dump file;
  • globus-url-copy, when restarted, can not resume a file transfer at an arbitrary point, but will start to move incomplete file(s) from the beginning;

Another approach to resume a transfer is the usage of the -sync flag. In this case, globus-url-copy will perform a check before moving a file or a folder. The defualt behaviour is to move the source only if its timestamp is more recent than that of the destination. The user can choose among different synchronisation mechanisms acting on the numerical value assigned to the option -sync-level according to the following mapping (taken from the man page of globus-url-copy):

  • 0: transfer the source only if it does not exist on the destination machine;
  • 1: the source is copied if not present on the target machine or if the size does not match;
  • 2 (default if -sync-level not specified): move the source if it is newer than what is there at the destination side (and of course if it does not exist at all);
  • 3: compute the checksum of the source and of the destination, performing the transfer only if the two are different.

Performance

Network performance can be influenced and enhanced by tuning the mechanisms of parallelism and concurrency. In particular:

  • parallelism is the number of streams used to move each single file. The flag to use is -p <number of parallel streams>. Reasonable numbers could be 4, 8, 12, and 16. Usually, beyond 16 performance reaches a plateau (or even gets worse again). You have to experiment yourself to find the optimum value, as this value depends on many factors;
  • concurrency is the number of GridFTP servers that are started on the target machine. In other words, it is the number of files that are transferred simultaneously. The related option on the command line is -cc <number of parallel files>. Reasonable numbers could be 2 or 4.

Both mechanisms can be combined to achieve optimal results, however, it is important to remember that there is not a single recipe for all cases and the final result depends on the network bandwidth and the machines (CPU power, memory) involved in the operation. As a rule of thumb, if there are few big files, than it is better to use only parallelism with a high number of streams, i.e., 16. If the transfer is made up of many small files, then it could be beneficial to introduce concurrency, i.e., 2 or 4, associated maybe with a low level of parallelism, i.e.. 2 or 4. Please note that, for example, with -cc 8 and -p 4, globus-url-copy is moving 8 files at a time, employing 4 streams for each one, leading to a total of 8 * 4 = 32 streams from the client to the destination. The client and the server should have enough resources in terms of computation and memory buffers to sustain that. The advice is to experiment with the parameters (maybe using -vb to display the current and average speed) and find the optimal compromise.

For the sake of completeness, we mention here that in case both of the following conditions apply:

  • server to server transfer, i.e., both the source and the destination are specified as gsiftp://
  • the destination server is made up of multiple different nodes (i.e., login nodes of a supercomputer) sharing the same filesystem

then the most efficient way to perform a transfer is to use the so called striping, specifying the option -stripe instead of -cc and <-p>. It is not necessary to enter the number of stripes, this is communicated by the target GridFTP server. Please verify with the administrators of the resources if striping has been configured and can be used.

 

  • No labels