Getting from A to B

How High-speed Data Transfer Could Power Science

Dr. Richard Boardman from the µ-VIS Centre for Computed Tomography (an X-Ray Imaging Centre providing support to the Engineering, Biomedical, Environmental and Archaeological Sciences at the University of Southampton) discuses how improving institutional data transfer speeds could lead to more efficient research. The pilot project setup by The University of Southampton, Diamond Light Source, the Science and Engineering South Consortium and Jisc was recently showcased in a workshop event at the University of Southampton’s Boldrewood campus.

What’s your experience of working with large datasets?

We’re used to handling fairly large datasets at our lab; the radiographs we routinely generate with our equipment are around 8 megabytes in size, and we can capture half a dozen of these per second, more or less continuously, for each machine we operate. Once we have finished rebuilding these radiographs into volumetric data ready for analysis, we end up with around 50 gigabytes of data for perhaps as little as five minutes of experiment time.

In house, we have a streamlined procedure for looking after these datasets, starting with the initial experimental enquiry, through the sample preparation and mounting, acquisition, post processing and analysis, and finally down to publication, archiving and sharing.

What about off-site experiments?

When we work off-site at a synchrotron or neutron source, we have to find a way to get the data back to our systems for the processing, and this has typically been more problematic.

Although synchrotron light sources, such as the Diamond Light Source at the Harwell Science and Innovation Campus, provide brilliant, high-flux photon beams leading to fast capture times and consequently high throughput, the amount of generated data can quickly overwhelm unprepared researchers. Additionally, the researchers working on these beamlines do shift work to ensure the beamtime is spent most efficiently, so they are aiming to acquire data for 24 hours per day.

“…the amount of generated data can quickly overwhelm unprepared researchers.”

All beamlines, no matter what the light source, provide some degree of capability for transferring data on to hard disk drives, and so mechanisms for working with these became the de facto standard; they are relatively cheap, relatively reliable and readily available. Experimenters will synchronise data on to a collection of these drives as it becomes available.

How are you moving the data from these facilities back to the university?

Right now, once the experiments have concluded, the researchers will return with a collection of hard disk drives ready for ingest into the local data store, and then prepared for analysis; working on the hard disk directly is rarely advisable – aside from the performance considerations for heavyweight dataset analysis, the hard disk is the only effective copy of the data at this point (of course, beamlines may keep copies of these datasets but policies vary between sites). This stage is time consuming and means there is a lead time from when the researchers return to when they can start processing their data.

What happens if something goes wrong?

If there is a problem with the data, then sometimes the researchers will have to return to the site and either reprocess the radiographic data, or reacquire it if conditions were suboptimal. Some beamlines have remote access, which can help in certain situations where reacquisition is not required, although these capabilities vary from site to site.

Finally, there is the risk of data loss on these copies – not just that a disk itself may fail, and the data would need retransferring (relatively rare but possible), but more that a lost disk could cause problems if an unauthorised third party were to access it.

“If there is a problem with the data, then sometimes the researchers will have to return to the site and either reprocess the radiographic data, or reacquire it if conditions were suboptimal”

How would a transfer system that moves real-time data back to the university help?

This would mean that the analysis step could commence as soon as these researchers return. Not only that, but during the acquisition other people could look at the data and verify that they are of a high enough quality; experienced beamline scientists might be able to spot a tired scintillator chip, whereas their junior colleagues at 3am on a Sunday session might miss it.

Similarly, far more people can work on the dataset at the home institution than can be sensibly accommodated at the beamline during acquisition. Preliminary results can guide experimentation and live or nearly-live feedback might result in a far more productive use of this expensive beamtime.

What are the perceived drawbacks to this kind of setup?

The general perception with high throughput datasets such as these is that the network connection will bottleneck it. After all, the average computer might have a 1Gb/sec campus connection, shared with a number of other people, and so perhaps it is not an unreasonable assumption.

So you went about a pilot project to see if this was possible, where did you start?

At first we set out to see if using the network to transfer the data would be viable, and if a workflow could be established that would provide not only parity with the existing “bag of disks” setup, but hopefully exceed it.

As the Diamond Light Source already had a Globus endpoint set up, accessible via their own Federal ID, we set up a test endpoint in our own lab. This was a much more straightforward affair than expected, and within just a few minutes we were ready to transfer some data… somewhere.

Where did you send it to?

The “somewhere” ended up being a high-speed datastore awaiting commission for another project, with enough disk space and performance to ensure we wouldn’t be bottlenecking our tests. Via an SMB connection, sustained write performance from the development workstation hosting our Globus endpoint was in excess of 800MB/sec – contrast this with a typical external hard disk speed of the order of 100MB/sec.

That’s quite a jump in speed!

Yes, initial tests were encouraging. A good data rate over the 1Gb/sec connection suggested that there was more bandwidth to be had from Diamond if we could get some breathing room to the development workstation. Once a 10Gb connection to the campus edge was installed, we found we could routinely match transfer rates from high-end internal hard disks (~200MB/sec), without the physical and temporal hassle of loading them up with data, and decanting it at the far end.

“Preliminary results can guide experimentation and live or nearly-live feedback might result in a far more productive use of this expensive beamtime.”

So the pilot has generally been a success?

Being able to see and work on the data almost as soon as it is being generated is certainly possible with the Globus sync option, and the advantages of a familiar, unintimidating web interface to kick off what might be a daunting transfer cannot be underestimated.

So, the transfer rates between the sites were good, and the system itself appears robust. Certainly, for near-live transfers, allowing a team of experienced researchers to get a look at the data coming over, Globus looks like a great solution.

Were there any bumps in the road?

Through the course of the testing, a couple of hiccups were uncovered. One small file – a script – had a permission set that meant the file could not be read. Whilst some systems (notably rsync) would have skipped this file and reported the error in the summary at the end, transferring the rest of the data in the meantime, Globus did not. Instead, it attempted to retry the file transfer, would fail, and retry again, and fail again, until someone intervened and either cancelled the transfer, or (in our case) someone fixed the file permission on the remote end. However, by the time we noticed this on an overnight transfer, we had lost several hours of possible transfer time.

The second issue we ran into was one of disk space checking. With a large transfer (20 terabytes) pending, we ensured we had enough drive space (25 terabytes or so) to store the entire dataset. However, during the transfer we noticed that the space ran out entirely. It turned out that the dataset’s owner had underestimated the size of the dataset….by 50 terabytes.

How could you fix this for future transfers?

Preferable behaviour might be a warning, whilst the transfer is starting, that there will not be enough disk space to successfully complete (and by how much). That way, we could optionally cancel it or provision more space during the transfer (or ignore it and get what we get). This would provide an advantage over existing systems that either ignore it and hit a wall, or won’t even start, and force you to rectify the problem before you can start transferring. Keeping options open to the user would be great.

Sounds like it’s been a really valuable setup – what’s next?

Setting up and using Globus to transfer datasets in the tens of terabytes was a fairly painless process, and, caveats mentioned aside, it was reliable too. We hope that soon we’ll be able to encourage more of our synchrotron partners to use Globus, and routinely use it for improving not only our data transfer arrangement, but for getting more out of our beamtime too.

Leave a Reply