The software interface between user and machine is an oft-unsung hero at Diamond, and yet is the pivotal mechanism through which beamline data is collected and converted into results, conclusions, and publications. Here we take a look at some of the key programs running the show, and the teams behind their development.
Towards the end of 2009, twelve years on from Eric S. Raymond’s seminal essay on open source revolution, ‘The Cathedral & The Bazaar’, a tribe of Diamond hackers released their Generic Data Acquisition software (GDA). Hacker, not by way of the computer criminal trope perpetuated by news media, but in its original sense as a trailblazer breaking new ground, or according to Raymond, “an enthusiast, an artist, a tinkerer, a problem solver, an expert” (1999: xii). Being both free and open-source, and thus a resource for other synchrotron facilities to adopt, it invited the growth of a community to hack their way to improvements in a system supporting user-driven development.
When the GDA software was made open source, data archives at Diamond had already collected what was then an impressive 37-plus terabytes of data since the first gigabyte was recorded in 2007 - and over two-thirds of that data had been gathered during the preceding 12 months. The increase in data volumes, driven largely by the increased efficiency of the beamlines and advancing detector technologies, quickened the challenge of providing a consistent interface to meet the size and complexity of the data. And of course the greater volume of data brought with it the need to efficiently process, analyse and present the results back to users quickly enough to inform the design of subsequent experiments.
The collaborative values of GDA are mirrored at Diamond in the data analysis software platforms, which transform, reduce and recombine the data collected by GDA into interpretable 1D, 2D, and 3D visualisations. One of Diamond’s key data analysis packages is DAWN (Data Analysis WorkbeNch), produced as part of a joint project between Diamond, the European Synchrotron Radiation Facility (ESRF), and the European Molecular Biology Laboratory (EMBL, Grenoble), and is used on a number of beamlines across these facilities. DAWN, like GDA, builds upon the Eclipse Rich Client Platform (RCP)1, and contributes its Diamond-driven data analysis plug-ins back to the wider community, enabling other users and institutions to repurpose DAWN code under the Eclipse public licence.
The flavour of collaboration permeates the world of scientific software development as large facilities and research groups increasingly pool resources for common problems, with a real drive for making interpretations of data more connected. But whilst the philosophical altruism of freedom for freedom’s sake is writ large in the coders’ collective mind-set, the democratisation of code advances fields of knowledge, and in turn makes life easier for those scientists with huge datasets to handle.
At Diamond, most new software developments are initiated in response to emerging user needs on the beamline and grown via project collaborations traversing departments, skills, and experience levels. Ever outward-facing when it comes to development, Diamond teams are also closely involved in an array of Collaborative Computational Projects (CCPs) covering MX, tomography, electron microscopy, and small-angle scattering techniques. Funded by the UK research councils, and some with funding links from outside the UK, the CCPs support computational research on a large scale in R&D, exploitation, and distribution phases, further enabling Diamond to produce cutting-edge software based on common frameworks.
Controlling collection and processing
For those scientists applying to use Diamond, the parameters for their data can be set as early as the initial application, with beamlinespecific local contacts on hand to provide software support once beamtime has been allocated. Before the user arrives, beamline staff will configure the hardware and GDA accordingly - checking the dance of endstation attachments, sample handling robots, goniometer positions, lights, cameras and other instrumentation that perform acquisition routines - always careful to report opportunities to optimise the data collection process.
The GDA software is generally the user’s first interface with the machine, accessed through the Linux terminals on the beamline. It presents an experiment-specific GUI, tailored to the needs of the user - either walking them through a predefined workflow, or in the more interactive experiments, providing a set of tools to use as required in response to the data coming from the sample.
From their desk in the control cabin (or any remote location worldwide), the user performs data collections that are comprised of a number of scans, which move the motors and record the data. The self-describing nature of the data means that it is packaged in a way that indicates exactly how the experiment was performed, and is increasingly used to provide live feedback from the chain of processing routines, or ‘pipelines’, to help determine what the next step of the experiment should be.
For every scan data point (for example, a diffraction image resulting from a short exposure to beam), the raw data is stored and sent down the pipeline for processing. In some cases the data goes via the ‘reducer’, another system developed at Diamond that automatically cherry-picks particular data from the read-out that is needed for the processing and analysis pipelines.
Automation and algorithms speed things up
The main focus of much of the analysis software at Diamond, including DAWN, is to create smart algorithms and automate pipelines for doing the data analysis behind the scenes allowing users to concentrate on sample preparation and data interpretation, typically requiring their specialised knowledge. Once the files have been through the reducer, the analysis software automatically processes that data to produce the data-rich visualisations for the user to navigate, investigate, and export.
The MX village has a mature computational backdrop, and a long and rich history of well trusted and extensively tested software. But with detector technology advancing at such a pace, and data-hungry pixel array detectors now taking precedence, the speed at which the data needs to be collected, processed, and analysed has meant that new algorithms are required to perform these jobs.
Graeme Winter, Senior Software Scientist in the Data Analysis Group explains, “If you go back to the early detectors that used image plates and film, and even when the MX beamlines were first built with the CCD detectors, the data was all fairly noisy, the spots tended to be quite big, and the algorithms required to extract that data were specific to the technology. A dataset that may have taken you fifteen or twenty minutes to collect and measure will now only take one or two minutes. Processing the dataset manually would take longer still, so we need the automatic processing simply to keep up with the speed the data is collected by these fast read-out detectors.”
Collaborative projects supporting emerging needs
One of Diamond’s major new MX software projects is DIALS (Diffraction Integration for Advanced Light Sources). Developed in collaboration with CCP4 (based in the Research Complex) and Berkeley Lab DIALS will analyse both synchrotron and XFEL diffraction data, meeting new challenges thrown up by smaller X-ray beam sizes, smaller crystals and the extraordinary data demands of pixel-array detectors. The VMXm beamline currently under construction at Diamond will measure diffraction data from sub-micron-sized crystals in vacuo, and is headed up by the DIALS project lead Gwyndaf Evans.
“I have a unique position in that I am developing the new VMXm beamline to produce X-ray diffraction data and ultimately protein structures from micron and submicron sized crystals while also steering the design of the DIALS analysis software that will be so essential to making the beamline a success” says Gwyndaf. “To be able to investigate the interface between data measurement and data analysis from both sides allows me to take an integrated view of their design.”
DIALS will become the standard for all crystallography beamlines at Diamond, and is also being adopted by the Small Molecule Single Crystal Diffraction beamline (I19) in the Materials Village. The data on I19 will be treated differently after it is processed, but the mathematics and physics are identical. The use of synchrotron radiation for single crystal diffraction is necessary when structures are too complex, or crystals are of insufficient quality or size to allow structure determination from the relatively low intensity of a laboratory X-ray diffractometer. DIALS will be able to support the high-speed data collection and high-flux required on I19 as well as on the MX beamlines, helping to enable full exploitation of the beam in the I19 endstation upgrade, thereby increasing throughput dramatically.
The primary data centre - where all the data from all the experiments goes
Keeping up with data demands
The capacity to process Diamond’s ever increasing data volumes can be seen with the number of cores used in processing having risen from 96 in 2008 to 1,120 in 2014, and as you’d expect, the 37 terabytes seen in 2009 is now officially small-fry. Diamond has recently been collecting around 20-30 TB per day, with a cumulative volume of 2.7 petabytes in the archive. Ongoing improvements in data collection, faster processing, and more complex analysis tools being available to Diamond’s users means not only an increased number of experiments, but a wider range of experimental methods - as seen with the long duration experiment on I11 - and, of course, a greater number and variety of publications emerging from the beamtime.
Exploiting the increased range, volume, and quality of data by developing user-centred software is one of the major strengths of the data acquisition and analysis groups, and is synonymous with the culture of collaboration that runs through Diamond, from the international computational research projects to the informal coffeeroom hacker meet-ups. It is this grassroots approach to co-creation that so typifies the hacker mentality, and gives the scientific community the chance to not just be passive consumers of their experimental data, but co-conspirators in tailoring to their needs and advancing the world of synchrotron data beyond.
Raymond ES. (1999) The Cathedral & The Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. Sebastopol (CA): O’Reilly Media Inc.
1. The Eclipse RCP is a toolkit for building user interfaces. Diamond is a founding member of the Eclipse Science Working Group alongside IBM and the US Department of Energy’s Oak Ridge National Laboratory, collaborating on producing technologies used for interdisciplinary analysis of scientific data.
Diamond Light Source is the UK's national synchrotron science facility, located at the Harwell Science and Innovation Campus in Oxfordshire.
Copyright © 2017 Diamond Light Source
Diamond Light Source Ltd
Harwell Science & Innovation Campus