ESnet Workshop Report Outlines Data Management Needs in Metagenomics, Precision Medicine

William Barnett, the chief research informatics officer for the Indiana Clinical and Translational Sciences Institute (CTSI) and the Regenstrief Institute at Indiana University, discusses the promise of precision medicine. Photo by Predrag, Indiana University.

February 22, 2018

Like most areas of research, the bioinformatics sciences community is facing an unprecedented explosion in the size and number of data sets being created, spurred largely by the decreasing cost of genome sequencing technology. As a result there is a critical need for more effective tools for data management, analysis and access.

Adding to the complexity, two major fields in bioinformatics – precision medicine and metagenomics – have unique data challenges and needs. To help address the situation, a workshop was organized by the Department of Energy’s Energy Sciences Network (ESnet) in 2016 at Lawrence Berkeley National Laboratory. Organized as part of a series of CrossConnects workshops, the two-day meeting brought together scientists from metagenomics and precision medicine, along with experts in computing and networking.

A report outlining the findings and recommendations from the workshop was published Dec. 19, 2017 in Standards in Genomic Sciences. The report reflected the input of 59 attendees from 39 organizations.

One driver for publishing the report was the realization that although each of the two focus areas have unique requirements, workshop discussions revealed several areas where the needs overlapped, said ESnet's Kate Mace, lead author of the report. In particular, the issue of data management loomed largest.

“It was clear that many current data management solutions are in the early stages and will require additional and continued community effort in order to store, structure, and share data in efficient ways for collaborators around the world,” the authors wrote. “It is important to realize that recent advances in technology are enabling discoveries at a rate that biological scientists and technologists are racing to keep up with.”


Larry Smarr, founding director of the California Institute for Telecommunications and Information Technology (Calit2), describes "Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Supercomputers and Supernetworks." Photo by Brooklin Gore, ESnet.

One characteristic that separates the bioinformatics sciences from other areas is the prevalence of DNA/RNA sequence data generation in many scattered locations, compared to high energy physics, where massive datasets are generated by a few large facilities and then widely shared and analyzed.

“These aspects, combined with privacy requirements around human health information, must be considered when devising sustainable data management and mobility solutions for the various branches of the bioinformatics community,” the authors noted.


The overarching issue facing both the metagenomics and precision medicine fields is data growth: how best to manage it, store it and share it. Both fields are wholly dependent on computation, and breakthroughs are accelerated through collaboration. Here are the key findings of the workshop:

  • There is a significant need for a series of easily accessible datastores that adhere to community-determined quality standards to hold raw genomics data and metadata.
  • Two main barriers to open data exchanges were discussed at the workshop: the compliance with policies around PHI (Protected Health Information), HIPAA (the  Health Insurance Portability and Accountability Act) and FISMA (the Federal Information Security Management Act) guidelines using current infrastructures, and “data hoarding” between the dates of data collection and publication.
  • The “Medical Science DMZ” is a design pattern that can support workflows containing PHI.
  • The Joint Genome Institute (JGI) Archive and Metadata Organizer (JAMO) is a notable example for data curation and metadata tagging for dissemination. The JGI Archive and Metadata Organizer (JAMO), launched in 2013, was presented as a good model for data curation and metadata tagging for dissemination in workshop discussions. It was suggested that JAMO developers share best practices to help others in the community implement this model.
  • New software and data portals are needed to efficiently manage and serve bioinformatics data. There is an increasing need for sustainable, well-maintained, quality software and portals built for the management, access and mobility of bioinformatics data. Solutions will need to involve coordination between scientific and clinical researchers, private companies, funding agencies, infrastructure facilities, and educators.


The paper concluded with four recommendations:

  • Scientists and researchers need to be able to easily locate and verify the quality and compliance of (meta)genomic raw data and metadata. A large-scale solution toward data storage and standards is needed in order to accelerate scientific discovery in bioinformatics (see finding 1).
  • Use cases for the development of bio-specific workflows at scale should be organized and documented using the PRP participants and partners (see finding 2).
  • DOE institutions, such as ESnet in partnership with national labs, and other members of the community should develop a bioinformatics-specific community resource of best practices and recommendations for policy-compliant infrastructure models, data transfers, system tuning, data analysis, data standards, etc.
  • At the conclusion of the workshop and through post-workshop attendee feedback, it was clear that the topic of data movement and collaboration is critical to the field, but issues around data privacy, policy compliance, and the lack of universally accepted data structure is currently a larger hindrance to data movement than physical infrastructure in some cases. Community-wide efforts to address these issues are needed in parallel with ongoing national initiatives that are shaping the future of computational and network facilities for bioinformatics.

The paper, which includes additional details on the findings and a list of participants, is openly available. In addition to Mace, the authors are Daniel Jacobson, Oak Ridge National Laboratory; Brooklin Gore, Lauren Rotman and Mary Hester, Lawrence Berkeley National Laboratory/ESnet; Jennifer Schopf, Predrag Radulovic and William Barnett, Indiana University.

ESnet is a DOE Office of Science User Facility. Argonne and Lawrence Berkeley national laboratories are supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit