ESnet’s Science DMZ Breaks Down Barriers, Speeds up Science
Contact: Jon Bashor, [email protected], +1 510 486 5849
From individual universities around the country to a consortium of research institutions stretching the length of the west coast, networking teams are deploying an infrastructure architecture known as the Science DMZ to help researchers make productive use of ever-increasing data flows.
The Science DMZ traces its name to an element of network security architecture. In a security context, a DMZ or “demilitarized zone” is a portion of a site network which is specifically dedicated to external-facing services (such as web and email servers). Typically, located at the network perimeter, a DMZ has its own security policy because of its dedicated purpose – exchanging data with the outside world. A Science DMZ is specifically dedicated to external-facing high-performance science services. For example, the data servers for a large data repository would be put in a Science DMZ so that collaborating institutions could easily transfer hundreds of terabytes of data for analysis.
Eli Dart, a network engineer with the Department of Energy’s Energy Sciences Network (ESnet), first coined the term “Science DMZ” in early 2010 to describe the network configuration linking two DOE sites – the Princeton Plasma Physics Laboratory in New Jersey and the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory in California. ESnet provides high-bandwidth connections between 40 DOE sites in the U.S. and links to collaborators around the globe. Both NERSC and ESnet are DOE Office of Science User Facilities.
In September 2015, Dart and Brent Draney of NERSC received the Berkeley Lab Director’s Award for Exceptional Achievement for their development of the Science DMZ concept as members of the supercomputer center’s Networking, Security and Servers Group. They were recognized for “achievement in operational effectiveness, process re-engineering or improvement, resource management and efficiency, or partnerships across organizational/departmental boundaries.”
Dart formalized the idea in a summer 2010 presentation to a meeting of the ESnet Site Coordinators Committee, then in February 2011 took it to a broader audience at a leading conference for the international networking community.
Since then, the concept has been endorsed by the National Science Foundation (NSF), replicated at more than 100 universities, is being considered by several federal research organizations and is the basis for the new Pacific Research Platform, a cutting-edge research infrastructure which will link together the Science DMZs of dozens of West Coast research institutions. On July 30, 2015, the NSF announced it would fund a $5 million, five-year award to UC San Diego and UC Berkeley to support the Pacific Research Platform as a science-driven high-capacity data-centric “freeway system” on a large regional scale.
Among the institutions using the Science DMZ architecture are General Atomics, a research company studying fusion energy, and the University of Utah, which collaborates in a number of research collaborations.
Dart said he started thinking about the challenge of accelerating data flows when he joined NERSC in 2001 and then his ideas picked up speed when he started working for ESnet in 2005.
“Network people live in the middle and see everything everyone does on the network,” Dart said. “By recognizing patterns, we could make the infrastructure for science behave better. And if we could come up with a model that could be pretty easily replicated by other facilities, we could remove a lot of the complexity of creating and maintaining a high performance data infrastructure.”
A number of factors are driving the issue: the increasingly collaborative nature of research, the building of more powerful experimental facilities that collect higher resolution images at higher speeds, and the increased bandwidth on network backbones. So more researchers are creating larger datasets and need to share the data, and while the data moves quickly across the long-haul science networks, the transfers bog down when they encounter traditional local network architectures at the end points.
As an example of the scope of the problem, ESnet operates a 100 gigabits-per-second network linking 40 DOE sites and several research sites in Europe. But ESnet also peers with almost 140 other research and education networks and, in fact, 80 percent of the nearly 25 petabytes carried by ESnet in one month either originates or terminates at a site off the main network. Without high-speed connectivity from end to end, the 100G backbone merely speeds datasets from bottleneck to bottleneck.
So when universities learn about the Science DMZ concept, it’s an idea that sells itself, Dart said. ESnet, Indiana University, and Internet2 have organized a series of Operating Innovative Networks (OIN) workshops since 2013 “to get the ideas into the hands of people doing the work at regional networks and campuses,” Dart said. In addition, during the same time, Dart and Jason Zurawski (another member of ESnet’s Science Engagement Team and ESnet’s lead for the OIN workshop series) have been giving talks at meetings and presenting webinars on the Science DMZ. Zurawski estimates that between them, he and Dart have given 100 presentations around the country. At the OIN meetings alone, they’ve met with more than 350 networking staff, of which 75 percent are at universities.
But Dart points out that institutions have different infrastructures, different policies and different funding models, meaning that one size does not fit all. There can also be some inertia on the personnel side, making it difficult to implement changes.
“The sociology is really important – different scientific fields have different expectations and different cultures and you need to take that into consideration,” Dart said. “So we have a set of design patterns, not a master blueprint. It’s important to have everybody on board. And sometimes you need to tackle the easy problems first because it’s often more effective to snowball something that’s successful in a small way than to start by banging your head against the biggest thing you can find.”
Kevin Thompson, program manager in the National Science Foundation’s Division of Advanced Cyber Infrastructure, attended a Science DMZ talk given by Dart in early 2012. At the time, the NSF was reviewing several reports from task forces supporting an NSF advisory committee. One report was about campus bridging – the need to address networking beyond the core. There was a recommendation to NSF to create a program to fund campus connections to POPs and the national backbones.
“I heard Eli give a talk on the Science DMZ architecture in the context of research campus border. He talked about an engineering change that would place scientific data flows at the top of the network food chain,” Thompson. “This was a new context and the timing was good as we were moving forward with our program to improve campus networking. When we put out our solicitation for the Campus Cyberinfrastructure - Network Infrastructure and Engineering program (CC-NIE), ESnet and others were attacking the performance problem and the DMZ was a natural, timely fit.“
On campuses, the scientific data went through the same path as dorm traffic, Thompson said. The paths had inline firewalls, were underpowered or misconfigured, and were not set up to handle scientific data flows. He remembers one university was trying to do a science demo at an SC conference between the campus and the exhibits hall and they couldn’t figure out why the flows kept shutting down. The campus network was programmed to treat any flow that large as a denial of service attack.
“What Eli was talking about rang very true with me – re-architecting specifically for scientific data flows. It was not about reconfiguring, but rethinking campus infrastructure,” Thompson said. “This is one example of the kind of infrastructure NSF wanted to support in the CC program.”
It is a considerable undertaking, Thompson pointed out, and an investment of time and resources for the campuses. And the more resources they have, the more complex it is to manage.
“But the changes in how campus IT leadership now views scientific data flows is as important as anything else,” Thompson said. “In building a DMZ, you explicitly acknowledge that scientific data flows are of primary importance to the organization. They support distributed collaborations, which exemplify science in the 21st century. It’s all about moving out of the individual researcher’s lab to a connected world. We are definitely seeing quantitative improvements in the reporting back– orders of magnitude improvements in data flow rates in some cases. Performance is something that is really quantifiable.”
Both DOE and NSF have decades of experience in high performance computing (HPC) and networking.
“In the R&E networking space we have a close working relationship and that helps with NSF’s mission to fund research at colleges and universities,” Thompson said. “It’s been a great partnership. The Science DMZ is one of the more important network engineering events for the community to build around in a long time. It’s hard to overstate the importance of this seminal engineering program. ESnet has shown national leadership in campus networking and it’s a big reason why the NSF program has been so successful.”
Over the past three years, NSF has made awards to 135 campuses to implement network improvements such as the Science DMZ, with 20-30+ new awards in 2015 expected, Thompson said.
Taking it on the Road
In his travels and talks, Zurawski has seen a wide range of challenges on campuses interested in deploying Science DMZs. Large schools often have the engineering talent and just need funding to buy additional equipment, such as an upgraded network switch. Medium-sized schools may have restrictive security policies to go with their networks. Small schools may not have a very strong network nor staff.
But in each case, the key is to talk with the people who will be managing the system, as well as those who will be using it, Zurawski said.
“We have to go out and talk to the users, which is the idea behind ESnet’s Science Engagement program,” Zurawski said. “Sometimes it’s hard reach the smaller schools, so we rely on word of mouth and community mailing lists.”
Zurawski and Dart also go to meetings of regional networks to get the word out. Here are some recent examples:
- In January 2015, they spoke about the Science DMZ at the biannual meeting of Westnet meeting with representatives of universities and networks in Arizona, California, Colorado, Idaho, New Mexico, Utah and Wyoming.
- In March, Zurawski led a webinar on “Upgrading Campus Cyberinfrastructure: An Introduction to the Science DMZ Architecture” for research and education organizations in Pennsylvania.
- At the twenty-second GENI (Global Environment for Network Innovations) Engineering Conference being held in March in Washington, D.C., ESnet staff demonstrated the Science DMZ as a service and showed how the technique for speeding the flow of large datasets can be created on demand.
- In May, Zurawski led a webinar for members of the Great Plains Network, giving a basic overview of cyber threats to a campus and ways the Science DMZ architecture can be implemented to protect and allow high-performance activities. The Great Plains Network, recipient of a regional CCIIE grant from NSF, comprises 20 universities in Arkansas, Kansas, Missouri, Nebraska, Oklahoma and South Dakota, with affiliates in Iowa, Minnesota and Wisconsin.
- Also in May, ESnet partnered with BioTeam, a high-performance consulting practice, to offer a webinar on the Science DMZ architectural paradigm. Although directed toward network operators and researchers working in the life sciences, the webinar was open to the general public.
Linking West Coast Researchers
But perhaps the biggest stage was in March at the 2015 annual conference of CENIC (Corporation for Education Network Initiatives in California) when the Pacific Research Platform was announced. The platform will link together the Science DMZs of top research institutions via three advanced networks: ESnet, CENIC’s California Research & Education Network (CalREN) and Pacific Wave. The project will weave together separate Science DMZs into one large, seamless research platform that enables colleagues worldwide to collaborate while not losing any of the advantages of a network architecture specially optimized for the unique needs of big-data research.
PRP links most of the research universities on the West Coast (the 10 University of California campuses, San Diego State University, Caltech, USC, Stanford, University of Washington) via the Corporation for Education Network Initiatives in California (CENIC)/Pacific Wave’s 100G infrastructure. To demonstrate extensibility, PRP also connects the University of Hawaii System, Montana State University, the University of Illinois at Chicago, Northwestern, and the University of Amsterdam. Other research institutions in the PRP include Lawrence Berkeley National Laboratory (LBNL) and four national supercomputer centers (SDSC-UC San Diego, NERSC-LBNL, NAS-NASA Ames, and NCAR). In addition, the PRP will interconnect with the NSF-funded Chameleon NSFCloud research testbed and the Chicago StarLight/MREN community.
The project was described by a panel that included ESnet’s Dart and Larry Smarr, founding director of the California Institute for Telecommunications and Information Technology (Calit2), a UC San Diego/UC Irvine partnership, and the Harry E. Gruber professor in Computer Science and Engineering at UC San Diego.
“We realized that many West Coast universities were recipients of CC-NIE grants – seven UC campuses, USC, Caltech, Stanford, University of Washington—each one had identified the science drivers and built a DMZ into their network,” Smarr said. “The next logical step was to link them all together. The campuses had already invested in CENIC and Pacific Wave, so we needed system integration with DMZs as the backplane.”
But this was not as straightforward as it sounds, Smarr said, because each proposal interpreted the DMZ principles differently, in terms of the local IT environment on each campus. The result is a collection of DMZ examples, rather than strict interpretation of a single model.
[ESnet] defined the Science DMZ and took it to the DOE science community. NSF has now cloned this approach through the CC-NIE program over the past three years. It’s been built out on over 100 campuses and these Science DMZs are all based on ESnet’s approach. - Larry Smarr
“Last December, we pulled together key networking experts up the West coast from San Diego to the University of Washington to see if we could wire the Science DMZs up and set the CENIC 2015 meeting in early March as our deadline,” Smarr said. “Instead of having high bandwidth only from campus gateway to campus gateway, we wanted to do it from inside each campus DMZ to the others – we wanted to create end-to-end integration.”
Again, it was not a trivial task. When they first tested the network, the team found they were only getting 10 Mbps on a 10 Gbps link. Using tools like perfSONAR, a widely deployed tool to measure network performance, and GridFTP, they were able to find the bottlenecks – routers, switches and the transport protocol used.
“Everybody really dug in – total collaboration mode,” Smarr said. “Then by March, we could announce at CENIC that between a number of sites we got performance of 9.6 Gbps, disk to disk, over a 10G link. And we got 36 Gbps over a 40G connection.”
The team had continued working on the project and now has NSF funding to put on an Oct. 14-16 workshop on the Pacific Research Platform that they built.
Smarr, who has had a long and noteworthy career in supercomputing and science, including founding the National Center for Supercomputing Applications (NCSA) at the University of Illinois, sees the emergence of ESnet and the Science DMZ almost as history repeating itself.
“If you go back 30 years, you find the equivalent of what’s going on now. Back then, interest in the DOE supercomputing centers transferred to the university research community,” he said. “When I wrote my proposal for NCSA in 1983, there were no multi-disciplinary nationally available supercomputers in the university sector. I proposed explicitly cloning the Lawrence Livermore National Lab system for NCSA and even used CTSS, the Cray Time Sharing System developed at LLNL. Sid Karin cloned MFECC – now known as NERSC – for the San Diego Supercomputer Center. DOE had the ideal combination of supercomputers and mass storage. “
Smarr added. “By early adoption of the DOE approach, NSF took that proven approach in computational science and engineering and moved it in the university sector, then they trained up a good deal of American industry in supercomputing.”
Fast forward 30 years and ESnet, with federal stimulus funding, built a 100Gbps national footprint.
“They defined the Science DMZ and took it to the DOE science community. NSF has now cloned this approach through the CC-NIE program over the past three years,” Smarr said. “It’s been built out on over 100 campuses and these Science DMZs are all based on ESnet’s approach.”
ESnet is managed by Lawrence Berkeley National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.