National Severe Storms Laboratory Performance Investigation
The National Severe Storms Laboratory (NSSL) is a National Oceanic and Atmospheric Administration (NOAA) facility operated on the campus of the University of Oklahoma. This center is responsible for creating products that assist in the prediction of weather events, and utilizes data feeds from many different remote locations. Efficient data movement is a critical part of the scientific process, and is a direct requirement in the day to day operation of the scientific users at this facility.
Scientific users at this facility reported performance abnormalities to their provider (NWave), but investigation did not go far after the provider concluded all of their equipment was functioning within specification. ESnet was consulted in May of 2017 to investigate and make recommendations. Multiple problems were uncovered over a 2-month investigation that impacted LAN and WAN performance.
In May of 2017 ESnet was contacted to assist with a network performance problem. Users at the facility noted months of slow performance when downloading and uploading to remote locations. Observations were that performance to all locations that were either directly on the NOAA network (e.g. Geophysical Fluid Dynamics Laboratory (GFDL) in Princeton NJ) or at partner institutions (Texas Advanced Computing Centre (TACC) in Austin TX) suffered the same slow performance. Given the scale of the observations, the problem could have been local to the NSSL or NWS (National Weather Service) network, the University of Oklahoma (OU) Campus network, OneNet (regional provider), or NWave. Traffic leaving the facility travels toward a network egress in Boulder CO (NWave TIC) before going to any other location.
ESnet engaged with the staff at NSSL to learn about the observations. Data transfer tools (GridFTP, etc.) were showing slow performance both inbound and outbound. The facility initially had a 1Gbps connected perfSONAR resource that was also showing the same observations (this was upgraded to 10Gbps early in the process). ESnet started a process to “map” the end to end path between NSSL and a single location that was having performance issues, in this case GFDL in Princeton NJ. During this process to map the network, it was discovered that there were several intermediate networks in the path:
- NSSL Local Network
- University of Oklahoma Network
- OneNet: the state network of Oklahoma
- Internet2: a national network providing services to NOAA over the Advanced Layer 2 Service (AL2S) service
- NOAA N-Wave: an overlay network (operated on top of Internet2, by the Global Research Network Operations Centre (GRNOC)) that provides connectivity to NOAA Sites
- MAGPI: A regional provider in Pennsylvania and New Jersey
- GFDL Local Network
The following map was created using input from engineers at each facility:
More importantly, with a detailed map of the environment it was possible to locate testing resources:
- OU Campus
- NOAA N-Wave (Norman, Washington)
Regular testing to all of these locations indicated poor performance (including to the short-latency resource located on NOAA N-Wave in Norman itself). This pointed to local area networking problems that were impacting all wide area traffic.
Working with NSSL engineers, the following issues were discovered:
- Use of a 1Gbps test infrastructure for perfSONAR, which was upgraded to 10Gbps
- Establishment of latency and bandwidth tests to other locations
- Discovery of CRC/Framing errors on several devices in the NSSL Core. Errors of this sort normally indicate either a failing or mis-configured piece of equipment
- Layer 2 (e.g. VLAN) networking that was complicated to understand, including the use of a VLAN that had circular paths through the NSSL core
- Establishment of a BGP peering for NSSL on a campus firewall, which caused Science DMZ traffic to still be pulled through the device, even though the intention was to avoid this path
The graph below illustrates performance after the problems were addressed.
The performance gains were short lived, and after about 5 days problems returned. Similar observations were seen – persistent packet loss from latency monitoring tools and low throughput. This was not a constant observation however, as the use of Link aggregation (LAG) links resulted in non-deterministic behavior across the infrastructure. Due to a busy period of time at the facility (weather events and staff vacations), it was not possible to perform a further investigation into a root cause immediately.
In early July, another investigation was performed focusing on the connectivity between the NWS Core and the NWave Norman MX80 Device. Consulting information on Core5, a large number of inbound framing errors were seen from the MX80 (counters on the MX80 would not show this) so an investigation was performed to try the following things:
- Dropping the one 10G member of the MX80 to NWS Core LAG that saw the framing errors. After doing so, no other errors were witnessed and tests showed clean results (leaving only 1 10Gbps connection for the facility).
- Cleaning the fiber on the dropped member and re-testing. This showed errors returning.
- Replacing the 10Gbps LR XFP optic. This showed errors returning.
- Migration to a completely different interface for the 10Gbps connection (xe-0/0/3 on the MX80)
Doing so resulted in immediate performance improvement. Shown in Figure 2 below:
A complete timeline for both fixes can be seen below:
The multiple layers of problems are common, with one issue potentially masking others. It is not known at this time why performance remained high for 5-6 days after the initial fix, only to degrade after. One explanation could be increased traffic load aggravating a slowly failing component. It could also be pure chance that the failure occurred at that time and not earlier or later.
Future work for NSSL includes working with the OU Campus to migrate the BGP peering away from the firewall, and consider implementation of a more fully featured Science DMZ.