Machine Learning to Add Another Dimension to ESnet's Toolbox for Predicting Data Patterns

Mariam Kiran leads the Deep Learning and Artificial Intelligence for High-Performance Networks (DAPHNE) project as part of her DOE Early Career Award research.

November 6, 2020
By Jon Bashor


Applications like Google Maps are a boon to people looking for the fastest, smoothest way to get from one place to another, whether they're navigating in real-time or looking ahead to the best time to avoid congestion when trying to catch a flight.

A research project called DAPHNE, led by Energy Sciences Network’s (ESnet’s) Mariam Kiran is using deep learning to give scientists the same kind of tools for predicting the best time and date to schedule large-scale big data transfers across ESnet, the Department of Energy's (DOE’s) high-speed network for science. In 2017, Kiran received a DOE Early Career Award with funding for five years to pursue DAPHNE, or Deep Learning and Artificial Intelligence for High-Performance Networks.

The project is exploring how artificial intelligence can be used to design and efficiently manage distributed network architectures to improve data transfers, guarantee high-throughput, and improve traffic engineering. The goal is to improve network performance using software, rather than the more common practice of upgrading by buying advanced (and costly) hardware such as routers and switches. A paper describing this work has been accepted to IEEE BigData 2020 Conference that will be held virtually from December 10-13. Kiran will also give a lightning talk on DAPHNE at SC20.

The project is refining a tool called NetPredict. NetPredict is designed to work with research and education (R&E) networks to schedule big science data transfers up to seven days ahead to achieve optimum network performance. While ESnet’s Portal (ESnet Portal), a software library that gives network administrators and the research community an up-close look at traffic across ESnet, is good for giving current network statistics, NetPredict uses deep learning models and real-time traffic statistics to predict when and where the network will be congested and how long transfers will be completed in.

"We started with 24 hours and now we have pushed our predictions up to seven days broken down into hourly slots,” Kiran said. "As we go farther out, we notice the prediction accuracy drops significantly even with LSTM-based approaches. So, we teamed up with Prasanna Balaprakash, a 2018 DOE Early Career awardee from Argonne National Lab, to develop a dynamic convolutional graph neural network. This allows us to reduce the prediction errors and give reliable estimates to scientists scheduling their transfers.“

In talking with staff on the ESnet Planning and Architecture team staff, the DAPHNE team is creating a new user interface. This will allow users to enter the starting and end location of the data transfer and the amount of data to be moved, which produces recommendations of the best time to schedule the transfer. The interface also details how long the transfer will take at the expected bandwidth, as well as issue alerts about potential slowdowns on segments of the route.

NetPredict, currently deployed on Google’s Cloud Platform, runs pre-trained deep learning models based on graph neural network architectures which allow users to predict multiple hours in the future. When a real-time query comes in, it triggers the machine learning models to run and produce multiple forecasting points which are then charted into the map to allow users to see how traffic across a specific network link has looked in the past.

The machine learning models extract seasonality, trends, and regular peaks in order to generate predictions for future values of the system. An example of this is using two years of training data consisting of hourly average network traffic to build a model, then using a learning window approach to update the model as traffic progresses with real-time model updates.

Applying machine learning yields better predictive results than the traditional method of using periodicity to extrapolate estimates, Kiran said. Also, deep learning methods can be run in real-time to further hone the techniques. But how do they assess the accuracy? 

"NetPredict has a link to another approach, which shows multiple machine learning models at the same time and how their accuracy is looking. Users can build their confidence in the network predictions, by comparing the prediction mean square error with real-time predictions,” Kiran said. "By comparing results from these models, we can see which perform best."

Kiran gave a demonstration of NetPredict at the SC19 conference in Dallas last November. Her demo showed how it worked in real-time. The models were triggered to move data between Berkeley Lab and Argonne National Laboratory, located near Chicago, via the Sunnyvale hub. The path crossed several different links as NetPredict chose the least congested path.

Kiran credits ESnet Director Inder Monga with suggesting the concept of NetPredict and the use of machine learning to predict network transfers. Monga's idea was that such a capability could help scientists who use OSCARS, the ESnet-developed On-Demand Secure Circuits and Advance Reservation System that allows researchers to easily reserve guaranteed end-to-end bandwidth for scheduling massive data transfers. Once NetPredict is ready for prime time, Kiran sees it being added as a clickable feature on the OSCARS reservation form. 

NetPredict will also be a service available as part of the "Superfacility" framework being developed by Berkeley Lab. The superfacility framework comprises the seamless integration of experimental and observational instruments with computational and data facilities using high-speed networking. While the concept sounds straightforward, achieving it requires resolving any number of smaller issues, which vary by facility. Using NetPredict would help make uninterrupted data movement more predictable, Kiran said.