Data, data everywhere, but not a byte that thinks

Data scientists working in energy, finance, healthcare and many other fields are facing a daunting problem, where the amount of information they need to analyze is increasing at a faster rate than the performance of tools at their disposal.

Experts say the world created 94 zettabytes of digital data in 2022 alone. In computing parlance, there are 1,024 terabytes in a petabyte, 1,024 petabytes in an exabyte, and 1,024 exabytes in a zettabyte — which means there are 103 trillion new terabytes of information this year, an overwhelming figure to be eclipsed every year for as long as we can imagine.

The quest to build software that can make sense of this information overload is called massive-scale data science. Today, most organizations that possess massive datasets simply put their data aside until some future day when stronger tools are readily available. An organization like the Internal Revenue Service can examine individual records and groups, but not the whole thing at once and not in ways that are inherently secure.

New challenges require new thinking

Distinguished Professor David Bader is director of New Jersey Institute of Technology’s Institute for Data Science, where the challenges of massive-scale are the order of the day.

His team focuses on expanding Arkouda, an open-source code library that originated in the U.S. Department of Defense. Its user interface is written in Python, providing ordinary people with infrastructure comparable to supercomputers. Python is an extremely common language not just for professionals but also for undergraduates. Because of the technical decision to use this language, all hands can be on deck to work on the problem. Without such tools, researchers would face a higher barrier to entry in working directly with a language designed for parallel supercomputing, where graduate-level knowledge is expected.

Bader’s research includes the development of a new Arkouda module called Arachne for large graph analytics. Arache is catching on and becoming useful for identifying useful information in an increasing variety of computationally intensive problems, such as social networks or telecommunications networks. "The number of users is growing and the problems that we're continuing to solve also continues to increase," he said. "At this scale, I'm not aware of other approaches."

COVID provides a spark

Data science has many applications and one of the best examples is healthcare, where the COVID pandemic required doctors and scientists to work faster than ever.

Oliver Alvarado Rodriguez, a doctoral student at NJIT, is developing software for what's called hypergraphs. These can be used for an application such as evaluating a school's COVID outbreak and learning which classrooms of infected students came into contact with other classrooms of healthy students.

Rodriguez said other people around the world are working on similar things, but most use a simpler approach that loses some accuracy. Looking forward, "Once we're done with most of the theory, then we can move into coding optimizations that we could do to make our method run in parallel, and there are a few things we could do to speed it when working massive-scale datasets. That's something that I would like to look into."

Meanwhile, their NJIT colleagues James Geller and Yehoshua Perl, in the Structural Analysis of Biomedical Ontologies Center, developed software to help healthcare workers interpret the novel coronavirus ontology.

Geller said data science is key to such work. "Students in statistics learn how to determine whether differences in data are statistically significant [while] artificial intelligence trains programs with two sets of data that are assumed to be different and the trained program can then recognize new data as belonging to one group or to the other group," he explained.

"So these are almost two sides of the same coin. The problem is that historically the students who learned the statistics methods did not learn the artificial intelligence methods, and the students who learned the artificial intelligence methods did not learn the statistics. So I see data science as an educational enterprise. We are teaching students both methods."

Beyond the magic

"If you were told by the machine, 'That's how I figured it out', would you be more scared or more relieved?"

That's a question posed by NJIT's Yao Ma, assistant professor of computer science, who studies graph neural networks. His data science dream is for computers to show us their work.

Everyone knows the uncomfortable experience of when an online advertiser or social network seems to know too much. This could happen when Amazon suggests you buy a product that's embarrassing or clearly made for a different demographic. It could happen when Facebook wants you to befriend someone who you're avoiding. When you talk to a friend about possibly having children and then start getting advertisements for diapers, is that a coincidence or is the Internet spying on you?

Ma develops new algorithms that can be used for transparency. Too often, even software developers don't really know what their code does, he explained. This is a concern of researchers everywhere, whether their fields are described as artificial intelligence, data science or machine learning. The limitation could scare away important users.

"They lack explainability," Ma emphasized. "Algorithms make predictions but they don't tell you how they make the predictions, how they decide. This explainability is a very essential topic, I think, that needs to be handled, especially if we want to apply these kinds of techniques to safety-critical areas. … How can we trust them if we don't know how they make the decisions?"

This content was paid for and created by the New Jersey Institute of Technology. The editorial staff of The Chronicle had no role in its preparation. Find out more about paid content.

New challenges require new thinking

Image captions

COVID provides a spark

Beyond the magic