Diagnosing data corruption
Open Compute Project Foundation supporting ASU researcher’s work to improve microchip quality assessments
As the demand for data usage and storage grows exponentially, its scale, capabilities and applications have grown in tandem. In the nearly 40 years of technological progress, 128-gigabyte micro SD cards used in today’s smartphones can now store as much data as two semitrucks of 8-pound Priam hard drives from 1985.
All data must be stored in a physical location, either within the device itself or at a remote data center that the device accesses. The data contained in smartphone apps, books, games and movies are often stored physically in warehouses, known as cloud data centers, owned by tech companies.
A single data center can contain hundreds of thousands of servers. Most cloud computing is performed via servers, with each server having complex microprocessor chips equipped with semiconductor technologies.
In recent years, cloud data centers have faced spontaneous and undetectable chip failure known as silent data corruption, or SDC, which occurs when a central processing unit inadvertently causes errors while processing data.
Most SDCs are not yet traceable by software, meaning data is often processed incorrectly and lost without any indication of the cause, says Krishnendu Chakrabarty, the Fulton Professor of Microelectronics in the School of Electrical, Computer and Energy Engineering, part of the Ira A. Fulton Schools of Engineering at Arizona State University.
“In the old days, one defective part per million was considered great because that meant a system might fail maybe once in three years,” he says. “But now, one cloud data center can have up to a million of these servers running simultaneously, meaning at least one of those parts could fail at any given moment.”
Several large tech companies came together to address this issue by establishing the Open Compute Project Foundation in 2011. The consortium supports collaboration between foundries and university researchers to develop innovative data center design solutions.
To improve quality control testing of SDCs, Chakrabarty has been given the Driving Innovation in SDC Mitigation Award from the Open Compute Project Foundation to develop modeling through generative artificial intelligence, or AI, techniques.
“A ‘one in a million’ failure is far too many,” he says.
Troubleshooting testing methodology
In industries such as health care, security and finances, cloud data is managing an unfathomable amount of confidential user data. Chakrabarty and his team at the ASU Center for Semiconductor Microelectronics, or ACME, are working to secure the data by troubleshooting the issue at its source.
During quality assurance testing, the chip is evaluated for its ability to accurately perform tasks with known solutions, similar to asking, “What is two plus two?” Chips that respond correctly, such as answering “four,” are then approved, packaged and shipped to vendors.
Since SDCs do not occur when testing the chip in a standalone setting, Chakrabarty is collaborating with Intel and ARM to enact functionality testing in earlier stages to simulate the environment in which the chips operate during use. In doing so, he will redesign the system to incorporate onboard sensors that check and redact errors as they occur.
Beyond detecting underperforming chips, Chakrabarty plans to develop a machine learning AI algorithm to understand the cause of the failures and identify which stimuli or sequences of inputs lead a system to fail.
“The goal is not just to throw out the bad parts, but to see if we can learn from that and use that information as feedback to improve the chip manufacturing process,” Chakrabarty says. “We want to be able to go back to the fabricator and say, ‘look, this step in production needs to be adjusted’ so the foundry can make the necessary changes and improve the yield.”
Farshad Firouzi, an electrical engineering research scientist in the School of Electrical, Computer and Energy Engineering and a collaborator on the project with Chakrabarty, says the method is a breakthrough.
“Using large language models is a new approach, and we are among the first to apply this new technology in this context,” Firouzi says.
ACME is welcoming a full-time employee from Google this fall who will study to earn an electrical engineering doctoral degree and specialize in the work ACME is developing.
Matchmaking microelectronic masterpieces
Chakrabarty is among researchers at other renowned engineering universities, such as Carnegie Mellon and Stanford, being awarded support for innovative approaches in the use of AI to detect and diagnose the cause of SDCs.
“We are competing with the best of the best,” he says. “It’s a lot of pressure, but I enjoy the responsibility.”
Chakrabarty has a longstanding history of collaborating with major foundries and government agencies to deliver important research results. He is currently the chief technology officer at the Southwest Advanced Prototyping Hub, or the SWAP Hub, an ASU-led and U.S. Department of Defense-funded consortium geared to developing an ecosystem for advancing the prototyping, fabricating and packaging of microelectronics.
Chakrabarty facilitates connections at the SWAP Hub for more than 150 small businesses, universities and large companies, as well as countless stakeholders. He describes his work as an exercise in matchmaking and notes that the interdisciplinary and collaborative environment has enhanced his perspective toward his own work.
“Some days I feel like I’m back in graduate school because I’m learning new things every day,” he says. “Researchers can get very narrowly focused on their research topics, but in the SWAP Hub, I get to learn how different technologies cater to multiple disciplines and can compare different approaches. It’s a lot of fun.”
As Chakrabarty gears up to conduct the OCP-sponsored research, he is looking to recruit more students to work in the ACME Center, which he describes as a large ecosystem featuring a global perspective that emphasizes education and growth.
“Any student who is doing research with us in ACME will get a chance to work with industry, apply research and see that research being used in meaningful ways.”
Students interested in joining ACME can send their resume to [email protected].