When using or sharing huge databases for machine learning, Rice University computer scientists have identified an inexpensive approach for tech businesses to implement a stringent form of personal data protection.
“There are many cases where machine learning could benefit society if data privacy could be ensured,” said Anshumali Shrivastava, an associate professor of computer science at Rice.
“There’s huge potential for improving medical treatments or finding patterns of discrimination, for example, if we could train machine learning systems to search for patterns in large databases of medical or financial records. Today, that’s essentially impossible because data privacy methods do not scale.”
With a new strategy they’ll present this week at CCS 2021, the Association for Computing Machinery’s annual flagship conference on computer and communications security, Shrivastava and Rice graduate student Ben Coleman seek to change that.
Shirvastava and Coleman discovered they could construct a brief summary of an extensive database of sensitive records using a technique called locality sensitive hashing. Their method, dubbed RACE, gets its name from these summaries, which are also known as “repeated array of count estimators” drawings.
There are many cases where machine learning could benefit society if data privacy could be ensured. There’s huge potential for improving medical treatments or finding patterns of discrimination, for example, if we could train machine learning systems to search for patterns in large databases of medical or financial records. Today, that’s essentially impossible because data privacy methods do not scale.
Anshumali Shrivastava
RACE drawings, according to Coleman, are both safe to make public and valuable for algorithms that use kernel sums, one of machine learning’s core building blocks, and for machine-learning programs that do common tasks like classification, ranking, and regression analysis.
RACE, he added, may allow businesses to profit from large-scale, distributed machine learning while also maintaining a strict kind of data protection known as differential privacy.
Differential privacy, which is used by a number of computer giants, is based on the principle of obscuring individual information with random noise.
“There are elegant and powerful techniques to meet differential privacy standards today, but none of them scale,” Coleman said. “The computational overhead and the memory requirements grow exponentially as data becomes more dimensional.”
Data is becoming increasingly multi-dimensional, which means it comprises a large number of observations as well as a large number of distinct attributes about each observation.
He explained that RACE is used to create scales for high-dimensional data. The sketches are compact and easy to distribute, as are the computational and memory needs for creating them.
“Engineers today must either sacrifice their budget or the privacy of their users if they wish to use kernel sums,” Shrivastava said. “RACE changes the economics of releasing high-dimensional information with differential privacy. It’s simple, fast and 100 times less expensive to run than existing methods.”
Shrivasta and his students have developed a number of algorithmic ways to make machine learning and data science faster and more scalable in the past.
They and their collaborators have discovered a more efficient way for social media companies to prevent misinformation from spreading online, discovered a way to train large-scale deep learning systems up to 10 times faster for “extreme classification” problems, discovered a way to more accurately and efficiently estimate the number of identified victims killed in the Syrian civil war, and demonstrated that deep neural networks can be trained up to 15 times faster on general-purpose CPUs.
The research was funded by the Basic Research Challenge program of the Office of Naval Research, the National Science Foundation, the Air Force Office of Scientific Research, and Adobe Inc.