Skip to content

School of Public Health leads development of database linkage tool that strengthens both data privacy and research accuracy

Prototype limits use of personal data and tracks linked data access
Back-end project architecture development. Database diagram

Data privacy or data utility? In the world of population data management, that’s the big tradeoff, according to public health experts.

“For example, contact tracing is critical to saving lives during outbreaks like COVID-19, but without the right privacy safeguards, it can also cross the line into violating personal privacy,” said Hye-Chung Kum, PhD, a data scientist with the Texas A&M University School of Public Health and director of the university’s Population Informatics Lab. “Prioritizing one means compromising the other, so organizations seek a workable balance.”

In data linkage, computers use details like names and birth dates to connect information about one person from multiple sources. These links provide insights that wouldn’t be possible from any single dataset alone.

Like others in the information technology world, Kum and her multidisciplinary, multinational team believe the computer-only approach falls short. Information like names could be incorrect or recorded differently in different places, and personal details are needed when combining different sources of data.

“People still must manually check the data and correct errors to make sure everything is linked correctly,” Kum said. “As a result, doing research with these data can be difficult, and the results are not always the best quality or easy to reproduce—important obstacles in health care, where lives are on the line.”

She said hybrid human–computer systems can help overcome these obstacles. And now, her team has developed, validated and tested one of the first hybrid human–computer interactive record linkage systems that explicitly protects privacy while maximizing usefulness. This software prototype is called MiNDFIRL, or Minimum Necessary Disclosure For Interactive Record Linkage.

Their research was funded by the Patient-Centered Outcomes Research Institute and published in the International Journal of Medical Informatics.

“MiNDFIRL protects privacy in keeping with laws such as HIPAA by using personal data on an as-needed basis only,” Kum said. “It also maintains data integrity and ensures regulatory compliance by keeping a record of who uses the linked data.”

The 11 researchers validated MiNDFIRL with real-world case studies on electronic health record and patient-generated data from the University of Texas at Houston and the University of Alabama at Birmingham.

Both studies were deduplication studies, which link the same data to itself to identify duplicate records. They matched a total of 10,000 data pairs from electronic health records with 18,240 patient IDs from patient-generated data.

Random Forest was found to be the best machine learning algorithm for this, with 388 and 539 matches each for electronic health record and patient generated data, and an additional 303 and 187 potential pairs, respectively, that needed manual review. For those cases, the researchers used MiNDFIRL to make the final call, and 232 and 84 more matches were confirmed.

“MiNDFIRL needed only 30% of personal identifying information in each pair to determine if the records were a match or not,” Kum said. “In 77% of the 303 pairs of electronic health records and 45% of the 187 pairs of patient-generated data, it confirmed that the two records referred to the same person.”

The system primarily used first names and email addresses, which Kum said indicates these are highly reliable identifiers for confirming whether two records refer to the same person or to two different people with similar names.

“While there is no one-size-fits-all answer, MiNDFIRL was found to provide high-quality linkage in real-world settings,” Kum said.

Other researchers with the Population Informatics Lab were Gurudev Ilangovan, MS; Qinbo Li, PhD; Alva G. Ferdinand, DrPH, JD; Mahin Ramezani, MS; Theodoros Giannouchos, PhD; and Cason Schmit, JD. Kum, Ilangovan, Li and Ramezani also are affiliated with the Texas A&M Department of Computer Science and Engineering.

Other researchers were with the University of Florida, the University of Alabama at Birmingham, the University of Texas Health Science Center at Houston, and Canada’s University of Calgary and Alberta Health Services.

Media contact: media@tamu.edu

Share This

Related Posts

Back To Top