May 01, 2024

Biotech Matters: Problems with Life Science Databases in the United States

While humans will retain their ultimate mysteries, many aspects of their traits, diseases, and environment are becoming increasingly tractable. Much of this advance has come from artificial intelligence (AI) algorithms trained on life science databases. These databases should include a variety of data, including deidentified genomes and associated health records, research findings on candidate therapeutics, and metrics about the functioning of different types of cells in the body. Algorithms trained on these databases can reveal important insights including the cause of diseases and optimal strategies for curing them.

In the United States, these databases are too small, fragmented, and difficult for legitimate researchers to access and use. This lack of essential databases undercuts U.S. competitiveness against China, which has built a large genomic database as part of a national strategy seeking global leadership in the biotechnologies. The lack of updated rules governing U.S. databases undercuts the ability of U.S. researchers to work with European Union (EU) partners. These partners are increasingly constrained from sharing data, as EU regulations require much stricter limits on the use and access to EU data than U.S. researchers can observe.

U.S. policy gaps also pose risks to patients. They leave personally identifiable medical data too vulnerable to cyberattack and deidentified medical data too vulnerable to being reassociated with the identity of the individual whose personal medical information is contained in the “deidentified” medical record. The ambiguity in the United States about who owns deidentified patient data and how to compensate companies for the use of relevant proprietary data encourages data hoarding. The rules should be changed to incentivize responsible sharing and facilitate the creation of critical databases.

This article recommends three reforms that would enable responsible access to needed data. Specifically, the United States should (1) update laws governing personal medical data, (2) clarify who owns deidentified patient data, and (3) create new authorities for technical experts to set up public-private organizations to acquire, curate, and manage life science data.

AI Transforms Biology

Biology is being transformed from an observational science into more of a classic design, build, and test discipline. Artificial intelligence is foremost among the various technical advances causing this transformation. AI is trained on databases, creating algorithms that can answer fundamental questions.

AI’s contributions to biology include distinguishing between the genomes of individuals with and without a particular disease or trait. These differences help reveal the genetic roots of diseases or traits and lead to faster diagnoses and better treatments.

More complex algorithms can identify the optimal combination of treatments for an individual with a particular combination of the genes that cause a disease. These precision medicine algorithms need databases with deidentified patient genomes and health records—a particularly rare type of searchable database in the United States.

Other algorithms can learn from an expert and then replicate the expert’s performance in a single, narrow task. The Food and Drug Administration has licensed about 500 AI-based tools to perform a single medical task, principally assessing medical images such as X-rays.

New efforts focus on incorporating the newest type of AI—large language models (LLMs) such as ChatGPT—into the life sciences. A medical LLM, with training based on all sorts of medical information, should be able to support medical practitioners by interacting with them in voice or text to answer detailed questions with well-informed responses.

All of these advances are powered by artificial intelligence—but various problems with databases in the United States, particularly those with patient data, will complicate further advances. Three legal reforms are needed.

Patient Data in the Information Age

The first reform calls for updates to the 1996 Health Information Portability and Accountability Act (HIPAA). HIPAA requires that personally identifiable health information be protected. This requirement should be preserved and strengthened for the information age. There need to be stronger and more consistent requirements for the security, particularly the cybersecurity, of this personal medical data.

Today, data often remain linked to names and stored in systems vulnerable to hacking, third-party leaks, or dispersion through commercial transactions. Ransomware attacks on hospital systems further demonstrate the vulnerability of these databases.

In addition, the law needs to specify standards to deidentify the data—in other words, to remove the name and identifying information from an individual’s medical sample or record. Once deidentified, information is no longer covered by HIPAA. Deidentified data then can usefully advance life science research while preserving the privacy of an individual’s genome and medical records.

Biology is being transformed from an observational science into more of a classic design, build, and test discipline

In an age of powerful computers and large name-based datasets, however, deidentified data sometimes can be reassociated with the original identity. This can occur, for example, if the deidentified data is placed in too small a database. Technical standards for robust deidentification should be specified and enforced. Reassociation of an individual’s identity with their genome and medical records in a now-public research database would breach their privacy and could reveal information harmful to them in various contexts like prospective employment.

The technical approaches for protecting privacy and assuring deidentification need to be updated regularly as computing and artificial intelligence continue to advance.

Data Marketplaces

The second reform would create incentives to share needed data—instead of the current situation, which creates incentives to hoard it. Specifically, the law needs to resolve questions about who owns deidentified patient data, while creating incentives to encourage the sharing of relevant proprietary data.

HIPAA gives patients the right to review, correct, and receive a copy of their medical information, with a few exceptions. Most observers judge this to mean that individuals own their medical data. There is no federal law that governs patient data at subsequent stages in the process. In most states, the health care providers or the medical facilities own the physical records containing patient data.

These physical records increasingly are contained in electronic health records (EHRs), which are managed by software provided for a fee by private companies. These private companies end up with all the patient data placed in the electronic records by care providers using their software. As most individuals in the United States with health insurance have an electronic health record, there are roughly 250 million EHRs.

Once the data is deidentified, the creator of the deidentified dataset seems to own it. This ownership is merely de facto in some cases, in others it is de jure, because of the terms of the contract between the care providers and their software provider. Clarifying ownership of patient data and methods to responsibly deidentify them would enable wider use of the data. The owner of the deidentified data could participate in a marketplace to sell the data.

There are other smaller sources of patient data, including the databases of commercial gene sequencing companies such as 23andMe, and Ancestry. These databases contain the gene sequences of about 26 million individuals. Since the data was provided in the context of a commercial transaction, HIPAA protections do not apply. Public databases include the National Institutes of Health’s high-quality but relatively small All of U.S. database and the U.S. Department of Veterans Affairs’ Million Veteran Program.

Private companies have been purchasing genetic data and EHRs because of the expectation of profits from new medicines and optimized treatment strategies developed using AI algorithms trained on that data.

For many research purposes, these patient health records would need to be supplemented with other data. Some would be proprietary, such as data about the characteristics of candidate drugs that were screened but not developed into final products by pharmaceutical companies. Others might be underlying datasets used for different but related research purposes—perhaps detailed measurements about the functioning of different types of cells in the human body.

Researcher access to this data would need to be incentivized, probably with a fee to the owners of the data. Many research questions would require information from a variety of data streams, and so acquiring, curating, and then managing the needed data would be a complicated process.

New Authorities for New Organizations

The third legal reform would create new authorities for technical experts to set up public-private organizations to acquire, curate, and manage all these types of data. Creating such organizations is extremely difficult now, given the laws governing the handling of patient data, the lack of marketplaces for deidentified data, and the lack of incentives for sharing proprietary data. Life science research would be accelerated if these new organizations could become one-stop shops for databases required by researchers.

Creating new authorities would allow different groups of technical experts to form organizations that would compete against one another to attract researchers to pay for their services. For example, researchers would need high-quality curation. Curation includes reconciling the different definitions used in separate databases with each other so all the inputs can effectively function as one database when combined.

Different researchers, for example, might use entirely different definitions for a “moderate” as opposed to a “severe” presentation of a particular disease. Before the data could be combined into a larger database, the task of reconciling the definitions would need to be performed, ideally by an AI algorithm written for the purpose.

Working separately, individual researchers often find such curation tasks tricky if not insurmountable. Building tools and expertise on curation within these new organizations would provide a competitive advantage for top-performing organizations and a boon for life science research in the United States.

More clarity about who owns and hence should receive compensation for data would enable access to this data through a marketplace. About $120 billion in public and private money is spent annually on medical research and development. The new organizations urged here should not require much new money in the research ecosystem; they should just spur the allocation of that money in a manner more likely to reward clever strategies to acquire, curate, and manage data in support of worthy research.

These new organizations would have to meet high cyber and physical security standards, given the sensitivity of the data. They would have to assure that each data stream is handled in a manner consistent with its specific restrictions, and they would have to collect fees as appropriate. Restrictions might depend on whether certain data were available only to academic or also commercial researchers, as well as to only domestic or also foreign researchers.

Resolve Solvable Mysteries

Artificial intelligence could help identify many more life-saving advances and medical innovations if these three reforms are put in place. Current rules make it far too difficult for researchers to access critical life science data in the United States.

The current rules also fail to adequately protect patient data, both when they are personally identifiable and when they are deidentified. The combination of outdated rules and their ambiguity incentivizes data hoarding. The proposed reforms would incentivize making data available and usable by researchers in a responsible manner.

Artificial intelligence has great promise to resolve many medical mysteries. We should strive for a future where the mystery of life continues to puzzle medical researchers, not the solvable issue of who owns medical data and how it can be deidentified.

About the Author

Carol Kuntz, PhD, teaches on the policy implications of artificial intelligence at Georgetown University and conducts research at the Center for Strategic and International Security. Her most recent report is Genomes: The Era of Purposeful Manipulation Begins. Kuntz is a member of the CNAS BioTech Task Force. She served at the U.S. Department of Defense (DoD) for more than 30 years, providing strategic guidance to incorporate cutting-edge biotechnologies into DoD countermeasure programs, update commercial remote sensing guidelines, shape homeland security programs after the 9/11 terrorist attacks, and craft DoD strategy after the end of the Cold War.


The author is grateful to Vivek Chilukuri, Hannah Kelley, and Maura McCarthy for their valuable feedback and suggestions on earlier drafts of this commentary, as well as to Melody Cook and Rin Rothback for their design support. This commentary series was made possible with general support to CNAS.

As a research and policy institution committed to the highest standards of organizational, intellectual, and personal integrity, CNAS maintains strict intellectual independence and sole editorial direction and control over its ideas, projects, publications, events, and other research activities. CNAS does not take institutional positions on policy issues, and the content of CNAS publications reflects the views of their authors alone. In keeping with its mission and values, CNAS does not engage in lobbying activity and complies fully with all applicable federal, state, and local laws. CNAS will not engage in any representational activities or advocacy on behalf of any entities or interests, and, to the extent that the Center accepts funding from non-U.S. sources, its activities will be limited to bona fide scholastic, academic, and research-related activities, consistent with applicable federal law. The Center publicly acknowledges on its website annually all donors who contribute.

  1. Anna Puglisi, China’s Hybrid Economy: What to Do about BGI? (Washington, DC: Center for Security and Emerging Technology, 2024),
  2. National Academies of Sciences, Engineering, and Medicine, Biodefense in the Age of Synthetic Biology (Washington, DC: The National Academies Press, 2018), 9.
  3. National Academy of Sciences, Heritable Human Genome Editing (Washington, DC: The National Academies Press, 2020).
  4. Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials et al., “Omics-Based Clinical Discovery: Science, Technology, and Applications,” chap. 2 in Evolution of Translational Omics: Lessons Learned and the Path Forward, eds. Christine M. Micheel, Sharly J. Nass, and Gilbert S. Omenn (Washington, DC: National Academy of Sciences, 2012); Danielle Whicher et al., Health Data Sharing to Support Better Outcomes (Washington, DC: National Academy of Medicine, 2021).
  5. Mariana Lenharo, “An AI Revolution Is Brewing in Medicine: What Will It Look Like?” Nature 622, no. 7984 (October 26, 2023): 686–88.
  6. M. Moor et al., “Foundation Models for Generalist Medical Artificial Intelligence,” Nature 616, no. 7956 (2023): 259–65.
  7. Steve Alder, “Editorial: Why Do Criminals Target Medical Records,” The HIPAA Journal, November 2, 2023.; Kara Swisher, “What Is 23andMe Doing with Your DNA?” The New York Times, September 20, 2021,
  8. Rebecca Carballo, “Ransomware Attack Disrupts Health Care Services in at Least Three States,” The New York Times, August 5, 2023,
  9. “Your Rights under HIPAA,” U.S. Department of Health and Human Services, accessed November 5, 2023,,purposes%20or%20sell%20your%20information.
  10. The two largest private sector providers of EHRs are Epic, which handles about 28 percent of hospitals, and Oracle Cerner, which handles about 26 percent of hospitals. The rest of the market is split among a handful of other companies. Much information in the Epic EHR and many other EHRs is not searchable because it is in PDF format. Jeff Green, “Who Are the Largest EHR Vendors?” EHR in Practice, March 16, 2023,,Cerner%3A%2026%20%25; Alex Milinovich and Michael W. Kattan, “Extracting and Utilizing Electronic Health Data from Epic for Research,” Annals of Translational Medicine 6, no. 3 (February 2, 2018); and Katie Jennings, “The Billionaire Who Really Controls Your Medical Records: Epic Founder Judy Faulkner,” Forbes, May 31, 2001,
  11. Raj Sharma, “Who Really Owns Your Health Data?” Forbes, April 23, 2018,; Kyle Jones, “Doctor or Patient? Who Owns Medical Records,” Fresh Perspectives: New Docs in Practice (blog), American Academy of Family Physicians, accessed December 12, 2023,; K. Royal, “Who Owns Patient Medical Records?” The Journal of Urgent Care Medicine, March 27, 2017,; David W. Parke II, “Medical Record Ownership and Access,” American Academy of Ophthalmology, June 2019,
  12. Antonio Regalado, “More Than 26 Million People Have Taken an At-Home Ancestry Test,” MIT Technology Review, February 11, 2019,
  13. “All of Us” has a goal of 1 million records; it has 502,000 participants who have completed initial steps; it has released 250,000 full genomes (3 billion base-pair) for research. See Katie Palmer, “With a New Center, All of Us Tackles Health Data Silos to Power Precision Medicine,” STAT, November 2, 2023, See also: the “All of Us” website, The largest government database with genomic data and health records is the Million Veteran Program, The database is available to VA-employed researchers with only limited exceptions.
  14. Smart money has had a recent interest in health data: the Blackstone Group bought Ancestry in 2020, Oracle bought Cerner in 2022, GSK bought anonymous DNA data from 23andMe in 2023, and Amazon bought OneMedical in 2023. See the discussion of prospects for profits from life science databases in Matthew Ponsford, “Is Your DNA Data Safe in Blackstone’s Hands”?, January 28, 2021,; Ben Hirschler, “Focus: Cashing in on DNA: Race On to Unlock Value in Genetic Data," Reuters, August 3, 2018,
  15. See especially: “Panel Discussion on the Pros and Cons of Consortia and Large Databases" in National Academies of Sciences, Engineering, and Medicine, “Large Databases and Consortia,” chap. 8 in Next Steps for Functional Genomics: Proceedings of a Workshop (Washington, DC: National Academies Press, 2000), 123. The government should enable the creation of such consortia as opposed to overseeing the creation itself. See further development of these views in Carol Kuntz, Genomes: The Era of Purposeful Manipulation Begins (Washington, DC: Center for Strategic and International Studies, July 2022),
  16. See National Research Council, “Challenge Problems in Bioinformatics and Computational Biology from Other Reports," part B in Catalyzing Inquiry at the Interface of Computing and Biology, eds. John C. Wooley and Herbert S. Lin (Washington, DC: National Academies Press, 2005). See also Marissa Mock et al., “AI Can Help to Speed Up Drug Discovery—But Only If We Give It the Right Data,” Nature 621 (September 2023), 467–70,
  17. The National Institutes of Health spent $40 billion on basic research in 2019, while the pharmaceutical industry spent about $83 billion on later-stage research and development. Research and Development in the Pharmaceutical Industry (Washington, DC: Congressional Budget Office, April 2021),


  • Carol Kuntz

    Member, BioTech Task Force, Senior Advisor, Chertoff Group

    Carol Kuntz teaches on the policy implications of artificial intelligence at Georgetown and George Washington Universities and conducts research in the Strategic Technologies ...

View All Reports View All Articles & Multimedia