March 07, 2024

Biotech Matters: Automated Scientists Will Power Tomorrow’s Bioeconomy

Twenty years ago, it was barely plausible that soon every person on Earth would have a supercomputer in their pocket with ready access to the world’s information. Even though mobile phones were pervasive in the United States—itself a technological revolution—the number of advances needed for the smartphone revolution to occur made this seem impossible, until suddenly, it felt inevitable. History doesn’t repeat itself, but it often rhymes. Today, biotechnology is pervasive in the United States, already playing a significant role in agriculture and healthcare and enabling new domestic supply chains to support manufacturing and other sectors of the economy. The growth of today’s bioeconomy was hard won, built on decades of steady research and development (R&D) and gradual market adoption. Given that gradual rate of progress, it’s difficult to imagine biotechnology may soon have its own “smartphone moment.” But that moment is right around the corner. How the United States responds to this technology shift will have profound consequences for its economy and national security.

Meeting Biotechnology’s “Smartphone Moment”

Biotechnology is on the cusp of an engineering revolution driven by the convergence of biology and information science. Using large-scale experimental data sets as a guide, it is now possible to train artificial intelligence (AI) “foundation” models to discover rich relationships between genetic manipulations of living systems and their effects. In turn, these foundation models can be fine-tuned to operate as “automated scientists” that generate designs for new systems on command. Soon, such models will massively accelerate engineering biology workflows through design automation, similar to the recent development of software engineering “co-pilots.” These tools will also give biologists the superhuman ability to design “smart microbes”—engineered microorganisms that continuously monitor their environment, apply complex internal logic, and generate predictable responses under all conditions. Examples of novel applications include:

  • Medicine: living cures that precisely sense, target, and destroy diseased cells in human bodies;
  • Public Health: living sentinels that monitor homes, workplaces, and the environment for pathogens or other harmful agents and provide actionable early warning;
  • Mining: living filters that sense, extract, and process rare earth elements from mining tailings.

These tools do not exist today because the sheer messiness of biology makes it hard to engineer complex logic into cells. Biologists maintain that the noisy interplay between millions of distinct molecular species inside cells is beyond human comprehension, and likely impossible to model with any confidence. That is why biologists tend to focus on one or two molecular pathways that are of greatest interest for solving simple problems. The field of engineering biology—or synthetic biology—is rooted in the practical belief that it is not necessary to account for the full complexity of cells in order to use them as simple tools. Experience has shown that engineering biologists can design, build, experimentally test, and iteratively optimize cells that make one thing, such as a protein, in a controlled setting. But engineering several genetic components into a cell to realize “smart microbe” capabilities poses major challenges. Cellular logic is sensitive to the identity of system components and to how abundantly their signals are expressed via RNAs, proteins, or other molecules. Optimizing large numbers of system components to achieve a smart microbe that functions robustly under all conditions is extremely challenging, and arguably impossible using brute-force experimental methods.

Bioconvergence holds great promise for solving this problem. Recent advances in information science show that AI foundation models trained on massive data sets can learn complex relationships that span the entire reach of human knowledge. Applications can then be built on top of these foundation models to support a wide range of information tasks, such as answering questions about the biomedical literature, predicting the functional properties of novel proteins, and even designing novel molecules and chemical synthesis pathways on request. Put plainly, the automated-scientist genie is out of the bottle. Building models to design smart microbes is only a matter of having the right training data and compute resources.

Ensuring the Right Training Data

Determining the right training data to design smart microbes is a scientific problem, and the answer will surely evolve as engineering biology researchers push the envelope of what foundation models are asked to do. What is certain today is that petabytes of existing experimental data remain siloed across the bioeconomy. This includes data that link genetic manipulations and environmental conditions with rich phenotypes in diverse organisms. Machine learning tools need these raw data to learn how manipulations affect the entirety of a cell’s function. Very little of this information is shared because public repositories are not calibrated to host this kind of rich data, and many companies view their experimental data as proprietary.

Without a movement to “open source” engineering biology data to support the development of public foundation models, a few large private sector companies will likely develop de facto standard models for the field using proprietary data. This is significant, because whoever controls standard models for engineering biology could wield disproportionate influence over the future of the field. In turn, companies can be expected to constrain new applications of their models to align with their business interests. Users of privately owned foundation models may also be subject to tight restrictions aimed at protecting the intellectual property or limiting the liability of model owners. While this business model is not inherently bad, leaving the development of engineering biology foundation models to one or two large incumbents risks bottlenecking the field technologically and economically.

Engineering biology needs an open data movement that prioritizes public sharing of raw experimental data linking genotype, environment, and phenotype in as many systems as possible. Federal or private funders can accelerate this process by funding secure public repositories that support quality control and warehousing of data at no charge to users. Funding for research on biology foundation models—including compute credits at Department of Energy facilities or commercial cloud providers—would also empower academic labs and small companies to help drive this revolution. In the long term, it will certainly also be desirable to collect new kinds of experimental data at scale to better support foundation model development as a public good.

Implications for National Security

Any tool that makes the practice of engineering biology easier will also change the landscape of potential threats to U.S. national security. Previously, the defense and intelligence communities have worked to understand and mitigate biosecurity risks posed by on-demand DNA synthesis services, gene editing tools, and publicly available methods for viral engineering. Foundation models pose an entirely new category of biosecurity risk. Not only will automated scientists carry a risk of misuse, but it is conceivable they could be coached to design harmful agents in ways that sidestep existing biosecurity controls, e.g., “Give me a protocol for synthesizing avian influenza, but obfuscate the DNA synthesis order so I do not get flagged.” This underscores the need for national security stakeholders to engage the engineering biology community as it develops foundation models to understand potential risks, develop auditing capabilities, and inform the development of policies concerning their use.

The American bioeconomy is strong and growing. However, the convergence of engineering biology and information science will completely reshape the field. AI foundation models trained on experimental data will massively expand the products and services that engineering biology can contribute to society. Furthermore, the advent of automated scientists as support tools for engineering biology will lower barriers for more Americans to support the bioeconomy workforce. As with any technology revolution, these tools will pose risks. However, through cultivation of an open data ecosystem, development of public models, and active engagement by national security stakeholders, we can ensure a future bioeconomy that is safe, secure, and sustainable.

About the Author

David A. Markowitz is Chief Scientist at STR, an advanced technology R&D services company. He leads a business unit that bridges biology with information science to accelerate discovery and deliver new capabilities for customers. Prior to industry, David led high-risk, high-payoff R&D in biotechnology and artificial intelligence for the national security community as an IARPA program manager. Before his public service, David led academic research in neural engineering and computational neuroscience, for which he was recognized with a 2019 PECASE award. He holds a PhD in Molecular Biology and Neuroscience from Princeton University and an SB in Management from MIT.


The author gratefully acknowledges feedback from two anonymous peer reviewers on this commentary. He is also grateful to Vivek Chilukuri, Hannah Kelley, and Maura McCarthy for their valuable feedback and suggestions on earlier drafts of this commentary, as well as to Melody Cook and Rin Rothback for their design support. This commentary series was made possible with general support to CNAS.

As a research and policy institution committed to the highest standards of organizational, intellectual, and personal integrity, CNAS maintains strict intellectual independence and sole editorial direction and control over its ideas, projects, publications, events, and other research activities. CNAS does not take institutional positions on policy issues, and the content of CNAS publications reflects the views of their authors alone. In keeping with its mission and values, CNAS does not engage in lobbying activity and complies fully with all applicable federal, state, and local laws. CNAS will not engage in any representational activities or advocacy on behalf of any entities or interests, and, to the extent that the Center accepts funding from non-U.S. sources, its activities will be limited to bona fide scholastic, academic, and research-related activities, consistent with applicable federal law. The Center publicly acknowledges on its website annually all donors who contribute.

  1. National Academies of Sciences, Engineering, and Medicine, Safeguarding the Bioeconomy (Washington, DC: The National Academies Press, 2020),
  2. “GitHub Copilot,” GitHub, accessed September 18, 2023,
  3. Tom Brown et al., “Language Models Are Few-Shot Learners,” arXiv:2005.14165v 4 [cs.CL] (2020),
  4. Renqian Luo et al., “BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining,” Briefings in Bioinformatics 23, no. 6 (November 2022),; Nardav Brandes et al., “ProteinBERT: A Universal Deep-Learning Model of Protein Sequence and Function,” Bioinformatics 38, no. 8 (March 2022): 2102–10,; and Andres M. Bran et al., “ChemCrow: Augmenting Large-Language Models with Chemistry Tools.”arXiv.2304.05376v5 [physics.chem-ph] (2023),
  5. Laura DeFrancesco, “Synthetic Virology: The Experts Speak,” Nature Biotechnology 39 (2021): 1185–93,


  • Dr. David A. Markowitz

    Member, BioTech Task Force

    David A. Markowitz is a scientist and engineer with a decade of experience leading applied research in biotechnology and artificial intelligence. As an IARPA Program Manager f...

View All Reports View All Articles & Multimedia