Informing Acquisition Functions via Foundation Models for Molecular Discovery
Episode

Informing Acquisition Functions via Foundation Models for Molecular Discovery

Dec 15, 20257:56
Machine LearningArtificial Intelligenceq-bio.QM
No ratings yet

Abstract

Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.

Summary

This paper addresses the challenge of accelerating molecular discovery using Bayesian Optimization (BO) in low-data regimes with vast candidate spaces. Traditional BO methods rely on surrogate models, which struggle with limited initial data and high-dimensional features from large language models (LLMs) and chemistry foundation models. The authors propose a novel likelihood-free BO method called LLMAT (LLM-guided Acquisition Tree) that bypasses explicit surrogate modeling and directly leverages priors from LLMs and foundation models to inform acquisition functions (AFs). LLMAT learns a tree-structured partition of the molecular search space, with local AFs learned for each node, enabling efficient candidate selection via Monte Carlo Tree Search (MCTS). It further incorporates coarse-grained LLM-based clustering to improve scalability by restricting AF evaluations to clusters with statistically higher property values. LLMAT's key contributions include: (1) a likelihood-free BO method that leverages LLMs and foundation models to inform AFs and learns a tree-structured partition of the molecular search space; (2) a meta-learning approach for training shared binary classifiers to improve the stability of partitioning and local AFs in low-data regimes; and (3) an LLM-guided pre-clustering strategy with statistical cluster selection to reduce computational cost and improve BO performance. The method is evaluated on six real-world chemistry datasets, demonstrating improved scalability, robustness, and sample efficiency compared to existing BO methods. LLMAT achieves superior performance across both fixed-feature and fine-tuned models, highlighting the value of incorporating general LLMs and domain-specific foundation models with principled algorithmic design for molecular discovery.

Key Insights

  • LLMAT directly models acquisition functions using density ratio estimation and binary classification, avoiding the computational burden of training deep Bayesian surrogate models, which is a common bottleneck in traditional BO.
  • The method learns a tree-structured partition of the molecular search space with local acquisition functions, improving the refinement of candidate suggestions. Experiments on the Levy-1D function show a narrower and more accurate confidence region around the true optimum at the leaf node, compared to the root node in vanilla LFBO.
  • LLM-based clustering is used to pre-screen the candidate space, significantly reducing the computational cost of AF evaluations. The method leverages the ability of LLMs to capture coarse-grained property rankings, even without precise numerical predictions.
  • Meta-learning is used to train shared binary classifiers for candidate partitioning, enhancing the stability of the learned partitions and local AFs, especially in low-data regimes.
  • LLMAT achieves state-of-the-art performance on six real-world chemistry datasets, outperforming baselines like Gaussian Processes (GP), Laplace approximation (LAPLACE), Likelihood-Free BO (LFBO), and BORE.
  • Ablation studies demonstrate that both Monte Carlo Tree Search (MCTS) for partition selection and meta-learning of binary classifiers are crucial for LLMAT's performance. For example, using only LFBO (L=0) performs significantly worse compared to deeper tree depths in LLMAT.
  • The paper challenges the claim that LLMs are only useful for BO over molecules when pre-trained or fine-tuned on domain-specific data. LLMAT achieves strong performance using features from general-purpose LLMs like T5, GPT-2, and Llama-2, highlighting the importance of algorithmic design in leveraging LLM priors.

Practical Implications

  • LLMAT can be used to accelerate the discovery of molecules with desired properties in various fields, including drug design, materials science, and chemical engineering.
  • Researchers and engineers can use LLMAT to efficiently explore vast chemical spaces and identify promising molecular candidates with limited experimental or computational resources.
  • The LLM-guided pre-clustering strategy can be applied to other optimization problems with large candidate sets and expensive evaluation functions.
  • The modular design of LLMAT allows practitioners to incorporate different LLMs and foundation models, depending on the specific application and available data.
  • Future research directions include exploring different tree-structured partitioning schemes, incorporating more sophisticated LLM prompting strategies, and extending LLMAT to multi-objective optimization problems.

Links & Resources

Authors