Tabular In-Context Learners Predict Biomolecular Properties

Davy Guan, Lu Zhang, Asiri Wijesinghe, Allen Zhu, He Zhao, Helen Power, F. Hafna Ahmed, Andrew Warden, Cheng Soon Ong, Daniel M. Steinberg· July 1, 2026 View original

Summary

This paper explores whether tabular foundation models, pretrained on synthetic data, can generalize to biomolecular property prediction, finding they are competitive for protein fitness regression and small-molecule classification when paired with appropriate representations. The choice of molecular representation significantly impacts performance.

Researchers investigated the surprising effectiveness of tabular in-context learning models, such as TabPFN3 and TabICL, for predicting biomolecular properties. These models are typically trained on synthetic tabular data derived from random causal graphs, a domain seemingly unrelated to the complex structures of proteins and molecules. Despite this, the study found that these models can indeed generalize well to biomolecular tasks. The research evaluated these models across two key areas: protein fitness regression and small-molecule classification. For protein fitness, using a fixed ESMC representation, tabular in-context learning consistently performed competitively. In the case of small-molecule classification with ECFP/RDKit descriptors, no single model pairing dominated, emphasizing that the choice of molecular representation is a critical factor, often more so than the tabular predictor's inherent bias. The conclusion is that tabular foundation models are strong contenders for biomolecular prediction, provided they are coupled with suitable sequence or molecular representations.

Why it matters

This opens new avenues for drug discovery and protein engineering by leveraging existing tabular AI models for data-efficient prediction of biomolecular properties, potentially accelerating research and development.

How to implement this in your domain

  1. 1Identify biomolecular prediction tasks within your R&D pipeline that involve limited labeled data.
  2. 2Experiment with tabular in-context learning models (e.g., TabPFN3) using existing biomolecular representations.
  3. 3Evaluate the impact of different molecular representations (e.g., ESMC, ECFP/RDKit) on model performance.
  4. 4Integrate successful tabular models into early-stage drug discovery or protein design workflows.

Who benefits

PharmaceuticalsBiotechnologyHealthcareChemical Manufacturing

Key takeaways

  • Tabular in-context learning models can effectively predict biomolecular properties.
  • These models perform competitively for protein fitness regression and small-molecule classification.
  • The choice of biomolecular representation is crucial for optimal performance.
  • This approach offers data-efficient prediction for tasks with limited labeled data.

Original post by Davy Guan, Lu Zhang, Asiri Wijesinghe, Allen Zhu, He Zhao, Helen Power, F. Hafna Ahmed, Andrew Warden, Cheng Soon Ong, Daniel M. Steinberg

"arXiv:2606.31126v1 Announce Type: new Abstract: Predicting biomolecular properties from limited labeled data is a central bottleneck in protein engineering and small-molecule design. As strong pretrained encoders now supply rich fixed-length representations, the difficulty has sh…"

View on X

Originally posted by Davy Guan, Lu Zhang, Asiri Wijesinghe, Allen Zhu, He Zhao, Helen Power, F. Hafna Ahmed, Andrew Warden, Cheng Soon Ong, Daniel M. Steinberg on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses