Encoding Numeric EHR Data for Transformers: A Comparative Study

Maria Elkj{\ae}r Montgomery, Christian Igel, Mikkel Odgaard, Martin Sillesen, Mads Nielsen· July 3, 2026 View original

Summary

This research systematically compares discrete, continuous, and hybrid encoding strategies for numeric values in Electronic Health Records (EHR) when used with transformer models. It finds that hybrid token-based approaches offer a robust and practical solution, balancing precision with optimization stability.

This study investigates the optimal methods for encoding numerical data, particularly from Electronic Health Records (EHR), for use with transformer-based AI models. Researchers compared various strategies, including discrete, continuous, and hybrid approaches, evaluating them on both synthetic arithmetic tasks and real-world clinical prediction scenarios. The findings highlight a trade-off between the precision of numerical representation, the stability of model optimization, and architectural flexibility. While methods explicitly modeling value-concept interactions performed best on precision-sensitive tasks, hybrid token-based approaches, which involve binning numerical values before projection, emerged as a more robust and broadly applicable default. The study also noted that models consistently achieve "good enough" numerical computation rather than exact arithmetic, and the clinical benefits of incorporating laboratory values are task-dependent. This suggests that for practical deployment, robustness and ease of use often outweigh the pursuit of maximal numerical precision.

Why it matters

Professionals developing AI solutions for healthcare need to understand the most effective and practical ways to handle numerical data in EHRs to ensure model accuracy and reliability. This research provides guidance on encoding strategies that balance precision with real-world deployability.

How to implement this in your domain

  1. 1Evaluate current data encoding pipelines for numerical EHR data, considering precision and computational overhead.
  2. 2Experiment with hybrid token-based encoding strategies, particularly those involving binning, for new or existing transformer models.
  3. 3Analyze the impact of different binning strategies on model performance and optimization stability for specific clinical tasks.
  4. 4Prioritize robustness and deployability in model design over achieving absolute maximal numerical precision, especially for production systems.
  5. 5Consult the study's findings on optimal binning based on dataset size for practical implementation.

Who benefits

HealthcarePharmaceuticalsHealthTechMedical Research

Key takeaways

  • Hybrid token-based encoding is a robust and practical default for numeric EHR data in transformers.
  • There is a trade-off between numeric precision, optimization stability, and architectural flexibility.
  • Models tend to achieve "good enough" numeric computation rather than exact arithmetic in practice.
  • Robustness and deployability often outweigh maximal numeric precision for real-world applications.

Original post by Maria Elkj{\ae}r Montgomery, Christian Igel, Mikkel Odgaard, Martin Sillesen, Mads Nielsen

"arXiv:2607.01391v1 Announce Type: new Abstract: How do we encode numeric values in transformer-based sequence processing, particularly in electronic health record (EHR) data? We systematically compare discrete, continuous, and hybrid value encoding strategies using synthetic arit…"

View on X

Originally posted by Maria Elkj{\ae}r Montgomery, Christian Igel, Mikkel Odgaard, Martin Sillesen, Mads Nielsen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses