Entity Embeddings Excel in High-Cardinality Fraud Detection

Xiao Han, Jingjing Liu, Moxuan Zheng, Zhen Zhang, Chenyu Wu· July 2, 2026 View original

Summary

A study comparing seven categorical encoding methods for high-cardinality fraud detection found that entity embeddings achieved the highest AUC-ROC, statistically tying with CatBoost, while off-the-shelf TabNet underperformed.

When tackling fraud detection with datasets containing many unique categorical values (high-cardinality features), the choice of encoding method significantly impacts model performance. A recent study evaluated seven different categorical encoding techniques on a large fraud benchmark dataset. The goal was to compare both interpretable and learned encoding approaches. The results showed that entity embeddings, a learned encoding method, delivered the highest AUC-ROC score, demonstrating a statistically significant tie with CatBoost, which uses its own integrated encoding. Simpler methods like target encoding performed slightly worse than tier group encoding, which aims for more auditor-friendly boundaries. Notably, TabNet, a neural network approach, did not outperform tree-based pipelines and struggled with data scarcity. While entity embeddings led on AUC-ROC, CatBoost showed superior performance on AUC-PR, indicating that no single encoder dominated both metrics. The analysis confirmed that the advantage of embeddings stems from their ability to create joint, multi-column representations.

Why it matters

Data scientists and fraud analysts need to select the most effective categorical encoding strategies to build robust and accurate fraud detection systems, especially with complex, high-cardinality data.

How to implement this in your domain

  1. 1Experiment with entity embeddings for high-cardinality categorical features in your fraud detection models.
  2. 2Consider using CatBoost as a strong baseline or alternative, especially when AUC-PR is a critical metric.
  3. 3Avoid off-the-shelf TabNet for fraud detection if data scarcity is a concern or if tree-based models already perform well.
  4. 4Perform per-column analysis to understand how different encoding methods impact feature representation and model performance.

Who benefits

BFSIE-commerceFintechInsurance

Key takeaways

  • Entity embeddings are highly effective for high-cardinality categorical features in fraud detection.
  • CatBoost offers competitive performance, particularly for AUC-PR, and integrates its own encoding.
  • Simpler encoding methods like target encoding can be competitive but may not reach the top performance.
  • TabNet may not be suitable for all fraud detection scenarios, especially with limited data.

Original post by Xiao Han, Jingjing Liu, Moxuan Zheng, Zhen Zhang, Chenyu Wu

"arXiv:2607.00477v1 Announce Type: new Abstract: A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation…"

View on X

Originally posted by Xiao Han, Jingjing Liu, Moxuan Zheng, Zhen Zhang, Chenyu Wu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses