TOTEN Improves Technical Text Tokenization for Portuguese.

Antonio de Sousa Leit\~ao Filho; Allan Kardec Duailibe Barros Filho; Fabr\'icio Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa· June 19, 2026 View original

Summary

TOTEN is a new knowledge-based ontological tokenization framework designed to accurately process physical quantities and technical notation in Brazilian Portuguese. Unlike statistical methods, it uses a formal ontology and external oracles to preserve semantic meaning, achieving superior numerical reconstruction and ontological atomicity.

Traditional tokenization methods, such as Byte-Pair Encoding (BPE), are efficient for general vocabulary compression but often fail to semantically preserve structured technical entities. These methods tend to fragment physical quantities, numbers, units, and symbolic expressions into arbitrary subwords, losing critical meaning in scientific and engineering texts. To address this, researchers have introduced TOTEN (Knowledge-Based Ontological Tokenization), a framework specifically designed for Brazilian Portuguese technical notation. TOTEN replaces statistical derivation with a declarative classification approach, grounded in a formal ontology of engineering entities (OEE). Its robustness is enhanced by deterministic integration with external oracles like Pint for dimensional analysis, Unicode Character Database for typography, and RSLP for Portuguese morphology. Evaluations against eight state-of-the-art baselines on both internal and external Brazilian Portuguese corpora demonstrated TOTEN's superior performance. It achieved perfect unit ontological atomicity and significantly higher numerical reconstruction scores (0.775-0.904 on external corpora vs. 0.627-0.703 for the best baseline), confirming its ability to maintain the integrity and meaning of technical expressions.

Why it matters

For professionals working with technical documents, scientific literature, or engineering specifications in Brazilian Portuguese, TOTEN offers a more accurate and semantically robust tokenization solution. This can greatly improve the performance of downstream NLP tasks such as information extraction, machine translation, and knowledge graph construction in specialized domains.

How to implement this in your domain

1Assess current NLP pipelines for handling technical text in Brazilian Portuguese, especially regarding physical quantities and units.
2Investigate integrating TOTEN or similar knowledge-based tokenization approaches into your text processing workflows.
3Evaluate the impact of improved tokenization on the accuracy of information extraction and semantic understanding tasks.
4Consider developing custom ontologies for specific technical domains to enhance tokenization precision.
5Apply this method in applications requiring high fidelity understanding of scientific and engineering data.

Who benefits

EngineeringScientific ResearchEducationTechnical PublishingAI/NLP Development

Key takeaways

Traditional tokenization struggles with technical notation and physical quantities.
TOTEN uses a knowledge-based, ontological approach for robust technical tokenization.
It significantly improves numerical reconstruction and semantic preservation in Brazilian Portuguese.
This framework enhances downstream NLP tasks for scientific and engineering texts.

Original post by Antonio de Sousa Leit\~ao Filho; Allan Kardec Duailibe Barros Filho; Fabr\'icio Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

"arXiv:2606.19626v1 Announce Type: new Abstract: Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically…"

View on X

Originally posted by Antonio de Sousa Leit\~ao Filho; Allan Kardec Duailibe Barros Filho; Fabr\'icio Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

TOTEN Improves Technical Text Tokenization for Portuguese.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

MCP and A2A Protocols Standardize Agentic Internet Development

VISReg Enhances JEPA Training with Novel Regularization

Ford's AI-Driven Layoffs Backfire Significantly