Stanford Releases EDGAR Filings Dataset for Financial LLMs

Nick Bettencourt, Xiaowei Ding, Kay Giesecke· June 17, 2026 View original

Summary

Stanford University has released the EDGAR Filings Dataset (SEFD), an open, layout-faithful, and token-efficient corpus of U.S. corporate and financial disclosures. This dataset, comprising 152 billion tokens in its initial release, provides high-quality, long-context pretraining data for Large Language Models, along with two new benchmarks for financial forecasting and OCR.

Stanford University has introduced the Stanford EDGAR Filings Dataset (SEFD), a significant new open-source resource for training and evaluating Large Language Models (LLMs) in the financial domain. This dataset reconstructs U.S. Securities and Exchange Commission (SEC) filings into a layout-faithful and token-efficient MultiMarkdown format. SEFD addresses the growing scarcity of high-quality, long-context training data by providing access to audited financial statements, risk disclosures, ownership reports, and other critical financial documents. The corpus is designed to be model-ready, with minimal overlap with common web corpora, making it ideal for financial language modeling, reasoning, forecasting, and compliance applications. The initial public release, SEFD-v1, contains 152 billion tokens, with a larger archive estimated at 550 billion tokens. Additionally, the researchers have introduced two benchmarks derived from SEFD: EDGAR-Forecast for numerical forecasting and EDGAR-OCR for transcribing complex financial tables, further enhancing its utility for the AI and finance communities.

Why it matters

Financial professionals and AI engineers can leverage this open dataset to train more accurate and specialized LLMs for financial analysis, forecasting, and compliance. This resource can significantly improve the performance of AI applications in the finance sector by providing high-quality, domain-specific training data.

How to implement this in your domain

  1. 1Download and integrate the Stanford EDGAR Filings Dataset into financial LLM pretraining pipelines.
  2. 2Develop and fine-tune LLMs specifically for financial reasoning, forecasting, and document understanding using SEFD.
  3. 3Utilize the EDGAR-Forecast and EDGAR-OCR benchmarks to evaluate the performance of financial AI models.
  4. 4Explore SEFD for compliance automation, risk assessment, and market intelligence applications.

Who benefits

BFSIFinTechInvestment ManagementLegalTechConsulting

Key takeaways

  • SEFD is an open, high-quality dataset of SEC filings for training financial LLMs.
  • It provides layout-faithful, token-efficient long-context data for financial reasoning and forecasting.
  • The dataset includes 152 billion tokens in its initial release, with a larger archive available.
  • Two new benchmarks, EDGAR-Forecast and EDGAR-OCR, are introduced for model evaluation.

Original post by Nick Bettencourt, Xiaowei Ding, Kay Giesecke

"arXiv:2606.18192v1 Announce Type: new Abstract: As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often prop…"

View on X

Originally posted by Nick Bettencourt, Xiaowei Ding, Kay Giesecke on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses