BaRA Agent Improves Web Data Collection with BFS and Reflection

Soojeong Lee, Joseph Lee, Yongseong Cho, Sunjae Kim, Youngwoo Moon, Kyungwoo Song· July 2, 2026 View original

Summary

Researchers introduce BaRA (BFS-and-Reflection Agent), a framework for site-level web data collection that combines bounded breadth-first search (BFS) traversal with history-based self-reflection. BaRA outperforms existing LLM-based web agents in link discovery and downloadable multimodal extraction, especially for images and videos.

Large language model (LLM)-based web agents aim to automate web data collection, reducing the need for manual scripting. However, these agents often struggle on live websites, frequently missing relevant pages, returning incomplete multimodal outputs, or providing non-downloadable media URLs. To address these limitations, a new framework called BaRA (BFS-and-Reflection Agent) has been developed. BaRA is designed for site-level data collection within a fixed interaction budget, integrating a bounded breadth-first search (BFS) traversal strategy with a history-based self-reflection mechanism. Evaluations on 50 synthetic websites with ground-truth references, as well as three public websites featuring complex or dynamic layouts, demonstrated BaRA's superior performance. It significantly outperformed Pure LLM, SeeAct-Vision, and Browser-use agents in link discovery and the extraction of downloadable multimodal content, showing particular strength in recovering valid images and videos.

Why it matters

Professionals in data science, marketing, and competitive intelligence can leverage BaRA to more efficiently and accurately collect comprehensive web data, including hard-to-find multimodal content, for analysis and strategic decision-making.

How to implement this in your domain

  1. 1Explore integrating BaRA into existing web scraping or data collection pipelines for enhanced performance.
  2. 2Utilize BaRA for comprehensive site-level data extraction, focusing on multimodal content like images and videos.
  3. 3Benchmark BaRA's performance against current LLM-based agents for specific data collection needs.
  4. 4Adapt BaRA's reflection mechanism to improve data quality and relevance for specific business objectives.

Who benefits

Market ResearchE-commerceData AnalyticsMedia MonitoringCybersecurity

Key takeaways

  • BaRA improves web data collection by combining BFS traversal with self-reflection.
  • It addresses common issues like missed pages and incomplete multimodal outputs in LLM agents.
  • BaRA significantly outperforms other agents in link discovery and downloadable media extraction.
  • The framework is particularly effective for recovering valid images and videos from complex websites.

Original post by Soojeong Lee, Joseph Lee, Yongseong Cho, Sunjae Kim, Youngwoo Moon, Kyungwoo Song

"arXiv:2607.00007v1 Announce Type: cross Abstract: Large language model (LLM)-based web agents reduce manual scripting for web data collection, yet on live websites, they often miss relevant pages, return incomplete multimodal outputs, or return media URLs that are not directly do…"

View on X

Originally posted by Soojeong Lee, Joseph Lee, Yongseong Cho, Sunjae Kim, Youngwoo Moon, Kyungwoo Song on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses