SemPiper Synthesizes ML Pipeline Code with Semantic Operator

SemPiper Synthesizes ML Pipeline Code with Semantic Operators

Olga Ovcharenko, Luciano Duarte, Sebastian Schelter· June 15, 2026 View original

Summary

SemPiper introduces a novel programming model that extends ML pipelines with declarative, LLM-powered semantic data operators, allowing developers to use natural language instructions for data operations. It interactively synthesizes optimized code for these operators, integrating seamlessly with standard Python data science libraries.

Developing machine learning (ML) pipelines is often a complex and error-prone process, largely due to the extensive data preparation, feature engineering, and integration required across diverse data sources. While large language models (LLMs) have shown potential in assisting programming tasks, current chat-based interfaces offer limited control over pipeline behavior and often generate code that is difficult to optimize or integrate into production systems. SemPipes addresses these challenges by proposing a new programming model that enhances ML pipelines with declarative, LLM-powered semantic data operators. This allows developers to specify high-level natural language instructions for data-centric operations, which can then be seamlessly combined with conventional Python code from established data science libraries. The system synthesizes specialized implementations for these semantic operators during pipeline training, adapting them based on dataset characteristics and the overall pipeline context. This approach enables flexible yet controlled integration of LLM capabilities into ML development. The demonstration, SemPiper, provides an interactive interface that visualizes the computational graphs of these pipelines, the synthesized operator implementations, and the optimization trajectories derived from an evolutionary search procedure. Users can explore various end-to-end scenarios, modify pipelines, inspect the generated code, and observe the synthesis and iterative optimization of semantic operators. This highlights how declarative semantic operators can lead to more controllable, optimizable, and practical integration of LLMs into the ML pipeline development workflow.

Why it matters

Streamlining ML pipeline development with natural language and LLM-powered semantic operators can significantly reduce development time, improve code quality, and make ML accessible to a broader range of professionals, accelerating innovation and deployment.

How to implement this in your domain

1Explore SemPiper: Investigate the SemPiper framework for integrating natural language instructions into your ML data preparation.
2Pilot semantic operators: Experiment with declarative semantic operators for common data transformation and feature engineering tasks in your ML workflows.
3Integrate LLM assistance: Leverage LLMs to synthesize and optimize code snippets for data operations within your existing Python data science pipelines.
4Improve MLOps efficiency: Adopt tools that visualize computational graphs and optimization trajectories to enhance transparency and control in ML pipeline development.

Who benefits

Data ScienceSoftware DevelopmentAI DevelopmentBusiness IntelligenceResearch & Development

Key takeaways

SemPiper simplifies ML pipeline development using LLM-powered semantic operators.
Developers can use natural language for data operations, combined with Python code.
The system synthesizes optimized code based on data and pipeline context.
It offers interactive visualization for better control and understanding of pipelines.

Original post by Olga Ovcharenko, Luciano Duarte, Sebastian Schelter

"arXiv:2606.14361v1 Announce Type: new Abstract: Machine learning (ML) pipelines require extensive data preparation, feature engineering, and integration across heterogeneous sources, making them tedious and error-prone to develop. While large language models (LLMs) have recently…"

View on X

Originally posted by Olga Ovcharenko, Luciano Duarte, Sebastian Schelter on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

SemPiper Synthesizes ML Pipeline Code with Semantic Operators

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

MCP and A2A Protocols Standardize Agentic Internet Development

VISReg Enhances JEPA Training with Novel Regularization

Ford's AI-Driven Layoffs Backfire Significantly