NaviGen Personalizes Multimodal Generation from User Behavior

Hengji Zhou, Yufeng Liu, Ye Liu, Yong Xu, Lianghao Xia, Liqiang Nie· June 24, 2026 View original

Summary

NaviGen enables personalized multimodal content generation by converting user interaction history into executable instructions for synthesis. It uses a dual identifier for behavioral and textual codes and a two-stage SFT+RL pipeline to distill preference reasoning and align generation with user intent.

Modern AI-generated content (AIGC) pipelines excel at producing high-fidelity images and videos, but they typically require precise creation instructions. End-users, however, often struggle to articulate detailed visual preferences, leading to a misalignment between generator output and user demand. This research introduces NaviGen, a system designed for personalized content generation that translates a user's interaction history into actionable instructions for downstream synthesis. NaviGen addresses two key challenges: encoding user behavior in a format understandable by language reasoning, and enabling the model to acquire instruction-writing skills not present in its pretraining or behavioral data. It achieves this by representing each item with a dual identifier, combining a collaborative code and a textual code into a single token stream, serving as both a behavioral substrate and a semantic bridge. The system employs a two-stage pipeline involving Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). This process first distills preference reasoning and instruction writing from evolutionarily searched supervision, then aligns the generation with user intent through hierarchical and self-consistent rewards. Experiments across product, game, and short-video domains demonstrate that NaviGen significantly improves personalized image and video generation, enhances next-item prediction, and produces more specific, relevant, and visually generatable instructions.

Why it matters

For professionals in e-commerce, media, and content platforms, NaviGen offers a powerful way to deliver truly personalized multimodal content, improving user engagement, conversion rates, and overall platform stickiness by bridging the gap between user intent and AI generation capabilities.

How to implement this in your domain

  1. 1Analyze current content generation pipelines for personalization gaps based on user behavior.
  2. 2Explore integrating dual identifier systems to encode user interaction history for AI models.
  3. 3Investigate two-stage SFT+RL pipelines for distilling user preferences into actionable instructions.
  4. 4Pilot NaviGen's approach for personalized recommendations or content creation in specific product categories.
  5. 5Collaborate with AI research teams to adapt and fine-tune this technology for unique platform requirements.

Who benefits

E-commerceSocial MediaEntertainmentAdvertisingGaming

Key takeaways

  • NaviGen personalizes multimodal content generation by converting user behavior into executable instructions.
  • It uses a dual identifier system to bridge behavioral and semantic information.
  • A two-stage SFT+RL pipeline distills preference reasoning and aligns generation with user intent.
  • NaviGen improves personalized image/video generation and next-item prediction across domains.

Original post by Hengji Zhou, Yufeng Liu, Ye Liu, Yong Xu, Lianghao Xia, Liqiang Nie

"arXiv:2606.24196v1 Announce Type: new Abstract: Modern AIGC pipelines deliver high-fidelity images and videos but presuppose a well-formed creation instruction, while end users rarely articulate visual details, leaving generators misaligned with user demand. We study personalized…"

View on X

Originally posted by Hengji Zhou, Yufeng Liu, Ye Liu, Yong Xu, Lianghao Xia, Liqiang Nie on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses