Embodied Agents & World Models · 2026年6月27日

每日论文速递 · Embodied Agents & World Models

💡 一句话：OmniAct 把 cyber tools、IoT、navigation、manipulation 放进统一 action space，并加上分层记忆和异步视觉 preemption，让机器人能在真实长任务里发现失败并恢复。

2026-06-27 09:15:028 篇论文条目

arXiv:2606.27251 arXiv:2606.23565 arXiv:2606.17511 arXiv:2606.22948 arXiv:2606.24525 arXiv:2606.27146 arXiv:2606.27374 arXiv:2606.27355

📄 每日论文速递 · Embodied Agents & World Models

日期：2026-06-27

1. 从孤立技能到日常物理自治：推进全模态具身智能体

Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy

🔗 https://arxiv.org/abs/2606.27251

🎯 关联：这篇最值得看。它不是单点 robot policy，而是 “planner + memory + verifier + execution recovery” 的完整 agent runtime，跟 InternOS 的执行层/验证层/长期上下文管理高度同构。

2. HoloAgent-0：带 3D 空间记忆的统一具身 Agent 框架

HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory

🔗 https://arxiv.org/abs/2606.23565

💡 一句话：把 LLM agent 的 reason-tool-feedback-revise 循环搬到物理机器人里，用 3D spatial memory 连接感知、规划和执行。

🎯 关联：这篇对 Anna 的价值在“状态表示”。未来 agent 平台不能只记文本 trace，必须有 environment state / spatial memory / action grounding，InternOS 如果往执行系统走，这层迟早要抽象出来。

3. MagicSim：可执行具身交互的统一基础设施

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

🔗 https://arxiv.org/abs/2606.17511

💡 一句话：把 simulation 从“渲染/测试工具”升级成统一 runtime，同一个 episode 可以执行、复现、评估、标注。

🎯 关联：这篇直接打到 AI sandbox / hardware infra 线。好的 sandbox 不只是隔离环境，而是可回放、可验证、可生成训练数据的 execution substrate。

4. ENVS：面向长程 GUI Agent 的环境原生验证搜索

ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents

🔗 https://arxiv.org/abs/2606.22948

💡 一句话：训练 GUI agent 时不靠纯 imitation，而是在真实 desktop environment 里搜索 trajectory，并用环境 verifier 过滤出可靠监督。

🎯 关联：这篇很像“generator + verifier + self-improvement loop”的数字环境版本。对 InternOS 的启发是：agent 学会执行不是靠 prompt，而是靠环境闭环里的 verified trajectories。

5. VisCritic：用视觉状态对比做 GUI Agent 的过程奖励

VisCritic: Visual State Comparison as Process Reward for GUI Agents

🔗 https://arxiv.org/abs/2606.24525

💡 一句话：不只看最终结果，而是比较 action 前后的 screenshot，让 GUI agent 每一步都有视觉级 process reward。

🎯 关联：这是 execution verifier 的好样板。Anna 做 agent 平台时，不能只记录“调用了什么 tool”，还要判断“环境状态是否真的朝目标推进”。

6. PhysReflect-VLA：给 VLA 策略加物理可行性检查和自反思调节

PhysReflect-VLA: Physical Feasibility and Self-Reflective Regulation for Reliable Vision-Language-Action Policies

🔗 https://arxiv.org/abs/2606.27146

💡 一句话：给 VLA policy 插入 execution-time feasibility evaluator 和 structured self-reflection，解决长程 manipulation 里 open-loop 执行容易崩的问题。

🎯 关联：这篇说明 VLA 单靠端到端动作生成不够，必须外接验证/纠错层。这个判断也适用于 software agent：policy 只是执行器，不是系统可靠性的全部。

7. World Action Models 通过循环生成回放实现持续模仿学习

World Action Models Enable Continual Imitation Learning with Recurrent Generative Replays

🔗 https://arxiv.org/abs/2606.27374

💡 一句话：用 World Action Model 生成旧任务的 pseudo replay，让 robot policy 在学习新任务时不必保存原始 demonstration 也能复习旧技能。

🎯 关联：这是 world model 从“预测未来”走向“训练数据生成器”的信号。对 agent 系统来说，未来的 memory 不只是存档，还会变成可采样、可训练、可回放的经验模型。

8. RouterVLA：把 smoke test 变成异构 VLA 选择监督

RouterVLA: Turning Smoke Tests into Supervision for Heterogeneous VLA Selection

🔗 https://arxiv.org/abs/2606.27355

💡 一句话：用预部署测试 rollout 给多个冻结 VLA expert 建 profile，再根据任务选择更可能成功的 policy。

🎯 关联：这篇不是最 sexy，但系统味很强。未来 agent 平台不会只有一个万能模型，而是 router 根据历史执行证据选择 planner/policy/tool，这跟 InternOS 的调度层很贴。

今日判断

今天的趋势很清楚：具身 agent 正在从“模型能不能做动作”转向“系统能不能长期执行、验证、恢复、复用经验”。最值得 Anna 盯的是 execution-time verifier + memory + environment-native feedback 这条线，它同时出现在真实机器人、GUI agent、simulation infra 和 VLA policy 里。 blunt 点说：单纯更大的 VLA/MLLM 已经不是核心卖点了，谁能把执行闭环做稳，谁才更接近未来的 agent OS。