GPT-4V-Act: Chromium Copilot

🔥 更新（2024 年 9 月 14 日）：请查看 Windows Agent Arena (WAA)！该项目包含了此前计划为 GPT-4V-Act 开发的所有功能，包括旨在实现桌面 UI 环境中 Set-of-Mark 提示功能的 AI 标注。

GPT-4V-Act 是一款出色的多模态 AI 助手，它将 GPT-4V(ision) 与 Web 浏览器有机结合。其设计目标是模拟人类操作员的输入与输出——主要是屏幕反馈以及底层的鼠标/键盘交互。该项目的目标是促进人机操作之间的平滑过渡，从而构建能够显著提升任何用户界面 (UI) 可访问性、辅助工作流自动化以及支持自动化 UI 测试的工具。

工作原理

GPT-4V-Act 利用了 GPT-4V(ision) 和 Set-of-Mark Prompting，并结合了定制的自动标注器（Auto-labeler）。该自动标注器会为每个可交互的 UI 元素分配一个唯一的数字 ID。

通过将任务和截图作为输入，GPT-4V-Act 可以推断出完成任务所需的后续操作。对于鼠标/键盘输出，它能够通过参考数字标签来确定精确的像素坐标。

开始使用！

# Clone the repo
git clone https://github.com/ddupont808/GPT-4V-Act ai-browser
# Navigate to the repo directory
cd ai-browser
# Install the required packages
npm install
# Start the demo
npm start

功能特性

🔄 视觉 (部分)
- ✅ JS DOM 自动标注器 (支持 COCO 导出)
- ❌ AI 自动标注器
✅ 点击
🔄 输入 (部分)
- ✅ 输入字符 (字母、数字、字符串)
- ❌ 输入特殊按键 (回车、向上翻页、向下翻页)
❌ 滚动
❌ 向用户询问更多信息
❌ 记忆与任务相关的信息

如果您有任何想法、反馈或希望参与贡献，欢迎创建 Issue 或发送邮件至 ddupont@mit.edu。

演示提示词 (Demonstration Prompt)

以下是一个使用该用户界面指示智能体“为我播放一首随机歌曲”的示例。

这是 GPT-4V 看到的提示词及其相应的输出：

用户 (User)

task: play a random song for me

type ClickAction = { action: "click", element: number }
type TypeAction = { action: "type", element: number, text: string }
type ScrollAction = { action: "scroll", direction: "up" | "down" }
type RequestInfoFromUser = { action: "request-info", prompt: string }
type RememberInfoFromSite = { action: "remember-info", info: string }
type Done = { action: "done" }

## response format
{
  briefExplanation: string,
  nextAction: ClickAction | TypeAction | ScrollAction | RequestInfoFromUser | RememberInfoFromSite | Done
}

## instructions
# observe the screenshot, and think about the next action
# output your response in a json markdown code block

助手 (Assistant)

{
  "briefExplanation": "I'll type 'random song' into the search bar to find a song for you.",
  "nextAction": { "action": "type", "element": 7, "text": "random song" }
}