unitedbyai/droidclaw

turn old phones into ai agents - give it a goal in plain english. it reads the screen, thinks about what to do, taps and types via adb, and repeats until the job is done.

GAI 中文摘要

Droidclaw 是一个将旧安卓手机转化为自动化 AI 智能体的开源项目。它通过大语言模型赋予手机感知与执行能力，让用户仅需使用自然语言描述目标，即可自动完成屏幕点击、输入和滑动等操作，无需编写复杂的自动化流程或集成 API。

项目利用感知、推理、行动的循环机制，通过分析屏幕内容实时决策执行步骤。内置死循环检测与重复动作追踪功能，可主动规避自动化执行过程中的逻辑陷阱。支持视觉回退机制，在传统无障碍树无法解析的复杂应用或游戏中，通过截图分析实现精准点击。具备完善的行动反馈与多轮记忆机制，确保智能体能够根据上一步的操作结果进行持续优化。无需调用第三方接口，直接模拟人类在设备上的操作习惯，实现跨应用的任务执行与自动化交互。

该项目非常适合希望利用闲置安卓设备进行自动化测试、任务执行或智能助理实验的开发者与极客群体，可用于处理各类无需 API 的自动化繁琐任务。

⭐

1.5k

Stars

🔱

224

Forks

👁

Watchers

📋

Issues

TypeScript创建于 2026/2/6更新于今天

在 GitHub 上查看访问主页

README

由 Gemini 翻译整理

droidclaw

一个可以控制 Android 手机的 AI Agent。只需用自然语言输入目标，它就能自动计算出需要点击、输入和滑动的操作。

下载 Android APK (v0.5.3) | 仪表盘 | Discord

我希望将闲置的 Android 设备变成 AI Agent。在经过几个小时对无障碍服务（Accessibility trees）的逆向工程和对 Tailscale 的一番折腾后，它成功了。

可以这样理解——几年前，我们只能通过预定义的流程来自动化 Android 操作。现在，想象一下这个自动化层拥有了 LLM 大脑。它能读取任何屏幕，理解当前状态，决定下一步行动并执行。你不需要 API，也不需要构建集成。只需安装你喜欢的应用，告诉 Agent 你想完成的任务即可。

它目前最酷的功能之一是：将收到的请求委托给设备上的 ChatGPT、Gemini 或 Google 搜索来处理，并将结果反馈回来。完全无需这些服务的 API Key——它就像人类一样直接操作这些 App。

$ bun run src/kernel.ts
enter your goal: open youtube and search for "lofi hip hop"

--- step 1/30 ---
think: i'm on the home screen. launching youtube.
action: launch (842ms)

--- step 2/30 ---
think: youtube is open. tapping search icon.
action: tap (623ms)

--- step 3/30 ---
think: search field focused.
action: type "lofi hip hop" (501ms)

--- step 4/30 ---
action: enter (389ms)

--- step 5/30 ---
think: search results showing. done.
action: done (412ms)

工作原理

其核心理念非常简单：一个循环执行的 感知（Perception） → 推理（Reasoning） → 行动（Action） 闭环，直到目标完成（或达到最大步数限制）。

                         ┌─────────────────────────────────────────┐
                         │              your goal                  │
                         │   "send good morning to mom on whatsapp"│
                         └────────────────┬────────────────────────┘
                                          │
                                          ▼
                    ┌─────────────────────────────────────────────────┐
                    │                                                 │
                    │              ┌──────────────┐                   │
                    │              │  1. perceive  │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    dump accessibility tree via adb               │
                    │    parse xml → interactive ui elements           │
                    │    diff with previous screen (detect changes)    │
                    │    optionally capture screenshot                 │
                    │                     │                           │
                    │                     ▼                           │
                    │              ┌──────────────┐                   │
                    │              │  2. reason    │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    send screen state + goal + history to llm     │
                    │    llm returns { think, plan, action }           │
                    │    "i see the search icon at (890, 156).         │
                    │     i should tap it."                            │
                    │                     │                           │
                    │                     ▼                           │
                    │              ┌──────────────┐                   │
                    │              │  3. act       │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    execute via adb: tap, type, swipe, etc.       │
                    │    feed result back to llm on next step          │
                    │    check if goal is done                        │
                    │                     │                           │
                    │                     ▼                           │
                    │               done? ─────── yes ──→ exit        │
                    │                │                                │
                    │                no                               │
                    │                │                                │
                    │                └─────── loop back to perceive   │
                    │                                                 │
                    └─────────────────────────────────────────────────┘

如何确保其稳定性

使用 LLM 控制 UI 听起来很脆弱。如果不处理故障模式，它确实如此。以下是 Droidclaw 的应对之道：