OmniParser 快速上手指南

# OmniParser 快速上手指南本文结合我们在 Ubuntu 上部署、Base64 解码、OCR 多语言支持及 `bbox` 坐标还原等讨论，带你一步步从零开始使用 OmniParser V2 构建“截屏→解析→LLM→执行”视觉自动化工具。 --- ## 一、项目简介 **OmniParser** 是微软开源的视觉驱动 GUI 解析工具，核心能力： 1. **检测（Detection）**：基于 YOLOv8 定位可交互元素（按钮、图标、输入框等）。 2. **描述（Captioning）**：使用 Florence-2/BLIP2 为每个元素生成语义化标签。 3. **OCR 提取**：内置 EasyOCR 和 PaddleOCR，默认英文，可扩展中文等多语言。 4. **结构化输出**：返回带 `bbox`（归一化坐标）、`content`（文字）、`interactivity` 等字段的元素列表。它能将像素级截图转换为可操作指令，为 LLM（如 GPT-4V、Qwen）提供精准视觉输入。 --- ## 二、部署解析服务（Ubuntu） 1. **环境准备** - Ubuntu 20.04+，Python 3.12（建议 Conda）。 - GPU + CUDA（可选，但推荐以降低延迟）。 2. **克隆与依赖** ```bash git clone https://github.com/microsoft/OmniParser.git cd OmniParser conda create -n omni python=3.12 && conda activate omni pip install -r requirements.txt ``` 3. **下载权重** ```bash mkdir -p weights/icon_detect weights/icon_caption_florence # detection huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/model.pt --local-dir weights/icon_detect huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/model.yaml --local-dir weights/icon_detect # caption huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/config.json --local-dir weights/icon_caption_florence huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/model.safetensors --local-dir weights/icon_caption_florence ``` 4. **启动服务** ```bash cd omnitool/omniparserserver uvicorn omniparserserver:app --host 0.0.0.0 --port 7861 ``` - 访问 `http://<服务器IP>:7861/docs` 查看 Swagger API。 --- ## 三、端到端流程 1. **截屏**（客户端已有方案，如 mss、PyAutoGUI）：获取 `W×H` 像素图。 2. **Base64 编码** ```bash img_b64=$(base64 -w0 screenshot.png) ``` 3. **调用解析接口** ```bash curl -X POST http://:7861/parse/ \ -H 'Content-Type: application/json' \ -d '{"base64_image":"'$img_b64'"}' ``` 4. **处理 Base64 前缀与填充**（服务端已改造） - 自动剥离 `data:…;base64,` 前缀并补全 `=` 填充。 5. **解析结果示例** ```json { "elements":[ { "type":"icon", "bbox":[0.0014,0.4193,0.1504,0.4856], "interactivity":true, "content":"下载", "source":"box_yolo_content_ocr" } ] } ``` 6. **还原像素坐标** ```python x1,y1,x2,y2 = element['bbox'] px1,py1 = int(x1*W), int(y1*H) px2,py2 = int(x2*W), int(y2*H) cx,cy = (px1+px2)//2, (py1+py2)//2 ``` 7. **LLM 决策** ```python prompt = build_prompt(elements, user_intent) resp = openai.chat.completions.create(model='gpt-4o', messages=[{'role':'user','content':prompt}]) action = resp.choices[0].message.content ``` 8. **执行动作**（PyAutoGUI）： ```python if action.startswith('CLICK'): pyautogui.click(cx, cy) ``` --- ## 四、多语言 OCR 支持在 `util/utils.py` 中： ```python # EasyOCR reader = easyocr.Reader(['ch_sim','en']) # PaddleOCR paddle_ocr = PaddleOCR(lang='ch', use_angle_cls=True) ``` 重启服务后即可同时识别中英文。 --- ## 五、最佳实践与优化 - **置信度过滤**：仅对 `interactivity=true` 且置信度 > 0.3 的元素触发操作。 - **安全校验**：关键操作前二次询问 LLM 确认。 - **并发与缓存**：解析结果可缓存多次复用，减少重复推理。 - **日志回环**：保存截图、解析、决策与执行记录，用于模型微调。 --- 到此，你已经掌握了 OmniParser 从部署到实战的完整流程。马上动手试试，让你的自动化代理“看得见”并“动得起来”！

OmniParser 快速上手指南

分享文章