🚀 LocateAnything

Locate any object in images or videos with natural language.
Upload an image/video on the left, choose a task type, enter what you want to find, then click Run Inference. Results with bounding boxes will appear on the right.

Quick Start: ① Select Image or Video → ② Pick a Task Type (Detection / Grounding / OCR / GUI / Pointing) → ③ Type your Categories (comma-separated) → ④ Click 🧠 Run Inference

⚠️ Note: magi-attention cannot be installed in this Hugging Face Space, so inputs larger than 1K are resized to 1K in this demo.

For full-resolution inference, please download the weights and run the model locally.

⚙️ Settings

1. Input Media Type

Select whether to process a single image or a video clip.

2. Task Type

Detection: find all instances | Grounding: match description | OCR: extract text | GUI: locate UI element | Pointing: point to target

4. Inference Mode

fast: MTP parallel decoding | slow: standard AR decoding | hybrid: auto-switch for best quality-speed balance

📥 Input Media

📤 Output Result

📝 Raw Input Prompt

🔍 Decoding Visualization


🖼️ Examples

Click any example below to auto-fill the settings and input image.