🚀 LocateAnything

Locate any object in images or videos with natural language.
Upload an image/video on the left, choose a task type, enter what you want to find, then click Run Inference. Results with bounding boxes will appear on the right.

Quick Start: ① Select Image or Video → ② Pick a Task Type (Detection / Grounding / OCR / GUI / Pointing) → ③ Type your Categories (comma-separated) → ④ Click 🧠 Run Inference

⚠️ Note: magi-attention cannot be installed in this Hugging Face Space, so inputs larger than 1K are resized to 1K in this demo.

For full-resolution inference, please download the weights and run the model locally.

⚙️ Settings

1. Input Media Type

Select whether to process a single image or a video clip.

Image Video

2. Task Type

Detection: find all instances | Grounding: match description | OCR: extract text | GUI: locate UI element | Pointing: point to target

3. Categories

Enter one or more categories separated by commas. Supports both English and Chinese (e.g. 汽车, 行人).

4. Inference Mode

fast: MTP parallel decoding | slow: standard AR decoding | hybrid: auto-switch for best quality-speed balance

📥 Input Media

Input Image

📤 Output Result

Detection Result

📝 Raw Input Prompt

Textbox

This is the prompt sent to the model (auto-generated from your settings above).

🔍 Decoding Visualization

🖼️ Examples

Click any example below to auto-fill the settings and input image.

Gallery