Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,190 +1,174 @@
---
license: apache-2.0
---
<div align="center">
<img src="https://github.com/FlagOpen/RoboBrain2.5/raw/main/assets/logo2.png" width="500"/>
</div>
# Introduction

<h1 align="center">RoboBrain 2.5: Depth in Sight, Time in Mind. </h1>
**FlagOS** is a unified heterogeneous computing software stack for large models, co-developed with leading global chip manufacturers. With core technologies such as the **FlagScale** distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the FlagOS stack to automatically produce and release various combinations of <chip + open-source model>. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application.

<p align="center">
</a>&nbsp&nbsp⭐️ <a href="https://superrobobrain.github.io/">Project Page</a></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/BAAI/robobrain25/">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://github.com/FlagOpen/RoboBrain2.5">Github</a>&nbsp&nbsp
</p>
Based on this, the **RoboBrain2.5-8B-FlagOS** model is adapted for the Ascend chip using the FlagOS software stack, enabling:

## 🔥 Overview
**RoboBrain-2.5** is a next-generation Embodied AI foundation model that significantly evolves its predecessor's core capabilities in general perception, spatial reasoning, and temporal modeling through extensive training on high-quality spatiotemporal data. It achieves a paradigm shift in 3D Spatial Reasoning, transitioning from 2D relative points to predicting 3D coordinates with depth information, understanding absolute metric constraints, and generating complete manipulation trajectories tailored for complex tasks with physical constraints. Furthermore, it establishes a breakthrough in Temporal Value Prediction by constructing a General Reward Modeling Method that provides dense progress tracking and multi-granular execution state estimation across varying viewpoints. This empowers VLA reinforcement learning with immediate, dense feedback signals, enabling robots to achieve high task success rates and robustness in fine-grained manipulation scenarios.
### Integrated Deployment

<div align="center">
<img src="https://github.com/FlagOpen/RoboBrain2.5/raw/main/assets/rb25_feature.png" />
</div>
- Deep integration with the open-source [FlagScale framework](https://github.com/FlagOpen/FlagScale)
- Out-of-the-box inference scripts with pre-configured hardware and software parameters
- Released **FlagOS-Ascend** container image supporting deployment within minutes

<div align="center">
<img src="https://github.com/FlagOpen/RoboBrain2.5/raw/main/assets/rb25_result.png" />
</div>
### Consistency Validation

## 🚀 Key Highlights
- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public.

### 1. Comprehensive Upgrade in ✨ Native 3D Spatial Reasoning ✨
Compared to version 2.0, **RoboBrain-2.5** achieves a leap in spatial perception and reasoning capabilities:
* **From 2D to 3D:** Upgraded from predicting coordinate points on 2D images to predicting coordinate points with depth information in **3D space** (3D Spatial Referring).
* **Relative to Absolute:** Evolved from understanding relative spatial relationships to measuring **absolute 3D spatial metric information** (3D Spatial Measuring). The model can comprehend precise physical constraint instructions (e.g., "hovering 1-5 cm above").
* **Point to Trace:** Advanced from predicting a single target point for pick-and-place to predicting a **series of key points** that describe the complete manipulation process (3D Spatial Trace), naturally possessing spatial planning capabilities with 3D absolute metrics.
# Technical Overview

## **FlagScale Distributed Training and Inference Framework**

### 2. Breakthrough in ✨ Dense Temporal Value Estimation ✨
**RoboBrain-2.5** makes significant progress in temporal modeling by constructing a General Reward Model (GRM):
* **Dense Progress Prediction:** Capable of multi-granularity task progress prediction across different tasks, viewpoints, and embodiments.
* **Execution State Estimation:** Understands task goals and estimates various states during execution (e.g., success, failure, error occurrence).
* **Empowering VLA Reinforcement Learning:** Provides real-time, dense feedback signals and rewards for VLA (Vision-Language-Action) reinforcement learning. With only **one demonstration**, it achieves a task success rate of **95%+** in complex, fine-grained manipulations.
FlagScale is an end-to-end framework for large models across heterogeneous computing resources, maximizing computational efficiency and ensuring model validity through core technologies. Its key advantages include:

### 3. More Powerful Core Capabilities from previous version 2.0
**RoboBrain 2.5** also maintains the three core capabilities of version 2.0, which supports ***interactive reasoning*** with long-horizon planning and closed-loop feedback, ***spatial perception*** for precise point and bbox prediction from complex instructions, ***temporal perception*** for future trajectory estimation, and ***scene reasoning*** through real-time structured memory construction and update.
- **Unified Deployment Interface:** Standardized command-line tools support one-click service deployment across multiple hardware platforms, significantly reducing adaptation costs in heterogeneous environments.
- **Intelligent Parallel Optimization:** Automatically generates optimal distributed parallel strategies based on chip computing characteristics, achieving dynamic load balancing of computation/communication resources.
- **Seamless Operator Switching:** Deep integration with the FlagGems operator library allows high-performance operators to be invoked via environment variables without modifying model code.

## 🛠️ Setup
## **FlagGems Universal Large-Model Operator Library**

```bash
# clone repo.
git clone https://github.com/FlagOpen/RoboBrain2.5.git
cd RoboBrain2.5

# build conda env.
conda create -n robobrain2_5 python=3.10
conda activate robobrain2_5
pip install -r requirements.txt
```
FlagGems is a Triton-based, cross-architecture operator library collaboratively developed with industry partners. Its core strengths include:

- **Full-stack Coverage**: Over 100 operators, with a broader range of operator types than competing libraries.
- **Ecosystem Compatibility**: Supports 7 accelerator backends. Ongoing optimizations have significantly improved performance.
- **High Efficiency**: Employs unique code generation and runtime optimization techniques for faster secondary development and better runtime performance compared to alternatives.

## 💡 Quickstart
## **FlagEval Evaluation Framework**

### 1. Usage for General VQA
```python
from inference import UnifiedInference
FlagEval (Libra)** is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features:
- **Multi-dimensional Evaluation**: Supports 800+ model evaluations across NLP, CV, Audio, and Multimodal fields, covering 20+ downstream tasks including language understanding and image-text generation.
- **Industry-Grade Use Cases**: Has completed horizontal evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation.

model = UnifiedInference("BAAI/RoboBrain2.5-8B-NV")
# Evaluation Results

# Example:
prompt = "What is shown in this image?"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
## Benchmark Result

pred = model.inference(prompt, image, task="general")
print(f"Prediction:\n{pred}")
```
| Metrics | RoboBrain2.5-8B-NV-CUDA | RoboBrain2.5-8B-ascend-FlagOS|
|-------------------|--------------------------|-----------------------------|
| erqa | 41.250 | 41.000 |
| Where2Place | 1.220 | 0.450 |
| blink_val_ev | 78.180 | 80.490 |
| cv_bench_test | 87.490 | 87.680 |
| embspatial_bench | 75.910 | 77.390 |
| SAT | 69.330 | 74.000 |
| vsi_bench_tiny | 38.340 | 35.610 |
| robo_spatial_home_all | 49.429 | 44.286 |
| all_angles_bench | 48.360 | 50.420 |
| egoplan_bench2 | 49.050 | 49.430 |
| EmbodiedVerse-Open-Sampled | 44.320 | 44.910 |
| ERQAPlus | 21.880 | 20.880 |

# User Guide

**Environment Setup**

|Accelerator Card Driver Version | Driver Version: 25.2.0 |
| ------------------------------- | ----------------------------------- |
| CANN | 8.3.0.2.220 (8.3.RC2) |
| FlagTree | Version: 0.4.0 |
| FlagGems | Version: 4.2.0 |
| VLLM-FL | Version: 0.0.0 |

### 2. Usage for Visual Grounding (VG)
```python
from inference import UnifiedInference

model = UnifiedInference("BAAI/RoboBrain2.5-8B-NV")
## Operation Steps

# Example:
prompt = "the person wearing a red hat"
image = "./assets/demo/grounding.jpg"
### Download Open-source Model Weights

# Visualization results will be saved to ./result, if `plot=True`.
pred = model.inference(prompt, image, task="grounding", plot=True, do_sample=False)
print(f"Prediction:\n{pred}")
```bash
pip install modelscope
modelscope download --model FlagRelease/RoboBrain2.5-8B-FlagOS --local_dir /data/workspace-robobrain2.5/RoboBrain2.5-8B
```

### 3. Usage for Affordance Prediction (Embodied)
```python
from inference import UnifiedInference
### Download FlagOS Image

```bash
docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-ascend-release-model_robobrain2.5-8b-tree_0.4.0_ascend3.2e-gems_4.2.0-scale_1.0.0-cx_0.8.0-python_3.11.13-torch_npu2.8.0-pcp_cann8.3.0.2.220_8.3.rc2-gpu_ascend001-arc_arm64-driver_25.2.0:latest

model = UnifiedInference("BAAI/RoboBrain2.5-8B-NV")
### Start the inference service

```bash
#Container Startup
docker run -itd --name flagos -u root --privileged=true --shm-size=1000g --net=host \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /data/workspace-robobrain2.5:/workspace-robobrain2.5 \
-v /root/.cache:/root/.cache \
-w /workspace-robobrain2.5 \
-e VLLM_USE_V1=1 \
-e CPU_AFFINITY_CONF=2 \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-e USE_FLAGGEMS=true \
harbor.baai.ac.cn/flagrelease-public/flagrelease-ascend-release-model_robobrain2.5-8b-tree_0.4.0_ascend3.2e-gems_4.2.0-scale_1.0.0-cx_0.8.0-python_3.11.13-torch_npu2.8.0-pcp_cann8.3.0.2.220_8.3.rc2-gpu_ascend001-arc_arm64-driver_25.2.0:latest bash
```

# Example:
prompt = "the affordance area for holding the cup"
image = "./assets/demo/affordance.jpg"
### Serve

# Visualization results will be saved to ./result, if `plot=True`.
pred = model.inference(prompt, image, task="pointing", plot=True, do_sample=False)
print(f"Prediction:\n{pred}")
```bash
docker exec -it flagos bash
flagscale serve rb25
```

### 4. Usage for Refering Prediction (Embodied)
```python
from inference import UnifiedInference

model = UnifiedInference("BAAI/RoboBrain2.5-8B-NV")
## Service Invocation

### API-based Invocation Script

# Example:
prompt = "Identify spot within the vacant space that's between the two mugs"
image = "./assets/demo/pointing.jpg"
```bash
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://<server_ip>:9014/v1/"
model = "RoboBrain2.5-8B-ascend-flagos"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the weather like today?"}
]
response = openai.chat.completions.create(
model=model,
messages=messages,
temperature=0.7,
top_p=0.95,
stream=False,
)
for item in response:
print(item)

# Visualization results will be saved to ./result, if `plot=True`.
pred = model.inference(prompt, image, task="pointing", plot=True, do_sample=True)
print(f"Prediction:\n{pred}")
```

### 5. Usage for Navigation Tasks (Embodied)
```python
from inference import UnifiedInference
### AnythingLLM Integration Guide

model = UnifiedInference("BAAI/RoboBrain2.5-8B-NV")
#### 1. Download & Install

# Example 1:
prompt_1 = "Identify spot within toilet in the house"
image = "./assets/demo/navigation.jpg"
- Visit the official site: https://anythingllm.com/
- Choose the appropriate version for your OS (Windows/macOS/Linux)
- Follow the installation wizard to complete the setup

# Visualization results will be saved to ./result, if `plot=True`.
pred = model.inference(prompt_1, image, task="pointing", plot=True, do_sample=True)
print(f"Prediction:\n{pred}")
#### 2. Configuration

# Example 2:
prompt_2 = "Identify spot within the sofa in the house"
image = "./assets/demo/navigation.jpg"
- Launch AnythingLLM
- Open settings (bottom left, fourth tab)
- Configure core LLM parameters
- Click "Save Settings" to apply changes

# Visualization results will be saved to ./result, if `plot=True`.
pred = model.inference(prompt_2, image, task="pointing", plot=True, do_sample=True)
print(f"Prediction:\n{pred}")
```
#### 3. Model Interaction

### 6. Usage for ✨ 3D Trajectory Prediction ✨ (Embodied)
```python
from inference import UnifiedInference
- After model loading is complete:
- Click **"New Conversation"**
- Enter your question (e.g., “Explain the basics of quantum computing”)
- Click the send button to get a response

model = UnifiedInference("BAAI/RoboBrain2.5-8B-NV")
# Contributing

# Example:
prompt = "reach for the banana on the plate"
image = "./assets/demo/trajectory.jpg"
We warmly welcome global developers to join us:

# Visualization results will be saved to ./result, if `plot=True`.
pred = model.inference(prompt, image, task="trajectory", plot=True, do_sample=False)
print(f"Prediction:\n{pred}")
```
1. Submit Issues to report problems
2. Create Pull Requests to contribute code
3. Improve technical documentation
4. Expand hardware adaptation support

### 7. Usage for ✨ Temporal Value Estimation ✨ (Embodied)
***We highly recommend referring to [Robo-Dopamine](https://github.com/FlagOpen/Robo-Dopamine) for detailed usage instructions.***
```bash
# clone Robo-Dopamine repo.
git clone https://github.com/FlagOpen/Robo-Dopamine.git
cd Robo-Dopamine
```
```python
import os
from examples.inference import GRMInference

# model = GRMInference("tanhuajie2001/Robo-Dopamine-GRM-3B")
model = GRMInference("BAAI/RoboBrain2.5-8B-NV")

TASK_INSTRUCTION = "organize the table"
BASE_DEMO_PATH = "./examples/demo_table"
GOAL_IMAGE_PATH = "./examples/demo_table/goal_image.png"
OUTPUT_ROOT = "./results"

output_dir = model.run_pipeline(
cam_high_path = os.path.join(BASE_DEMO_PATH, "cam_high.mp4"),
cam_left_path = os.path.join(BASE_DEMO_PATH, "cam_left_wrist.mp4"),
cam_right_path = os.path.join(BASE_DEMO_PATH, "cam_right_wrist.mp4"),
out_root = OUTPUT_ROOT,
task = TASK_INSTRUCTION,
frame_interval = 30,
batch_size = 1,
goal_image = GOAL_IMAGE_PATH,
eval_mode = "incremental",
visualize = True
)

print(f"Episode ({BASE_DEMO_PATH}) processed with Incremental-Mode. Output at: {output_dir}")
# License

```
本模型的权重来源于BAAI/RoboBrain2.5-8B-NV,以apache2.0协议https://www.apache.org/licenses/LICENSE-2.0.txt开源。