AI

SAM3 大模型部署

Meta开源模型

Posted by LXG on March 26, 2026

SAM3-github

SAM3 模型介绍

在多模态大模型领域,SAM 2/3 系列与 Qwen3-VL-8B-Instruct 代表了两种截然不同的演进方向:前者侧重于通用分割与时空感知,后者侧向于综合视觉理解与交互

配置环境


conda create -n sam3 python=3.12
conda deactivate
conda activate sam3

pip install opencv-python einops pycocotools psutil

权限

类型 作用 是否必须
Hugging Face Token 登录下载用 ✅ 必须
SAM3 Repo Access 模型授权 ✅ 必须(最关键)

一般很难获取模型参数授权,需要想其他方式

下载安装SAM3 API


git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

测试脚本和模型权重


(sam3) root@ebm9tkrl:~/code/sam3_test# tree -L 3
.
├── sam3_github
│   └── sam3
│       ├── CODE_OF_CONDUCT.md
│       ├── CONTRIBUTING.md
│       ├── LICENSE
│       ├── MANIFEST.in
│       ├── README.md
│       ├── README_TRAIN.md
│       ├── assets
│       ├── examples
│       ├── pyproject.toml
│       ├── sam3
│       ├── sam3.egg-info
│       └── scripts
├── sam3_model
│   ├── LICENSE
│   ├── README.md
│   ├── config.json
│   ├── gitattributes
│   ├── merges.txt
│   ├── model-e7be5886a47c.safetensors.qkdownloading
│   ├── processor_config.json
│   ├── sam3.pt
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── sam3_test.py
└── test_scene.jpg

ssh 文件传输


scp -P 10357 root@ip:/root/code/sam3_test/object.jpg ./code/AI/

# 语法:scp [本地文件路径] [用户名]@[服务器IP]:[远程路径]
scp -P 10357 car_768x768.jpg root@ip:/root/code/sam3_test/

RTX5090 pytorch版本问题

支持RTX5090的pytorch版本


NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.

安装cuda128以上版本


pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128

推理脚本


import argparse
import time
import torch
import sys
from PIL import Image
import numpy as np
import cv2

# 1. 路径设置
sys.path.insert(0, "/root/code/sam3_test/sam3_github/sam3")
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# 2. 环境配置:强制离线模式
import os
os.environ["HF_HUB_OFFLINE"] = "1"

# 命令行参数
parser = argparse.ArgumentParser(description="SAM3 image segmentation")
parser.add_argument("image", help="输入图片路径")
parser.add_argument("--prompt", default="object", help="文本 prompt(默认: object)")
parser.add_argument("--output", default="result.jpg", help="输出图片路径(默认: result.jpg)")
args = parser.parse_args()

device = "cuda" if torch.cuda.is_available() else "cpu"
ckpt_path = "/root/code/sam3_test/sam3_model/sam3.pt"

# 3. 模型预加载(计时)
t0 = time.perf_counter()
model = build_sam3_image_model(
    checkpoint_path=ckpt_path,
    load_from_HF=False,
    device=device
)
print("✅ Model built successfully.")

state_dict = torch.load(ckpt_path, map_location="cpu")
if "model" in state_dict:
    state_dict = state_dict["model"]
model.load_state_dict(state_dict, strict=False)
model.to(device)
model.eval()
t_load = time.perf_counter() - t0
print(f"✅ 模型加载完成,耗时 {t_load:.2f} 秒")

# 4. 初始化 Processor(调低阈值,默认0.5会过滤掉几乎所有结果)
processor = Sam3Processor(model, confidence_threshold=0.05)

# 5. 读取图片
image = Image.open(args.image).convert("RGB")
print(f"Image: {args.image}  size={image.size}")

# 6. 图像编码(Vision Encoder)
t1 = time.perf_counter()
inference_state = processor.set_image(image)
t_encode = time.perf_counter() - t1
print(f"图像编码耗时:{t_encode*1000:.1f} ms")

# 7. 文本 Prompt 推理
prompt = args.prompt
print(f"Querying for: '{prompt}'")
t2 = time.perf_counter()
output = processor.set_text_prompt(
    state=inference_state,
    prompt=prompt
)
t_text = time.perf_counter() - t2
print(f"文本检测耗时:{t_text*1000:.1f} ms")

# 8. 获取结果
masks = output["masks"]
scores = output["scores"]
print(f"masks shape: {masks.shape}, dtype: {masks.dtype}")
print(f"scores: {scores}")

if masks.numel() == 0:
    print("⚠️  文本 Prompt 没有检测到任何目标,尝试全图 box prompt 验证模型...")
    t3 = time.perf_counter()
    output = processor.add_geometric_prompt(
        box=[0.5, 0.5, 1.0, 1.0],
        label=True,
        state=inference_state,
    )
    t_box = time.perf_counter() - t3
    masks = output["masks"]
    scores = output["scores"]
    print(f"box Prompt 结果 -> masks shape: {masks.shape}, scores: {scores}  (耗时 {t_box*1000:.1f} ms)")

# 9. 可视化
t4 = time.perf_counter()
vis = np.array(image)
for i, mask in enumerate(masks):
    score_val = scores[i].item() if scores.numel() > 0 else 0.0
    print(f"  mask[{i}] score={score_val:.4f}, shape={mask.shape}")
    if mask.dim() > 2:
        mask = mask.squeeze()

    m = mask.cpu().numpy().astype(bool)
    print(f"  positive pixels: {m.sum()} / {m.size}")
    if score_val < 0.1:
        print(f"  skipped (score too low)")
        continue
    if m.sum() == 0:
        print(f"  skipped (empty mask)")
        continue

    color = np.random.randint(0, 255, (3,), dtype=np.uint8)
    vis[m] = (vis[m] * 0.5 + color * 0.5).astype(np.uint8)

cv2.imwrite(args.output, cv2.cvtColor(vis, cv2.COLOR_RGB2BGR))
t_total = time.perf_counter() - t1  # 不含模型加载
print(f"✅ 完成!结果已保存到 {args.output}")
print(f"推理总耗时(编码+检测):{t_total*1000:.1f} ms")

运行结果

python sam3_test_preload.py test_scene_1024.jpg –prompt “object” –output object.jpg


sam3) root@ebm9tkrl:~/code/sam3_test# python sam3_test_preload.py test_scene_1024.jpg --prompt "object" --output object.jpg
/root/code/sam3_test/sam3_github/sam3/sam3/model_builder.py:8: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
✅ Model built successfully.
✅ 模型加载完成,耗时 10.96 秒
Image: test_scene_1024.jpg  size=(1024, 1024)
图像编码耗时:327.2 ms
Querying for: 'object'
文本检测耗时:394.8 ms
masks shape: torch.Size([53, 1, 1024, 1024]), dtype: torch.bool
scores: tensor([0.0849, 0.0939, 0.0723, 0.0742, 0.3552, 0.0846, 0.0578, 0.3125, 0.0942,
        0.0552, 0.1710, 0.1374, 0.0574, 0.1886, 0.1119, 0.0551, 0.2212, 0.1630,
        0.1422, 0.1208, 0.2016, 0.0632, 0.2288, 0.1188, 0.0694, 0.0538, 0.1000,
        0.0706, 0.1827, 0.1730, 0.1118, 0.1243, 0.0990, 0.1559, 0.1440, 0.1035,
        0.0789, 0.0746, 0.1141, 0.1205, 0.1125, 0.0927, 0.0549, 0.0550, 0.1415,
        0.0893, 0.0965, 0.0509, 0.1359, 0.1781, 0.0923, 0.1687, 0.1606],
       device='cuda:0')
  mask[0] score=0.0849, shape=torch.Size([1, 1024, 1024])
  positive pixels: 9879 / 1048576
  skipped (score too low)
  mask[1] score=0.0939, shape=torch.Size([1, 1024, 1024])
  positive pixels: 430 / 1048576
  skipped (score too low)
  mask[2] score=0.0723, shape=torch.Size([1, 1024, 1024])
  positive pixels: 8369 / 1048576
  skipped (score too low)
  mask[3] score=0.0742, shape=torch.Size([1, 1024, 1024])
  positive pixels: 254 / 1048576
  skipped (score too low)
  mask[4] score=0.3552, shape=torch.Size([1, 1024, 1024])
  positive pixels: 6833 / 1048576
  mask[5] score=0.0846, shape=torch.Size([1, 1024, 1024])
  positive pixels: 282 / 1048576
  skipped (score too low)
  mask[6] score=0.0578, shape=torch.Size([1, 1024, 1024])
  positive pixels: 11310 / 1048576
  skipped (score too low)
  mask[7] score=0.3125, shape=torch.Size([1, 1024, 1024])
  positive pixels: 929 / 1048576
  mask[8] score=0.0942, shape=torch.Size([1, 1024, 1024])
  positive pixels: 238 / 1048576
  skipped (score too low)
  mask[9] score=0.0552, shape=torch.Size([1, 1024, 1024])
  positive pixels: 1998 / 1048576
  skipped (score too low)
  mask[10] score=0.1710, shape=torch.Size([1, 1024, 1024])
  positive pixels: 1877 / 1048576
  mask[11] score=0.1374, shape=torch.Size([1, 1024, 1024])
  positive pixels: 19673 / 1048576
  mask[12] score=0.0574, shape=torch.Size([1, 1024, 1024])
  positive pixels: 741 / 1048576
  skipped (score too low)
  mask[13] score=0.1886, shape=torch.Size([1, 1024, 1024])
  positive pixels: 9654 / 1048576
  mask[14] score=0.1119, shape=torch.Size([1, 1024, 1024])
  positive pixels: 195 / 1048576
  mask[15] score=0.0551, shape=torch.Size([1, 1024, 1024])
  positive pixels: 144 / 1048576
  skipped (score too low)
  mask[16] score=0.2212, shape=torch.Size([1, 1024, 1024])
  positive pixels: 695 / 1048576
  mask[17] score=0.1630, shape=torch.Size([1, 1024, 1024])
  positive pixels: 2465 / 1048576
  mask[18] score=0.1422, shape=torch.Size([1, 1024, 1024])
  positive pixels: 460 / 1048576
  mask[19] score=0.1208, shape=torch.Size([1, 1024, 1024])
  positive pixels: 2945 / 1048576
  mask[20] score=0.2016, shape=torch.Size([1, 1024, 1024])
  positive pixels: 7558 / 1048576
  mask[21] score=0.0632, shape=torch.Size([1, 1024, 1024])
  positive pixels: 588 / 1048576
  skipped (score too low)
  mask[22] score=0.2288, shape=torch.Size([1, 1024, 1024])
  positive pixels: 2686 / 1048576
  mask[23] score=0.1188, shape=torch.Size([1, 1024, 1024])
  positive pixels: 2184 / 1048576
  mask[24] score=0.0694, shape=torch.Size([1, 1024, 1024])
  positive pixels: 2092 / 1048576
  skipped (score too low)
  mask[25] score=0.0538, shape=torch.Size([1, 1024, 1024])
  positive pixels: 158663 / 1048576
  skipped (score too low)
  mask[26] score=0.1000, shape=torch.Size([1, 1024, 1024])
  positive pixels: 271 / 1048576
  skipped (score too low)
  mask[27] score=0.0706, shape=torch.Size([1, 1024, 1024])
  positive pixels: 785 / 1048576
  skipped (score too low)
  mask[28] score=0.1827, shape=torch.Size([1, 1024, 1024])
  positive pixels: 19244 / 1048576
  mask[29] score=0.1730, shape=torch.Size([1, 1024, 1024])
  positive pixels: 19096 / 1048576
  mask[30] score=0.1118, shape=torch.Size([1, 1024, 1024])
  positive pixels: 122 / 1048576
  mask[31] score=0.1243, shape=torch.Size([1, 1024, 1024])
  positive pixels: 256 / 1048576
  mask[32] score=0.0990, shape=torch.Size([1, 1024, 1024])
  positive pixels: 5375 / 1048576
  skipped (score too low)
  mask[33] score=0.1559, shape=torch.Size([1, 1024, 1024])
  positive pixels: 8390 / 1048576
  mask[34] score=0.1440, shape=torch.Size([1, 1024, 1024])
  positive pixels: 1206 / 1048576
  mask[35] score=0.1035, shape=torch.Size([1, 1024, 1024])
  positive pixels: 1462 / 1048576
  mask[36] score=0.0789, shape=torch.Size([1, 1024, 1024])
  positive pixels: 94 / 1048576
  skipped (score too low)
  mask[37] score=0.0746, shape=torch.Size([1, 1024, 1024])
  positive pixels: 2789 / 1048576
  skipped (score too low)
  mask[38] score=0.1141, shape=torch.Size([1, 1024, 1024])
  positive pixels: 955 / 1048576
  mask[39] score=0.1205, shape=torch.Size([1, 1024, 1024])
  positive pixels: 2747 / 1048576
  mask[40] score=0.1125, shape=torch.Size([1, 1024, 1024])
  positive pixels: 1663 / 1048576
  mask[41] score=0.0927, shape=torch.Size([1, 1024, 1024])
  positive pixels: 4945 / 1048576
  skipped (score too low)
  mask[42] score=0.0549, shape=torch.Size([1, 1024, 1024])
  positive pixels: 104 / 1048576
  skipped (score too low)
  mask[43] score=0.0550, shape=torch.Size([1, 1024, 1024])
  positive pixels: 4836 / 1048576
  skipped (score too low)
  mask[44] score=0.1415, shape=torch.Size([1, 1024, 1024])
  positive pixels: 714 / 1048576
  mask[45] score=0.0893, shape=torch.Size([1, 1024, 1024])
  positive pixels: 155 / 1048576
  skipped (score too low)
  mask[46] score=0.0965, shape=torch.Size([1, 1024, 1024])
  positive pixels: 162 / 1048576
  skipped (score too low)
  mask[47] score=0.0509, shape=torch.Size([1, 1024, 1024])
  positive pixels: 6947 / 1048576
  skipped (score too low)
  mask[48] score=0.1359, shape=torch.Size([1, 1024, 1024])
  positive pixels: 8776 / 1048576
  mask[49] score=0.1781, shape=torch.Size([1, 1024, 1024])
  positive pixels: 31010 / 1048576
  mask[50] score=0.0923, shape=torch.Size([1, 1024, 1024])
  positive pixels: 9588 / 1048576
  skipped (score too low)
  mask[51] score=0.1687, shape=torch.Size([1, 1024, 1024])
  positive pixels: 3370 / 1048576
  mask[52] score=0.1606, shape=torch.Size([1, 1024, 1024])
  positive pixels: 13970 / 1048576
✅ 完成!结果已保存到 out.jpg
推理总耗时(编码+检测):970.5 ms

768x768

python sam3_test_preload.py test_scene_768.jpg –prompt “object” –output object.jpg

RTX5090 + Intel(R) Xeon(R) Platinum 8473C


✅ 模型加载完成,耗时 10.22 秒
Image: test_scene_768.jpg  size=(768, 768)
图像编码耗时:285.0 ms
Querying for: 'object'
文本检测耗时:341.8 ms
✅ 完成!结果已保存到 object.jpg
推理总耗时(编码+检测):755.9 ms

RTX3090 + i9-14900KF


✅ 模型加载完成,耗时 5.94 秒
Image: test_scene_768.jpg  size=(768, 768)
图像编码耗时:410.5 ms
Querying for: 'object'
文本检测耗时:182.4 ms
✅ 完成!结果已保存到 object_768_3090.jpg
推理总耗时(编码+检测):657.8 ms

阶段 5090 + Xeon 8473C 3090 + i9-14900KF 胜者 ✅ 更本质原因
模型加载 10.22s 5.94s 🟢 3090组 内存延迟 + 单核频率(不是GPU问题)
图像编码 285ms 410ms 🔵 5090组 GPU算力 + Tensor Core
文本检测 341ms 182ms 🟢 3090组 ❗CPU主导 + Python调度
总耗时 755ms 657ms 🟢 3090组 ❗pipeline不GPU-bound

SAM3 推理的硬件分配

阶段 术语 执行硬件 你的 log 对应项 目的
A 图像解码 (Decoding) CPU 脚本启动时 .jpg → 像素矩阵(HWC → Tensor)
B 图像编码 (Vision Encoding) GPU 图像编码耗时:XXX ms 像素 → 高维视觉特征(image embedding)
C 文本处理 (Text Processing) CPU为主 Querying for: “object” prompt → token(字符串 → 数值)
D 文本编码 (Text Encoding) CPU + GPU(轻量) 文本检测耗时的一部分 token → 文本 embedding
E 跨模态匹配 (Cross-modal Fusion) GPU(但低利用率) 文本检测耗时的核心部分 image embedding + text embedding 做关联
F Mask 解码 (Mask Decoding) GPU masks shape / scores 生成候选分割 mask
G 结果筛选 (Filtering / Ranking) CPU skipped (score too low) 根据 score / 阈值筛选
H 后处理 (Postprocess) CPU 保存前 mask → 可视化(resize / overlay)
I 图像编码保存 (Encoding) CPU 保存 jpg Tensor → .jpg 文件

Car 检测


sam3) root@ebm9tkrl:~/code/sam3_test# python sam3_test_preload.py car_768x768.jpg --prompt "car" --output object.jpg
✅ 模型加载完成,耗时 9.18 秒
Image: car_768x768.jpg  size=(768, 768)
图像编码耗时:264.6 ms
Querying for: 'car'
文本检测耗时:308.8 ms
masks shape: torch.Size([2, 1, 768, 768]), dtype: torch.bool
scores: tensor([0.0643, 0.9781], device='cuda:0')
  mask[0] score=0.0643, shape=torch.Size([1, 768, 768])
  positive pixels: 142272 / 589824
  skipped (score too low)
  mask[1] score=0.9781, shape=torch.Size([1, 768, 768])
  positive pixels: 144442 / 589824
✅ 完成!结果已保存到 object.jpg
推理总耗时(编码+检测):622.1 ms

car_768x768_sam

开启:FP16 + TensorRT

对比

项目 PyTorch原始推理 FP16 + TensorRT
推理速度 🟡 中等(基准) 🚀 2~5×提升
GPU利用率 不稳定 高且稳定
延迟抖动 有(Python + kernel launch) 极低
显存占用 ↓30%~50%
batch吞吐 一般 很强
工程部署 简单 复杂(build engine)

为什么 TensorRT 会快这么多?

PyTorch 推理路径


Python
  ↓
PyTorch eager execution
  ↓
ATen operator(很多小算子)
  ↓
CUDA kernel 多次 launch
  ↓
GPU执行

TensorRT 推理路径

PyTorch SAM3
  ↓ export
ONNX
  ↓
TensorRT graph optimization
  ↓
layer fusion(Conv + BN + Act)
  ↓
kernel 合并(少但大)
  ↓
CUDA execution

安装环境


pip install tensorrt pycuda onnx onnxsim

sudo apt-get install nvidia-cuda-dev

将SAM3模型导出为ONNX

sam3-onnxruntime

sam3-onnxruntim 没跑起来

SAM3-TensorRT