Qwen 大模型本地部署

ModelScope 是由阿里巴巴 (达摩院) 发起的开源模型社区

项目	千问3‑VL‑8B‑Instruct	千问3‑VL‑8B‑Thinking	千问3‑VL‑8B‑Instruct‑FP8
模型目的	指令式推理BF16	思考/逻辑推理能力加强	指令式推理（优化 FP8 效率）
最佳场景	任务引导式理解、分类、描述型任务	复杂多步推理、逻辑推理、连续推断任务	性能优先、显存敏感（更低显存占用）
是否支持视觉（VL）	✅ 视觉 + 文本	✅ 视觉 + 文本	✅ 视觉 + 文本
推理速度	🟢 中等	🟡 稍慢	🔵 较快
显存占用	🟠 适中	🔴 较高	🔵 较低
是否适合 Int4 / FP8	⭐ Int4	⚠ Int4 成效不如 Instruct	⭐ FP8 优化效果好
推理准确性	⭐⭐	⭐⭐⭐（逻辑 & 思考更强）	⭐⭐（与 Instruct 相当）
对场景分类任务适配度	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
推荐用途	主力分类 & 任务执行	多步推理情境	部署性能优化版本

RTX 3090

RTX 3090 属于 Ampere 架构，它是 NVIDIA 第一代原生支持异步张量核心 (Tensor Core) 处理 FP8 数据的显卡

显存红利： 3090 有 24GB 显存。
- 跑 BF16 (Instruct/Thinking)：显存占用约 20GB，剩下的 4GB 很难支撑高分辨率（1080P+）的视频流或者长序列的 VSLAM 历史帧，极易 OOM。
- 跑 FP8：显存仅占约 10GB。这意味着你可以在同一张卡上同时跑两个模型，或者预留 14GB 显存给 KV Cache，实现极长的场景上下文记忆。
计算吞吐：在 3090 上，FP8 的理论计算吞吐量是 BF16 的 2 倍。对于机器人移动过程中的实时场景分类，速度就是安全。

指标	RTX 3090	RTX 4090	RTX 5090
架构	Ampere	Ada Lovelace	Blackwell
发布年份	2020	2022	2025
CUDA 核心数	10,496	16,384	21,760
Tensor Cores	328	512	680
VRAM	24 GB GDDR6X	24 GB GDDR6X	32 GB GDDR7
内存带宽	936 GB/s	1010 GB/s	1790 GB/s
FP32 理论性能	35.6 TFLOPS	82.6 TFLOPS	104.8 TFLOPS
制程	8 nm	5 nm	4 nm
推理速度相对（以 4090 为基准）	~0.4‑0.6×	1×	~1.2×
适用任务	小/中量化模型	中大型模型	大模型/长上下文

千问3‑VL‑8B‑Instruct‑FP8

Qwen3‑VL‑8B‑Instruct‑FP8 是 Qwen3 系列视觉语言模型的 FP8 量化版本，基于 Qwen3‑VL‑8B‑Instruct 训练和发布。

属性	说明
模型名称	千问3‑VL‑8B‑Instruct‑FP8
版本/发布日期	2026.03.02
参数量	8B（80亿参数）
类型	视觉-语言（VL）多模态理解
定位	指令式推理模型（Instruct），优化 FP8 量化
任务能力	- 场景分类（office / corridor / road 等） - 光照评估（normal / low_light / strong_glare） - 结构特征识别（墙壁、天花板、地面材质） - 文本 + 图像联合推理
框架支持	Transformers / accelerate / bitsandbytes / PyTorch
授权	Apache 2.0

目标

在 RTX3090 上成功运行：千问3‑VL‑8B‑Instruct‑FP8 + 图像输入 + JSON输出

环境准备

安装Conda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda

# 初始化当前 shell
eval "$($HOME/miniconda/bin/conda shell.bash hook)"

# 写入配置文件，下次登录自动生效
$HOME/miniconda/bin/conda init

# 接受服务条款
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

什么是 Conda

Conda 是一个“环境管理器 + 包管理器”

没有 Conda 时

系统里只有一个 Python

项目A：
  需要 torch 2.1 + CUDA11.8

项目B：
  需要 torch 1.13 + CUDA11.3

👉 直接冲突 💥

有 Conda 后

环境1（qwen_vl）：
  Python 3.10 + torch 2.1

环境2（old_project）：
  Python 3.8 + torch 1.13

创建个干净的环境

# 创建
conda create -n qwen_vl python=3.10 -y

# 激活
conda activate qwen_vl

# 退出
conda deactivate

# 删除
conda remove -n qwen_vl --all

创建完成后

xt@xt-2288H-V5:~/miniconda3/envs/qwen_vl$ tree -L 1
.
├── bin
├── compiler_compat
├── conda-meta
├── etc
├── include
├── lib
├── man
├── share
├── ssl
└── x86_64-conda-linux-gnu

激活后环境变成新的qwen_vl

(base) xt@xt-2288H-V5:~/miniconda3/envs/qwen_vl$ conda activate qwen_vl
(qwen_vl) xt@xt-2288H-V5:~/miniconda3/envs/qwen_vl$ python3 --version
Python 3.10.20

pytorch 环境

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

依赖冲突检测和安装

pip check

pip install future

下载模型

pip install modelscope

modelscope download --model Qwen/Qwen3-VL-8B-Instruct-FP8

下载后的模型目录

(qwen_vl) xt@xt-2288H-V5:~/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8$ ls
chat_template.json  configuration.json      model-00001-of-00002.safetensors  model.safetensors.index.json  README.md              tokenizer.json                  vocab.json
config.json         generation_config.json  model-00002-of-00002.safetensors  preprocessor_config.json      tokenizer_config.json  video_preprocessor_config.json

文件/目录	功能说明
config.json	模型结构配置文件，定义 Transformer 层数、隐藏层维度、注意力头数等核心参数。Transformers 在加载模型时会读取它。
configuration.json	通常是与 `config.json` 类似的别名文件，某些加载接口会用它来解析模型架构。
generation_config.json	生成策略配置（推理参数），如最大长度、温度（temperature）、Top-k、Top-p 等。Transformers 生成文本时会参考此文件。
model-00001-of-00002.safetensors & model-00002-of-00002.safetensors	模型权重分片文件（FP8 量化），存储神经网络的参数。FP8 量化可显著降低显存占用。 `.safetensors` 是比 `.pt` 更安全、更快的权重文件格式。
model.safetensors.index.json	分片索引文件，告诉加载器如何组合多个分片文件形成完整模型。
tokenizer.json	BPE/Tokenizer 词表及编码规则，用于将文本转为 token ID。Transformers 使用此文件进行文本预处理。
tokenizer_config.json	tokenizer 的额外配置，如特殊 tokens（[CLS], [PAD], [EOS] 等）信息。
vocab.json	字典文件，对应每个 token 的整数 ID 映射。通常与 `tokenizer.json` 配合使用。
preprocessor_config.json	图像预处理配置，定义输入图像尺寸、归一化方式、通道顺序等，用于视觉输入。
video_preprocessor_config.json	视频预处理配置，支持多帧/视频输入场景，定义帧采样、尺寸、通道归一化等参数。
chat_template.json	内置对话模板，用于 Instruct 模式的 prompt 封装，方便生成结构化回答或任务指令。
README.md	模型介绍、使用说明、许可证信息。建议先阅读了解使用注意事项。

模型部署

特性	vLLM	SGLang
部署复杂度	中等，需要安装 vLLM、bitsandbytes	简单，直接 pip 安装即可
性能	高吞吐量，多卡 / 并行推理优	单卡性能足够，但不擅长大并发
FP8 支持	原生支持，显存利用高效	支持，但依赖内部实现，单卡足够
图像 + 文本多模态	支持，需 AutoProcessor	支持，集成方便，易于快速开发
可扩展性	优秀，适合机器人批量 / 长序列任务	适合快速验证 & 边缘设备快速部署
学习成本	略高，需要理解 vLLM API	低，API 接口直观

安装 CUDA Toolkit

wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run

sudo sh cuda_12.2.2_535.104.05_linux.run

vim ~/.bashrc

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

source ~/.bashrc

nvcc -V

(base) xt@xt-2288H-V5:~/Qwen3-VL-8B-Instruct-FP8$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

提示词 prompt.md

你是机器人视觉感知模块。请严格基于图片内容回答，不要凭空假设。

任务：判断场景类型、区域类型、光照条件，并评估是否适合建立稳定 SpatialNode。

可选值：
- scene_type: office | corridor | intersection | road | lobby | unknown
- zone_type: desk_area | doorway | path | open_space | unknown
- lighting_condition: normal | low_light | strong_glare

要求：
1) 先在脑中识别至少 3 个可见视觉证据（例如门、桌子、走廊墙面、地面材质、天花板灯）。
2) 如果确实看不清，再给 unknown；否则不要默认 unknown。
3) 仅输出 JSON，不要输出 markdown 代码围栏。

输出字段：
{
  "scene_classification": {
    "scene_type": "...",
    "zone_type": "...",
    "lighting_condition": "...",
    "confidence": 0.0
  },
  "spatial_node_suitability": true,
  "depth_camera_impact": "...",
  "structural_features": ["...", "..."],
  "chinese_explanation": "...",
  "evidence": ["...", "...", "..."]
}

使用推理框架 SGlang

SGLang（轻量化灵活推理）

# 下载过程中容易失败
pip install sglang

# 使用清华镜像下载
pip install sglang -i https://pypi.tuna.tsinghua.edu.cn/simple

为什么 SGLang 安装包这么大

编译后的算子库 (Custom Kernels)：为了在你的 RTX3090 上跑 FP8，它内置了大量针对 Ampere 架构优化的 CUDA 算子
多模态支持： Qwen3-VL 需要处理图像/视频，SGLang 集成了 flash-infer 等视觉加速库
运行时依赖：它依赖 torch、cupy 等大型库，整套环境安装下来轻松突破 10GB - 15GB

模型运行脚本

import time
import torch
import argparse
from PIL import Image
from sglang import Engine
from qwen_vl_utils import process_vision_info
from modelscope import AutoProcessor

def main(image_path, prompt_text):
    """
    Qwen3-VL 多模态模型推理函数
    
    Args:
        image_path: 输入图像路径
        prompt_text: 用户提示词/指令
    """
    # 模型检查点路径
    checkpoint_path = "/home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8"

    # 加载预训练处理器
    processor = AutoProcessor.from_pretrained(checkpoint_path)

    # 构建对话消息格式：包含图像和文本提示
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image_path,
                },
                {"type": "text", "text": prompt_text},
            ],
        }
    ]

    # 使用处理器将消息格式化为模型输入格式（应用 chat template）
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # 提取并处理图像信息，将图像转换为 patch 表示
    image_inputs, _ = process_vision_info(
        messages,
        image_patch_size=processor.image_processor.patch_size
    )

    # 初始化推理引擎
    llm = Engine(
        model_path=checkpoint_path,
        enable_multimodal=True,      # 启用多模态支持
        mem_fraction_static=0.8,     # 设置显存静态分配比例为 80%
        tp_size=1,                   # 单卡推理配置
    )

    # 记录推理开始时间
    start = time.time()

    # 设置生成参数
    sampling_params = {
        "max_new_tokens": 512,       # 最多生成 512 个新 token
        "temperature": 0.7            # 温度参数（控制生成多样性）
    }

    # 执行推理：输入格式化文本和图像数据，返回模型输出
    response = llm.generate(
        prompt=text,
        image_data=image_inputs,
        sampling_params=sampling_params
    )

    # 计算并输出推理耗时
    elapsed_time = time.time() - start
    print(f"推理耗时: {elapsed_time:.2f}s")
    print("=" * 60)
    print("模型生成的文本:")
    print("=" * 60)
    print(response["text"])
    print("=" * 60)


# 命令行入口：支持通过参数传入图像路径和提示词
if __name__ == "__main__":
    # 创建命令行参数解析器
    parser = argparse.ArgumentParser(
        description="Qwen3-VL 多模态推理脚本 - 用于场景理解和视觉分析"
    )
    
    # 定义必需参数：图像文件路径
    parser.add_argument(
        "--image",
        type=str,
        required=True,
        help="输入图像文件的路径（例如：./test_scene.jpg）"
    )
    
    # 定义必需参数：提示词文本
    parser.add_argument(
        "--prompt",
        type=str,
        required=True,
        help="用户提示词/指令文本（可通过 $(cat prompt.txt) 从文件读入）"
    )
    
    # 解析命令行参数
    args = parser.parse_args()
    
    # 调用主函数进行推理
    main(args.image, args.prompt)

推理失败

无法识别出图片内容

(qwen_vl) xt@xt-2288H-V5:~/Qwen3-VL-8B-Instruct-FP8$ python qwen_vl_demo.py --image ./test_scene.jpg prompt-file prompt.md
✓ 图像路径: /home/xt/Qwen3-VL-8B-Instruct-FP8/test_scene.jpg
✓ 图像加载成功: 格式=JPEG, 尺寸=(1280, 1707)
✓ 提示词长度: 1077 字符
✓ 图像已处理: 1 张图像
✓ 首个图像特征类型: Image, shape=None, len=N/A
初始化推理引擎...
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.60s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.44s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.47s/it]

Capturing batches (bs=1 avail_mem=2.54 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  3.80it/s]
执行推理...
============================================================
推理耗时: 4.96s
============================================================
模型生成的文本:
============================================================
{
  "scene_classification": {
    "scene_type": "unknown",
    "zone_type": "unknown",
    "lighting_condition": "normal",
    "confidence": 0.65
  },
  "spatial_node_suitability": false,
  "depth_camera_impact": "正常光照下深度相机精度受影响较小",
  "structural_features": [],
  "chinese_explanation": "图像中未提供有效视觉信息，无法明确判断场景类型、区域类型及结构特征。建议获取清晰图像以提高识别准确率。"
}

推理框架vllm

vLLM 最初是由加州大学伯克利分校 Sky Computing Lab（前身是著名的 RISELab，诞生过 Spark 和 Ray）的研究团队开发的。

2025 年 5 月，vLLM 正式加入 PyTorch 基金会，成为其旗下的托管项目。这意味着它在治理上是中立的，不属于任何单一公司。

conda create -n qwen_vllm_env python=3.10 -y
conda activate qwen_vllm_env

# 很慢 最好用镜像源下载
pip install vllm

清华大学镜像 (推荐)：
https://pypi.tuna.tsinghua.edu.cn/simple

阿里云镜像：
https://mirrors.aliyun.com/pypi/simple/

百度云镜像：
https://mirror.baidu.com/pypi/simple/

pip install vllm -i https://pypi.tuna.tsinghua.edu.cn/simple

# 阿里云最快
pip install vllm -i https://mirrors.aliyun.com/pypi/simple/

# 安装qwen-vl-utils
pip install qwen-vl-utils -i https://mirrors.aliyun.com/pypi/simple/

# 下载测试脚本
curl -L -O -C - --progress-bar http://rk3588image.wif.ink/Qwen3-VL-8B-Instruct-FP8.zip

# 下载模型

pip install modelscope
modelscope download --model Qwen/Qwen3-VL-8B-Instruct-FP8 --local_dir ./Qwen3-VL-8B-Instruct-FP8

vllm 推理脚本

# -*- coding: utf-8 -*-
"""Qwen3-VL + vLLM 多模态推理脚本。

该脚本用于在本地/离线环境下进行单图推理，核心目标是：
1) 接收图片与提示词（CLI）；
2) 按 Qwen3-VL 官方方式构造多模态输入；
3) 通过 vLLM 执行生成并输出结果。

同时保留了针对当前环境的稳定性配置：
- 关闭 standalone compile 路径，避免已知兼容问题；
- 强制 eager 模式，绕开图编译路径；
- 限制 max_model_len，降低 KV cache 显存需求。
"""

import argparse
import os
import time

from PIL import Image
import torch
from qwen_vl_utils import process_vision_info
from modelscope import AutoProcessor
from vllm import LLM, SamplingParams

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"
os.environ["VLLM_USE_STANDALONE_COMPILE"] = "0"

DEFAULT_MODEL_PATH = "/home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8"
DEFAULT_GPU_MEMORY_UTILIZATION = 0.90
DEFAULT_MAX_MODEL_LEN = 16384


def normalize_prompt(text):
    """清洗提示词文本。

    主要移除 markdown 代码围栏（```），减少模型把围栏当正文学习的风险。
    """
    lines = text.splitlines()
    cleaned = []
    for line in lines:
        if line.strip().startswith("```"):
            continue
        cleaned.append(line)
    return "\n".join(cleaned).strip()


def resolve_image_path(image_path):
    """解析图像路径并做基础校验。

    - 若是 URL，直接返回；
    - 若是本地路径，检查存在性并尝试用 PIL 打开，提前暴露损坏图像问题。
    """
    if image_path.startswith(("http://", "https://")):
        return image_path

    if not os.path.exists(image_path):
        raise FileNotFoundError(f"图像文件不存在: {image_path}")

    absolute_path = os.path.abspath(image_path)
    with Image.open(absolute_path) as img:
        print(f"✓ 图像加载成功: 路径={absolute_path}, 格式={img.format}, 尺寸={img.size}")
    return absolute_path


def build_messages(image_path, prompt_text):
    """构造 Qwen3-VL chat message 格式。

    Qwen3-VL 使用类似多轮对话的消息结构，图像和文本都放在 content 列表中。
    """
    return [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": prompt_text},
            ],
        }
    ]


def prepare_inputs_for_vllm(messages, processor):
    """把消息转换为 vLLM 可消费的输入字典。

    返回格式包含三部分：
    - prompt: 应用 chat template 后的文本；
    - multi_modal_data: 图像/视频等多模态数据；
    - mm_processor_kwargs: 额外多模态参数（视频场景常见）。
    """
    # 生成最终喂给模型的文本 prompt（包含系统模板与 generation 起始标记）。
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # 通过官方工具处理视觉输入。
    # 注意：这里返回的是模型前处理后的图像/视频对象，不是最终 token。
    image_inputs, video_inputs, video_kwargs = process_vision_info(
        messages,
        image_patch_size=processor.image_processor.patch_size,
        return_video_kwargs=True,
        return_video_metadata=True,
    )

    mm_data = {}
    if image_inputs is not None:
        mm_data["image"] = image_inputs
        print(f"✓ 已提取图像输入: {len(image_inputs)} 张")
    if video_inputs is not None:
        mm_data["video"] = video_inputs
        print(f"✓ video_kwargs: {video_kwargs}")

    return {
        "prompt": prompt,
        "multi_modal_data": mm_data,
        "mm_processor_kwargs": video_kwargs,
    }


def load_prompt(args):
    """根据命令行参数读取提示词。

    优先级：
    - 指定 --prompt-file 时，从文件读取；
    - 否则使用 --prompt 直接文本；
    - 最后统一做 clean，并校验非空。
    """
    if args.prompt_file:
        with open(args.prompt_file, "r", encoding="utf-8") as handle:
            prompt_text = handle.read()
    else:
        prompt_text = args.prompt

    prompt_text = normalize_prompt(prompt_text or "")
    if not prompt_text:
        raise ValueError("提示词为空，请使用 --prompt 或 --prompt-file 提供内容。")
    return prompt_text


def main(args):
    """主流程：准备输入、初始化模型、执行推理并打印结果。"""
    # 1) 解析并验证图像输入，避免把坏路径/坏图片传到推理阶段才报错。
    image_path = resolve_image_path(args.image)
    # 2) 读取并清洗提示词。
    prompt_text = load_prompt(args)
    # 3) 加载处理器：负责模板拼接、视觉前处理等。
    processor = AutoProcessor.from_pretrained(args.model_path)

    messages = build_messages(image_path, prompt_text)
    inputs = [prepare_inputs_for_vllm(messages, processor)]

    # 4) 初始化 vLLM。
    # - enforce_eager=True: 禁用 compile/cudagraph 路径，提升兼容性。
    # - max_model_len: 显式限制上下文长度，避免 KV cache 预留过大导致 OOM。
    # - gpu_memory_utilization: 允许 vLLM 使用更多显存构建 KV cache。
    llm = LLM(
        model=args.model_path,
        trust_remote_code=True,
        gpu_memory_utilization=args.gpu_memory_utilization,
        enforce_eager=True,
        max_model_len=args.max_model_len,
        tensor_parallel_size=max(torch.cuda.device_count(), 1),
        seed=args.seed,
    )

    # 5) 采样参数。
    # temperature=0 时更偏确定性输出，适合结构化提取类任务。
    sampling_params = SamplingParams(
        temperature=args.temperature,
        max_tokens=args.max_tokens,
        top_k=-1,
        stop_token_ids=[],
    )

    print("=" * 40)
    print(f"模型路径: {args.model_path}")
    print(f"max_model_len: {args.max_model_len}")
    print(f"gpu_memory_utilization: {args.gpu_memory_utilization}")
    print(f"prompt长度: {len(inputs[0]['prompt'])}")
    print("=" * 40)

    # 6) 执行推理并统计耗时。
    start_time = time.time()
    outputs = llm.generate(inputs, sampling_params=sampling_params)
    elapsed = time.time() - start_time

    for output in outputs:
        generated_text = output.outputs[0].text
        print("\n" + "=" * 40)
        print(f"推理耗时: {elapsed:.2f}s")
        print("Generated text:")
        print(generated_text)


if __name__ == "__main__":
    # CLI 参数设计：默认值尽量“能跑起来”，并允许按机器资源调优。
    parser = argparse.ArgumentParser(description="Qwen3-VL vLLM 多模态推理脚本")
    parser.add_argument("--image", type=str, required=True, help="输入图像路径或 URL")
    parser.add_argument("--prompt", type=str, default=None, help="直接传入提示词")
    parser.add_argument("--prompt-file", type=str, default=None, help="从文件读取提示词")
    parser.add_argument("--model-path", type=str, default=DEFAULT_MODEL_PATH, help="模型路径")
    parser.add_argument(
        "--gpu-memory-utilization",
        type=float,
        default=DEFAULT_GPU_MEMORY_UTILIZATION,
        help="vLLM 显存利用率",
    )
    parser.add_argument(
        "--max-model-len",
        type=int,
        default=DEFAULT_MAX_MODEL_LEN,
        help="限制模型上下文长度，避免 KV cache 显存不足",
    )
    parser.add_argument("--max-tokens", type=int, default=1024, help="最大生成 token 数")
    parser.add_argument("--temperature", type=float, default=0.0, help="采样温度")
    parser.add_argument("--seed", type=int, default=0, help="随机种子")

    parsed_args = parser.parse_args()
    if not parsed_args.prompt and not parsed_args.prompt_file:
        parser.error("必须提供 --prompt 或 --prompt-file 之一。")

    main(parsed_args)

推理结果

$ python qwen_vl_vllm_demo.py --image ./test_scene.jpg --prompt-file ./prompt.md
✓ 图像加载成功: 路径=/home/xt/Qwen3-VL-8B-Instruct-FP8/test_scene.jpg, 格式=JPEG, 尺寸=(1280, 1707)
✓ 已提取图像输入: 1 张
INFO 03-25 16:34:30 [utils.py:233] non-default args: {'trust_remote_code': True, 'max_model_len': 16384, 'disable_log_stats': True, 'enforce_eager': True, 'model': '/home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8'}
INFO 03-25 16:34:30 [model.py:533] Resolved architecture: Qwen3VLForConditionalGeneration
INFO 03-25 16:34:30 [model.py:1582] Using max model len 16384
INFO 03-25 16:34:31 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-25 16:34:31 [vllm.py:754] Asynchronous scheduling is enabled.
WARNING 03-25 16:34:31 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 03-25 16:34:31 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 03-25 16:34:31 [vllm.py:964] Cudagraph is disabled under eager mode
INFO 03-25 16:34:31 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=311200) INFO 03-25 16:34:44 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8', speculative_config=None, tokenizer='/home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=311200) INFO 03-25 16:34:45 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.111.157:36131 backend=nccl
(EngineCore pid=311200) INFO 03-25 16:34:45 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=311200) INFO 03-25 16:34:48 [gpu_model_runner.py:4481] Starting to load model /home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8...
(EngineCore pid=311200) INFO 03-25 16:34:49 [cuda.py:373] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=311200) INFO 03-25 16:34:49 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=311200) INFO 03-25 16:34:49 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=311200) WARNING 03-25 16:34:49 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=311200) WARNING 03-25 16:34:49 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=311200) INFO 03-25 16:34:49 [vllm.py:964] Cudagraph is disabled under eager mode
(EngineCore pid=311200) INFO 03-25 16:34:49 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=311200) INFO 03-25 16:34:49 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=311200) INFO 03-25 16:34:49 [flash_attn.py:598] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.61s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.68s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.67s/it]
(EngineCore pid=311200) 
(EngineCore pid=311200) INFO 03-25 16:34:53 [default_loader.py:384] Loading weights took 3.42 seconds
(EngineCore pid=311200) WARNING 03-25 16:34:53 [marlin_utils_fp8.py:97] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore pid=311200) INFO 03-25 16:34:54 [gpu_model_runner.py:4566] Model loading took 10.42 GiB memory and 4.114282 seconds
(EngineCore pid=311200) INFO 03-25 16:34:54 [gpu_model_runner.py:5488] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=311200) INFO 03-25 16:35:09 [gpu_worker.py:456] Available KV cache memory: 8.44 GiB
(EngineCore pid=311200) INFO 03-25 16:35:09 [kv_cache_utils.py:1316] GPU KV cache size: 61,456 tokens
(EngineCore pid=311200) INFO 03-25 16:35:09 [kv_cache_utils.py:1321] Maximum concurrency for 16,384 tokens per request: 3.75x
(EngineCore pid=311200) INFO 03-25 16:35:09 [core.py:281] init engine (profile, create kv cache, warmup model) took 15.20 seconds
(EngineCore pid=311200) WARNING 03-25 16:35:09 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=311200) WARNING 03-25 16:35:09 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=311200) INFO 03-25 16:35:09 [vllm.py:964] Cudagraph is disabled under eager mode
INFO 03-25 16:35:09 [llm.py:391] Supported tasks: ['generate']
========================================
模型路径: /home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct-FP8
max_model_len: 16384
gpu_memory_utilization: 0.9
prompt长度: 813
========================================
Rendering prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.98s/it]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.37s/it, est. speed input: 442.30 toks/s, output: 28.49 toks/s]

========================================
推理耗时: 8.36s
Generated text:
{
  "scene_classification": {
    "scene_type": "corridor",
    "zone_type": "path",
    "lighting_condition": "normal",
    "confidence": 0.98
  },
  "spatial_node_suitability": true,
  "depth_camera_impact": "low",
  "structural_features": ["glass partition walls", "wooden laminate flooring", "ceiling-mounted fluorescent lights"],
  "chinese_explanation": "该场景为办公室走廊，地面平整，照明充足，结构清晰，适合建立稳定SpatialNode。",
  "evidence": ["走廊地面为木纹地板", "天花板有嵌入式照明灯具", "左侧有玻璃隔断墙和金属货架"]
}
(EngineCore pid=311200) INFO 03-25 16:35:18 [core.py:1201] Shutdown initiated (timeout=0)
(EngineCore pid=311200) INFO 03-25 16:35:18 [core.py:1224] Shutdown complete

为什么 vLLM 能成，SGLang 不成？

Vision Transformer (ViT) 的复杂性：Qwen3-VL 的视觉部分需要动态分辨率。vLLM 内部有一套非常成熟的 multi-modal-projector，它对图片 Token 的处理比 SGLang 的实验性算子更具普适性。

算子对齐：SGLang 追求极致性能，经常直接调用 Triton 编写的未经验证的 Kernel。而 vLLM 更多依赖标准的 PyTorch/CUDA 算子，容错率更高。

模型预加载测试

# -*- coding: utf-8 -*-
"""Qwen3-VL + vLLM 多模态推理脚本。

该脚本用于在本地/离线环境下进行单图推理，核心目标是：
1) 接收图片与提示词（CLI）；
2) 按 Qwen3-VL 官方方式构造多模态输入；
3) 通过 vLLM 执行生成并输出结果。

同时保留了针对当前环境的稳定性配置：
- 关闭 standalone compile 路径，避免已知兼容问题；
- 强制 eager 模式，绕开图编译路径；
- 限制 max_model_len，降低 KV cache 显存需求。
"""

import argparse
import gc
import os
import time

from PIL import Image
import torch
from qwen_vl_utils import process_vision_info
from modelscope import AutoProcessor
from vllm import LLM, SamplingParams

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"
os.environ["VLLM_USE_STANDALONE_COMPILE"] = "0"

DEFAULT_MODEL_PATH = "/home/xt/.cache/modelscope/hub/models/Qwen/Qwen3-VL-4B-Instruct-FP8"
# 4B 模型默认采用更激进的低延迟参数，避免被长上下文和长输出拖慢。
DEFAULT_GPU_MEMORY_UTILIZATION = 0.92
DEFAULT_MAX_MODEL_LEN = 8192
DEFAULT_MAX_TOKENS = 128


def normalize_prompt(text):
    """清洗提示词文本。

    主要移除 markdown 代码围栏（```），减少模型把围栏当正文学习的风险。
    """
    lines = text.splitlines()
    cleaned = []
    for line in lines:
        if line.strip().startswith("```"):
            continue
        cleaned.append(line)
    return "\n".join(cleaned).strip()


def resolve_image_path(image_path):
    """解析图像路径并做基础校验。

    - 若是 URL，直接返回；
    - 若是本地路径，检查存在性并尝试用 PIL 打开，提前暴露损坏图像问题。
    """
    if image_path.startswith(("http://", "https://")):
        return image_path

    if not os.path.exists(image_path):
        raise FileNotFoundError(f"图像文件不存在: {image_path}")

    absolute_path = os.path.abspath(image_path)
    with Image.open(absolute_path) as img:
        print(f"✓ 图像加载成功: 路径={absolute_path}, 格式={img.format}, 尺寸={img.size}")
    return absolute_path


def build_messages(image_path, prompt_text):
    """构造 Qwen3-VL chat message 格式。

    Qwen3-VL 使用类似多轮对话的消息结构，图像和文本都放在 content 列表中。
    """
    return [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": prompt_text},
            ],
        }
    ]


def prepare_inputs_for_vllm(messages, processor):
    """把消息转换为 vLLM 可消费的输入字典。

    返回格式包含三部分：
    - prompt: 应用 chat template 后的文本；
    - multi_modal_data: 图像/视频等多模态数据；
    - mm_processor_kwargs: 额外多模态参数（视频场景常见）。
    """
    # 生成最终喂给模型的文本 prompt（包含系统模板与 generation 起始标记）。
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # 通过官方工具处理视觉输入。
    # 注意：这里返回的是模型前处理后的图像/视频对象，不是最终 token。
    image_inputs, video_inputs, video_kwargs = process_vision_info(
        messages,
        image_patch_size=processor.image_processor.patch_size,
        return_video_kwargs=True,
    )

    mm_data = {}
    if image_inputs is not None:
        mm_data["image"] = image_inputs
        print(f"✓ 已提取图像输入: {len(image_inputs)} 张")
    if video_inputs is not None:
        mm_data["video"] = video_inputs
        print(f"✓ video_kwargs: {video_kwargs}")

    return {
        "prompt": prompt,
        "multi_modal_data": mm_data,
        "mm_processor_kwargs": video_kwargs,
    }


def load_prompt(args):
    """根据命令行参数读取提示词。

    优先级：
    - 指定 --prompt-file 时，从文件读取；
    - 否则使用 --prompt 直接文本；
    - 最后统一做 clean，并校验非空。
    """
    if args.prompt_file:
        with open(args.prompt_file, "r", encoding="utf-8") as handle:
            prompt_text = handle.read()
    else:
        prompt_text = args.prompt

    prompt_text = normalize_prompt(prompt_text or "")
    if not prompt_text:
        raise ValueError("提示词为空，请使用 --prompt 或 --prompt-file 提供内容。")
    return prompt_text


def init_runtime(args):
    """初始化并常驻运行时对象。

    返回值中的 processor 和 llm 可以被多次复用，避免每次请求都重新加载模型。
    """
    load_start = time.time()
    processor = AutoProcessor.from_pretrained(args.model_path)

    # 4) 初始化 vLLM。
    # - enforce_eager=True: 禁用 compile/cudagraph 路径，提升兼容性。
    # - max_model_len: 显式限制上下文长度，避免 KV cache 预留过大导致 OOM。
    # - gpu_memory_utilization: 允许 vLLM 使用更多显存构建 KV cache。
    llm = LLM(
        model=args.model_path,
        trust_remote_code=True,
        gpu_memory_utilization=args.gpu_memory_utilization,
        enforce_eager=True,
        max_model_len=args.max_model_len,
        tensor_parallel_size=max(torch.cuda.device_count(), 1),
        seed=args.seed,
    )

    print("=" * 40)
    print(f"运行时初始化完成，耗时: {time.time() - load_start:.2f}s")
    print(f"模型路径: {args.model_path}")
    print(f"max_model_len: {args.max_model_len}")
    print(f"gpu_memory_utilization: {args.gpu_memory_utilization}")
    print("=" * 40)
    return processor, llm


def run_inference_once(args, processor, llm, image_path, prompt_text):
    """执行单次推理。"""
    messages = build_messages(image_path, prompt_text)
    inputs = [prepare_inputs_for_vllm(messages, processor)]

    # 5) 采样参数。
    # temperature=0 时更偏确定性输出，适合结构化提取类任务。
    sampling_params = SamplingParams(
        temperature=args.temperature,
        max_tokens=args.max_tokens,
        top_k=-1,
        stop_token_ids=[],
    )

    print("=" * 40)
    print(f"prompt长度: {len(inputs[0]['prompt'])}")
    print("=" * 40)

    # 6) 执行推理并统计耗时。
    start_time = time.time()
    outputs = llm.generate(inputs, sampling_params=sampling_params)
    elapsed = time.time() - start_time

    for output in outputs:
        generated_text = output.outputs[0].text
        print("\n" + "=" * 40)
        print(f"推理耗时: {elapsed:.2f}s")
        print("Generated text:")
        print(generated_text)


def run_warmup(args, processor, llm):
    """执行一次极短预热请求。

    目的：提前触发权重懒加载、算子初始化和缓存建立，减少首个真实请求时延。
    """
    warmup_prompt = "请只回复OK"
    warmup_sampling = SamplingParams(temperature=0.0, max_tokens=8, top_k=-1, stop_token_ids=[])
    warmup_messages = build_messages(args.image, warmup_prompt)
    warmup_inputs = [prepare_inputs_for_vllm(warmup_messages, processor)]
    _ = llm.generate(warmup_inputs, sampling_params=warmup_sampling)
    print("✓ 预热完成")


def main(args):
    """主流程：初始化运行时后执行一次或多次推理。"""
    # 首次请求仍需要图像和提示词，用于正式推理或预热。
    image_path = resolve_image_path(args.image)
    prompt_text = load_prompt(args)

    processor, llm = init_runtime(args)

    try:
        if args.warmup:
            run_warmup(args, processor, llm)

        # 先跑当前命令传入的请求。
        run_inference_once(args, processor, llm, image_path, prompt_text)

        # 可选交互模式：保持进程常驻，持续复用已加载模型。
        if not args.interactive:
            return

        print("\n进入交互模式：可连续推理，输入 q 退出。")
        while True:
            next_image = input("请输入图片路径(或 q 退出): ").strip()
            if next_image.lower() in {"q", "quit", "exit"}:
                break
            if not next_image:
                continue

            next_prompt = input("请输入提示词(留空沿用启动时提示词): ").strip() or prompt_text
            try:
                next_image = resolve_image_path(next_image)
                run_inference_once(args, processor, llm, next_image, next_prompt)
            except Exception as exc:
                print(f"交互请求失败: {exc}")
    finally:
        # 显式销毁 vLLM 引擎，释放 GPU 显存。
        # vLLM 不会在进程退出时自动清理，必须手动 del + empty_cache。
        print("\n正在释放 GPU 显存...")
        del llm
        del processor
        gc.collect()
        torch.cuda.empty_cache()
        print("✓ GPU 显存已释放")


if __name__ == "__main__":
    # CLI 参数设计：默认值尽量“能跑起来”，并允许按机器资源调优。
    parser = argparse.ArgumentParser(description="Qwen3-VL vLLM 多模态推理脚本")
    parser.add_argument("--image", type=str, required=True, help="输入图像路径或 URL")
    parser.add_argument("--prompt", type=str, default=None, help="直接传入提示词")
    parser.add_argument("--prompt-file", type=str, default=None, help="从文件读取提示词")
    parser.add_argument("--model-path", type=str, default=DEFAULT_MODEL_PATH, help="模型路径")
    parser.add_argument(
        "--gpu-memory-utilization",
        type=float,
        default=DEFAULT_GPU_MEMORY_UTILIZATION,
        help="vLLM 显存利用率",
    )
    parser.add_argument(
        "--max-model-len",
        type=int,
        default=DEFAULT_MAX_MODEL_LEN,
        help="限制模型上下文长度，避免 KV cache 显存不足",
    )
    parser.add_argument("--max-tokens", type=int, default=DEFAULT_MAX_TOKENS, help="最大生成 token 数")
    parser.add_argument("--temperature", type=float, default=0.0, help="采样温度")
    parser.add_argument("--seed", type=int, default=0, help="随机种子")
    parser.add_argument("--warmup", action="store_true", help="启动后先执行一次短请求预热")
    parser.add_argument("--interactive", action="store_true", help="保持模型常驻，进入交互式连续推理")

    parsed_args = parser.parse_args()
    if not parsed_args.prompt and not parsed_args.prompt_file:
        parser.error("必须提供 --prompt 或 --prompt-file 之一。")

    main(parsed_args)

参数调节加快识别时间

python qwen_vl_vllm_preload.py --image ./test_scene_1280x1280.jpg --prompt-file ./prompt.md --warmup --interactive

(Worker_TP0 pid=843340) WARNING 03-26 09:53:24 [marlin_utils_fp8.py:97] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

显卡（RTX 3090）不支持原生 FP8 运算，现在用的是 “伪 FP8”（权重量化 + 特殊 kernel），可能反而更慢（尤其是计算密集型模型，比如 8B）。

模型格式	推荐 GPU	原因	不推荐
FP32	任意 GPU	兼容性最好	❌ 太慢
FP16	3090 / 4090	Tensor Core 原生支持	—
BF16	A100 / H100	训练稳定性强	3090
INT8	3090 / 4090 / A100	推理成熟方案	—
FP8	5090 / H100	原生支持	❌ 3090 / A100

FP8格式选卡

GPU	架构	FP8支持	FP8性能
3090	Ampere	❌	无
A100	Ampere	❌	无
4090	Ada	⚠️	很有限
H100	Hopper	✅🔥	🚀
5090	Blackwell	✅🔥🔥	🚀🚀

RTX5090租赁

阿里云上租不到

算力云

RTX5090 单张卡租赁价格： ￥2.39/小时或者 ￥1762/月
NVIDIA H100 单张卡租赁价格： ￥9.58/小时或者 ￥6,900/月

新的RTX5090服务器配置

平台	综合定位	5090利用率	推理瓶颈
新机器 (Platinum 8473C)	🔥 AI 满血平台	≈100%	基本无明显瓶颈
华为服务器 (C4215R)	⚠️ 中高端但偏旧	≈60%	PCIe + CPU调度瓶颈
老机器 (E5-2682 v4)	❌ 传统计算服务器	≈30%	CPU + PCIe 双重瓶颈

监控性能


watch -n 1 "nvidia-smi; free -h; top -b -n 1 | head -20"

总的配置

root@ebm9tkrl:~# hostnamectl
   Static hostname: n/a                             
Transient hostname: ebm9tkrl.vm
         Icon name: computer-vm
           Chassis: vm
        Machine ID: ae56c173389a45fa90d60c3ca9622682
           Boot ID: dbb71e8cbab4454c8c462860838e9fd6
    Virtualization: microsoft
  Operating System: Ubuntu 22.04.2 LTS              
            Kernel: Linux 5.15.0-60-generic
      Architecture: x86-64
   Hardware Vendor: QEMU
    Hardware Model: Standard PC _Q35 + ICH9, 2009_

它是一台基于 Azure (Microsoft Hypervisor) 运行的、通过 QEMU/KVM 虚拟化技术挂载了 RTX 5090 的顶级云端虚拟机。

CPU

root@ebm9tkrl:~# sudo lscpu
sudo: unable to resolve host ebm9tkrl.vm: Name or service not known
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8473C
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            4200.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
                          rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq dtes64 vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdr
                         and hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi
                         1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_
                         bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdi
                         ri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    32 MiB (8 instances)
  L3:                    16 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Mitigation; TSX disabled

内存

root@ebm9tkrl:~# sudo lshw -short -C memory
H/W path          Device     Class          Description
=======================================================
/0/1000                      memory         96GiB System Memory
/0/1000/0                    memory         16GiB DIMM RAM
/0/1000/1                    memory         16GiB DIMM RAM
/0/1000/2                    memory         16GiB DIMM RAM
/0/1000/3                    memory         16GiB DIMM RAM
/0/1000/4                    memory         16GiB DIMM RAM
/0/1000/5                    memory         16GiB DIMM RAM
/0/0                         memory         96KiB BIOS

GPU

root@ebm9tkrl:~# nvidia-smi 
Thu Mar 26 13:56:17 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:09:00.0 Off |                  N/A |
|  0%   25C    P8              4W /  600W |       2MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

新服务器推理

python qwen_vl_vllm_preload_8B.py --image ./test_scene.jpg --prompt-file ./prompt.md --warmup --interactive

# 10轮测试
python qwen_vl_vllm_preload_8B_multi.py --image ./test_scene.jpg --prompt-file ./prompt.md --warmup --benchmark --benchmark-runs 10

# 并发测试
python qwen_vl_vllm_preload_8B_multi.py --image ./test_scene.jpg --prompt-file ./prompt.md --warmup --benchmark --benchmark-runs 10 --parallel-requests 8

单张图片串行循环10次推理测试数据

图片尺寸1280, 1707

图片尺寸=(1280, 1707)

========================================
基准测试汇总
GPU型号: NVIDIA GeForce RTX 5090
输入准备平均耗时: 0.003s
推理平均耗时: 4.196s
输出平均吞吐: 36.011 tok/s
推理耗时中位数(P50): 4.202s
输出吞吐中位数(P50): 35.935 tok/s
========================================


========================================
基准测试汇总
GPU型号: NVIDIA GeForce RTX 3090
输入准备平均耗时: 0.003s
推理平均耗时: 4.438s
输出平均吞吐: 34.319 tok/s
推理耗时中位数(P50): 4.441s
输出吞吐中位数(P50): 34.404 tok/s
========================================

修改图片大小为1280p

图片尺寸=(960, 1280)

========================================
基准测试汇总
GPU型号: NVIDIA GeForce RTX 5090
输入准备平均耗时: 0.001s
推理平均耗时: 3.362s
输出平均吞吐: 44.548 tok/s
推理耗时中位数(P50): 3.247s
输出吞吐中位数(P50): 45.892 tok/s
========================================

========================================
基准测试汇总
GPU型号: NVIDIA GeForce RTX 3090
输入准备平均耗时: 0.001s
推理平均耗时: 4.375s
输出平均吞吐: 34.079 tok/s
推理耗时中位数(P50): 4.344s
输出吞吐中位数(P50): 34.086 tok/s
========================================

修改图片大小为720p

图片尺寸=(540, 720)

========================================
基准测试汇总
GPU型号: NVIDIA GeForce RTX 5090
输入准备平均耗时: 0.001s
推理平均耗时: 3.977s
输出平均吞吐: 41.654 tok/s
推理耗时中位数(P50): 3.854s
输出吞吐中位数(P50): 42.556 tok/s
========================================

========================================
基准测试汇总
GPU型号: NVIDIA GeForce RTX 3090
输入准备平均耗时: 0.001s
推理平均耗时: 5.175s
输出平均吞吐: 32.513 tok/s
推理耗时中位数(P50): 4.858s
输出吞吐中位数(P50): 34.445 tok/s
========================================

分辨率差异

图片尺寸	主导计算	GPU利用率	瓶颈	表现
1280×1707	图像编码（Vision Encoder）	⚠️ 中等	显存带宽 + 大特征图	5090优势不明显
960×1280	编码 + 解码平衡	✅ 高	GPU计算	⭐ 性能最佳
540×720	LLM解码（Token生成）	⚠️ 偏低	串行token生成	GPU吃不满

时间拆解

CPU图像处理
数据拷贝到GPU（PCIe）
Vision Encoder（ViT/CNN）
LLM Prefill（第一轮）
LLM Decode（逐token生成）
输出处理（CPU）

1280×1707 总耗时 ≈ 4.2s

┌──────────────────────────────┐
│ CPU预处理        0.1s        │
├──────────────────────────────┤
│ PCIe拷贝         0.05s       │
├──────────────────────────────┤
│ Vision Encoder   2.5s  🔴最大 │
├──────────────────────────────┤
│ LLM Prefill      0.4s        │
├──────────────────────────────┤
│ LLM Decode       1.3s  🔴瓶颈 │
├──────────────────────────────┤
│ 输出处理         0.1s        │
└──────────────────────────────┘

CPU 图像处理（~0.05 ~ 0.15s）

解码图片（JPEG/PNG）
调整分辨率（如 1280×1707 → 960×1280）
归一化（mean/std）
转成 PyTorch Tensor

数据拷贝到 GPU（~0.02 ~ 0.08s）

Tensor 从 RAM 拷贝到 GPU VRAM
触发 CUDA kernel 准备

Vision Encoder（🔥最大头：~1.5 ~ 2.8s）

图片 → ViT / CNN → 图像特征 embedding

把图片 → 转成“语言模型能理解的向量”

LLM Prefill（~0.3 ~ 0.5s）

把图像 embedding 输入 LLM
初始化 KV cache
计算第一轮 attention

LLM Decode（🔥关键瓶颈：~1.0 ~ 2.5s）

读 KV cache
算 attention
输出下一个 token

输出处理（CPU）（~0.05 ~ 0.2s）

tokenizer decode
拼接字符串
JSON / API 输出

RTX5090 VS RTX3090

步骤	是否用GPU	5090优势	原因	实际提升
① CPU图像处理	❌	❌ 无	CPU执行	0%
② PCIe拷贝	⚠️	❌ 很小	受总线限制	~0~5%
③ Vision Encoder	✅	⚠️ 中等	memory-bound	~5~20%
④ LLM Prefill	✅	✅ 很大	高并行计算	🚀 1.5~3倍
⑤ LLM Decode	✅	⚠️ 中等	串行限制	~20~40%
⑥ 输出处理	❌	❌ 无	CPU执行	0%

10张图片并行测试

python qwen_vl_vllm_preload_8B_conc.py --image ./test_scene.jpg --prompt-file ./prompt.md --warmup --benchmark --benchmark-runs 10 --parallel-requests 8

图片尺寸1280x1707

GPU型号: NVIDIA GeForce RTX 5090

========================================
并行批次大小: 8
批次总推理耗时: 3.14s
单请求折算耗时: 0.39s
批次总输出token数: 1197
批次总吞吐: 380.86 tok/s
========================================

========================================
基准测试汇总
GPU型号: NVIDIA GeForce RTX 5090
并行请求数: 8
输入准备平均耗时: 0.003s
批次推理平均耗时: 3.090s
单请求折算平均耗时: 0.386s
输出平均吞吐: 388.299 tok/s
批次推理耗时中位数(P50): 3.023s
单请求折算耗时中位数(P50): 0.378s
输出吞吐中位数(P50): 395.661 tok/s
========================================

GPU型号: NVIDIA GeForce RTX 3090

========================================
并行批次大小: 8
批次总推理耗时: 4.90s
单请求折算耗时: 0.61s
批次总输出token数: 1261
批次总吞吐: 257.23 tok/s
========================================

========================================
基准测试汇总
GPU型号: NVIDIA GeForce RTX 3090
并行请求数: 8
输入准备平均耗时: 0.003s
批次推理平均耗时: 5.264s
单请求折算平均耗时: 0.658s
输出平均吞吐: 240.998 tok/s
批次推理耗时中位数(P50): 4.956s
单请求折算耗时中位数(P50): 0.619s
输出吞吐中位数(P50): 254.727 tok/s
========================================

阶段	串行耗时	并发后（折算）	提升倍数	说明
CPU预处理	0.1s	0.02s	🚀 ~5x	被 batch 摊薄
PCIe拷贝	0.05s	0.01s	🚀 ~5x	批量拷贝
Vision Encoder	2.5s	0.6s	⚠️ ~4x	有提升但受限
Prefill	0.4s	0.15s	🚀 ~3x	GPU并行计算
Decode	1.3s	0.2s	🔥🔥 ~6~8x	最大提升
输出处理	0.1s	0.03s	🚀 ~3x	CPU摊薄

注意: 并发的作用不是“让每一步更快” 而是：

👉 让 GPU 始终有活干 👉 隐藏等待时间 👉 把串行问题转成并行问题

0 次点赞

RTX3090

魔塔社区