MiniMax-M2.7-W8A8 双机 DP=2 部署
本文介绍了在昇腾双机8卡服务器上部署MiniMax-M2.7-W8A8。
适配:Ascend 910B,双机 16 卡 = TP=8 × DP=2
镜像:`quay.io/ascend/vllm-ascend:v0.18.0rc1`
一、拉起容器
两台机器都要执行,只需把 `{容器名}` 和 `{master内网IP}` 替换成实际值:
docker run -itd -u 0 --ipc=host --privileged \
--name {容器名} \
--net=host \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
--shm-size=1200g \
-e VLLM_USE_MODELSCOPE=True \
-e ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home/:/home/ \
-v /root/.cache:/root/.cache \
quay.io/ascend/vllm-ascend:v0.18.0rc1 bash
二、修复 modelslim_config.py(必须)
v0.18.0rc1 镜像的 `modelslim_config.py` 缺少 `MODELSLIM_CONFIG_FILENAME` 常量,会导致 ImportError。**两台容器都要修复**,先修哪台都行。
2.1 下载官方配置
git clone https://gitcode.com/vLLM\_Ascend/MiniMax-M2.5-W8A8.git/tmp/minimax25-w8a8
2.2 替换容器内文件并添加常量
Master 容器:
docker cp /tmp/minimax25-w8a8/modelslim_config.py {master容器名}:/vllm-workspace/vllm-ascend/vllm_ascend/quantization/modelslim_config.py
docker exec {master容器名} bash -c \
'echo "MODELSLIM_CONFIG_FILENAME = \"quant_model_description.json\"" >> /vllm-workspace/vllm-ascend/vllm_ascend/quantization/modelslim_config.py'
Worker 容器:
docker cp /tmp/minimax25-w8a8/modelslim_config.py {worker容器名}:/vllm-workspace/vllm-ascend/vllm_ascend/quantization/modelslim_config.py
docker exec {worker容器名} bash -c \
'echo "MODELSLIM_CONFIG_FILENAME = \"quant_model_description.json\"" >> /vllm-workspace/vllm-ascend/vllm_ascend/quantization/modelslim_config.py'
三、启动 vLLM
3.1 确认 bond1 网卡
两台都要确认:
ip a | grep bond1
能看到 `bond1` inet 地址即可。
3.2 启动顺序
必须先启动 worker,再启动 master,间隔 10 秒。
3.3 启动 Worker
docker exec -d {worker容器名} bash -c '
export HCCL_IF_IP="{worker内网IP}"
export GLOO_SOCKET_IFNAME="bond1"
export TP_SOCKET_IFNAME="bond1"
export HCCL_SOCKET_IFNAME="bond1"
export HCCL_BUFFSIZE=1024
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
nohup vllm serve /root/.cache/modelscope/hub/models/Eco-Tech/MiniMax-M2___7-w8a8-QuaRot \
--served-model-name "MiniMax-M2.7" \
--host 0.0.0.0 --port 8077 \
--headless \
--tensor-parallel-size 8 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address {master内网IP} \
--data-parallel-rpc-port 13389 \
--max-num-seqs 128 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.92 \
--enable-expert-parallel \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--compilation-config "{\"cudagraph_mode\": \"FULL_DECODE_ONLY\"}" \
--mm_processor_cache_type="shm" \
--async-scheduling \
--additional-config "{\"enable_cpu_binding\":true}" \
/tmp/vllm-worker.log 2>&1 & '
3.4 启动 Master(等 10 秒)
docker exec -d {master容器名} bash -c '
export HCCL_IF_IP="{master内网IP}"
export GLOO_SOCKET_IFNAME="bond1"
export TP_SOCKET_IFNAME="bond1"
export HCCL_SOCKET_IFNAME="bond1"
export HCCL_BUFFSIZE=1024
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
nohup vllm serve /root/.cache/modelscope/hub/models/Eco-Tech/MiniMax-M2___7-w8a8-QuaRot \
--served-model-name "MiniMax-M2.7" \
--host 0.0.0.0 --port 8077 \
--tensor-parallel-size 8 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 0 \
--data-parallel-address {master内网IP} \
--data-parallel-rpc-port 13389 \
--max-num-seqs 128 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.92 \
--enable-expert-parallel \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--compilation-config "{\"cudagraph_mode\": \"FULL_DECODE_ONLY\"}" \
--mm_processor_cache_type="shm" \
--async-scheduling \
--additional-config "{\"enable_cpu_binding\":true}" \
/tmp/vllm-master.log 2>&1 & '
四、验证
等待约 5 分钟后,在任一节点执行:
curl http://{master内网IP}:8077/v1/models
应返回 `MiniMax-M2.7`,`max_model_len: 196608`。
推理测试:
curl --location "http://{master内网IP}:8077/v1/chat/completions" \
--header "Content-Type: application/json" \
--data '{"model":"MiniMax-M2.7","messages":[{"role":"user","content":"hello"}],"stream":false}'
五、访问服务
服务启动后,通过 master 节点的 8077 端口访问:
http://{master内网IP}:8077
API 端点:
- `GET /v1/models` — 查看可用模型
- `POST /v1/chat/completions` — 对话
- `POST /v1/completions` — 文本补全
六、重启恢复
6.1 启动容器
docker start {master容器名}
docker start {worker容器名}
6.2 重新启动 vLLM(顺序:先 worker,再 master)
等几秒后,分别执行 3.3 和 3.4 的启动命令。
6.3 确认进程运行
worker 上
docker exec {worker容器名} bash -c "ps -ef | grep 'vllm serve' | grep -v grep"
master 上
docker exec {master容器名} bash -c "ps -ef | grep 'vllm serve' | grep -v grep"
应该各有 1 个 vllm 进程。
七、常见问题排查
7.1 查看启动日志
master 日志
docker exec {master容器名} tail -100 /tmp/vllm-master.log
worker 日志
docker exec {worker容器名} tail -100 /tmp/vllm-worker.log
7.2 查看实时日志
docker logs -f {容器名}
7.3 确认端口在监听
docker exec {容器名} bash -c "netstat -tlnp | grep 8077"
7.4 确认 NPU 进程
docker exec {容器名} bash -c "npu-smi info | grep 'Process id'"
应该各有 8 个进程(TP=8)。
7.5 确认 HCCL 通信正常
在容器内执行:
docker exec {容器名} bash -c "HCCL_INFO=1 python -c 'import torch; torch.distributed.is_initialized()'"
7.6 常见错误
ImportError: MODELSLIM_CONFIG_FILENAME
→ modelslim_config.py 未修复,见本文档第二节
Connection refused on port 8077
→ vllm 进程未启动,看日志确认
HCCL timeout / DP Coordinator timeout
→ 检查 bond1 网卡是否互通;确认启动顺序是 worker 先、master 后
Worker 启动后立即退出
→ 检查 `--data-parallel-address` 是否填写了 master 的内网 IP
更多推荐

所有评论(0)