【AI】ragflow 多张 4090 GPU 初始化文档报错 NCCL Error 2: unhandled system error
GPU。
·
故障
18:53:19 Task has been received.
18:53:26 Page(1~100000001): Text extraction finished.
18:53:40 Page(1~100000001): Image extraction finished
18:53:41 Page(1~100000001): Generate 69 chunks
18:53:42 Page(1~100000001): [ERROR]Generate embedding error:NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
18:53:42 [ERROR][Exception]: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
GPU
解决方案
1. 调整Docker容器启动参数(核心步骤)
修改docker-compose.yml
中ragflow-server
的配置,添加以下参数:
services:
ragflow-server:
image: infiniflow/ragflow:v0.16.0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# 新增以下配置
ipc: host
shm_size: 8g
environment:
- NCCL_DEBUG=INFO
- CUDA_VISIBLE_DEVICES=0,1 # 可选:限制使用的GPU数量
关键参数说明:
ipc: host
:允许容器共享主机的IPC命名空间,解决NCCL多卡通信问题shm_size: 8g
:增大共享内存容量(默认64MB不足)。CUDA_VISIBLE_DEVICES
:可选,限制仅使用特定GPU测试。
docker-compose-gpu.yml
# The RAGFlow team do not actively maintain docker-compose-gpu.yml, so use them at your own risk.
# However, you are welcome to file a pull request to improve it.
include:
- ./docker-compose-base.yml
services:
ragflow:
depends_on:
mysql:
condition: service_healthy
image: ${RAGFLOW_IMAGE}
container_name: ragflow-server
ports:
- ${SVR_HTTP_PORT}:9380
- 80:80
- 443:443
volumes:
- ./ragflow-logs:/ragflow/logs
- ./nginx/ragflow.conf:/etc/nginx/conf.d/ragflow.conf
- ./nginx/proxy.conf:/etc/nginx/proxy.conf
- ./nginx/nginx.conf:/etc/nginx/nginx.conf
env_file: .env
# 新增以下配置,解决多卡通讯故障NCCL Error 2: unhandled system error
ipc: host
shm_size: 8g
environment:
- TZ=${TIMEZONE}
- HF_ENDPOINT=${HF_ENDPOINT}
- MACOS=${MACOS}
- NCCL_DEBUG=INFO
networks:
- ragflow
restart: on-failure
# https://docs.docker.com/engine/daemon/prometheus/#create-a-prometheus-configuration
# If you're using Docker Desktop, the --add-host flag is optional. This flag makes sure that the host's internal IP gets exposed to the Prometheus container.
extra_hosts:
- "host.docker.internal:host-gateway"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
NCCL Error 2: unhandled system error 报错故障解决。
但是GPU并么有都发挥作用,后续在研究。
更多推荐
所有评论(0)