【AI】ragflow 多张 4090 GPU 初始化文档报错 NCCL Error 2: unhandled system error

GPU。

hkNaruto

1225人浏览 · 2025-02-21 12:36:20

hkNaruto · 2025-02-21 12:36:20 发布

故障


18:53:19 Task has been received.
18:53:26 Page(1~100000001): Text extraction finished.
18:53:40 Page(1~100000001): Image extraction finished
18:53:41 Page(1~100000001): Generate 69 chunks
18:53:42 Page(1~100000001): [ERROR]Generate embedding error:NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
18:53:42 [ERROR][Exception]: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)

GPU

解决方案

1. 调整Docker容器启动参数（核心步骤）

修改docker-compose.yml中ragflow-server的配置，添加以下参数：


services:
  ragflow-server:
    image: infiniflow/ragflow:v0.16.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    # 新增以下配置
    ipc: host
    shm_size: 8g
    environment:
      - NCCL_DEBUG=INFO
      - CUDA_VISIBLE_DEVICES=0,1  # 可选：限制使用的GPU数量

关键参数说明：

ipc: host：允许容器共享主机的IPC命名空间，解决NCCL多卡通信问题
shm_size: 8g：增大共享内存容量（默认64MB不足）。
CUDA_VISIBLE_DEVICES：可选，限制仅使用特定GPU测试。

docker-compose-gpu.yml

# The RAGFlow team do not actively maintain docker-compose-gpu.yml, so use them at your own risk.
# However, you are welcome to file a pull request to improve it.
include:
  - ./docker-compose-base.yml

services:
  ragflow:
    depends_on:
      mysql:
        condition: service_healthy
    image: ${RAGFLOW_IMAGE}
    container_name: ragflow-server
    ports:
      - ${SVR_HTTP_PORT}:9380
      - 80:80
      - 443:443
    volumes:
      - ./ragflow-logs:/ragflow/logs
      - ./nginx/ragflow.conf:/etc/nginx/conf.d/ragflow.conf
      - ./nginx/proxy.conf:/etc/nginx/proxy.conf
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    env_file: .env
    # 新增以下配置,解决多卡通讯故障NCCL Error 2: unhandled system error
    ipc: host
    shm_size: 8g
    environment:
      - TZ=${TIMEZONE}
      - HF_ENDPOINT=${HF_ENDPOINT}
      - MACOS=${MACOS}
      - NCCL_DEBUG=INFO
    networks:
      - ragflow
    restart: on-failure
    # https://docs.docker.com/engine/daemon/prometheus/#create-a-prometheus-configuration
    # If you're using Docker Desktop, the --add-host flag is optional. This flag makes sure that the host's internal IP gets exposed to the Prometheus container.
    extra_hosts:
      - "host.docker.internal:host-gateway"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

NCCL Error 2: unhandled system error 报错故障解决。