故障


18:53:19 Task has been received.
18:53:26 Page(1~100000001): Text extraction finished.
18:53:40 Page(1~100000001): Image extraction finished
18:53:41 Page(1~100000001): Generate 69 chunks
18:53:42 Page(1~100000001): [ERROR]Generate embedding error:NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
18:53:42 [ERROR][Exception]: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)

GPU

解决方案

1. 调整Docker容器启动参数(核心步骤)

修改docker-compose.ymlragflow-server的配置,添加以下参数:


services:
  ragflow-server:
    image: infiniflow/ragflow:v0.16.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    # 新增以下配置
    ipc: host
    shm_size: 8g
    environment:
      - NCCL_DEBUG=INFO
      - CUDA_VISIBLE_DEVICES=0,1  # 可选:限制使用的GPU数量

关键参数说明

  • ipc: host:允许容器共享主机的IPC命名空间,解决NCCL多卡通信问题 
  • shm_size: 8g:增大共享内存容量(默认64MB不足)。
  • CUDA_VISIBLE_DEVICES:可选,限制仅使用特定GPU测试。

docker-compose-gpu.yml

# The RAGFlow team do not actively maintain docker-compose-gpu.yml, so use them at your own risk.
# However, you are welcome to file a pull request to improve it.
include:
  - ./docker-compose-base.yml

services:
  ragflow:
    depends_on:
      mysql:
        condition: service_healthy
    image: ${RAGFLOW_IMAGE}
    container_name: ragflow-server
    ports:
      - ${SVR_HTTP_PORT}:9380
      - 80:80
      - 443:443
    volumes:
      - ./ragflow-logs:/ragflow/logs
      - ./nginx/ragflow.conf:/etc/nginx/conf.d/ragflow.conf
      - ./nginx/proxy.conf:/etc/nginx/proxy.conf
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    env_file: .env
    # 新增以下配置,解决多卡通讯故障NCCL Error 2: unhandled system error
    ipc: host
    shm_size: 8g
    environment:
      - TZ=${TIMEZONE}
      - HF_ENDPOINT=${HF_ENDPOINT}
      - MACOS=${MACOS}
      - NCCL_DEBUG=INFO
    networks:
      - ragflow
    restart: on-failure
    # https://docs.docker.com/engine/daemon/prometheus/#create-a-prometheus-configuration
    # If you're using Docker Desktop, the --add-host flag is optional. This flag makes sure that the host's internal IP gets exposed to the Prometheus container.
    extra_hosts:
      - "host.docker.internal:host-gateway"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

NCCL Error 2: unhandled system error 报错故障解决。

但是GPU并么有都发挥作用,后续在研究。

Logo

Agent 垂直技术社区,欢迎活跃、内容共建。

更多推荐