GLM-OCR部署教程:Ansible自动化部署脚本编写,百台服务器批量上线

1. 项目概述与环境准备

GLM-OCR是一个基于先进多模态架构构建的高性能OCR识别模型,专门针对复杂文档理解场景设计。它集成了CogViT视觉编码器、跨模态连接器和GLM语言解码器,支持文本识别、表格识别和公式识别等多种功能。

在实际生产环境中,当需要在数百台服务器上批量部署GLM-OCR服务时,手动逐台部署显然不现实。这时候就需要借助Ansible这样的自动化运维工具来实现高效、一致的批量部署。

1.1 前置环境要求

在开始编写Ansible部署脚本前,需要确保以下环境准备就绪:

  • 控制节点:安装Ansible 2.9+的Linux服务器
  • 目标节点:需要部署GLM-OCR的服务器集群(Ubuntu 20.04+/CentOS 7+)
  • 网络连通:控制节点能够通过SSH连接到所有目标节点
  • 权限配置:控制节点具有目标节点的sudo权限

1.2 项目结构规划

我们先规划Ansible项目的目录结构:

glm-ocr-ansible/
├── inventories/           # 库存文件目录
│   ├── production        # 生产环境服务器列表
│   └── staging           # 测试环境服务器列表
├── group_vars/           # 组变量目录
│   ├── all.yml           # 全局变量
│   └── ocr_servers.yml   # OCR服务器专用变量
├── roles/                # 角色目录
│   └── glm-ocr/         # GLM-OCR部署角色
│       ├── tasks/       # 任务文件
│       ├── handlers/    # 处理器
│       ├── templates/   # 模板文件
│       └── files/       # 静态文件
├── playbooks/           # 剧本目录
│   └── deploy-ocr.yml   # 主部署剧本
└── ansible.cfg          # Ansible配置文件

2. Ansible库存与变量配置

2.1 定义服务器分组

首先在inventories/production文件中定义服务器分组:

[ocr_servers]
server01 ansible_host=192.168.1.101 ansible_user=root
server02 ansible_host=192.168.1.102 ansible_user=root
server03 ansible_host=192.168.1.103 ansible_user=root

[ocr_servers:vars]
ansible_ssh_private_key_file=~/.ssh/deploy_key
ansible_python_interpreter=/usr/bin/python3

[gpu_servers:children]
ocr_servers

[all:vars]
env=production

2.2 配置全局变量

group_vars/all.yml中定义全局变量:

# GLM-OCR部署相关变量
glm_ocr_version: "latest"
glm_ocr_port: 7860
glm_ocr_model_path: "/root/ai-models/ZhipuAI/GLM-OCR"
glm_ocr_project_dir: "/root/GLM-OCR"

# 系统配置
system_user: "root"
system_group: "root"

# 依赖版本
python_version: "3.10.19"
miniconda_version: "py310"
torch_version: "2.9.1"
transformers_version: "5.0.1.dev0"

# 下载URL
miniconda_url: "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"

2.3 配置服务器组变量

group_vars/ocr_servers.yml中定义服务器组特定变量:

# GPU相关配置
gpu_required: true
cuda_version: "11.8"
gpu_memory_min: "8000"  # 最小显存要求8GB

# 服务配置
service_workers: 4
service_timeout: 300
max_memory: "16G"

# 监控配置
monitoring_enabled: true
log_retention_days: 30

3. Ansible角色任务编写

3.1 主任务文件

创建roles/glm-ocr/tasks/main.yml

- name: 检查系统要求
  block:
    - name: 验证操作系统
      fail:
        msg: "仅支持Ubuntu 20.04+或CentOS 7+系统"
      when: ansible_distribution not in ["Ubuntu", "CentOS"] or (ansible_distribution == "Ubuntu" and ansible_distribution_version | float < 20.04) or (ansible_distribution == "CentOS" and ansible_distribution_major_version | int < 7)

    - name: 检查GPU可用性(如需要)
      fail:
        msg: "GPU不可用或显存不足,至少需要8GB显存"
      when: gpu_required and (ansible_facts['devices']['nvidia'] is not defined or ansible_facts['nvidia_gpu']['memory_total'] | int < gpu_memory_min | int)

- name: 安装系统依赖
  package:
    name:
      - wget
      - curl
      - git
      - build-essential
      - python3-dev
      - python3-pip
    state: present

- name: 创建项目目录
  file:
    path: "{{ item }}"
    state: directory
    owner: "{{ system_user }}"
    group: "{{ system_group }}"
    mode: '0755'
  loop:
    - "{{ glm_ocr_project_dir }}"
    - "{{ glm_ocr_model_path }}"
    - "{{ glm_ocr_project_dir }}/logs"

- include_tasks: install_miniconda.yml
- include_tasks: setup_python_env.yml
- include_tasks: deploy_glm_ocr.yml
- include_tasks: configure_service.yml

3.2 安装Miniconda

创建roles/glm-ocr/tasks/install_miniconda.yml

- name: 检查Miniconda是否已安装
  stat:
    path: "/opt/miniconda3/bin/conda"
  register: miniconda_installed

- name: 下载Miniconda安装脚本
  get_url:
    url: "{{ miniconda_url }}"
    dest: "/tmp/miniconda_install.sh"
    mode: '0755'
  when: not miniconda_installed.stat.exists

- name: 安装Miniconda
  command: "bash /tmp/miniconda_install.sh -b -p /opt/miniconda3"
  args:
    creates: "/opt/miniconda3/bin/conda"
  when: not miniconda_installed.stat.exists

- name: 初始化conda环境
  lineinfile:
    path: "/home/{{ system_user }}/.bashrc"
    line: 'export PATH="/opt/miniconda3/bin:$PATH"'
    state: present
  when: not miniconda_installed.stat.exists

3.3 设置Python环境

创建roles/glm-ocr/tasks/setup_python_env.yml

- name: 创建conda环境
  command: "/opt/miniconda3/bin/conda create -n {{ miniconda_version }} python={{ python_version }} -y"
  args:
    creates: "/opt/miniconda3/envs/{{ miniconda_version }}"

- name: 安装PyTorch和CUDA工具包
  command: "/opt/miniconda3/envs/{{ miniconda_version }}/bin/pip install torch=={{ torch_version }} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118"
  when: gpu_required

- name: 安装CPU版本的PyTorch
  command: "/opt/miniconda3/envs/{{ miniconda_version }}/bin/pip install torch=={{ torch_version }} torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu"
  when: not gpu_required

- name: 安装Transformers和其他依赖
  command: "/opt/miniconda3/envs/{{ miniconda_version }}/bin/pip install git+https://github.com/huggingface/transformers.git@v{{ transformers_version }} gradio"

3.4 部署GLM-OCR

创建roles/glm-ocr/tasks/deploy_glm_ocr.yml

- name: 克隆GLM-OCR项目
  git:
    repo: "https://github.com/THUDM/GLM-OCR.git"
    dest: "{{ glm_ocr_project_dir }}"
    version: "main"
    force: yes

- name: 复制启动脚本
  template:
    src: "start_vllm.sh.j2"
    dest: "{{ glm_ocr_project_dir }}/start_vllm.sh"
    mode: '0755'

- name: 复制服务脚本
  template:
    src: "serve_gradio.py.j2"
    dest: "{{ glm_ocr_project_dir }}/serve_gradio.py"
    mode: '0644'

- name: 检查模型文件是否存在
  stat:
    path: "{{ glm_ocr_model_path }}/config.json"
  register: model_files

- name: 下载模型文件(如不存在)
  command: "/opt/miniconda3/envs/{{ miniconda_version }}/bin/python -c 'from transformers import AutoModel; AutoModel.from_pretrained(\"ZhipuAI/GLM-OCR\", cache_dir=\"{{ glm_ocr_model_path }}\")'"
  when: not model_files.stat.exists

4. 模板文件编写

4.1 启动脚本模板

创建roles/glm-ocr/templates/start_vllm.sh.j2

#!/bin/bash

# GLM-OCR启动脚本
# 自动生成于 {{ ansible_date_time.iso8601 }}

export PATH="/opt/miniconda3/bin:$PATH"
source activate {{ miniconda_version }}

cd {{ glm_ocr_project_dir }}

# 设置环境变量
export PYTHONPATH={{ glm_ocr_project_dir }}:$PYTHONPATH
export TRANSFORMERS_CACHE={{ glm_ocr_model_path }}
export HF_HOME={{ glm_ocr_model_path }}

# 创建日志目录
mkdir -p logs

# 生成日志文件名
LOG_FILE="logs/glm_ocr_$(date +%Y%m%d_%H%M%S).log"

echo "$(date): 启动GLM-OCR服务" >> $LOG_FILE

# 启动服务
nohup /opt/miniconda3/envs/{{ miniconda_version }}/bin/python serve_gradio.py \
    --port {{ glm_ocr_port }} \
    --model-path {{ glm_ocr_model_path }} \
    --workers {{ service_workers }} \
    >> $LOG_FILE 2>&1 &

echo "服务已启动,端口: {{ glm_ocr_port }}"
echo "日志文件: $LOG_FILE"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):{{ glm_ocr_port }}"

4.2 服务脚本模板

创建roles/glm-ocr/templates/serve_gradio.py.j2

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import gradio as gr
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
import logging
from datetime import datetime

# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# 模型路径
MODEL_PATH = "{{ glm_ocr_model_path }}"
PORT = {{ glm_ocr_port }}

def load_model():
    """加载GLM-OCR模型"""
    logger.info("开始加载GLM-OCR模型...")
    
    try:
        processor = AutoProcessor.from_pretrained(MODEL_PATH)
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_PATH,
            torch_dtype=torch.float16,
            device_map="auto" if torch.cuda.is_available() else "cpu",
            trust_remote_code=True
        )
        logger.info("模型加载成功")
        return processor, model
    except Exception as e:
        logger.error(f"模型加载失败: {str(e)}")
        raise

def process_image(image, prompt):
    """处理图像识别请求"""
    try:
        processor, model = load_model()
        
        # 预处理图像
        inputs = processor(images=image, text=prompt, return_tensors="pt")
        
        # 生成输出
        with torch.no_grad():
            outputs = model.generate(**inputs, max_length=4096)
        
        # 解码结果
        result = processor.decode(outputs[0], skip_special_tokens=True)
        return result
        
    except Exception as e:
        logger.error(f"处理失败: {str(e)}")
        return f"处理错误: {str(e)}"

# 创建Gradio界面
with gr.Blocks(title="GLM-OCR服务") as demo:
    gr.Markdown("# GLM-OCR 文档识别服务")
    gr.Markdown("支持文本识别、表格识别、公式识别等功能")
    
    with gr.Row():
        with gr.Column():
            image_input = gr.Image(type="filepath", label="上传图片")
            prompt_input = gr.Dropdown(
                choices=[
                    "Text Recognition:",
                    "Table Recognition:", 
                    "Formula Recognition:"
                ],
                value="Text Recognition:",
                label="选择识别类型"
            )
            submit_btn = gr.Button("开始识别", variant="primary")
        
        with gr.Column():
            output_text = gr.Textbox(label="识别结果", lines=10)
    
    submit_btn.click(
        fn=process_image,
        inputs=[image_input, prompt_input],
        outputs=output_text
    )

if __name__ == "__main__":
    demo.launch(
        server_name="0.0.0.0",
        server_port=PORT,
        share=False
    )

5. 服务配置与监控

5.1 配置系统服务

创建roles/glm-ocr/tasks/configure_service.yml

- name: 创建systemd服务文件
  template:
    src: "glm-ocr.service.j2"
    dest: "/etc/systemd/system/glm-ocr.service"
    mode: '0644'

- name: 重载systemd配置
  systemd:
    daemon_reload: yes

- name: 启用并启动GLM-OCR服务
  systemd:
    name: glm-ocr
    state: started
    enabled: yes

- name: 配置日志轮转
  template:
    src: "glm-ocr.logrotate.j2"
    dest: "/etc/logrotate.d/glm-ocr"
    mode: '0644'

5.2 创建服务监控

创建roles/glm-ocr/templates/glm-ocr.service.j2

[Unit]
Description=GLM-OCR Document Recognition Service
After=network.target

[Service]
Type=simple
User={{ system_user }}
Group={{ system_group }}
WorkingDirectory={{ glm_ocr_project_dir }}
Environment=PATH=/opt/miniconda3/bin:$PATH
Environment=PYTHONPATH={{ glm_ocr_project_dir }}:$PYTHONPATH
Environment=TRANSFORMERS_CACHE={{ glm_ocr_model_path }}
Environment=HF_HOME={{ glm_ocr_model_path }}

ExecStart=/opt/miniconda3/envs/{{ miniconda_version }}/bin/python {{ glm_ocr_project_dir }}/serve_gradio.py --port {{ glm_ocr_port }} --model-path {{ glm_ocr_model_path }}

Restart=always
RestartSec=10
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=glm-ocr

# 资源限制
MemoryMax={{ max_memory }}
CPUQuota=200%

[Install]
WantedBy=multi-user.target

5.3 配置日志轮转

创建roles/glm-ocr/templates/glm-ocr.logrotate.j2

{{ glm_ocr_project_dir }}/logs/*.log {
    daily
    missingok
    rotate {{ log_retention_days }}
    compress
    delaycompress
    notifempty
    copytruncate
    create 644 {{ system_user }} {{ system_group }}
}

6. 主部署剧本与批量执行

6.1 编写主部署剧本

创建playbooks/deploy-ocr.yml

---
- name: 部署GLM-OCR到服务器集群
  hosts: ocr_servers
  serial: 10  # 每次部署10台服务器
  gather_facts: yes
  become: yes
  become_method: sudo

  vars_files:
    - ../group_vars/all.yml
    - ../group_vars/ocr_servers.yml

  roles:
    - role: ../roles/glm-ocr

  handlers:
    - name: 重启GLM-OCR服务
      systemd:
        name: glm-ocr
        state: restarted

  tasks:
    - name: 等待服务启动
      wait_for:
        port: "{{ glm_ocr_port }}"
        host: "{{ ansible_host }}"
        delay: 10
        timeout: 300
      delegate_to: localhost

    - name: 验证服务健康状态
      uri:
        url: "http://{{ ansible_host }}:{{ glm_ocr_port }}"
        method: GET
        timeout: 30
        status_code: 200
      register: health_check
      until: health_check.status == 200
      retries: 10
      delay: 10
      delegate_to: localhost

    - name: 记录部署结果
      debug:
        msg: "服务器 {{ ansible_host }} 部署成功,服务地址: http://{{ ansible_host }}:{{ glm_ocr_port }}"
      when: health_check.status == 200

6.2 执行批量部署

使用以下命令执行批量部署:

# 检查服务器连通性
ansible -i inventories/production ocr_servers -m ping

# 执行部署剧本
ansible-playbook -i inventories/production playbooks/deploy-ocr.yml

# 只检查而不实际执行(dry-run)
ansible-playbook -i inventories/production playbooks/deploy-ocr.yml --check

# 限制特定服务器执行
ansible-playbook -i inventories/production playbooks/deploy-ocr.yml --limit server01,server02

# 增加详细输出
ansible-playbook -i inventories/production playbooks/deploy-ocr.yml -vvv

6.3 部署后验证脚本

创建验证脚本scripts/verify_deployment.sh

#!/bin/bash

# 部署验证脚本
INVENTORY=$1
LOG_FILE="deployment_verify_$(date +%Y%m%d_%H%M%S).log"

echo "开始验证GLM-OCR部署状态 - $(date)" | tee -a $LOG_FILE

# 检查服务状态
ansible -i $INVENTORY ocr_servers -m systemd -a "name=glm-ocr state=started" | tee -a $LOG_FILE

# 检查端口监听
ansible -i $INVENTORY ocr_servers -m shell -a "netstat -tlnp | grep :7860" | tee -a $LOG_FILE

# 检查模型文件
ansible -i $INVENTORY ocr_servers -m shell -a "ls -la /root/ai-models/ZhipuAI/GLM-OCR/" | tee -a $LOG_FILE

# 测试API接口
ansible -i $INVENTORY ocr_servers -m uri -a "url=http://localhost:7860 method=GET return_content=yes" | tee -a $LOG_FILE

echo "验证完成 - $(date)" | tee -a $LOG_FILE

7. 总结与最佳实践

通过本文介绍的Ansible自动化部署方案,你可以实现GLM-OCR在数百台服务器上的快速、一致部署。这个方案的主要优势包括:

批量部署效率:通过Ansible的并行执行能力,可以在短时间内完成大量服务器的部署工作,大大提高了运维效率。

配置一致性:使用模板和变量管理,确保所有服务器的配置完全一致,避免了手动部署可能出现的配置差异问题。

可维护性强:模块化的角色设计和清晰的目录结构,使得后续的维护和升级变得简单容易。

灵活扩展:支持根据不同环境(生产、测试)和不同服务器配置进行灵活调整,满足各种部署需求。

在实际使用过程中,建议先在小规模测试环境中验证部署脚本的正确性,然后再扩展到生产环境。同时,定期检查服务器资源使用情况,根据实际负载调整服务配置参数。


获取更多AI镜像

想探索更多AI镜像和应用场景?访问 CSDN星图镜像广场,提供丰富的预置镜像,覆盖大模型推理、图像生成、视频生成、模型微调等多个领域,支持一键部署。

Logo

Agent 垂直技术社区,欢迎活跃、内容共建。

更多推荐