GLM-OCR部署教程:Ansible自动化部署脚本编写,百台服务器批量上线
GLM-OCR部署教程:Ansible自动化部署脚本编写,百台服务器批量上线
1. 项目概述与环境准备
GLM-OCR是一个基于先进多模态架构构建的高性能OCR识别模型,专门针对复杂文档理解场景设计。它集成了CogViT视觉编码器、跨模态连接器和GLM语言解码器,支持文本识别、表格识别和公式识别等多种功能。
在实际生产环境中,当需要在数百台服务器上批量部署GLM-OCR服务时,手动逐台部署显然不现实。这时候就需要借助Ansible这样的自动化运维工具来实现高效、一致的批量部署。
1.1 前置环境要求
在开始编写Ansible部署脚本前,需要确保以下环境准备就绪:
- 控制节点:安装Ansible 2.9+的Linux服务器
- 目标节点:需要部署GLM-OCR的服务器集群(Ubuntu 20.04+/CentOS 7+)
- 网络连通:控制节点能够通过SSH连接到所有目标节点
- 权限配置:控制节点具有目标节点的sudo权限
1.2 项目结构规划
我们先规划Ansible项目的目录结构:
glm-ocr-ansible/
├── inventories/ # 库存文件目录
│ ├── production # 生产环境服务器列表
│ └── staging # 测试环境服务器列表
├── group_vars/ # 组变量目录
│ ├── all.yml # 全局变量
│ └── ocr_servers.yml # OCR服务器专用变量
├── roles/ # 角色目录
│ └── glm-ocr/ # GLM-OCR部署角色
│ ├── tasks/ # 任务文件
│ ├── handlers/ # 处理器
│ ├── templates/ # 模板文件
│ └── files/ # 静态文件
├── playbooks/ # 剧本目录
│ └── deploy-ocr.yml # 主部署剧本
└── ansible.cfg # Ansible配置文件
2. Ansible库存与变量配置
2.1 定义服务器分组
首先在inventories/production文件中定义服务器分组:
[ocr_servers]
server01 ansible_host=192.168.1.101 ansible_user=root
server02 ansible_host=192.168.1.102 ansible_user=root
server03 ansible_host=192.168.1.103 ansible_user=root
[ocr_servers:vars]
ansible_ssh_private_key_file=~/.ssh/deploy_key
ansible_python_interpreter=/usr/bin/python3
[gpu_servers:children]
ocr_servers
[all:vars]
env=production
2.2 配置全局变量
在group_vars/all.yml中定义全局变量:
# GLM-OCR部署相关变量
glm_ocr_version: "latest"
glm_ocr_port: 7860
glm_ocr_model_path: "/root/ai-models/ZhipuAI/GLM-OCR"
glm_ocr_project_dir: "/root/GLM-OCR"
# 系统配置
system_user: "root"
system_group: "root"
# 依赖版本
python_version: "3.10.19"
miniconda_version: "py310"
torch_version: "2.9.1"
transformers_version: "5.0.1.dev0"
# 下载URL
miniconda_url: "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
2.3 配置服务器组变量
在group_vars/ocr_servers.yml中定义服务器组特定变量:
# GPU相关配置
gpu_required: true
cuda_version: "11.8"
gpu_memory_min: "8000" # 最小显存要求8GB
# 服务配置
service_workers: 4
service_timeout: 300
max_memory: "16G"
# 监控配置
monitoring_enabled: true
log_retention_days: 30
3. Ansible角色任务编写
3.1 主任务文件
创建roles/glm-ocr/tasks/main.yml:
- name: 检查系统要求
block:
- name: 验证操作系统
fail:
msg: "仅支持Ubuntu 20.04+或CentOS 7+系统"
when: ansible_distribution not in ["Ubuntu", "CentOS"] or (ansible_distribution == "Ubuntu" and ansible_distribution_version | float < 20.04) or (ansible_distribution == "CentOS" and ansible_distribution_major_version | int < 7)
- name: 检查GPU可用性(如需要)
fail:
msg: "GPU不可用或显存不足,至少需要8GB显存"
when: gpu_required and (ansible_facts['devices']['nvidia'] is not defined or ansible_facts['nvidia_gpu']['memory_total'] | int < gpu_memory_min | int)
- name: 安装系统依赖
package:
name:
- wget
- curl
- git
- build-essential
- python3-dev
- python3-pip
state: present
- name: 创建项目目录
file:
path: "{{ item }}"
state: directory
owner: "{{ system_user }}"
group: "{{ system_group }}"
mode: '0755'
loop:
- "{{ glm_ocr_project_dir }}"
- "{{ glm_ocr_model_path }}"
- "{{ glm_ocr_project_dir }}/logs"
- include_tasks: install_miniconda.yml
- include_tasks: setup_python_env.yml
- include_tasks: deploy_glm_ocr.yml
- include_tasks: configure_service.yml
3.2 安装Miniconda
创建roles/glm-ocr/tasks/install_miniconda.yml:
- name: 检查Miniconda是否已安装
stat:
path: "/opt/miniconda3/bin/conda"
register: miniconda_installed
- name: 下载Miniconda安装脚本
get_url:
url: "{{ miniconda_url }}"
dest: "/tmp/miniconda_install.sh"
mode: '0755'
when: not miniconda_installed.stat.exists
- name: 安装Miniconda
command: "bash /tmp/miniconda_install.sh -b -p /opt/miniconda3"
args:
creates: "/opt/miniconda3/bin/conda"
when: not miniconda_installed.stat.exists
- name: 初始化conda环境
lineinfile:
path: "/home/{{ system_user }}/.bashrc"
line: 'export PATH="/opt/miniconda3/bin:$PATH"'
state: present
when: not miniconda_installed.stat.exists
3.3 设置Python环境
创建roles/glm-ocr/tasks/setup_python_env.yml:
- name: 创建conda环境
command: "/opt/miniconda3/bin/conda create -n {{ miniconda_version }} python={{ python_version }} -y"
args:
creates: "/opt/miniconda3/envs/{{ miniconda_version }}"
- name: 安装PyTorch和CUDA工具包
command: "/opt/miniconda3/envs/{{ miniconda_version }}/bin/pip install torch=={{ torch_version }} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118"
when: gpu_required
- name: 安装CPU版本的PyTorch
command: "/opt/miniconda3/envs/{{ miniconda_version }}/bin/pip install torch=={{ torch_version }} torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu"
when: not gpu_required
- name: 安装Transformers和其他依赖
command: "/opt/miniconda3/envs/{{ miniconda_version }}/bin/pip install git+https://github.com/huggingface/transformers.git@v{{ transformers_version }} gradio"
3.4 部署GLM-OCR
创建roles/glm-ocr/tasks/deploy_glm_ocr.yml:
- name: 克隆GLM-OCR项目
git:
repo: "https://github.com/THUDM/GLM-OCR.git"
dest: "{{ glm_ocr_project_dir }}"
version: "main"
force: yes
- name: 复制启动脚本
template:
src: "start_vllm.sh.j2"
dest: "{{ glm_ocr_project_dir }}/start_vllm.sh"
mode: '0755'
- name: 复制服务脚本
template:
src: "serve_gradio.py.j2"
dest: "{{ glm_ocr_project_dir }}/serve_gradio.py"
mode: '0644'
- name: 检查模型文件是否存在
stat:
path: "{{ glm_ocr_model_path }}/config.json"
register: model_files
- name: 下载模型文件(如不存在)
command: "/opt/miniconda3/envs/{{ miniconda_version }}/bin/python -c 'from transformers import AutoModel; AutoModel.from_pretrained(\"ZhipuAI/GLM-OCR\", cache_dir=\"{{ glm_ocr_model_path }}\")'"
when: not model_files.stat.exists
4. 模板文件编写
4.1 启动脚本模板
创建roles/glm-ocr/templates/start_vllm.sh.j2:
#!/bin/bash
# GLM-OCR启动脚本
# 自动生成于 {{ ansible_date_time.iso8601 }}
export PATH="/opt/miniconda3/bin:$PATH"
source activate {{ miniconda_version }}
cd {{ glm_ocr_project_dir }}
# 设置环境变量
export PYTHONPATH={{ glm_ocr_project_dir }}:$PYTHONPATH
export TRANSFORMERS_CACHE={{ glm_ocr_model_path }}
export HF_HOME={{ glm_ocr_model_path }}
# 创建日志目录
mkdir -p logs
# 生成日志文件名
LOG_FILE="logs/glm_ocr_$(date +%Y%m%d_%H%M%S).log"
echo "$(date): 启动GLM-OCR服务" >> $LOG_FILE
# 启动服务
nohup /opt/miniconda3/envs/{{ miniconda_version }}/bin/python serve_gradio.py \
--port {{ glm_ocr_port }} \
--model-path {{ glm_ocr_model_path }} \
--workers {{ service_workers }} \
>> $LOG_FILE 2>&1 &
echo "服务已启动,端口: {{ glm_ocr_port }}"
echo "日志文件: $LOG_FILE"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):{{ glm_ocr_port }}"
4.2 服务脚本模板
创建roles/glm-ocr/templates/serve_gradio.py.j2:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import gradio as gr
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
import logging
from datetime import datetime
# 配置日志
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# 模型路径
MODEL_PATH = "{{ glm_ocr_model_path }}"
PORT = {{ glm_ocr_port }}
def load_model():
"""加载GLM-OCR模型"""
logger.info("开始加载GLM-OCR模型...")
try:
processor = AutoProcessor.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.float16,
device_map="auto" if torch.cuda.is_available() else "cpu",
trust_remote_code=True
)
logger.info("模型加载成功")
return processor, model
except Exception as e:
logger.error(f"模型加载失败: {str(e)}")
raise
def process_image(image, prompt):
"""处理图像识别请求"""
try:
processor, model = load_model()
# 预处理图像
inputs = processor(images=image, text=prompt, return_tensors="pt")
# 生成输出
with torch.no_grad():
outputs = model.generate(**inputs, max_length=4096)
# 解码结果
result = processor.decode(outputs[0], skip_special_tokens=True)
return result
except Exception as e:
logger.error(f"处理失败: {str(e)}")
return f"处理错误: {str(e)}"
# 创建Gradio界面
with gr.Blocks(title="GLM-OCR服务") as demo:
gr.Markdown("# GLM-OCR 文档识别服务")
gr.Markdown("支持文本识别、表格识别、公式识别等功能")
with gr.Row():
with gr.Column():
image_input = gr.Image(type="filepath", label="上传图片")
prompt_input = gr.Dropdown(
choices=[
"Text Recognition:",
"Table Recognition:",
"Formula Recognition:"
],
value="Text Recognition:",
label="选择识别类型"
)
submit_btn = gr.Button("开始识别", variant="primary")
with gr.Column():
output_text = gr.Textbox(label="识别结果", lines=10)
submit_btn.click(
fn=process_image,
inputs=[image_input, prompt_input],
outputs=output_text
)
if __name__ == "__main__":
demo.launch(
server_name="0.0.0.0",
server_port=PORT,
share=False
)
5. 服务配置与监控
5.1 配置系统服务
创建roles/glm-ocr/tasks/configure_service.yml:
- name: 创建systemd服务文件
template:
src: "glm-ocr.service.j2"
dest: "/etc/systemd/system/glm-ocr.service"
mode: '0644'
- name: 重载systemd配置
systemd:
daemon_reload: yes
- name: 启用并启动GLM-OCR服务
systemd:
name: glm-ocr
state: started
enabled: yes
- name: 配置日志轮转
template:
src: "glm-ocr.logrotate.j2"
dest: "/etc/logrotate.d/glm-ocr"
mode: '0644'
5.2 创建服务监控
创建roles/glm-ocr/templates/glm-ocr.service.j2:
[Unit]
Description=GLM-OCR Document Recognition Service
After=network.target
[Service]
Type=simple
User={{ system_user }}
Group={{ system_group }}
WorkingDirectory={{ glm_ocr_project_dir }}
Environment=PATH=/opt/miniconda3/bin:$PATH
Environment=PYTHONPATH={{ glm_ocr_project_dir }}:$PYTHONPATH
Environment=TRANSFORMERS_CACHE={{ glm_ocr_model_path }}
Environment=HF_HOME={{ glm_ocr_model_path }}
ExecStart=/opt/miniconda3/envs/{{ miniconda_version }}/bin/python {{ glm_ocr_project_dir }}/serve_gradio.py --port {{ glm_ocr_port }} --model-path {{ glm_ocr_model_path }}
Restart=always
RestartSec=10
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=glm-ocr
# 资源限制
MemoryMax={{ max_memory }}
CPUQuota=200%
[Install]
WantedBy=multi-user.target
5.3 配置日志轮转
创建roles/glm-ocr/templates/glm-ocr.logrotate.j2:
{{ glm_ocr_project_dir }}/logs/*.log {
daily
missingok
rotate {{ log_retention_days }}
compress
delaycompress
notifempty
copytruncate
create 644 {{ system_user }} {{ system_group }}
}
6. 主部署剧本与批量执行
6.1 编写主部署剧本
创建playbooks/deploy-ocr.yml:
---
- name: 部署GLM-OCR到服务器集群
hosts: ocr_servers
serial: 10 # 每次部署10台服务器
gather_facts: yes
become: yes
become_method: sudo
vars_files:
- ../group_vars/all.yml
- ../group_vars/ocr_servers.yml
roles:
- role: ../roles/glm-ocr
handlers:
- name: 重启GLM-OCR服务
systemd:
name: glm-ocr
state: restarted
tasks:
- name: 等待服务启动
wait_for:
port: "{{ glm_ocr_port }}"
host: "{{ ansible_host }}"
delay: 10
timeout: 300
delegate_to: localhost
- name: 验证服务健康状态
uri:
url: "http://{{ ansible_host }}:{{ glm_ocr_port }}"
method: GET
timeout: 30
status_code: 200
register: health_check
until: health_check.status == 200
retries: 10
delay: 10
delegate_to: localhost
- name: 记录部署结果
debug:
msg: "服务器 {{ ansible_host }} 部署成功,服务地址: http://{{ ansible_host }}:{{ glm_ocr_port }}"
when: health_check.status == 200
6.2 执行批量部署
使用以下命令执行批量部署:
# 检查服务器连通性
ansible -i inventories/production ocr_servers -m ping
# 执行部署剧本
ansible-playbook -i inventories/production playbooks/deploy-ocr.yml
# 只检查而不实际执行(dry-run)
ansible-playbook -i inventories/production playbooks/deploy-ocr.yml --check
# 限制特定服务器执行
ansible-playbook -i inventories/production playbooks/deploy-ocr.yml --limit server01,server02
# 增加详细输出
ansible-playbook -i inventories/production playbooks/deploy-ocr.yml -vvv
6.3 部署后验证脚本
创建验证脚本scripts/verify_deployment.sh:
#!/bin/bash
# 部署验证脚本
INVENTORY=$1
LOG_FILE="deployment_verify_$(date +%Y%m%d_%H%M%S).log"
echo "开始验证GLM-OCR部署状态 - $(date)" | tee -a $LOG_FILE
# 检查服务状态
ansible -i $INVENTORY ocr_servers -m systemd -a "name=glm-ocr state=started" | tee -a $LOG_FILE
# 检查端口监听
ansible -i $INVENTORY ocr_servers -m shell -a "netstat -tlnp | grep :7860" | tee -a $LOG_FILE
# 检查模型文件
ansible -i $INVENTORY ocr_servers -m shell -a "ls -la /root/ai-models/ZhipuAI/GLM-OCR/" | tee -a $LOG_FILE
# 测试API接口
ansible -i $INVENTORY ocr_servers -m uri -a "url=http://localhost:7860 method=GET return_content=yes" | tee -a $LOG_FILE
echo "验证完成 - $(date)" | tee -a $LOG_FILE
7. 总结与最佳实践
通过本文介绍的Ansible自动化部署方案,你可以实现GLM-OCR在数百台服务器上的快速、一致部署。这个方案的主要优势包括:
批量部署效率:通过Ansible的并行执行能力,可以在短时间内完成大量服务器的部署工作,大大提高了运维效率。
配置一致性:使用模板和变量管理,确保所有服务器的配置完全一致,避免了手动部署可能出现的配置差异问题。
可维护性强:模块化的角色设计和清晰的目录结构,使得后续的维护和升级变得简单容易。
灵活扩展:支持根据不同环境(生产、测试)和不同服务器配置进行灵活调整,满足各种部署需求。
在实际使用过程中,建议先在小规模测试环境中验证部署脚本的正确性,然后再扩展到生产环境。同时,定期检查服务器资源使用情况,根据实际负载调整服务配置参数。
获取更多AI镜像
想探索更多AI镜像和应用场景?访问 CSDN星图镜像广场,提供丰富的预置镜像,覆盖大模型推理、图像生成、视频生成、模型微调等多个领域,支持一键部署。
更多推荐



所有评论(0)