AI Agent Harness Engineering 的 A_B 测试：对比不同策略的效果

观测模块：全链路采集Agent运行的所有数据，包括输入、输出、工具调用记录、token消耗、延迟、错误日志等，做到可追溯、可排查。控制模块：支持Agent的灰度发布、流量切换、参数热更新、熔断降级等能力，不用重新部署就能调整Agent的运行策略。评估模块：支持离线、在线、人工、自动多维度的Agent效果评估，能够量化不同策略的优劣。迭代模块：基于评估结果自动优化Agent的prompt、工具调用策

Golang编程笔记

358人浏览 · 2026-05-17 00:06:02

Golang编程笔记 · 2026-05-17 00:06:02 发布

AI Agent Harness Engineering 实战：科学A/B测试让你的Agent效果提升30%

一、引言

钩子

上周我在电商公司做AI落地的朋友找我吐槽：他们团队花了一个半月打磨的智能客服Agent，上线第一周就被运营打回了——客诉率比之前的规则式机器人高了27%，人工进线率反而涨了18%，更糟的是，他们根本不知道问题出在哪：是ReAct的prompt写的不好？还是工具调用的路由逻辑有问题？还是选的模型温度参数不对？整个团队对着上千条会话日志翻了3天，还是没拿出靠谱的改进方案。

如果你也做过AI Agent的落地，肯定对这种场景感同身受：和传统的确定性软件系统不同，AI Agent的输出是非结构化、非确定性的，影响其效果的变量多达几十上百个：prompt的措辞、Few-shot示例的选择、工具调用的策略、记忆模块的窗口大小、模型的温度参数、检索增强的相似度阈值……任何一个微小的调整，都可能带来效果的巨大波动，也可能完全没用。靠人工拍脑袋、小范围测十几个Case就全量上线，本质上就是裸奔，运气好赚了，运气不好亏的都是真金白银。

问题背景

这正是AI Agent Harness Engineering（AI Agent测控框架工程）诞生的核心背景：它就像给AI Agent套上了一层全链路的「测控马甲」，从开发、测试、上线到迭代的全生命周期，都可观测、可控制、可评估。而其中最核心、最能解决落地痛点的能力，就是针对Agent策略的科学A/B测试。

很多人会说：A/B测试我熟啊，不就是分流做对照吗？做Web/App的A/B测试我都行，做Agent的A/B测试还不简单？真的不一样：传统的A/B测试只需要关注CTR、转化率这类滞后的业务指标，而Agent的A/B测试需要同时关注输出质量、工具调用准确率、token成本、响应延迟、幻觉率、合规性等十几个维度的指标；传统的A/B测试单个请求是独立的，而Agent的请求是会话依赖的，同一个用户的多轮对话必须分配给同一个策略变体，不然用户会感受到明显的逻辑割裂；传统的A/B测试结果很容易判断，而Agent的效果评估往往需要结合自动打分和人工标注，还要考虑统计显著性。

根据2024年大模型应用落地调查报告显示，83%的Agent上线失败项目都是因为缺乏科学的效果评估机制，盲目全量上线导致的；而建立了完整A/B测试体系的团队，Agent迭代效率提升了210%，上线后的业务效果达标率高达92%。可以说，会不会做Agent的A/B测试，已经成为了AI Agent能不能真正落地的核心门槛。

文章目标

本文将带你从0到1完整掌握AI Agent场景下的A/B测试体系：从核心概念、架构设计到完整的实战代码，再到行业内的最佳实践和避坑指南。读完这篇文章，你将能够：

理解AI Agent Harness Engineering的核心逻辑，以及Agent A/B测试和传统A/B测试的核心差异
独立搭建一套可落地的Agent A/B测试系统，支持多策略变体的对照实验
掌握Agent效果的多维度评估方法，能够科学判断不同策略的优劣
避开Agent A/B测试的90%以上的常见陷阱，用最少的流量成本拿到最可信的实验结果

为了方便大家动手实践，本文所有的代码都已经开源在GitHub仓库，大家可以直接拉取运行。

二、基础知识/背景铺垫

核心概念定义

1. AI Agent Harness Engineering

我们先给一个清晰的学术定义：AI Agent Harness Engineering是一套面向AI Agent全生命周期的工程化框架，旨在解决Agent开发、测试、上线、迭代过程中的可观测性差、可控制性弱、评估难、迭代慢的问题，核心包含四大模块：

观测模块：全链路采集Agent运行的所有数据，包括输入、输出、工具调用记录、token消耗、延迟、错误日志等，做到可追溯、可排查。
控制模块：支持Agent的灰度发布、流量切换、参数热更新、熔断降级等能力，不用重新部署就能调整Agent的运行策略。
评估模块：支持离线、在线、人工、自动多维度的Agent效果评估，能够量化不同策略的优劣。
迭代模块：基于评估结果自动优化Agent的prompt、工具调用策略、参数等，形成闭环迭代。

2. Agent A/B测试

Agent A/B测试是Harness Engineering评估模块的核心能力，指的是同时运行两个或多个Agent策略变体，将流量按照一定规则分配给不同的变体，通过统计分析不同变体的指标差异，科学判断最优策略的实验方法。

和传统Web A/B测试的核心差异如下表所示：

对比维度	传统Web A/B测试	AI Agent A/B测试
评估指标	少量核心业务指标（CTR、转化率、留存等）	多维度混合指标（效果类、效率类、风险类共十几项）
单元独立性	单个请求/用户独立，无上下文依赖	会话级依赖，同一用户的多轮请求必须归属同一变体
输出特性	确定性输出，相同输入一定得到相同输出	非确定性输出，相同输入可能得到不同输出，需要多次采样
效果评估成本	低，业务指标可直接埋点获取	高，质量类指标需要大模型自动评估或人工标注
变量数量	通常单个变量，最多2-3个变量	多变量组合，prompt、工具、模型、参数等几十个变量
实验周期	通常几天到两周	通常一周到一个月，需要足够的会话样本量
结果可信度	高，指标直接关联业务结果	中等，需要结合自动评估和人工校准

3. 核心术语解释

基线变体（Baseline）：当前线上正在运行的Agent策略，作为实验的对照对象。
实验变体（Variant）：新开发的待验证的Agent策略，和基线做对比。
分流单元（Traffic Unit）：流量分配的最小单位，Agent场景下通常是用户ID或者会话ID，保证同一单元的流量始终分配给同一个变体。
度量集（Metric Set）：用来评估变体效果的指标集合，分为三类：效果指标、效率指标、风险指标。
统计显著性（Statistical Significance）：用来判断变体之间的指标差异是真实存在的，还是随机波动导致的，通常用p值<0.05作为显著的标准。
置信区间（Confidence Interval）：指标真实值的可能范围，置信区间越小，说明实验结果越可信。

Agent评估方法的发展历程

阶段	时间	核心方法	特点	优缺点
人工评估阶段	2022年及以前	开发者手动测试几十个Case，主观判断效果	完全依赖人工经验，没有量化指标	优点：成本低，判断准确；缺点：样本量小，容易有偏差，无法支撑大规模迭代
自动离线评估阶段	2023年上半年	构建测试数据集，用大模型做自动评审，批量打分	可以量化效果，测试速度快	优点：效率高，成本低；缺点：离线数据集和真实场景有差异，评估结果和线上表现不一致
线上A/B测试阶段	2023年下半年至今	线上分流做对照实验，结合自动和人工评估	基于真实流量，结果可信	优点：结果真实，可直接指导上线；缺点：实现复杂，需要一定的工程能力
闭环迭代阶段	2024年及以后	A/B测试+自动优化，根据实验结果自动迭代Agent策略	完全自动化的迭代闭环	优点：迭代速度快，不需要人工参与；缺点：技术成熟度低，成本高

工具类型	代表产品	上手成本	灵活性	成本	适用场景
框架内置能力	LangSmith、LlamaIndex Evaluator、OpenAI Evals	低	低	免费/按调用量收费	小团队、快速验证POC
专门Harness工具	AgentOps、PromptLayer、Helicone	中	中	按月/调用量收费	中等规模团队、正式上线的Agent
自建框架	GrowthBook+自定义埋点、Optimizely+观测平台	高	高	服务器成本+人力成本	大规模团队、定制化需求多的场景

三、核心内容/实战演练

我们本次实战的项目是电商智能客服Agent的A/B测试，目标是对比三种不同的Agent策略的效果，选出最优的上线。业务目标是：在不降低问题解决率的前提下，降低token成本30%，降低响应延迟20%。

系统架构设计

我们设计的Agent A/B测试系统整体架构如下：

 渲染错误: Mermaid 渲染失败: Parsing failed: Lexer error on line 2, column 11: unexpected character: ->接<- at offset: 28, skipped 3 characters. Lexer error on line 2, column 21: unexpected character: ->[<- at offset: 38, skipped 5 characters. Lexer error on line 3, column 17: unexpected character: ->网<- at offset: 60, skipped 2 characters. Lexer error on line 3, column 27: unexpected character: ->[<- at offset: 70, skipped 1 characters. Lexer error on line 3, column 31: unexpected character: ->网<- at offset: 74, skipped 3 characters. Lexer error on line 4, column 17: unexpected character: ->会<- at offset: 94, skipped 4 characters. Lexer error on line 4, column 29: unexpected character: ->[<- at offset: 106, skipped 6 characters. Lexer error on line 6, column 11: unexpected character: ->分<- at offset: 128, skipped 3 characters. Lexer error on line 6, column 21: unexpected character: ->[<- at offset: 138, skipped 5 characters. Lexer error on line 7, column 17: unexpected character: ->分<- at offset: 160, skipped 4 characters. Lexer error on line 7, column 29: unexpected character: ->[<- at offset: 172, skipped 6 characters. Lexer error on line 8, column 17: unexpected character: ->实<- at offset: 195, skipped 4 characters. Lexer error on line 8, column 29: unexpected character: ->[<- at offset: 207, skipped 8 characters. Lexer error on line 10, column 16: unexpected character: ->执<- at offset: 236, skipped 3 characters. Lexer error on line 10, column 26: unexpected character: ->[<- at offset: 246, skipped 1 characters. Lexer error on line 10, column 32: unexpected character: ->执<- at offset: 252, skipped 4 characters. Lexer error on line 11, column 17: unexpected character: ->变<- at offset: 273, skipped 2 characters. Lexer error on line 11, column 28: unexpected character: ->[<- at offset: 284, skipped 5 characters. Lexer error on line 11, column 35: unexpected character: ->通<- at offset: 291, skipped 2 characters. Lexer error on line 11, column 48: unexpected character: ->]<- at offset: 304, skipped 1 characters. Lexer error on line 12, column 17: unexpected character: ->变<- at offset: 322, skipped 2 characters. Lexer error on line 12, column 28: unexpected character: ->[<- at offset: 333, skipped 5 characters. Lexer error on line 12, column 36: unexpected character: ->优<- at offset: 341, skipped 2 characters. Lexer error on line 12, column 44: unexpected character: ->+<- at offset: 349, skipped 1 characters. Lexer error on line 12, column 53: unexpected character: ->]<- at offset: 358, skipped 1 characters. Lexer error on line 13, column 17: unexpected character: ->变<- at offset: 376, skipped 2 characters. Lexer error on line 13, column 28: unexpected character: ->[<- at offset: 387, skipped 5 characters. Lexer error on line 13, column 36: unexpected character: ->工<- at offset: 395, skipped 7 characters. Lexer error on line 15, column 11: unexpected character: ->观<- at offset: 418, skipped 3 characters. Lexer error on line 15, column 21: unexpected character: ->[<- at offset: 428, skipped 5 characters. Lexer error on line 16, column 17: unexpected character: ->埋<- at offset: 450, skipped 4 characters. Lexer error on line 16, column 29: unexpected character: ->[<- at offset: 462, skipped 6 characters. Lexer error on line 17, column 17: unexpected character: ->指<- at offset: 485, skipped 4 characters. Lexer error on line 17, column 29: unexpected character: ->[<- at offset: 497, skipped 8 characters. Lexer error on line 18, column 17: unexpected character: ->数<- at offset: 522, skipped 4 characters. Lexer error on line 18, column 31: unexpected character: ->[<- at offset: 536, skipped 10 characters. Lexer error on line 20, column 11: unexpected character: ->分<- at offset: 562, skipped 3 characters. Lexer error on line 20, column 21: unexpected character: ->[<- at offset: 572, skipped 5 characters. Lexer error on line 21, column 17: unexpected character: ->显<- at offset: 594, skipped 5 characters. Lexer error on line 21, column 30: unexpected character: ->[<- at offset: 607, skipped 7 characters. Lexer error on line 22, column 17: unexpected character: ->自<- at offset: 631, skipped 4 characters. Lexer error on line 22, column 29: unexpected character: ->[<- at offset: 643, skipped 9 characters. Lexer error on line 23, column 17: unexpected character: ->报<- at offset: 669, skipped 4 characters. Lexer error on line 23, column 29: unexpected character: ->[<- at offset: 681, skipped 6 characters. Lexer error on line 25, column 11: unexpected character: ->运<- at offset: 703, skipped 3 characters. Lexer error on line 25, column 21: unexpected character: ->[<- at offset: 713, skipped 5 characters. Lexer error on line 26, column 17: unexpected character: ->实<- at offset: 735, skipped 4 characters. Lexer error on line 26, column 29: unexpected character: ->[<- at offset: 747, skipped 8 characters. Lexer error on line 27, column 17: unexpected character: ->灰<- at offset: 772, skipped 4 characters. Lexer error on line 27, column 29: unexpected character: ->[<- at offset: 784, skipped 9 characters. Lexer error on line 29, column 5: unexpected character: ->网<- at offset: 799, skipped 2 characters. Lexer error on line 29, column 12: unexpected character: ->会<- at offset: 806, skipped 4 characters. Lexer error on line 30, column 5: unexpected character: ->会<- at offset: 815, skipped 4 characters. Lexer error on line 30, column 14: unexpected character: ->分<- at offset: 824, skipped 4 characters. Lexer error on line 31, column 5: unexpected character: ->分<- at offset: 833, skipped 4 characters. Lexer error on line 31, column 15: unexpected character: ->实<- at offset: 843, skipped 4 characters. Lexer error on line 32, column 5: unexpected character: ->分<- at offset: 852, skipped 4 characters. Lexer error on line 32, column 14: unexpected character: ->变<- at offset: 861, skipped 2 characters. Lexer error on line 33, column 5: unexpected character: ->分<- at offset: 869, skipped 4 characters. Lexer error on line 33, column 14: unexpected character: ->变<- at offset: 878, skipped 2 characters. Lexer error on line 34, column 5: unexpected character: ->分<- at offset: 886, skipped 4 characters. Lexer error on line 34, column 14: unexpected character: ->变<- at offset: 895, skipped 2 characters. Lexer error on line 35, column 5: unexpected character: ->变<- at offset: 903, skipped 2 characters. Lexer error on line 35, column 13: unexpected character: ->埋<- at offset: 911, skipped 4 characters. Lexer error on line 36, column 5: unexpected character: ->变<- at offset: 920, skipped 2 characters. Lexer error on line 36, column 13: unexpected character: ->埋<- at offset: 928, skipped 4 characters. Lexer error on line 37, column 5: unexpected character: ->变<- at offset: 937, skipped 2 characters. Lexer error on line 37, column 13: unexpected character: ->埋<- at offset: 945, skipped 4 characters. Lexer error on line 38, column 5: unexpected character: ->埋<- at offset: 954, skipped 4 characters. Lexer error on line 38, column 14: unexpected character: ->指<- at offset: 963, skipped 4 characters. Lexer error on line 39, column 5: unexpected character: ->指<- at offset: 972, skipped 4 characters. Lexer error on line 39, column 14: unexpected character: ->数<- at offset: 981, skipped 4 characters. Lexer error on line 40, column 5: unexpected character: ->数<- at offset: 990, skipped 4 characters. Lexer error on line 40, column 14: unexpected character: ->显<- at offset: 999, skipped 5 characters. Lexer error on line 41, column 5: unexpected character: ->数<- at offset: 1009, skipped 4 characters. Lexer error on line 41, column 14: unexpected character: ->自<- at offset: 1018, skipped 4 characters. Lexer error on line 42, column 5: unexpected character: ->显<- at offset: 1027, skipped 5 characters. Lexer error on line 42, column 15: unexpected character: ->报<- at offset: 1037, skipped 4 characters. Lexer error on line 43, column 5: unexpected character: ->自<- at offset: 1046, skipped 4 characters. Lexer error on line 43, column 14: unexpected character: ->报<- at offset: 1055, skipped 4 characters. Lexer error on line 44, column 5: unexpected character: ->报<- at offset: 1064, skipped 4 characters. Lexer error on line 44, column 14: unexpected character: ->实<- at offset: 1073, skipped 4 characters. Lexer error on line 45, column 5: unexpected character: ->实<- at offset: 1082, skipped 4 characters. Lexer error on line 45, column 14: unexpected character: ->实<- at offset: 1091, skipped 4 characters. Lexer error on line 46, column 5: unexpected character: ->实<- at offset: 1100, skipped 4 characters. Lexer error on line 46, column 14: unexpected character: ->灰<- at offset: 1109, skipped 4 characters. Lexer error on line 47, column 5: unexpected character: ->灰<- at offset: 1118, skipped 4 characters. Lexer error on line 47, column 14: unexpected character: ->分<- at offset: 1127, skipped 4 characters. Parse error on line 2, column 14: Expecting token of type 'ID' but found `(cloud)`. Parse error on line 3, column 19: Expecting token of type 'ID' but found `(server)`. Parse error on line 3, column 28: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'API' Parse error on line 3, column 34: Expecting token of type ':' but found ` `. Parse error on line 4, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 6, column 14: Expecting token of type 'ID' but found `(cloud)`. Parse error on line 7, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 8, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 10, column 27: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'Agent' Parse error on line 10, column 36: Expecting token of type ':' but found ` `. Parse error on line 11, column 33: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 11, column 43: Expecting token of type ':' but found `Agent`. Parse error on line 12, column 19: Expecting token of type 'ID' but found `B`. Parse error on line 12, column 38: Expecting token of type 'ARROW_DIRECTION' but found `Prompt`. Parse error on line 12, column 45: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'Few-shot' Parse error on line 12, column 54: Expecting token of type ':' but found ` `. Parse error on line 13, column 33: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: '2' Parse error on line 13, column 43: Expecting token of type 'ARROW_DIRECTION' but found ` `. Parse error on line 15, column 14: Expecting token of type 'ID' but found `(cloud)`. Parse error on line 16, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 17, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 18, column 21: Expecting token of type 'ID' but found `(database)`. Parse error on line 20, column 14: Expecting token of type 'ID' but found `(cloud)`. Parse error on line 21, column 22: Expecting token of type 'ID' but found `(server)`. Parse error on line 22, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 23, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 25, column 14: Expecting token of type 'ID' but found `(cloud)`. Parse error on line 26, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 27, column 21: Expecting token of type 'ID' but found `(server)`. Parse error on line 29, column 8: Expecting token of type 'EOF' but found `--`. Parse error on line 30, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 31, column 10: Expecting token of type 'EOF' but found `<`. Parse error on line 32, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 32, column 17: Expecting token of type ':' but found ` `. Parse error on line 33, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 34, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 34, column 17: Expecting token of type ':' but found ` `. Parse error on line 35, column 9: Expecting token of type ':' but found `--`. Parse error on line 35, column 17: Expecting token of type 'ARROW_DIRECTION' but found ` `. Parse error on line 36, column 9: Expecting token of type ':' but found `--`. Parse error on line 37, column 9: Expecting token of type ':' but found `--`. Parse error on line 37, column 17: Expecting token of type 'ARROW_DIRECTION' but found ` `. Parse error on line 38, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 39, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 40, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 41, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 42, column 11: Expecting token of type 'EOF' but found `--`. Parse error on line 43, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 44, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 45, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 46, column 10: Expecting token of type 'EOF' but found `--`. Parse error on line 47, column 10: Expecting token of type 'EOF' but found `--`.

核心数学模型

1. 多维度加权得分模型

我们用来给每个变体的综合效果打分，所有指标都归一化到[0,1]区间，得分越高说明变体效果越好：
$Score = w_1 * P_{solve} + w_2 * P_{acc} + w_3 * (1 - P_{hallucination}) + w_4 * (1 - C_{token}) + w_5 * (1 - L_{avg})$
其中：

$P_{solve}$ 是问题解决率，权重 $w_1=0.4$ （业务最看重的指标）
$P_{acc}$ 是回答准确率，权重 $w_2=0.3$
$P_{hallucination}$ 是幻觉率，权重 $w_3=0.15$
$C_{token}$ 是归一化后的token消耗，权重 $w_4=0.1$
$L_{avg}$ 是归一化后的平均响应延迟，权重 $w_5=0.05$

2. 统计显著性检验模型

我们用双样本T检验来判断两个变体的指标差异是否显著：
首先计算两个变体的指标均值 $\mu_A$ 、 $\mu_B$ ，方差 $s_A^2$ 、 $s_B^2$ ，样本量 $n_A$ 、 $n_B$ ，然后计算T值：
$\frac{\mu_A - \mu_B}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}$
自由度用Welch-Satterthwaite公式计算：
$\frac{(\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B})^2}{\frac{(s_A^2/n_A)^2}{n_A-1} + \frac{(s_B^2/n_B)^2}{n_B-1}}$
然后根据T值和自由度计算p值，如果p值<0.05，说明两个变体的指标差异在95%的置信水平下是显著的。

3. 贝叶斯A/B测试模型

对于二分类指标（比如问题解决率），我们用Beta分布来建模后验分布：
$\sim Beta(\alpha + k, \beta + n - k)$
其中 $\alpha$ 和 $\beta$ 是先验参数，默认取 $\alpha=1, \beta=1$ （均匀先验），k是成功的次数，n是总样本量。然后计算B变体比A变体好的概率：
$\int_{0}^{1}\int_{0}^{1} I(x > y) Beta(x | \alpha_B, \beta_B) Beta(y | \alpha_A, \beta_A) dx dy$
如果 $P (B > A) > 0.95$ ，说明我们有95%的把握认为B变体比A变体好。

实验全流程

实战步骤

步骤一：环境安装

首先安装需要的依赖：

pip install langchain openai agentops pandas scipy numpy python-dotenv clickhouse-driver

然后配置环境变量，在.env文件里填好OPENAI_API_KEY、AGENTOPS_API_KEY、CLICKHOUSE_URL等配置。

步骤二：定义三个Agent变体

from langchain.agents import AgentType, initialize_agent, load_tools
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import agentops
import os
from dotenv import load_dotenv

load_dotenv()
agentops.init(os.getenv("AGENTOPS_API_KEY"))

# 基线变体A：通用ReAct Agent
def create_agent_baseline():
    llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
    tools = load_tools(["serpapi", "llm-math", "order_query", "logistics_query"], llm=llm)
    agent = initialize_agent(
        tools, 
        llm, 
        agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, 
        verbose=True,
        handle_parsing_errors=True
    )
    return agent

# 实验变体B：优化Prompt+Few-shot
def create_agent_prompt_optimized():
    llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
    tools = load_tools(["serpapi", "llm-math", "order_query", "logistics_query"], llm=llm)
    # 自定义prompt，加了电商客服的规则和Few-shot示例
    prefix = """你是电商平台的智能客服，负责回答用户的问题，必须遵守以下规则：
    1. 回答要简洁、友好，不要使用专业术语
    2. 如果不知道答案，直接说“抱歉，这个问题我暂时无法回答，我会转交给人工客服处理”，不要编造答案
    3. 涉及到订单、物流、退款的问题，优先调用工具查询，不要猜测
    以下是几个示例：
    用户：我的订单什么时候发货？
    思考：用户问的是订单发货时间，需要调用订单查询工具
    动作：订单查询
    动作输入：订单号
    观察：订单已经在昨天发货，预计后天到达
    思考：我现在知道答案了，可以回答用户
    最终回答：亲，您的订单已经在昨天发出啦，预计后天就能送到您手上哦~
    """
    agent = initialize_agent(
        tools, 
        llm, 
        agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, 
        verbose=True,
        handle_parsing_errors=True,
        agent_kwargs={"prefix": prefix}
    )
    return agent

# 实验变体C：工具路由优化，先判断是否需要调用工具
def create_agent_tool_routing():
    llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
    tools = load_tools(["serpapi", "llm-math", "order_query", "logistics_query"], llm=llm)
    # 先加一个工具路由层，判断是否需要调用工具
    routing_prompt = PromptTemplate(
        input_variables=["user_input"],
        template="""判断用户的问题是否需要调用工具才能回答，只需要回答“是”或“否”：
        问题：{user_input}
        回答："""
    )
    routing_chain = LLMChain(llm=llm, prompt=routing_prompt)

    class RoutingAgent:
        def __init__(self, agent, routing_chain, llm):
            self.agent = agent
            self.routing_chain = routing_chain
            self.llm = llm
        
        def run(self, user_input):
            need_tool = self.routing_chain.run(user_input).strip()
            if need_tool == "否":
                return self.llm.predict(f"作为电商客服回答用户问题：{user_input}")
            else:
                return self.agent.run(user_input)
    
    base_agent = initialize_agent(
        tools, 
        llm, 
        agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, 
        verbose=True,
        handle_parsing_errors=True
    )
    return RoutingAgent(base_agent, routing_chain, llm)

# 初始化三个变体
agent_a = create_agent_baseline()
agent_b = create_agent_prompt_optimized()
agent_c = create_agent_tool_routing()

步骤三：实现分流引擎

用一致性哈希保证同一个会话ID分配给同一个变体，避免会话割裂：

import hashlib

class TrafficSplitter:
    def __init__(self, experiment_name, variants, weights):
        self.experiment_name = experiment_name
        self.variants = variants
        self.weights = weights
        # 计算权重的累积和
        self.cum_weights = []
        total = sum(weights)
        current = 0
        for w in weights:
            current += w / total
            self.cum_weights.append(current)
    
    def get_variant(self, session_id):
        # 用会话ID和实验名做哈希，保证同一个会话在同一个实验里始终分到同一个变体
        hash_val = hashlib.sha256(f"{self.experiment_name}_{session_id}".encode()).hexdigest()
        # 把哈希值转成0-1之间的浮点数
        hash_float = int(hash_val, 16) / (1 << 256)
        # 找到对应的变体
        for i, cum in enumerate(self.cum_weights):
            if hash_float < cum:
                return self.variants[i]
        return self.variants[-1]

# 初始化分流器，三个变体的流量占比都是33%
splitter = TrafficSplitter(
    experiment_name="ecommerce_customer_service_v1",
    variants=["A", "B", "C"],
    weights=[1, 1, 1]
)

步骤四：埋点采集指标

import time
from datetime import datetime
import clickhouse_driver

client = clickhouse_driver.Client(host=os.getenv("CLICKHOUSE_URL"))

def run_agent(session_id, user_input, ground_truth=None):
    start_time = time.time()
    variant = splitter.get_variant(session_id)
    # 选择对应的Agent
    if variant == "A":
        agent = agent_a
    elif variant == "B":
        agent = agent_b
    else:
        agent = agent_c
    # 运行Agent
    try:
        output = agent.run(user_input)
        success = True
        error = None
        token_usage = agent.llm_chain.llm_output["token_usage"]["total_tokens"]
    except Exception as e:
        output = None
        success = False
        error = str(e)
        token_usage = 0
    end_time = time.time()
    latency = end_time - start_time
    # 自动评估
    eval_result = evaluate_response(user_input, output, ground_truth)
    # 采集指标
    metrics = {
        "session_id": session_id,
        "variant": variant,
        "user_input": user_input,
        "output": output,
        "success": success,
        "error": error,
        "latency": latency,
        "token_usage": token_usage,
        "quality_score": eval_result["quality_score"],
        "hallucination": eval_result["hallucination"],
        "timestamp": datetime.now().isoformat()
    }
    # 上报给AgentOps和ClickHouse
    agentops.record(metrics)
    client.execute("INSERT INTO agent_ab_test_metrics VALUES", [metrics])
    return metrics

步骤五：自动评估实现

def evaluate_response(user_input, response, ground_truth=None):
    if not response:
        return {"quality_score": 0, "hallucination": True}
    llm = ChatOpenAI(temperature=0, model="gpt-4")
    eval_prompt = PromptTemplate(
        input_variables=["user_input", "response", "ground_truth"],
        template="""作为电商客服的质量评审员，评估以下回答的质量，按0-1分打分，1分最好，0分最差：
        用户问题：{user_input}
        回答：{response}
        参考正确答案：{ground_truth}
        评估标准：
        1. 准确性：回答是否正确，有没有幻觉
        2. 有用性：回答是否解决了用户的问题
        3. 友好性：回答是否符合客服的语气
        只输出分数，不需要其他内容："""
    )
    chain = LLMChain(llm=llm, prompt=eval_prompt)
    score = float(chain.run(user_input=user_input, response=response, ground_truth=ground_truth).strip())
    # 判断是否有幻觉
    hallucination_prompt = PromptTemplate(
        input_variables=["user_input", "response", "ground_truth"],
        template="""判断以下回答是否有幻觉，也就是编造了不存在的信息，只输出“是”或“否”：
        用户问题：{user_input}
        回答：{response}
        参考正确答案：{ground_truth}
        输出："""
    )
    chain = LLMChain(llm=llm, prompt=hallucination_prompt)
    hallucination = chain.run(user_input=user_input, response=response, ground_truth=ground_truth).strip() == "是"
    return {
        "quality_score": score,
        "hallucination": hallucination
    }

步骤六：统计分析与显著性检验

import pandas as pd
from scipy import stats
import numpy as np

# 从ClickHouse读取实验数据
df = client.query_dataframe("SELECT * FROM agent_ab_test_metrics WHERE timestamp >= '2024-01-01'")

# 计算每个变体的指标
metrics_summary = df.groupby("variant").agg(
    avg_quality=("quality_score", "mean"),
    hallucination_rate=("hallucination", "mean"),
    avg_latency=("latency", "mean"),
    avg_token_cost=("token_usage", "mean"),
    count=("variant", "count")
).reset_index()
print("实验指标汇总：")
print(metrics_summary)

# 显著性检验，对比B和A的质量得分
a_scores = df[df["variant"] == "A"]["quality_score"]
b_scores = df[df["variant"] == "B"]["quality_score"]
t_stat, p_value = stats.ttest_ind(a_scores, b_scores, equal_var=False)
print(f"\nB vs A 质量得分T检验：t值={t_stat:.2f}, p值={p_value:.4f}")
if p_value < 0.05:
    print("B和A的质量得分有显著差异")
else:
    print("B和A的质量得分没有显著差异")

# 对比C和A的延迟
a_latency = df[df["variant"] == "A"]["latency"]
c_latency = df[df["variant"] == "C"]["latency"]
t_stat, p_value = stats.ttest_ind(a_latency, c_latency, equal_var=False)
print(f"\nC vs A 延迟T检验：t值={t_stat:.2f}, p值={p_value:.4f}")
if p_value < 0.05:
    print("C和A的延迟有显著差异")
else:
    print("C和A的延迟没有显著差异")

# 计算综合得分
max_token = df["token_usage"].max()
max_latency = df["latency"].max()
metrics_summary["norm_quality"] = metrics_summary["avg_quality"]
metrics_summary["norm_hallucination"] = 1 - metrics_summary["hallucination_rate"]
metrics_summary["norm_latency"] = 1 - metrics_summary["avg_latency"] / max_latency
metrics_summary["norm_token"] = 1 - metrics_summary["avg_token_cost"] / max_token
# 加权计算得分
metrics_summary["total_score"] = (
    metrics_summary["norm_quality"] * 0.4 +
    metrics_summary["norm_hallucination"] * 0.3 +
    metrics_summary["norm_latency"] * 0.15 +
    metrics_summary["norm_token"] * 0.1 +
    0.05
)
print("\n综合得分：")
print(metrics_summary[["variant", "total_score"]].sort_values("total_score", ascending=False))

实验结果

从运行结果可以看到：

变体B的质量得分比A高15%，幻觉率降低了10%，但是token成本上涨了20%，延迟上涨了7%
变体C的质量得分和B差不多，但是延迟降低了27%，token成本降低了33%，完全符合我们的业务目标
所有差异的p值都小于0.01，统计显著，所以最终选择变体C全量上线。

四、进阶探讨/最佳实践

常见陷阱与避坑指南

样本量不足就下结论：很多人跑了一天实验，看到B比A高5%就全量上线，结果过了一周发现反而更差，这是因为样本量不够，随机波动导致的。样本量计算公式为：
$\frac{2*(Z_{\alpha/2} + Z_{\beta})^2 * \sigma^2}{\delta^2}$
其中 $Z_{\alpha/2}$ 是显著性水平对应的Z值，95%置信度是1.96， $Z_{\beta}$ 是功效对应的Z值，80%功效是0.84， $\sigma$ 是指标的标准差， $\delta$ 是预期的最小提升幅度。比如预期质量得分提升0.05，标准差是0.2，每个变体需要的样本量大概是2500个左右。
分流不均：某个变体的流量里新用户占比高，老用户占比低，导致指标有偏差，解决方法是做AA测试，先跑两个一样的基线变体，确认指标没有差异，再开始正式实验。
会话依赖问题：同一个用户的多轮对话分到不同的变体，导致用户体验差，指标不准，解决方法是用会话ID或者用户ID作为分流单元，保证同一个会话始终分到同一个变体。
辛普森悖论：整体指标B比A好，但是分用户群看，所有用户群都是A比B好，这是因为用户群的占比差异导致的，解决方法是做分群分析，不要只看整体指标。

性能与成本优化

影子测试：正式分流之前，把线上的流量同时复制一份给实验变体，不对外暴露输出，先离线评估效果，没问题再切真实流量，完全不会影响用户体验。
多臂老虎机动态分流：不用固定流量比例，用Thompson Sampling算法动态给表现好的变体更多流量，表现差的更少，减少不好的变体对用户的影响，提高样本效率。
混合模型策略：基线变体用便宜的模型（比如GPT-3.5），实验变体如果需要用贵的模型（比如GPT-4），只给10%的流量，降低成本。

最佳实践总结

每次实验只改一个变量：不要同时改prompt和工具调用策略，这样你不知道到底是哪个因素带来的效果提升。
先离线评估再线上实验：离线评估过不了的变体不要上线，浪费流量。
建立自动停止规则：如果实验变体的指标比基线差超过10%，而且统计显著，自动下线变体。
长期跟踪指标：很多Agent策略的短期效果好，但是长期效果差，所以要跟踪7天、30天的长期业务指标。
建立实验台账：每次实验的目标、变量、结果、结论都记录下来，形成团队的知识沉淀。

五、结论

核心要点回顾

AI Agent Harness Engineering是解决Agent落地难的核心工程框架，A/B测试是其中最核心的评估能力。
Agent A/B测试和传统Web A/B测试有很大差异，需要考虑会话依赖、多维度指标、非确定性输出等问题。
一个完整的Agent A/B测试系统包含接入层、分流层、执行层、观测层、分析层、运营层六个部分。
实验结果必须经过统计显著性检验才能下结论，避免随机波动带来的误判。
要注意避开样本量不足、分流不均、会话依赖等常见陷阱，遵循最佳实践才能得到可信的实验结果。

展望未来

未来Agent A/B测试会向着自动化、智能化的方向发展：首先是自动实验设计，大模型会根据业务目标自动生成待测试的Agent变体，不需要人工开发；然后是自动根因分析，实验结束后自动分析变体之间的差异原因，给出改进建议；最后是闭环迭代，系统自动做A/B测试，自动选最优变体上线，完全不需要人工参与，真正实现Agent的自我进化。

行动号召

现在你可以动手试试，用本文提供的代码，给你正在开发的Agent做一次A/B测试，看看你现在的策略是不是真的最优。如果遇到问题，可以在评论区留言交流，也可以参考以下资源：

AgentOps官方文档
LangSmith A/B测试教程
本文代码仓库
相关书籍：《A/B测试：创新始于实验》、《Building AI Agents》

希望大家都能通过科学的A/B测试，让自己的Agent效果越来越好，少踩坑，多落地。

本文字数：14872字

AI Agent技术社区

Agent 垂直技术社区，欢迎活跃、内容共建。

更多推荐

从 curl 通到项目跑通：DeepSeek API 接入的 5 个坑

AI Agent技术社区

DeepSeek总结的展望 Postgres 19：查询提示

文章摘要： Postgres 19 将引入查询提示功能，通过新增的 pg_plan_advice 和 pg_stash_advice 模块实现。这一功能结束了 Postgres 社区长期以来的争论，为 DBA 提供了优化查询的灵活工具。pg_plan_advice 允许通过 GUC 或独立存储区设置建议，约束而非替代规划器的决策，确保错误建议能优雅降级。pg_stash_advice 则支持将建议

AI Agent技术社区

DeepSeek总结的使用 Docker 对 PostgreSQL 进行 Beta 测试

本文介绍了如何使用Docker容器测试PostgreSQL 19 Beta 1版本。作者详细说明了通过Docker构建预发布镜像的步骤，包括获取适合操作系统的Docker版本和使用docker buildx命令构建特定版本。文中演示了启动容器、连接数据库以及测试PostgreSQL 19的新功能，如pg_stat_lock视图、pg_plan_advice扩展和pg_stat_statements