
传统分块已死?Agentic Chunking拯救语义断裂,实测RAG准确率飙升40%,LLM开发者必看!
Agentic Chunking是一种非常强大的文本分块技术,它能够将文档中相隔较远但主题相关的句子归入同一组,从而提升RAG模型的效果,但是这种方法在成本和延迟上相对较高。同事尝试了Agentic chunking之后,据他说准确率提升了40%,但成本也增加了3倍。那么我们时候应该使用Agentic chunking呢?根据我的项目经验,以下场景特别适合:非结构化文本(如客服对话记录)主题反复横
最近公司处理LLM项目的同事咨询了我一个问题:明明文档中多次提到同一个专有名词,RAG却总是漏掉关键信息。排查后发现,问题出在传统的分块方法上——那些相隔几页却密切相关的句子,被无情地拆散了。我给了一些通用的建议,比如使用混合检索代替单一的语义检索,基于chunk生成QA对等等。接着他又提出了一个问题,有没有通过分块技术能减少这类问题的发生?我说你也可以试试最近新提出的一种分块策略:Agentic Chunking.
为什么分块如此重要?
在RAG模型中,文本分块是第一步,也是最关键的一步。传统的分块方法,比如递归字符分割(Recursive character splitting),虽然简单易用,但它有一个明显的缺点:它依赖于固定的token长度进行分割,这可能导致一个主题被分割到不同的文本块中,从而破坏了上下文的连贯性。
另一种常见的分块方法是语义分割(semantic splitting),它通过检测句子之间的语义变化来进行分割。这种方法虽然比递归字符分割更智能,但它也有局限性。比如,当文档中的话题来回切换时,语义分割可能会将相关内容分割到不同的块中,导致信息不连贯。
比如遇到下面这种场景时,它们就会集体失灵:
“小明介绍了Transformer架构…(中间插入5段其他内容)…最后他强调,Transformer的核心是自注意力机制。”
传统方法要么把这两句话拆到不同区块,要么被中间内容干扰导致语义断裂。而人工分块时,我们自然会将它们归为“模型原理”组——这种跨越文本距离的关联性,正是Agentic Chunking要解决的。
Agentic Chunking的工作原理
Agentic Chunking的核心思想是让大语言模型(LLM)主动评估每一句话,并将其分配到最合适的文本块中。与传统的分块方法不同,Agentic Chunking不依赖于固定的token长度或语义变化,而是通过LLM的智能判断,将文档中相隔较远但主题相关的句子归入同一组。
举个例子,假设我们有以下文本:
`On July 20, 1969, astronaut Neil Armstrong walked on the moon. He was leading the NASA’s Apollo 11 mission. Armstrong famously said, “That’s one small step for man, one giant leap for mankind” as he stepped onto the lunar surface. `
在Agentic Chunking中,LLM会将这些句子进行propositioning处理,即将每个句子独立化,确保每个句子都有自己的主语。处理后的文本如下:
`On July 20, 1969, astronaut Neil Armstrong walked on the moon. Neil Armstrong was leading the NASA’s Apollo 11 mission. Neil Armstrong famously said, “That’s one small step for man, one giant leap for mankind” as he stepped onto the lunar surface. `
这样,LLM就可以单独检查每一个句子,并将其分配到最合适的文本块中。
propositioning 可以看做是对文档进行“句子级整容”,确保每个句子独立完整
如何实现Agentic Chunking?
实现Agentic Chunking的关键在于propositioning和文本块的动态创建与更新。我们可以使用Langchain和Pydantic等工具来实现这一过程。流程图如下:
1. Propositioning文本
首先,我们需要将文本中的每个句子进行propositioning处理。我们可以使用Langchain提供的提示词模板,让LLM自动完成这项工作。以下是一个简单的代码示例:
`from langchain.chains import create_extraction_chain_pydantic from langchain_core.pydantic_v1 import BaseModel from typing import Optional from langchain.chat_models import ChatOpenAI import uuid import os from typing import List from langchain import hub from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI from pydantic import BaseModel obj = hub.pull("wfh/proposal-indexing") llm = ChatOpenAI(model="gpt-4o") class Sentences(BaseModel): sentences: List[str] extraction_llm = llm.with_structured_output(Sentences) extraction_chain = obj | extraction_llm sentences = extraction_chain.invoke( """ On July 20, 1969, astronaut Neil Armstrong walked on the moon. He was leading the NASA's Apollo 11 mission. Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface. """ ) `
2. 创建和更新文本块
接下来,我们需要创建一个函数来动态生成和更新文本块。每个文本块包含主题相似的propositions,并且随着新propositions的加入,文本块的标题和摘要也会不断更新。
`def create_new_chunk(chunk_id, proposition): summary_llm = llm.with_structured_output(ChunkMeta) summary_prompt_template = ChatPromptTemplate.from_messages([ ("system", "Generate a new summary and a title based on the propositions."), ("user", "propositions:{propositions}"), ]) summary_chain = summary_prompt_template | summary_llm chunk_meta = summary_chain.invoke({"propositions": [proposition]}) chunks[chunk_id] = { "summary": chunk_meta.summary, "title": chunk_meta.title, "propositions": [proposition], } `
3. 将proposition推送到合适的文本块
最后,我们需要一个AI Agent来判断新的proposition应该被添加到哪个文本块中。如果没有合适的文本块,Agent会创建一个新的文本块。
`def find_chunk_and_push_proposition(proposition): class ChunkID(BaseModel): chunk_id: int = Field(description="The chunk id.") allocation_llm = llm.with_structured_output(ChunkID) allocation_prompt = ChatPromptTemplate.from_messages([ ("system", "Find the chunk that best matches the proposition. If no chunk matches, return a new chunk id."), ("user", "proposition:{proposition} chunks_summaries:{chunks_summaries}"), ]) allocation_chain = allocation_prompt | allocation_llm chunks_summaries = {chunk_id: chunk["summary"] for chunk_id, chunk in chunks.items()} best_chunk_id = allocation_chain.invoke({"proposition": proposition, "chunks_summaries": chunks_summaries}).chunk_id if best_chunk_id not in chunks: create_new_chunk(best_chunk_id, proposition) else: add_proposition(best_chunk_id, proposition) `
实测效果如何
我选择了新加坡圣淘沙著名景点 Wings of Time 的介绍文本作为测试对象,使用 GPT-4 模型进行处理。这段文本包含了景点介绍、票务信息、开放时间等多个方面的内容,是一个很好的测试样本。
`Product Name: Wings of Time Product Description: Wings of Time is one of Sentosa's most breathtaking attractions, combining water, laser, fire, and music to create a mesmerizing night show about friendship and courage. Situated on the scenic (https://www.sentosa.com.sg/en/things-to-do/attractions/siloso-beach/) Siloso Beach , this award-winning spectacle is staged nightly, promising an unforgettable experience for visitors of all ages. Be wowed by spellbinding laser, fire, and water effects set to a majestic soundtrack, complete with a jaw-dropping fireworks display. A fitting end to your day out at Sentosa, it’s possibly the only place in Singapore where you can witness such an awe-inspiring performance. Get ready for an even better experience starting 1 February 2025 ! Wings of Time Fireworks Symphony, Singapore’s only daily fireworks show, now features a fireworks display that is four times longer! Important Note: Please visit (https://www.sentosa.com.sg/sentosa-reservation) here if you need to change your visit date. All changes must be made at least 1 day prior to the visit date. Product Category: Shows Product Type: Attraction Keywords: Wings of Time, Sentosa night show, Sentosa attractions, laser show Sentosa, water show Singapore, Sentosa events, family activities Sentosa, Singapore night shows, outdoor night show Sentosa, book Wings of Time tickets Meta Description: Experience Wings of Time at Sentosa! A breathtaking night show featuring water, laser, and fire effects. Perfect for a memorable evening. Product Tags: Family Fun,Popular experiences,Frequently Bought Locations: Beach Station [Tickets] Name: Wings of Time (Std) Terms: • All Wings of Time (WOT) Open-Dated tickets require prior redemption at Singapore Cable Car Ticketing counters and are subjected to seats availability on a first come first serve basis. • This is a rain or shine event. Tickets are non-exchangeable or nonrefundable under any circumstances. • Once timeslot is confirmed, no further amendments are allowed. Please proceed to WOT admission gates to scan your issued QR code via mobile or physical printout for admission. • Gates will open 15 minutes prior to the start of the show. • Show Duration: 20 minutes per show. • Please be punctual for your booked time slot. • Admission will be on a first come first serve basis within the allocated timeslot or at the discretion of the attraction host. • Standard seats are applicable to guest aged 4 years and above. • No outside Food & Drinks are allowed. • Refer to (https://www.mountfaberleisure.com/attraction/wings-of-time/) https://www.mountfaberleisure.com/attraction/wings-of-time/ for more information on Wings of Time. Pax Type: Standard Promotion A: Enjoy $1.90 off when you purchase online! Discount will automatically be applied upon checkout. Price: 19 Opening Hours: Daily Show 1: 7.40pm Show 2: 8.40pm Accessibilities: Wheelchair [Information] Title: Terms & Conditions Description: For more information, click (https://www.sentosa.com.sg/en/promotional-general-store-terms-and-conditions) here for Terms & Conditions Title: Getting Here Description: By Sentosa Express: Alight at Beach Station By Public Bus: Board Bus 123 and alight at Beach Station By Intra-Island Bus: Board Sentosa Bus A or B and alight at Beach Station Nearest Car Park Beach Station Car Park Title: Contact Us Description: Beach Station +65 6361 0088 (mailto:guestrelations@mflg.com.sg) guestrelations@mflg.com.sg `
系统首先将原文转化为 50 多个独立的陈述句(propositions)。有趣的是,在这个过程中,系统自动将每句话的主语统一为"Wings of Time",这显示出了 AI 对文本主题的准确把握。
`[ "Wings of Time is one of Sentosa's most breathtaking attractions.", 'Wings of Time combines water, laser, fire, and music to create a mesmerizing night show.', 'The night show of Wings of Time is about friendship and courage.', 'Wings of Time is situated on the scenic Siloso Beach.', 'Wings of Time is an award-winning spectacle staged nightly.', 'Wings of Time promises an unforgettable experience for visitors of all ages.', 'Wings of Time features spellbinding laser, fire, and water effects set to a majestic soundtrack.', 'Wings of Time includes a jaw-dropping fireworks display.', 'Wings of Time is a fitting end to a day out at Sentosa.', 'Wings of Time is possibly the only place in Singapore where such an awe-inspiring performance can be witnessed.', 'Wings of Time will offer an even better experience starting 1 February 2025.', 'Wings of Time Fireworks Symphony is Singapore’s only daily fireworks show.', 'Wings of Time Fireworks Symphony now features a fireworks display that is four times longer.', 'Visitors should visit the provided link if they need to change their visit date to Wings of Time.', 'All changes to the visit date must be made at least 1 day prior to the visit date.', 'Wings of Time is categorized as a show.', 'Wings of Time is a type of attraction.', 'Keywords for Wings of Time include: Wings of Time, Sentosa night show, Sentosa attractions, laser show Sentosa, water show Singapore, Sentosa events, family activities Sentosa, Singapore night shows, outdoor night show Sentosa, book Wings of Time tickets.', 'The meta description for Wings of Time is: Experience Wings of Time at Sentosa! A breathtaking night show featuring water, laser, and fire effects. Perfect for a memorable evening.', 'Product tags for Wings of Time include: Family Fun, Popular experiences, Frequently Bought.', 'Wings of Time is located at Beach Station.', 'Wings of Time (Std) tickets require prior redemption at Singapore Cable Car Ticketing counters.', 'Wings of Time (Std) tickets are subjected to seats availability on a first come first serve basis.', 'Wings of Time is a rain or shine event.', 'Tickets for Wings of Time are non-exchangeable or nonrefundable under any circumstances.', 'Once the timeslot for Wings of Time is confirmed, no further amendments are allowed.', 'Visitors should proceed to Wings of Time admission gates to scan their issued QR code via mobile or physical printout for admission.', 'Gates for Wings of Time will open 15 minutes prior to the start of the show.', 'The show duration for Wings of Time is 20 minutes per show.', 'Visitors should be punctual for their booked time slot for Wings of Time.', 'Admission to Wings of Time will be on a first come first serve basis within the allocated timeslot or at the discretion of the attraction host.', 'Standard seats for Wings of Time are applicable to guests aged 4 years and above.', 'No outside food and drinks are allowed at Wings of Time.', 'More information on Wings of Time can be found at the provided link.', 'The pax type for Wings of Time is Standard.', 'Promotion A for Wings of Time offers $1.90 off when purchased online.', 'The discount for Promotion A will automatically be applied upon checkout.', 'The price for Wings of Time is 19.', 'Wings of Time has opening hours daily with Show 1 at 7.40pm and Show 2 at 8.40pm.', 'Wings of Time is accessible by wheelchair.', "The title for terms and conditions is 'Terms & Conditions'.", 'More information on terms and conditions can be found at the provided link.', "The title for getting to Wings of Time is 'Getting Here'.", 'Visitors can get to Wings of Time by Sentosa Express by alighting at Beach Station.', 'Visitors can get to Wings of Time by Public Bus by boarding Bus 123 and alighting at Beach Station.', 'Visitors can get to Wings of Time by Intra-Island Bus by boarding Sentosa Bus A or B and alighting at Beach Station.', 'The nearest car park to Wings of Time is Beach Station Car Park.', "The title for contacting Wings of Time is 'Contact Us'.", 'The contact location for Wings of Time is Beach Station.', 'The contact phone number for Wings of Time is +65 6361 0088.', 'The contact email for Wings of Time is guestrelations@mflg.com.sg.'] `
经过 AI 的智能分块(agentic chunking),整个文本被自然地划分为四个主要部分:
-
主体信息块:包含了 Wings of Time 的核心介绍、特色、位置等综合信息
-
日程政策块:专门处理预约变更相关的信息
-
价格优惠块:聚焦于折扣和支付相关内容
-
法律条款块:归纳了各项条款和规定
`Chunk (a641f): Sentosa's Wings of Time Show & Visitor Information Summary: This chunk contains comprehensive details about the Wings of Time attraction in Sentosa, including its features, themes, location, visitor experience, ticketing and admission procedures, future enhancements, promotions, classification as a show and attraction, unique fireworks display, daily show schedule, accessibility options, importance of punctuality and ticket redemption, extended fireworks display in the Fireworks Symphony, transportation options to reach the venue, and the necessity of adhering to non-exchangeable ticket policies, with a focus on the standard ticketing process and visitor guidelines, and the recent update on the extended fireworks display, as well as the contact information and accessibility details, and the new experience starting February 2025. Chunk (ae2b8): Scheduling Policies Summary: This chunk contains information about policies regarding changes to scheduled dates and times. Chunk (dadbb): Retail & Discounts Summary: This chunk contains information about the application of discounts during the checkout process. Chunk (3347c): Legal Terms & Conditions Summary: This chunk contains information about terms and conditions, including their titles and where to find more information. `
经过这样的分块之后,各个块的主题明确,不重叠,且重要信息优先,辅助信息分类存放。把这样的信息放在一起,也有助于提升向量库的召回率,从而提升RAG的准确率。
总结
Agentic Chunking是一种非常强大的文本分块技术,它能够将文档中相隔较远但主题相关的句子归入同一组,从而提升RAG模型的效果,但是这种方法在成本和延迟上相对较高。同事尝试了Agentic chunking之后,据他说准确率提升了40%,但成本也增加了3倍。那么我们时候应该使用Agentic chunking呢?
根据我的项目经验,以下场景特别适合:
-
非结构化文本(如客服对话记录)
-
主题反复横跳的内容(技术沙龙实录)
-
需要跨段落关联的QA系统
而面对结构清晰的论文、说明书等,传统分块和语义分块仍是性价比之选。
如何学习大模型 AI ?
由于新岗位的生产效率,要优于被取代岗位的生产效率,所以实际上整个社会的生产效率是提升的。
但是具体到个人,只能说是:
“最先掌握AI的人,将会比较晚掌握AI的人有竞争优势”。
这句话,放在计算机、互联网、移动互联网的开局时期,都是一样的道理。
我在一线互联网企业工作十余年里,指导过不少同行后辈。帮助很多人得到了学习和成长。
我意识到有很多经验和知识值得分享给大家,也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑,所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限,很多互联网行业朋友无法获得正确的资料得到学习提升,故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。
第一阶段(10天):初阶应用
该阶段让大家对大模型 AI有一个最前沿的认识,对大模型 AI 的理解超过 95% 的人,可以在相关讨论时发表高级、不跟风、又接地气的见解,别人只会和 AI 聊天,而你能调教 AI,并能用代码将大模型和业务衔接。
- 大模型 AI 能干什么?
- 大模型是怎样获得「智能」的?
- 用好 AI 的核心心法
- 大模型应用业务架构
- 大模型应用技术架构
- 代码示例:向 GPT-3.5 灌入新知识
- 提示工程的意义和核心思想
- Prompt 典型构成
- 指令调优方法论
- 思维链和思维树
- Prompt 攻击和防范
- …
第二阶段(30天):高阶应用
该阶段我们正式进入大模型 AI 进阶实战学习,学会构造私有知识库,扩展 AI 的能力。快速开发一个完整的基于 agent 对话机器人。掌握功能最强的大模型开发框架,抓住最新的技术进展,适合 Python 和 JavaScript 程序员。
- 为什么要做 RAG
- 搭建一个简单的 ChatPDF
- 检索的基础概念
- 什么是向量表示(Embeddings)
- 向量数据库与向量检索
- 基于向量检索的 RAG
- 搭建 RAG 系统的扩展知识
- 混合检索与 RAG-Fusion 简介
- 向量模型本地部署
- …
第三阶段(30天):模型训练
恭喜你,如果学到这里,你基本可以找到一份大模型 AI相关的工作,自己也能训练 GPT 了!通过微调,训练自己的垂直大模型,能独立训练开源多模态大模型,掌握更多技术方案。
到此为止,大概2个月的时间。你已经成为了一名“AI小子”。那么你还想往下探索吗?
- 为什么要做 RAG
- 什么是模型
- 什么是模型训练
- 求解器 & 损失函数简介
- 小实验2:手写一个简单的神经网络并训练它
- 什么是训练/预训练/微调/轻量化微调
- Transformer结构简介
- 轻量化微调
- 实验数据集的构建
- …
第四阶段(20天):商业闭环
对全球大模型从性能、吞吐量、成本等方面有一定的认知,可以在云端和本地等多种环境下部署大模型,找到适合自己的项目/创业方向,做一名被 AI 武装的产品经理。
- 硬件选型
- 带你了解全球大模型
- 使用国产大模型服务
- 搭建 OpenAI 代理
- 热身:基于阿里云 PAI 部署 Stable Diffusion
- 在本地计算机运行大模型
- 大模型的私有化部署
- 基于 vLLM 部署大模型
- 案例:如何优雅地在阿里云私有部署开源大模型
- 部署一套开源 LLM 项目
- 内容安全
- 互联网信息服务算法备案
- …
学习是一个过程,只要学习就会有挑战。天道酬勤,你越努力,就会成为越优秀的自己。
如果你能在15天内完成所有的任务,那你堪称天才。然而,如果你能完成 60-70% 的内容,你就已经开始具备成为一名大模型 AI 的正确特征了。
这份完整版的大模型 AI 学习资料已经上传CSDN,朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费
】
更多推荐
所有评论(0)