LLM 驱动的自主代理 LLM Powered Autonomous Agents

禅与计算机程序设计艺术

2078人浏览 · 2024-12-24 15:07:55

禅与计算机程序设计艺术 · 2024-12-24 15:07:55 发布

Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.以LLM（大型语言模型）为核心控制器构建智能体是一个酷炫的概念。AutoGPT、GPT-Engineer 和 BabyAGI 等几个概念验证演示作为启发性的例子。LLM的潜力不仅限于生成高质量的副本、故事、论文和程序；它可以被视为一个强大的通用问题解决者。

Agent System Overview 智能体系统概述

In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

在一个由LLM驱动的自主代理系统中，LLM充当代理的大脑，辅以几个关键组件：

Planning 规划
- Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
- 子目标与分解：智能体将大任务分解成更小、更易管理的子目标，从而实现复杂任务的高效处理。
- Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.
- 反思与优化：智能体可以对过去的行为进行自我批评和自我反思，从错误中学习并优化它们以供未来步骤使用，从而提高最终结果的质量。
Memory 内存
- Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.
- 短期记忆：我会把所有情境学习（见提示工程）视为利用模型的短期记忆进行学习。
- Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.
- 长期记忆：这使代理能够长期保留和回忆（无限）信息，通常通过利用外部向量存储和快速检索来实现。
Tool use 工具使用
- The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.
- 智能体学习调用外部 API 以获取模型权重（通常在预训练后难以更改）中缺失的额外信息，包括当前信息、代码执行能力、访问专有信息源等。

Fig. 1. Overview of a LLM-powered autonomous agent system.图 1. 基于LLM驱动的自主代理系统概述。

Component One: Planning 组件一：规划

A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.一个复杂的任务通常涉及许多步骤。智能体需要知道它们是什么，并提前做好计划。

Task Decomposition 任务分解

Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.思维链（CoT；Wei 等人，2022）已成为增强复杂任务模型性能的标准提示技术。模型被指导“逐步思考”，以利用更多的测试时间计算来将困难任务分解成更小、更简单的步骤。CoT 将大任务转化为多个可管理的任务，并揭示了模型思考过程的解释。

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.思维树（姚等，2023）通过探索每一步的多种推理可能性扩展了 CoT。它首先将问题分解为多个思维步骤，并在每一步生成多个思维，形成一个树状结构。搜索过程可以是 BFS（广度优先搜索）或 DFS（深度优先搜索），每个状态由分类器（通过提示）或多数投票进行评估。

Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.任务分解可以通过以下方式完成：(1) 使用简单的提示如 "Steps for XYZ.\n1." 、 "What are the subgoals for achieving XYZ?" 通过LLM，(2) 使用特定任务的说明；例如，使用 "Write a story outline." 来撰写小说，或者(3) 使用人工输入。

Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.

另一种相当独特的方法，LLM+P（刘等人，2023 年），涉及依赖外部经典规划器进行长期规划。这种方法利用规划领域定义语言（PDDL）作为中间接口来描述规划问题。在这个过程中，LLM（1）将问题转换为“问题 PDDL”，然后（2）请求经典规划器根据现有的“领域 PDDL”生成 PDDL 计划，最后（3）将 PDDL 计划转换回自然语言。本质上，规划步骤被外包给外部工具，假设领域特定的 PDDL 和合适的规划器可用，这在某些机器人设置中很常见，但在许多其他领域则不然。

Self-Reflection 自我反思

Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable.自我反思是自主代理迭代改进、精炼过去行动决策和纠正先前错误的重要方面。它在试错不可避免的真实世界任务中扮演着关键角色。

ReAct (Yao et al. 2023) integrates reasoning and acting within LLM by extending the action space to be a combination of task-specific discrete actions and the language space. The former enables LLM to interact with the environment (e.g. use Wikipedia search API), while the latter prompting LLM to generate reasoning traces in natural language.ReAct（姚等，2023）通过扩展动作空间为任务特定离散动作和语言空间的组合，在LLM中集成了推理和行动。前者使LLM能够与环境交互（例如使用维基百科搜索 API），而后者则提示LLM以自然语言生成推理轨迹。

The ReAct prompt template incorporates explicit steps for LLM to think, roughly formatted as:《LLM 驱动的自主代理 | Lil'Log》该 ReAct 提示模板包含 LLM 思考的明确步骤，大致格式如下：

Thought: ...Action: ...Observation: ... ... (Repeated many times)

Fig. 2. Examples of reasoning trajectories for knowledge-intensive tasks (e.g. HotpotQA, FEVER) and decision-making tasks (e.g. AlfWorld Env, WebShop). (Image source: Yao et al. 2023).图 2. 知识密集型任务（例如 HotpotQA、FEVER）和决策任务（例如 AlfWorld Env、WebShop）的推理轨迹示例。（图片来源：姚等，2023 年）。

In both experiments on knowledge-intensive tasks and decision-making tasks, ReAct works better than the Act-only baseline where Thought: … step is removed.在知识密集型任务和决策任务的两个实验中， ReAct 比移除 Thought: … 步骤的仅 Act 基线表现更好。

Reflexion (Shinn & Labash 2023) is a framework to equips agents with dynamic memory and self-reflection capabilities to improve reasoning skills. Reflexion has a standard RL setup, in which the reward model provides a simple binary reward and the action space follows the setup in ReAct where the task-specific action space is augmented with language to enable complex reasoning steps. After each action at, the agent computes a heuristic ht and optionally may decide to reset the environment to start a new trial depending on the self-reflection results.Reflexion（Shinn & Labash 2023）是一个框架，为智能体提供动态记忆和自我反思能力，以提高推理技能。Reflexion 具有标准的强化学习设置，其中奖励模型提供简单的二元奖励，动作空间遵循 ReAct 中的设置，特定任务的动作空间通过语言增强以实现复杂的推理步骤。在每个动作 at 之后，智能体计算一个启发式 ht ，并可根据自我反思结果选择性地决定重置环境以开始新的试验。

Fig. 3. Illustration of the Reflexion framework. (Image source: Shinn & Labash, 2023)图 3. Reflexion 框架的说明。（图片来源：Shinn & Labash，2023）

The heuristic function determines when the trajectory is inefficient or contains hallucination and should be stopped. Inefficient planning refers to trajectories that take too long without success. Hallucination is defined as encountering a sequence of consecutive identical actions that lead to the same observation in the environment.启发函数确定何时轨迹效率低下或包含幻觉，并应停止。效率低下的规划指的是耗时过长而未成功的轨迹。幻觉被定义为遇到一系列连续相同的动作，导致在环境中观察到相同的观察结果。

Self-reflection is created by showing two-shot examples to LLM and each example is a pair of (failed trajectory, ideal reflection for guiding future changes in the plan). Then reflections are added into the agent’s working memory, up to three, to be used as context for querying LLM.自我反思是通过向<code1001>展示两步示例来创建的，每个示例都是一个（失败的轨迹，理想反射以指导未来计划的变化）的配对。然后，将反思添加到代理的工作记忆中，最多三个，用作查询<code1002>的上下文。</code1002></code1001>

Fig. 4. Experiments on AlfWorld Env and HotpotQA. Hallucination is a more common failure than inefficient planning in AlfWorld. (Image source: Shinn & Labash, 2023)图 4. 在 AlfWorld 环境和 HotpotQA 上的实验。在 AlfWorld 中，幻觉比低效的规划更常见的失败。（图片来源：Shinn & Labash，2023）

The idea of CoH is to present a history of sequentially improved outputs in context and train the model to take on the trend to produce better outputs. Algorithm Distillation (AD; Laskin et al. 2023) applies the same idea to cross-episode trajectories in reinforcement learning tasks, where an algorithm is encapsulated in a long history-conditioned policy. Considering that an agent interacts with the environment many times and in each episode the agent gets a little better, AD concatenates this learning history and feeds that into the model. Hence we should expect the next predicted action to lead to better performance than previous trials. The goal is to learn the process of RL instead of training a task-specific policy itself.
CoH 的理念是在上下文中展示逐次改进的输出历史，并训练模型承担产生更好输出的趋势。算法蒸馏（AD；Laskin 等人，2023）将同样的理念应用于强化学习任务中的跨剧集轨迹，其中算法封装在一个长期历史条件策略中。考虑到智能体与环境多次互动，并且在每个剧集智能体都会变得更好一些，AD 将这个学习历史连接起来并输入到模型中。因此，我们应该期待下一个预测动作将比之前的试验带来更好的性能。目标是学习强化学习的过程，而不是训练特定的任务策略本身。

Fig. 6. Illustration of how Algorithm Distillation (AD) works.
图 6. 算法蒸馏（AD）工作原理示意图。
(Image source: Laskin et al. 2023).
(图片来源：Laskin 等人，2023 年)。

The paper hypothesizes that any algorithm that generates a set of learning histories can be distilled into a neural network by performing behavioral cloning over actions. The history data is generated by a set of source policies, each trained for a specific task. At the training stage, during each RL run, a random task is sampled and a subsequence of multi-episode history is used for training, such that the learned policy is task-agnostic.
该论文假设任何生成学习历史集合的算法都可以通过在动作上执行行为克隆来蒸馏成一个神经网络。历史数据由一组源策略生成，每个策略都针对特定任务进行训练。在训练阶段，每次强化学习运行期间，都会随机采样一个任务，并使用多集历史的一个子序列进行训练，从而使学习到的策略对任务无感知。

In reality, the model has limited context window length, so episodes should be short enough to construct multi-episode history. Multi-episodic contexts of 2-4 episodes are necessary to learn a near-optimal in-context RL algorithm. The emergence of in-context RL requires long enough context.
实际上，该模型具有有限的范围窗口长度，因此情节应该足够短，以便构建多情节历史。学习近最优的情境强化学习算法需要 2-4 个情节的多情节上下文。情境强化学习的出现需要足够长的上下文。

In comparison with three baselines, including ED (expert distillation, behavior cloning with expert trajectories instead of learning history), source policy (used for generating trajectories for distillation by UCB), RL^2 (Duan et al. 2017; used as upper bound since it needs online RL), AD demonstrates in-context RL with performance getting close to RL^2 despite only using offline RL and learns much faster than other baselines. When conditioned on partial training history of the source policy, AD also improves much faster than ED baseline.
与包括 ED（专家蒸馏，使用专家轨迹而不是学习历史的行为克隆）、源策略（用于通过 UCB 生成蒸馏的轨迹）和 RL^2（Duan 等人，2017；由于需要在线 RL，用作上界）在内的三个基线相比，AD（AD 表示）在仅使用离线 RL 的情况下，在上下文中展示了 RL，其性能接近 RL^2，并且比其他基线学习得更快。当基于源策略的部分训练历史进行条件化时，AD 也比 ED 基线提高得更快。

Fig. 7. Comparison of AD, ED, source policy and RL^2 on environments that require memory and exploration. Only binary reward is assigned. The source policies are trained with A3C for "dark" environments and DQN for watermaze.
图 7. 需要记忆和探索的环境下 AD、ED、源策略和 RL^2 的比较。仅分配二进制奖励。源策略使用 A3C 在“黑暗”环境中训练，使用 DQN 在水迷宫中训练。
(Image source: Laskin et al. 2023)
(图片来源：Laskin 等人，2023)

Component Two: Memory 组件二：内存

(Big thank you to ChatGPT for helping me draft this section. I’ve learned a lot about the human brain and data structure for fast MIPS in my conversations with ChatGPT.)
非常感谢 ChatGPT 帮助我起草这一部分。在与 ChatGPT 的对话中，我学到了很多关于人脑和快速 MIPS 数据结构的知识。

Types of Memory 内存类型

Memory can be defined as the processes used to acquire, store, retain, and later retrieve information. There are several types of memory in human brains.
内存可以定义为获取、存储、保留和随后检索信息的过程。人脑中有几种不同的记忆类型。

Sensory Memory: This is the earliest stage of memory, providing the ability to retain impressions of sensory information (visual, auditory, etc) after the original stimuli have ended. Sensory memory typically only lasts for up to a few seconds. Subcategories include iconic memory (visual), echoic memory (auditory), and haptic memory (touch).
感觉记忆：这是记忆的最早阶段，能够在原始刺激结束后保留感觉信息（视觉、听觉等）的印象。感觉记忆通常只持续几秒钟。子类别包括图像记忆（视觉）、回声记忆（听觉）和触觉记忆（触觉）。
Short-Term Memory (STM) or Working Memory: It stores information that we are currently aware of and needed to carry out complex cognitive tasks such as learning and reasoning. Short-term memory is believed to have the capacity of about 7 items (Miller 1956) and lasts for 20-30 seconds.
短期记忆（STM）或工作记忆：它存储我们当前意识到的信息，以及执行复杂认知任务（如学习和推理）所需的信息。短期记忆的容量被认为约为 7 个项目（米勒 1956 年），持续时间为 20-30 秒。
Long-Term Memory (LTM): Long-term memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. There are two subtypes of LTM:
长期记忆（LTM）：长期记忆可以存储信息长达数天到数十年，具有几乎无限的存储容量。LTM 有两种亚型：
- Explicit / declarative memory: This is memory of facts and events, and refers to those memories that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).
  显式/陈述性记忆：这是关于事实和事件的记忆，指的是那些可以自觉回忆的记忆，包括情景记忆（事件和经历）和语义记忆（事实和概念）。
- Implicit / procedural memory: This type of memory is unconscious and involves skills and routines that are performed automatically, like riding a bike or typing on a keyboard.
  隐式/程序性记忆：这种记忆是无意识的，涉及自动执行技能和程序，如骑自行车或打字。

Fig. 8. Categorization of human memory.
图 8. 人类记忆的分类

We can roughly consider the following mappings:
我们可以大致考虑以下映射：

Sensory memory as learning embedding representations for raw inputs, including text, image or other modalities;
感官记忆作为学习原始输入（包括文本、图像或其他模态）的嵌入表示
Short-term memory as in-context learning. It is short and finite, as it is restricted by the finite context window length of Transformer.
短期记忆作为上下文学习。它是短暂的、有限的，因为它受到 Transformer 有限上下文窗口长度的限制。
Long-term memory as the external vector store that the agent can attend to at query time, accessible via fast retrieval.
长期记忆作为外部向量存储，代理在查询时可以关注，可通过快速检索访问。

Maximum Inner Product Search (MIPS)
最大内积搜索（MIPS）

The external memory can alleviate the restriction of finite attention span. A standard practice is to save the embedding representation of information into a vector store database that can support fast maximum inner-product search (MIPS). To optimize the retrieval speed, the common choice is the approximate nearest neighbors (ANN) algorithm to return approximately top k nearest neighbors to trade off a little accuracy lost for a huge speedup.
外部内存可以缓解有限注意力范围的限制。一种标准做法是将信息的嵌入表示保存到支持快速最大内积搜索（MIPS）的向量存储数据库中。为了优化检索速度，常用的选择是近似最近邻（ANN）算法，以牺牲一点精度来换取巨大的速度提升。

A couple common choices of ANN algorithms for fast MIPS:
一种常见的快速 MIPS 神经网络算法选择：

LSH
(Locality-Sensitive Hashing): It introduces a hashing function such that similar input items are mapped to the same buckets with high probability, where the number of buckets is much smaller than the number of inputs.
局部敏感哈希（Locality-Sensitive Hashing）：它引入了一种哈希函数，使得相似输入项以高概率映射到相同的桶中，其中桶的数量远小于输入项的数量。
ANNOY
(Approximate Nearest Neighbors Oh Yeah): The core data structure are random projection trees, a set of binary trees where each non-leaf node represents a hyperplane splitting the input space into half and each leaf stores one data point. Trees are built independently and at random, so to some extent, it mimics a hashing function. ANNOY search happens in all the trees to iteratively search through the half that is closest to the query and then aggregates the results. The idea is quite related to KD tree but a lot more scalable.
ANNOY（近似最近邻哦耶）：核心数据结构是随机投影树，一组二叉树，其中每个非叶节点代表一个将输入空间分成两半的超平面，每个叶节点存储一个数据点。树是独立且随机构建的，因此在一定程度上模仿了哈希函数。ANNOY 搜索发生在所有树中，通过迭代搜索最接近查询的半空间，然后汇总结果。这个想法与 KD 树非常相关，但可扩展性要好得多。
HNSW
(Hierarchical Navigable Small World): It is inspired by the idea of small world networks where most nodes can be reached by any other nodes within a small number of steps; e.g. “six degrees of separation” feature of social networks. HNSW builds hierarchical layers of these small-world graphs, where the bottom layers contain the actual data points. The layers in the middle create shortcuts to speed up search. When performing a search, HNSW starts from a random node in the top layer and navigates towards the target. When it can’t get any closer, it moves down to the next layer, until it reaches the bottom layer. Each move in the upper layers can potentially cover a large distance in the data space, and each move in the lower layers refines the search quality.
HNSW（分层可导航小世界）：它受到小世界网络思想的启发，其中大多数节点可以通过少量步骤到达任何其他节点；例如，社交网络的“六度分隔”功能。HNSW 构建了这些小世界图的分层层，其中底层包含实际数据点。中间层创建快捷方式以加快搜索。在执行搜索时，HNSW 从顶层的一个随机节点开始，向目标导航。当它无法更接近时，它移动到下一层，直到达到底层。上层每次移动都可能在大数据空间中覆盖很大距离，而下层每次移动都细化搜索质量。
FAISS
(Facebook AI Similarity Search): It operates on the assumption that in high dimensional space, distances between nodes follow a Gaussian distribution and thus there should exist clustering of data points. FAISS applies vector quantization by partitioning the vector space into clusters and then refining the quantization within clusters. Search first looks for cluster candidates with coarse quantization and then further looks into each cluster with finer quantization.
FAISS（Facebook AI 相似搜索）：它基于在高维空间中，节点之间的距离遵循高斯分布的假设，因此应该存在数据点的聚类。FAISS 通过将向量空间划分为簇并在簇内细化量化来应用向量量化。搜索首先使用粗量化查找簇候选者，然后进一步使用更细的量化查看每个簇。
ScaNN
(Scalable Nearest Neighbors): The main innovation in ScaNN is anisotropic vector quantization. It quantizes a data point xi to x~i such that the inner product 〈q,xi〉 is as similar to the original distance of ∠q,x~i as possible, instead of picking the closet quantization centroid points.
ScaNN（可扩展最近邻）：ScaNN 的主要创新是各向异性矢量量化。它将数据点 xi 量化为 x~i ，使得内积 〈q,xi〉 尽可能接近原始距离 ∠q,x~i ，而不是选择最近的量化质心点。

Fig. 9. Comparison of MIPS algorithms, measured in recall@10. (Image source: Google Blog, 2020)
图 9. 在召回率@10 下比较 MIPS 算法。（图片来源：谷歌博客，2020 年）

Check more MIPS algorithms and performance comparison in ann-benchmarks.com.
查看更多 MIPS 算法和性能比较，请访问 ann-benchmarks.com。

Component Three: Tool Use
组件三：工具使用

Tool use is a remarkable and distinguishing characteristic of human beings. We create, modify and utilize external objects to do things that go beyond our physical and cognitive limits. Equipping LLMs with external tools can significantly extend the model capabilities.
工具使用是人类的一个显著和独特的特征。我们创造、修改和利用外部物体来做超出我们身体和认知极限的事情。为LLMs配备外部工具可以显著扩展模型功能。

Fig. 10. A picture of a sea otter using rock to crack open a seashell, while floating in the water. While some other animals can use tools, the complexity is not comparable with humans. (Image source: Animals using tools)
图 10. 一只海獭在水面上用石头砸开贝壳的画面。虽然有些其他动物可以使用工具，但它们的复杂程度无法与人类相比。（图片来源：使用工具的动物）

MRKL (Karpas et al. 2022), short for “Modular Reasoning, Knowledge and Language”, is a neuro-symbolic architecture for autonomous agents. A MRKL system is proposed to contain a collection of “expert” modules and the general-purpose LLM works as a router to route inquiries to the best suitable expert module. These modules can be neural (e.g. deep learning models) or symbolic (e.g. math calculator, currency converter, weather API).
MRKL（Karpas 等人，2022 年），代表“模块化推理、知识和语言”，是一种用于自主代理的神经符号架构。提出了一种 MRKL 系统，其中包含一组“专家”模块，通用LLM充当路由器，将查询路由到最合适的专家模块。这些模块可以是神经的（例如深度学习模型）或符号的（例如数学计算器、货币转换器、天气 API）。

They did an experiment on fine-tuning LLM to call a calculator, using arithmetic as a test case. Their experiments showed that it was harder to solve verbal math problems than explicitly stated math problems because LLMs (7B Jurassic1-large model) failed to extract the right arguments for the basic arithmetic reliably. The results highlight when the external symbolic tools can work reliably, knowing when to and how to use the tools are crucial, determined by the LLM capability.
他们对微调LLM以调用计算器进行了实验，使用算术作为测试案例。他们的实验表明，解决口头数学问题比明确陈述的数学问题更困难，因为LLMs（7B Jurassic1-large 模型）无法可靠地提取基本算术的正确参数。结果突出了当外部符号工具可以可靠地工作时，知道何时以及如何使用这些工具是至关重要的，这取决于LLM的能力。

Both TALM (Tool Augmented Language Models; Parisi et al. 2022) and Toolformer (Schick et al. 2023) fine-tune a LM to learn to use external tool APIs. The dataset is expanded based on whether a newly added API call annotation can improve the quality of model outputs. See more details in the “External APIs” section of Prompt Engineering.
TALM（工具增强语言模型；Parisi 等人，2022 年）和 Toolformer（Schick 等人，2023 年）都对语言模型进行微调，以学习使用外部工具 API。数据集的扩展基于新添加的 API 调用注释是否可以提高模型输出的质量。更多详情请参阅“外部 API”部分的提示工程。

ChatGPT Plugins and OpenAI API function calling are good examples of LLMs augmented with tool use capability working in practice. The collection of tool APIs can be provided by other developers (as in Plugins) or self-defined (as in function calls).
ChatGPT 插件和 OpenAI API 函数调用是实践中的良好示例，通过工具使用能力增强的LLMs。工具 API 的集合可以由其他开发者提供（如插件中所示）或自行定义（如函数调用中所示）。

HuggingGPT (Shen et al. 2023) is a framework to use ChatGPT as the task planner to select models available in HuggingFace platform according to the model descriptions and summarize the response based on the execution results.
HuggingGPT（沈等，2023）是一个框架，将 ChatGPT 用作任务规划器，根据模型描述在 HuggingFace 平台上选择可用的模型，并根据执行结果总结响应。

Fig. 11. Illustration of how HuggingGPT works. (Image source: Shen et al. 2023)
图 11. HuggingGPT 工作原理示意图。（图片来源：沈等，2023）

The system comprises of 4 stages:
系统包括 4 个阶段：

(1) Task planning: LLM works as the brain and parses the user requests into multiple tasks. There are four attributes associated with each task: task type, ID, dependencies, and arguments. They use few-shot examples to guide LLM to do task parsing and planning.
(1) 任务规划：LLM作为大脑，将用户请求解析成多个任务。每个任务都与四个属性相关：任务类型、ID、依赖项和参数。他们使用少量示例来指导LLM进行任务解析和规划。

Instruction: 指令：

The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning.
人工智能助手可以解析用户输入到多个任务中：[{"任务": task, "id": task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]。"dep"字段表示前一个任务的 ID，该任务生成当前任务所依赖的新资源。特殊标记"-task_id"指代依赖任务中 ID 为 task_id 的生成文本、图像、音频和视频。任务必须从以下选项中选择：{{可用任务列表}}。任务之间存在逻辑关系，请注意它们的顺序。如果用户输入无法解析，则需要回复空 JSON。以下是一些参考案例：{{演示}}。聊天历史记录为{{聊天历史}}。从聊天历史中，您可以找到用户提到的资源路径，以便您进行任务规划。

(2) Model selection: LLM distributes the tasks to expert models, where the request is framed as a multiple-choice question. LLM is presented with a list of models to choose from. Due to the limited context length, task type based filtration is needed.
(2) 模型选择：LLM将任务分配给专家模型，其中请求被表述为一个多项选择题。LLM展示了一个可供选择的模型列表。由于上下文长度有限，需要根据任务类型进行筛选。

Instruction: 指令：

Given the user request and the call command, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The AI assistant merely outputs the model id of the most appropriate model. The output must be in a strict JSON format: "id": "id", "reason": "your detail reason for the choice". We have a list of models for you to choose from {{ Candidate Models }}. Please select one model from the list.
根据用户请求和调用命令，AI 助手帮助用户从模型列表中选择一个合适的模型来处理用户请求。AI 助手仅输出最合适的模型 ID。输出必须严格遵循 JSON 格式：“id”：“id”，“reason”：“您选择该模型的具体原因”。我们为您准备了一份模型列表供您选择{{候选模型}}。请从列表中选择一个模型。

(3) Task execution: Expert models execute on the specific tasks and log results.
(3) 任务执行：专家模型在特定任务上执行并记录结果。

Instruction: 指令：

With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: {{ User Input }}, Task Planning: {{ Tasks }}, Model Selection: {{ Model Assignment }}, Task Execution: {{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path.
与输入和推理结果一起，人工智能助手需要描述过程和结果。前几个阶段可以表示为 - 用户输入：{{ 用户输入 }}, 任务规划：{{ 任务 }}, 模型选择：{{ 模型分配 }}, 任务执行：{{ 预测 }}。您必须首先以直接的方式回答用户的要求。然后描述任务过程，并以第一人称向用户展示您的分析和模型推理结果。如果推理结果包含文件路径，必须告诉用户完整的文件路径。

(4) Response generation: LLM receives the execution results and provides summarized results to users.
(4) 响应生成：LLM 接收执行结果并向用户提供总结结果。

To put HuggingGPT into real world usage, a couple challenges need to solve: (1) Efficiency improvement is needed as both LLM inference rounds and interactions with other models slow down the process; (2) It relies on a long context window to communicate over complicated task content; (3) Stability improvement of LLM outputs and external model services.
将 HuggingGPT 应用于实际使用中，需要解决以下一些挑战：（1）需要提高效率，因为LLM推理轮次和其他模型的交互都使过程变慢；（2）它依赖于长上下文窗口来沟通复杂任务内容；（3）提高LLM输出和外部模型服务的稳定性。

API-Bank (Li et al. 2023) is a benchmark for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls. The selection of APIs is quite diverse, including search engines, calculator, calendar queries, smart home control, schedule management, health data management, account authentication workflow and more. Because there are a large number of APIs, LLM first has access to API search engine to find the right API to call and then uses the corresponding documentation to make a call.
API-Bank（李等，2023）是评估工具增强LLMs性能的基准。它包含 53 个常用 API 工具、完整的工具增强LLM工作流程和 264 个涉及 568 个 API 调用的注释对话。API 的选择相当多样化，包括搜索引擎、计算器、日历查询、智能家居控制、日程管理、健康数据管理、账户认证工作流程等。由于 API 数量众多，LLM首先通过 API 搜索引擎查找要调用的正确 API，然后使用相应的文档进行调用。

Fig. 12. Pseudo code of how LLM makes an API call in API-Bank. (Image source: Li et al. 2023)
图 12. LLM在 API-Bank 中发起 API 调用的伪代码。（图片来源：李等，2023）

In the API-Bank workflow, LLMs need to make a couple of decisions and at each step we can evaluate how accurate that decision is. Decisions include:
在 API-Bank 工作流程中，LLMs需要做出一些决定，并且我们可以在每个步骤评估这些决定的准确性。决定包括：

Whether an API call is needed.
是否需要 API 调用。
Identify the right API to call: if not good enough, LLMs need to iteratively modify the API inputs (e.g. deciding search keywords for Search Engine API).
确定要调用的正确 API：如果不满意，需要迭代修改 API 输入（例如，为搜索引擎 API 决定搜索关键词）LLMs。
Response based on the API results: the model can choose to refine and call again if results are not satisfied.
基于 API 结果的响应：模型可以选择细化并再次调用，如果结果不满意。

This benchmark evaluates the agent’s tool use capabilities at three levels:
这个基准评估了代理在三个层面的工具使用能力：

Level-1 evaluates the ability to call the API. Given an API’s description, the model needs to determine whether to call a given API, call it correctly, and respond properly to API returns.
Level-1 评估调用 API 的能力。给定一个 API 的描述，模型需要确定是否调用指定的 API，正确调用它，并正确响应 API 的返回。
Level-2 examines the ability to retrieve the API. The model needs to search for possible APIs that may solve the user’s requirement and learn how to use them by reading documentation.
二级检查检索 API 的能力。模型需要搜索可能解决用户需求的 API，并通过阅读文档学习如何使用它们。
Level-3 assesses the ability to plan API beyond retrieve and call. Given unclear user requests (e.g. schedule group meetings, book flight/hotel/restaurant for a trip), the model may have to conduct multiple API calls to solve it.
Level-3 评估规划 API 的能力，而不仅仅是检索和调用。面对不明确的用户请求（例如安排团队会议、预订旅行中的航班/酒店/餐厅），模型可能需要进行多次 API 调用才能解决问题。

Case Studies 案例研究

Scientific Discovery Agent
科学发现代理

ChemCrow (Bran et al. 2023) is a domain-specific example in which LLM is augmented with 13 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. The workflow, implemented in LangChain, reflects what was previously described in the ReAct and MRKLs and combines CoT reasoning with tools relevant to the tasks:
ChemCrow（Bran 等人，2023）是一个特定领域的示例，其中LLM通过 13 个专家设计的工具增强，以完成有机合成、药物发现和材料设计等任务。在 LangChain 中实现的流程反映了之前在 ReAct 和 MRKLs 中描述的内容，并结合了 CoT 推理与相关任务的工具。

The LLM is provided with a list of tool names, descriptions of their utility, and details about the expected input/output.
The LLM 提供了一个工具名称列表、它们的功能描述以及预期的输入/输出细节。
It is then instructed to answer a user-given prompt using the tools provided when necessary. The instruction suggests the model to follow the ReAct format - Thought, Action, Action Input, Observation.
它被指示在必要时使用提供的工具回答用户给出的提示。指示建议模型遵循 ReAct 格式 - Thought, Action, Action Input, Observation 。

One interesting observation is that while the LLM-based evaluation concluded that GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts oriented towards the completion and chemical correctness of the solutions showed that ChemCrow outperforms GPT-4 by a large margin. This indicates a potential problem with using LLM to evaluate its own performance on domains that requires deep expertise. The lack of expertise may cause LLMs not knowing its flaws and thus cannot well judge the correctness of task results.
一个有趣的观察是，虽然基于LLM的评估得出 GPT-4 和 ChemCrow 的表现几乎相当，但针对解决方案的完成和化学正确性的专家评估表明，ChemCrow 在性能上大幅优于 GPT-4。这表明使用LLM来评估其自身在需要深厚专业知识的领域上的性能可能存在潜在问题。缺乏专业知识可能导致LLMs不了解其自身的缺陷，因此无法很好地判断任务结果的正确性。

Boiko et al. (2023) also looked into LLM-empowered agents for scientific discovery, to handle autonomous design, planning, and performance of complex scientific experiments. This agent can use tools to browse the Internet, read documentation, execute code, call robotics experimentation APIs and leverage other LLMs.
Boiko 等人（2023）还研究了LLM赋能的智能体用于科学发现，以处理复杂科学实验的自主设计、规划和性能。此智能体可以使用工具浏览互联网、阅读文档、执行代码、调用机器人实验 API 并利用其他LLMs。

For example, when requested to "develop a novel anticancer drug", the model came up with the following reasoning steps:
例如，当请求执行 "develop a novel anticancer drug" 时，模型提出了以下推理步骤：

inquired about current trends in anticancer drug discovery;
咨询了抗癌药物发现领域的当前趋势
selected a target; 选择了一个目标
requested a scaffold targeting these compounds;
请求针对这些化合物的支架
Once the compound was identified, the model attempted its synthesis.
一旦确定了该化合物，模型就尝试进行合成。

They also discussed the risks, especially with illicit drugs and bioweapons. They developed a test set containing a list of known chemical weapon agents and asked the agent to synthesize them. 4 out of 11 requests (36%) were accepted to obtain a synthesis solution and the agent attempted to consult documentation to execute the procedure. 7 out of 11 were rejected and among these 7 rejected cases, 5 happened after a Web search while 2 were rejected based on prompt only.
他们还讨论了风险，特别是非法药物和生物武器。他们开发了一个包含已知化学武器剂的测试集，并要求代理合成这些物质。11 个请求中有 4 个（36%）被接受以获得合成解决方案，代理试图查阅文档以执行程序。11 个中有 7 个被拒绝，在这 7 个被拒绝的案例中，5 个是在网络搜索后发生的，而 2 个仅基于提示被拒绝。

Generative Agents Simulation
生成代理模拟

Generative Agents (Park, et al. 2023) is super fun experiment where 25 virtual characters, each controlled by a LLM-powered agent, are living and interacting in a sandbox environment, inspired by The Sims. Generative agents create believable simulacra of human behavior for interactive applications.
生成代理（Park 等人，2023）是一个超级有趣的实验，其中 25 个虚拟角色，每个角色都由<code1001></code1001>

The design of generative agents combines LLM with memory, planning and reflection mechanisms to enable agents to behave conditioned on past experience, as well as to interact with other agents.
生成代理的设计将LLM与记忆、规划和反思机制相结合，使代理能够根据以往经验进行行为，以及与其他代理进行交互。

Memory
stream: is a long-term memory module (external database) that records a comprehensive list of agents’ experience in natural language.
内存流：是一种长期记忆模块（外部数据库），记录了代理在自然语言中的全面经验列表。
- Each element is an observation, an event directly provided by the agent. - Inter-agent communication can trigger new natural language statements.
  每个元素都是一个观察，由代理直接提供的事件。 - 代理间的通信可以触发新的自然语言语句。
Retrieval
model: surfaces the context to inform the agent’s behavior, according to relevance, recency and importance.
检索模型：根据相关性、近期性和重要性，将上下文呈现出来以指导代理的行为。
- Recency: recent events have higher scores
  近期：近期事件得分更高
- Importance: distinguish mundane from core memories. Ask LM directly.
  重要性：区分日常记忆和核心记忆。直接询问 LM。
- Relevance: based on how related it is to the current situation / query.
  相关性：基于它与当前情况/查询的相关程度。
Reflection
mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events (<- note that this is a bit different from self-reflection above)
反思机制：随着时间的推移，将记忆综合为更高层次的推理，并指导代理的未来行为。它们是过去事件的更高层次总结（<- 注意这与上述自我反思略有不同）。
- Prompt LM with 100 most recent observations and to generate 3 most salient high-level questions given a set of observations/statements. Then ask LM to answer those questions.
  提示 LM 使用最近 100 个观察结果，并针对一组观察/陈述生成 3 个最显著的高层次问题。然后要求 LM 回答这些问题。
Planning & Reacting
: translate the reflections and the environment information into actions
规划与反应：将反思和环境信息转化为行动
- Planning is essentially in order to optimize believability at the moment vs in time.
  规划本质上是为了在当下与时间上优化可信度。
- Prompt template: {Intro of an agent X}. Here is X's plan today in broad strokes: 1)
  提示模板： {Intro of an agent X}. Here is X's plan today in broad strokes: 1)
- Relationships between agents and observations of one agent by another are all taken into consideration for planning and reacting.
  关系和被其他代理观察到的代理的观察都被纳入到规划和反应的考虑之中。
- Environment information is present in a tree structure.
  环境信息以树状结构呈现。

Fig. 13. The generative agent architecture. (Image source: Park et al. 2023)
图 13. 生成代理架构。（图片来源：Park 等人，2023 年）

This fun simulation results in emergent social behavior, such as information diffusion, relationship memory (e.g. two agents continuing the conversation topic) and coordination of social events (e.g. host a party and invite many others).
这个有趣的模拟产生了涌现的社会行为，例如信息传播、关系记忆（例如两个智能体继续对话主题）以及社交活动的协调（例如举办派对并邀请许多人）。

Proof-of-Concept Examples
概念验证示例

AutoGPT has drawn a lot of attention into the possibility of setting up autonomous agents with LLM as the main controller. It has quite a lot of reliability issues given the natural language interface, but nevertheless a cool proof-of-concept demo. A lot of code in AutoGPT is about format parsing.
AutoGPT 引起了很多人对使用 LLM 作为主要控制器的自主代理设置的注意。鉴于自然语言界面，它存在很多可靠性问题，但无论如何都是一个很酷的概念验证演示。AutoGPT 中的很多代码都是关于格式解析的。

Here is the system message used by AutoGPT, where {{...}} are user inputs:
这里是由 AutoGPT 使用的系统消息，其中 {{...}} 是用户输入：

Youare {{ai-name}}, {{user-providedAIbotdescription}}.Yourdecisionsmustalwaysbemadeindependentlywithoutseekinguserassistance.PlaytoyourstrengthsasanLLMandpursuesimplestrategieswithnolegalcomplications.GOALS:1. {{user-providedgoal1}}2. {{user-providedgoal2}}3....4....5....Constraints:1.~4000wordlimitforshorttermmemory.Yourshorttermmemoryisshort,soimmediatelysaveimportantinformationtofiles.2.Ifyouareunsurehowyoupreviouslydidsomethingorwanttorecallpastevents,thinkingaboutsimilareventswillhelpyouremember.3.Nouserassistance4.Exclusivelyusethecommandslistedindoublequotese.g."command name"5.UsesubprocessesforcommandsthatwillnotterminatewithinafewminutesCommands:1. Google Search:"google",args:"input":"<search>"2. Browse Website:"browse_website",args:"url":"<url>","question":"<what_you_want_to_find_on_website>"3. Start GPT Agent:"start_agent",args:"name":"<name>","task":"<short_task_desc>","prompt":"<prompt>"4. Message GPT Agent:"message_agent",args:"key":"<key>","message":"<message>"5. List GPT Agents:"list_agents",args:6. Delete GPT Agent:"delete_agent",args:"key":"<key>"7. Clone Repository:"clone_repository",args:"repository_url":"<url>","clone_path":"<directory>"8. Write to file:"write_to_file",args:"file":"<file>","text":"<text>"9. Read file:"read_file",args:"file":"<file>"10. Append to file:"append_to_file",args:"file":"<file>","text":"<text>"11. Delete file:"delete_file",args:"file":"<file>"12. Search Files:"search_files",args:"directory":"<directory>"13. Analyze Code:"analyze_code",args:"code":"<full_code_string>"14. Get Improved Code:"improve_code",args:"suggestions":"<list_of_suggestions>","code":"<full_code_string>"15. Write Tests:"write_tests",args:"code":"<full_code_string>","focus":"<list_of_focus_areas>"16. Execute Python File:"execute_python_file",args:"file":"<file>"17. Generate Image:"generate_image",args:"prompt":"<prompt>"18. Send Tweet:"send_tweet",args:"text":"<text>"19. Do Nothing:"do_nothing",args:20.TaskComplete(Shutdown):"task_complete",args:"reason":"<reason>"Resources:1.Internetaccessforsearchesandinformationgathering.2.LongTermmemorymanagement.3.GPT-3.5poweredAgentsfordelegationofsimpletasks.4.Fileoutput.Performance Evaluation:1.Continuouslyreviewandanalyzeyouractionstoensureyouareperformingtothebestofyourabilities.2.Constructivelyself-criticizeyourbig-picturebehaviorconstantly.3.Reflectonpastdecisionsandstrategiestorefineyourapproach.4.Everycommandhasacost,sobesmartandefficient.Aimtocompletetasksintheleastnumberofsteps.YoushouldonlyrespondinJSONformatasdescribedbelowResponse Format:{"thoughts": {"text":"thought","reasoning":"reasoning","plan":"- short bulleted\n- list that conveys\n- long-term plan","criticism":"constructive self-criticism","speak":"thoughts summary to say to user" },"command": {"name":"command name","args": {"arg name":"value" } } }EnsuretheresponsecanbeparsedbyPythonjson.loads

GPT-Engineer is another project to create a whole repository of code given a task specified in natural language. The GPT-Engineer is instructed to think over a list of smaller components to build and ask for user input to clarify questions as needed.
GPT-Engineer 是一个创建给定自然语言任务的全代码库的项目。GPT-Engineer 被指示思考构建一系列较小的组件，并在需要时请求用户输入以澄清问题。

Here are a sample conversation for task clarification sent to OpenAI ChatCompletion endpoint used by GPT-Engineer. The user inputs are wrapped in {{user input text}}.
这里是一个用于任务明确的示例对话，发送到 GPT-Engineer 使用的 OpenAI ChatCompletion 端点。用户输入被包裹在 {{user input text}} 中。

[ {"role":"system","content":"You will read instructions and not carry them out, only seek to clarify them.\nSpecifically you will first summarise a list of super short bullets of areas that need clarification.\nThen you will pick one clarifying question, and wait for an answer from the user.\n" }, {"role":"user","content":"We are writing {{a Super Mario game in python. MVC components split in separate files. Keyboard control.}}\n" }, {"role":"assistant","content":"Summary of areas that need clarification:\n1. Specifics of the Super Mario game (e.g. level design, characters, gameplay mechanics)\n2. Details about the MVC components (e.g. which components are in each file)\n3. Keyboard control implementation (e.g. which keys to use, how to handle input)\n\nClarifying question:\nCan you provide more details about the Super Mario game, such as level design, characters, and gameplay mechanics?" }, {"role":"user","content":"{{There are 10 levels in total. The main character is a plumber named Mario, who can walk and jump. It is a classical platform game just like Super Mario. The main character moves from left to right, trying to get to the destination, where there are many obstacles and attacks from enemies in the process.}}\n\nIs anything else unclear? If yes, only answer in the form:\n{remaining unclear areas} remaining questions.\n{Next question}\nIf everything is sufficiently clear, only answer \"Nothing more to clarify.\"." }, {"role":"assistant","content":"Remaining unclear areas: 2 remaining questions.\nCan you provide more information about how the MVC components are split into separate files?" }, {"role":"user","content":"{{Make your own assumptions and state them explicitly before starting}}" } ]

Then after these clarification, the agent moved into the code writing mode with a different system message. System message:
然后在这些澄清之后，代理进入了代码编写模式，并显示了一条不同的系统消息。系统消息：

You will get instructions for code to write. You will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code. Make sure that every detail of the architecture is, in the end, implemented as code.
您将获得编写代码的说明。您将写一个非常长的答案。确保最终将架构的每一个细节都实现为代码。确保最终将架构的每一个细节都实现为代码。

Think step by step and reason yourself to the right decisions to make sure we get it right. You will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.
逐步思考并为自己做出正确的决定进行推理，以确保我们做得正确。你首先列出必要的核心类、函数、方法的名称，以及它们用途的简要注释。

Then you will output the content of each file including ALL code. Each file must strictly follow a markdown code block format, where the following tokens must be replaced such that FILENAME is the lowercase file name including the file extension, LANG is the markup code block language for the code’s language, and CODE is the code:
然后您将输出每个文件的内容，包括所有代码。每个文件必须严格遵循 Markdown 代码块格式，其中以下标记必须替换，以便 FILENAME 是包括文件扩展名的小写文件名，LANG 是代码的代码块标记语言，CODE 是代码：

FILENAME 文件名

CODE

You will start with the “entrypoint” file, then go to the ones that are imported by that file, and so on. Please note that the code should be fully functional. No placeholders.
您将从“entrypoint”文件开始，然后转到由该文件导入的文件，依此类推。请注意，代码应完全可用。不要使用占位符。

Follow a language and framework appropriate best practice file naming convention. Make sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other. Ensure to implement all code, if you are unsure, write a plausible implementation. Include module dependency or package manager dependency definition file. Before you finish, double check that all parts of the architecture is present in the files.
遵循适合的语言和框架的最佳实践文件命名约定。确保文件包含所有导入、类型等。确保不同文件中的代码相互兼容。如果不确定，请实现所有代码。包括模块依赖或包管理器依赖定义文件。完成前，请检查文件中是否包含架构的所有部分。

Useful to know: You almost always put different classes in different files. For Python, you always create an appropriate requirements.txt file. For NodeJS, you always create an appropriate package.json file. You always add a comment briefly describing the purpose of the function definition. You try to add comments explaining very complex bits of logic. You always follow the best practices for the requested languages in terms of describing the code written as a defined package/project.
有用的知识：你几乎总是将不同的类放在不同的文件中。对于 Python，你总是创建一个适当的 requirements.txt 文件。对于 NodeJS，你总是创建一个适当的 package.json 文件。你总是添加一个注释简要描述函数定义的目的。你尝试添加注释来解释非常复杂的逻辑部分。你总是遵循请求语言的最佳实践来描述编写的代码作为定义的包/项目。

Python toolbelt preferences:
Python 工具带首选项：

pytest pytest：pytest
dataclasses 数据类

Conversatin samples: 对话样本：

[ {"role":"system","content":"You will get instructions for code to write.\nYou will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code.\nMake sure that every detail of the architecture is, in the end, implemented as code.\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nYou will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nPlease note that the code should be fully functional. No placeholders.\n\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other.\nEnsure to implement all code, if you are unsure, write a plausible implementation.\nInclude module dependency or package manager dependency definition file.\nBefore you finish, double check that all parts of the architecture is present in the files.\n\nUseful to know:\nYou almost always put different classes in different files.\nFor Python, you always create an appropriate requirements.txt file.\nFor NodeJS, you always create an appropriate package.json file.\nYou always add a comment briefly describing the purpose of the function definition.\nYou try to add comments explaining very complex bits of logic.\nYou always follow the best practices for the requested languages in terms of describing the code written as a defined\npackage/project.\n\n\nPython toolbelt preferences:\n- pytest\n- dataclasses\n" }, # … same conversation as earlier, ended with"Make your own assumptions and state them explicitly before starting". {"role":"assistant","content":"Assumptions:\n1. Model: The model will contain the game's data, such as level information, character states, and enemy positions.\n2. View: The view will handle the game's visuals, including rendering the game objects, backgrounds, and updating the display.\n3. Controller: The controller will manage user input, such as keyboard controls, and update the model accordingly.\n\nWith these assumptions in mind, is there any specific aspect of the keyboard control implementation you would like me to clarify?" }, {"role":"user","content":"Please now remember the steps:\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nFirst lay out the names of the core classes, functions, methods that will be necessary, As well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nPlease note that the code should be fully functional. No placeholders.\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. The code should be fully functional. Make sure that code in different files are compatible with each other.\nBefore you finish, double check that all parts of the architecture is present in the files.\n" } ]

Challenges 挑战

After going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations:
在了解了以LLM为中心的智能体构建的关键思想和演示之后，我开始注意到一些常见的局限性：

Finite context length: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention.
有限上下文长度：受限的上下文容量限制了历史信息的包含、详细说明、API 调用上下文和响应。系统的设计必须与这种有限的通信带宽协同工作，而像自我反思从过去的错误中学习这样的机制，如果拥有长或无限的上下文窗口将受益匪浅。尽管向量存储和检索可以提供访问更大知识库的途径，但它们的表示能力不如全注意力强大。
Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.
长期规划和任务分解的挑战：在漫长的历史中进行规划和有效探索解决方案空间仍然具有挑战性。在面临意外错误时，<code1001>难以调整计划，这使得它们与从试错中学习的人类相比，鲁棒性较低。</code1001>
Reliability of natural language interface: Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasionally exhibit rebellious behavior (e.g. refuse to follow an instruction). Consequently, much of the agent demo code focuses on parsing model output.
自然语言界面的可靠性：当前智能体系统依赖于自然语言作为LLMs与外部组件（如内存和工具）之间的接口。然而，模型输出的可靠性值得怀疑，因为LLMs可能会出现格式错误，偶尔表现出叛逆行为（例如拒绝遵循指令）。因此，大量的智能体演示代码集中在解析模型输出上。

Citation 引用

Cited as: 引用为：

Weng, Lilian. (Jun 2023). “LLM-powered Autonomous Agents”. Lil’Log. https://lilianweng.github.io/posts/2023-06-23-agent/.
翁，莉莉安。（2023 年 6 月）。《LLM动力自主代理》。Lil’Log。https://lilianweng.github.io/posts/2023-06-23-agent/

Or 或者

@article{weng2023agent, title ="LLM-powered Autonomous Agents", author ="Weng, Lilian", journal ="lilianweng.github.io", year ="2023", month ="Jun", url ="https://lilianweng.github.io/posts/2023-06-23-agent/"}

References 参考文献

[1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022
[1] 威等。“思维链提示在大语言模型中引发推理。”NeurIPS 2022

[2] Yao et al. “Tree of Thoughts: Dliberate Problem Solving with Large Language Models.” arXiv preprint arXiv:2305.10601 (2023).
[2] 姚等。“思维树：用大型语言模型解放问题解决。”arXiv 预印本 arXiv:2305.10601（2023）。

[3] Liu et al. “Chain of Hindsight Aligns Language Models with Feedback “ arXiv preprint arXiv:2302.02676 (2023).
[3] 刘等人。“事后诸葛亮将语言模型与反馈对齐” arXiv 预印本 arXiv:2302.02676（2023）。

[4] Liu et al. “LLM+P: Empowering Large Language Models with Optimal Planning Proficiency” arXiv preprint arXiv:2304.11477 (2023).
[4] 刘等人。“LLM+P：通过最优规划能力赋能大型语言模型”arXiv 预印本 arXiv:2304.11477（2023）。

[5] Yao et al. “ReAct: Synergizing reasoning and acting in language models.” ICLR 2023.
[5] 姚等。“ReAct：在语言模型中协同推理和行动。”ICLR 2023。

[6] Google Blog. “Announcing ScaNN: Efficient Vector Similarity Search” July 28, 2020.
[6] 谷歌博客。“宣布 ScaNN：高效向量相似性搜索” 2020 年 7 月 28 日。

[7] https://chat.openai.com/share/46ff149e-a4c7-4dd7-a800-fc4a642ea389

[8] Shinn & Labash. “Reflexion: an autonomous agent with dynamic memory and self-reflection” arXiv preprint arXiv:2303.11366 (2023).
[8] Shinn & Labash. “反思：具有动态记忆和自我反思能力的自主代理” arXiv 预印本 arXiv:2303.11366（2023）。

[9] Laskin et al. “In-context Reinforcement Learning with Algorithm Distillation” ICLR 2023.
[9] Laskin 等人。“基于上下文的强化学习与算法蒸馏”ICLR 2023。

[10] Karpas et al. “MRKL Systems A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.” arXiv preprint arXiv:2205.00445 (2022).
[10] Karpas 等人。“MRKL 系统：一种结合大型语言模型、外部知识源和离散推理的模块化神经符号架构。”arXiv 预印本 arXiv:2205.00445（2022）。

[11] Nakano et al. “Webgpt: Browser-assisted question-answering with human feedback.” arXiv preprint arXiv:2112.09332 (2021).
[11] 中野等人。“Webgpt：基于浏览器的问答系统，并辅以人工反馈。”arXiv 预印本 arXiv:2112.09332（2021）。

[12] Parisi et al. “TALM: Tool Augmented Language Models”
[12] 巴黎等。“TALM：工具增强语言模型”

[13] Schick et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv preprint arXiv:2302.04761 (2023).
[13] Schick 等人。“Toolformer：语言模型可以自学使用工具。”arXiv 预印本 arXiv:2302.04761（2023）。

[14] Weaviate Blog. Why is Vector Search so fast? Sep 13, 2022.
[14] Weaviate 博客。为什么向量搜索这么快？2022 年 9 月 13 日。

[15] Li et al. “API-Bank: A Benchmark for Tool-Augmented LLMs” arXiv preprint arXiv:2304.08244 (2023).
[15] 李等。“API-Bank：工具增强LLMs基准”arXiv 预印本 arXiv:2304.08244（2023）。

[16] Shen et al. “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace” arXiv preprint arXiv:2303.17580 (2023).
[16] 沈等。“拥抱 GPT：利用 ChatGPT 及其 HuggingFace 的朋友解决 AI 任务”arXiv 预印本 arXiv:2303.17580（2023）。

[17] Bran et al. “ChemCrow: Augmenting large-language models with chemistry tools.” arXiv preprint arXiv:2304.05376 (2023).
[17] Bran 等人。“ChemCrow：通过化学工具增强大型语言模型。”arXiv 预印本 arXiv:2304.05376（2023）。

[18] Boiko et al. “Emergent autonomous scientific research capabilities of large language models.” arXiv preprint arXiv:2304.05332 (2023).
[18] 博伊科等人。“大型语言模型的涌现自主科学研究能力。”arXiv 预印本 arXiv:2304.05332（2023）。

[19] Joon Sung Park, et al. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv preprint arXiv:2304.03442 (2023).
[19] 帕克俊松，等。“生成代理：人类行为的交互模拟。”arXiv 预印本 arXiv:2304.03442（2023）。

[20] AutoGPT. https://github.com/Significant-Gravitas/Auto-GPT
[20] AutoGPT. https://github.com/Significant-Gravitas/Auto-GPT [20] AutoGPT. https://github.com/Significant-Gravitas/Auto-GPT

[21] GPT-Engineer. https://github.com/AntonOsika/gpt-engineer
[21] GPT-工程师. https://github.com/AntonOsika/gpt-engineer