无人机视觉语言导航从入门到精通（十四）：大语言模型与导航决策

本文系统介绍了大语言模型(LLM)在视觉语言导航(VLN)中的应用范式。主要内容包括：LLM作为高层规划器分解导航任务、思维链推理技术提升决策质量、Prompt工程优化交互设计，以及LLM与传统导航方法的结合策略。文章详细阐述了NavCoT等专用推理框架，通过层次化架构和结构化Prompt设计，充分发挥LLM在语言理解、常识推理和任务规划方面的优势。这些方法显著提升了导航系统在复杂场景下的泛化能力

Mark Zero

544人浏览 · 2026-01-02 01:42:48

Mark Zero · 2026-01-02 01:42:48 发布

无人机视觉语言导航从入门到精通（十四）：大语言模型与导航决策

摘要

大语言模型（Large Language Model, LLM）凭借其强大的语言理解和推理能力，为视觉语言导航带来了新的范式。本文将系统介绍 LLM 在 VLN 任务中的应用，包括 LLM 作为高层规划器、思维链推理、Prompt 工程设计、以及 LLM 与传统导航方法的结合策略。通过本文的学习，读者将理解如何利用 LLM 的知识和推理能力提升导航系统的性能。

关键词：大语言模型、GPT、LLaMA、思维链、Prompt 工程、导航规划

一、引言

近年来，大语言模型取得了突破性进展。GPT-3/4、LLaMA、Claude 等模型展现出强大的语言理解、知识推理和指令遵循能力。这些能力与 VLN 任务高度相关：

语言理解：准确解析复杂的导航指令
常识推理：利用世界知识辅助导航决策
规划能力：分解任务、制定执行计划
泛化能力：处理未见过的场景和指令

二、LLM 在导航中的角色

2.1 不同的使用范式

LLM 在 VLN 系统中可以扮演不同角色：

角色	描述	优势	挑战
高层规划器	分解任务，生成子目标	利用常识推理	需要底层执行器
指令解析器	解析指令语义	深度语言理解	可能过度解读
决策推理器	分析场景，推理动作	可解释性强	推理延迟
知识库	提供世界知识	丰富背景知识	幻觉风险

2.2 LLM 作为高层规划器

层次化导航架构：

工作流程：

LLM 接收完整导航指令
分解为一系列子目标
底层控制器执行每个子目标
LLM 根据执行反馈调整计划

示例：

输入指令：“从卧室出发，穿过走廊，经过浴室，到达厨房，拿起桌上的咖啡杯”

LLM 规划输出：

1. 离开卧室，进入走廊
2. 沿走廊前行
3. 经过浴室门口
4. 进入厨房
5. 定位桌子
6. 接近桌子
7. 定位咖啡杯
8. 执行抓取

2.3 LLM 作为指令解析器

语义解析任务：

$\text{LLM}: \text{Instruction} \rightarrow \{(\text{action}_i, \text{landmark}_i, \text{relation}_i)\}$

解析示例：

输入：“Turn left at the stairs and go straight until you see a red door”

解析输出：

{
  "steps": [
    {
      "action": "turn_left",
      "landmark": "stairs",
      "relation": "at"
    },
    {
      "action": "go_straight",
      "condition": "until",
      "landmark": "red door",
      "relation": "see"
    }
  ]
}

三、思维链推理

3.1 思维链（Chain-of-Thought）概述

**思维链（Chain-of-Thought, CoT）**是一种让 LLM 展示推理步骤的技术，能够显著提升复杂任务的性能。

基本形式：

$\text{Prompt} + \text{Question} \xrightarrow{\text{CoT}} \text{Reasoning Steps} \rightarrow \text{Answer}$

在 VLN 中的应用：

问题：根据指令"走到楼梯，下两层楼，找到蓝色沙发"，当前看到的场景包括
[门、窗户、楼梯入口、书架]，应该选择哪个方向？

思维链推理：
1. 指令首先要求"走到楼梯"
2. 当前场景中有"楼梯入口"
3. 这说明楼梯在可见范围内
4. 应该朝楼梯入口的方向前进
5. 结论：选择朝向楼梯入口的方向

3.2 CoT 在导航中的变体

零样本 CoT：

在 prompt 末尾添加 “Let’s think step by step”：

Given the instruction and current observation, decide the next action.
Let's think step by step.

少样本 CoT：

提供带推理过程的示例：

Example 1:
Instruction: Go to the kitchen
Observation: [living room, hallway entrance, dining area]
Reasoning: The instruction asks to go to the kitchen. Kitchens are
typically connected to dining areas. I should head towards the dining
area to find the kitchen.
Action: Move towards dining area

Example 2: ...

Now solve:
Instruction: {current_instruction}
Observation: {current_observation}
Reasoning:

自洽性（Self-Consistency）：

多次采样推理路径，选择最一致的答案：

$a^* = \arg\max_a \sum_{i=1}^{N} \mathbb{1}[a_i = a]$

3.3 NavCoT：导航专用思维链

NavCoT 是专为视觉语言导航设计的思维链方法。

推理步骤：

指令理解：解析当前应执行的子指令
场景描述：描述当前观测到的场景
匹配分析：分析场景与指令的匹配程度
动作推理：基于分析选择动作
进度评估：评估任务完成进度

模板：

[Instruction Understanding]
The current instruction segment is: "{sub_instruction}"
Key landmarks to find: {landmarks}

[Scene Description]
I can see: {visible_objects}
Possible directions: {candidate_directions}

[Matching Analysis]
Comparing instruction with scene:
- {landmark_1} is {visible/not visible}
- The instruction mentions {direction}, I can go {available_directions}

[Action Reasoning]
Based on the analysis:
- {reasoning_step_1}
- {reasoning_step_2}
Therefore, I should {action}

[Progress Estimation]
Estimated progress: {X}%

四、Prompt 工程设计

4.1 Prompt 设计原则

清晰的任务描述：

You are a navigation agent in an indoor environment. Your task is to
follow natural language instructions to reach a target location.

结构化的输入格式：

=== Navigation Task ===
Instruction: {instruction}

=== Current State ===
Location: {position}
Heading: {direction}
Step: {step_number}

=== Observation ===
Visible objects: {objects}
Candidate viewpoints: {viewpoints}

=== Action Space ===
Available actions: {actions}

Based on the above information, select the best action.

输出格式约束：

Please respond in the following JSON format:
{
  "reasoning": "your step-by-step reasoning",
  "action": "selected action",
  "confidence": "high/medium/low"
}

4.2 上下文信息设计

历史信息：

=== Navigation History ===
Step 1: Moved forward (saw: door, window)
Step 2: Turned left (saw: hallway, painting)
Step 3: Moved forward (saw: stairs, plant)
Current step: 4

地图信息（如果可用）：

=== Explored Map ===
- Living room (visited)
- Hallway (visited)
- Stairs area (current location)
- Kitchen (not visited, estimated direction: north)

指令进度：

=== Instruction Progress ===
Full instruction: "Go upstairs, turn right, enter the bedroom"
Completed: "Go upstairs" ✓
Current: "turn right"
Remaining: "enter the bedroom"

4.3 少样本示例设计

示例选择策略：

策略	描述	适用场景
随机选择	随机采样示例	通用场景
相似度选择	选择相似的示例	特定场景
多样性选择	覆盖不同类型	复杂任务
难度递进	从简单到复杂	复杂推理

示例模板：

=== Example 1 (Easy) ===
Instruction: "Go to the door"
Observation: [door ahead, window left, wall right]
Reasoning: The door is directly ahead, matching the instruction.
Action: move_forward

=== Example 2 (Medium) ===
Instruction: "Turn left at the plant and go to the kitchen"
Observation: [plant left, hallway ahead, door right]
Reasoning: I see a plant on the left. The instruction says to turn
left at the plant. I should turn left first.
Action: turn_left

=== Example 3 (Hard) ===
Instruction: "Go past the second door on your right"
Observation: [door1 right-front, door2 right-back, hallway ahead]
Reasoning: I need to pass two doors on my right. I can see two doors.
I should move forward to pass the first door, then continue to pass
the second one.
Action: move_forward

4.4 Prompt 优化技术

指令调优：

通过实验迭代优化 prompt：

# Prompt 优化伪代码
prompts = generate_prompt_variants(base_prompt)
for prompt in prompts:
    score = evaluate(prompt, validation_set)
    if score > best_score:
        best_prompt = prompt
        best_score = score

动态 Prompt：

根据当前状态动态调整 prompt：

def build_prompt(state, instruction, history):
    prompt = BASE_TEMPLATE

    # 根据难度调整示例数量
    if is_complex(instruction):
        prompt += get_examples(n=3)
    else:
        prompt += get_examples(n=1)

    # 根据进度调整重点
    if near_goal(state):
        prompt += GOAL_FINDING_HINT
    elif stuck(history):
        prompt += EXPLORATION_HINT

    return prompt

五、代表性方法

5.1 LM-Nav（2022）

LM-Nav 使用 LLM 将指令分解为地标序列，然后使用 VLM 进行地标定位。

架构：

工作流程：

指令解析：GPT-3 提取地标序列

输入: "Go past the fire hydrant and stop at the bench"
输出: ["fire hydrant", "bench"]

视觉定位：CLIP 在图像中定位地标
$\text{score}(v, l) = \text{cos}(\text{CLIP}_{img}(v), \text{CLIP}_{text}(l))$
路径规划：规划经过地标的路径

5.2 NavGPT（2023）

NavGPT 直接使用 GPT 进行导航推理，无需额外训练。

Prompt 设计：

You are an intelligent navigation assistant. Given the following
information, help me navigate.

[Task] Navigate following: "{instruction}"

[Observation]
Current view shows: {scene_description}
Possible actions: {action_list}

[History]
Previous actions: {action_history}
Visited locations: {location_history}

[Reasoning Required]
1. What does the instruction ask me to do?
2. What relevant objects/landmarks can I see?
3. Which action best follows the instruction?

[Your Decision]

迭代决策：

def navgpt_navigate(instruction, env):
    history = []
    while not done:
        observation = env.get_observation()
        prompt = build_prompt(instruction, observation, history)
        response = gpt4(prompt)
        action = parse_action(response)

        env.step(action)
        history.append((observation, action))

        if action == "STOP":
            done = True

5.3 LLM-Planner（2023）

LLM-Planner 使用 LLM 进行高层规划，结合传统方法进行底层控制。

分层架构：

层级	模块	功能
高层	LLM 规划器	任务分解、子目标生成
中层	子目标执行器	导航到特定地标
底层	局部控制器	避障、路径跟踪

重规划机制：

def llm_planner_navigate(instruction, env):
    plan = llm_plan(instruction)  # 初始规划

    for subgoal in plan:
        success = execute_subgoal(subgoal, env)

        if not success:
            # 执行失败，请求重规划
            current_state = env.get_state()
            plan = llm_replan(instruction, current_state, plan)

5.4 DiscussNav（2023）

DiscussNav 使用多个 LLM 角色进行讨论式决策。

多角色设计：

讨论过程：

[Observer]: I can see a hallway ahead, a door on the left, and
stairs on the right. The hallway appears to lead to more rooms.

[Planner]: The instruction says to "find the bedroom upstairs".
We need to go upstairs first, then look for a bedroom.

[Discusser]: Given that we need to go upstairs and I can see
stairs on the right, we should head towards the stairs. The
hallway might also lead to stairs, but the visible stairs are
a more direct option.

[Executor]: Based on the discussion, I will turn right towards
the stairs.

5.5 VELMA（2023）

VELMA 使用 LLM 进行视觉-语言导航中的显式推理。

推理框架：

场景理解：描述当前视觉观测
指令对齐：将观测与指令关联
假设生成：生成可能的动作假设
假设验证：评估每个假设
动作选择：选择最佳动作

Prompt 结构：

[Scene Understanding]
Describe what you see in the current observation:
{scene_description}

[Instruction Alignment]
Current instruction segment: {current_instruction}
Relevant elements in scene: {relevant_elements}

[Hypothesis Generation]
Possible interpretations:
H1: {hypothesis_1}
H2: {hypothesis_2}
H3: {hypothesis_3}

[Hypothesis Evaluation]
Evaluating each hypothesis:
H1: {evaluation_1}
H2: {evaluation_2}
H3: {evaluation_3}

[Action Selection]
Best hypothesis: {best_hypothesis}
Selected action: {action}

六、LLM 与传统方法的结合

6.1 混合架构

LLM 增强传统模型：

应用方式：

增强类型	描述
指令增强	LLM 改写/扩展指令
知识注入	LLM 提供场景知识
置信度校准	LLM 评估模型输出
错误恢复	LLM 检测并纠正错误

6.2 知识蒸馏

从 LLM 到轻量模型：

$\mathcal{L}_{KD} = \text{KL}(P_{student} || P_{LLM})$

蒸馏流程：

使用 LLM 生成推理轨迹
收集 (状态, 推理, 动作) 三元组
训练学生模型模仿 LLM 的行为

# 知识蒸馏伪代码
llm_data = []
for episode in dataset:
    for state in episode:
        reasoning = llm.reason(state)
        action = llm.act(state)
        llm_data.append((state, reasoning, action))

# 训练学生模型
student_model.train(llm_data)

6.3 LLM 作为评判器

动作评估：

Given the navigation context, evaluate the proposed action.

Context:
- Instruction: {instruction}
- Current observation: {observation}
- Proposed action: {action}
- Action reasoning: {reasoning}

Evaluate on:
1. Instruction alignment (1-5): Does the action follow the instruction?
2. Feasibility (1-5): Is the action physically possible?
3. Progress (1-5): Does the action make progress towards the goal?

Provide your evaluation with brief justification.

重排序：

def llm_rerank(candidates, state, instruction):
    scores = []
    for action in candidates:
        prompt = build_eval_prompt(action, state, instruction)
        eval_result = llm(prompt)
        scores.append(parse_score(eval_result))

    best_action = candidates[argmax(scores)]
    return best_action

七、实现细节

7.1 API 调用优化

批量处理：

async def batch_llm_call(prompts):
    tasks = [llm.async_generate(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    return results

缓存机制：

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_llm_call(prompt_hash):
    return llm.generate(prompt)

超时处理：

import asyncio

async def llm_with_timeout(prompt, timeout=5.0):
    try:
        result = await asyncio.wait_for(
            llm.async_generate(prompt),
            timeout=timeout
        )
        return result
    except asyncio.TimeoutError:
        return fallback_action()

7.2 推理加速

模型量化：

使用 INT8/INT4 量化减少推理时间：

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "llama-7b",
    load_in_8bit=True,
    device_map="auto"
)

KV Cache：

复用历史计算，加速自回归生成。

并行解码：

使用 speculative decoding 加速：

# 使用小模型草拟，大模型验证
draft_tokens = small_model.generate(prompt, n=5)
verified_tokens = large_model.verify(draft_tokens)

7.3 本地部署方案

开源模型选择：

模型	参数量	特点
LLaMA-2-7B	7B	平衡性能与效率
Mistral-7B	7B	高效推理
Qwen-7B	7B	中文支持好
Phi-2	2.7B	轻量高效

部署框架：

# 使用 vLLM 高效部署
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=256
)

def generate(prompts):
    outputs = llm.generate(prompts, sampling_params)
    return [o.outputs[0].text for o in outputs]

八、挑战与局限

8.1 幻觉问题

问题：LLM 可能生成与实际场景不符的描述或推理。

缓解策略：

视觉验证：使用视觉模型验证 LLM 输出
置信度阈值：低置信度时回退到传统方法
多路径验证：采样多个推理路径，取共识

def verified_reasoning(observation, instruction):
    # LLM 推理
    reasoning = llm.reason(observation, instruction)

    # 视觉验证
    mentioned_objects = extract_objects(reasoning)
    detected_objects = vision_model.detect(observation)

    # 检查一致性
    for obj in mentioned_objects:
        if obj not in detected_objects:
            return fallback_reasoning(observation, instruction)

    return reasoning

8.2 实时性问题

问题：LLM 推理延迟可能影响导航效率。

解决方案：

方案	描述	适用场景
异步推理	提前预测下一步	连续导航
模型压缩	使用轻量模型	资源受限
选择性调用	关键决策点调用	复杂场景
缓存复用	缓存相似场景	重复环境

# 异步预测
class AsyncNavigator:
    def __init__(self):
        self.future_action = None

    async def step(self, observation):
        # 使用之前预测的动作
        if self.future_action:
            action = await self.future_action
        else:
            action = await self.llm_decide(observation)

        # 异步预测下一步
        next_obs_estimate = self.estimate_next(observation, action)
        self.future_action = asyncio.create_task(
            self.llm_decide(next_obs_estimate)
        )

        return action

8.3 成本问题

API 调用成本：

模型	输入价格	输出价格
GPT-4	$0.03/1K	$0.06/1K
GPT-3.5	$0.001/1K	$0.002/1K
Claude-3	$0.015/1K	$0.075/1K

成本优化：

def cost_aware_navigation(instruction, env, budget):
    cost = 0
    use_llm = True

    while not done and cost < budget:
        observation = env.get_observation()

        if use_llm and is_critical_point(observation, history):
            # 关键决策点使用 LLM
            action = llm_decide(observation, instruction)
            cost += estimate_cost(prompt_length)
        else:
            # 简单场景使用轻量模型
            action = lightweight_model(observation)

        env.step(action)

8.4 可控性问题

问题：LLM 输出格式不稳定，可能产生无效动作。

解决方案：

import json
import re

def parse_llm_output(output):
    # 尝试 JSON 解析
    try:
        result = json.loads(output)
        if "action" in result:
            return result["action"]
    except:
        pass

    # 正则匹配
    patterns = [
        r"Action:\s*(\w+)",
        r"I will\s+(\w+)",
        r"Selected action:\s*(\w+)"
    ]
    for pattern in patterns:
        match = re.search(pattern, output, re.IGNORECASE)
        if match:
            return normalize_action(match.group(1))

    # 回退策略
    return default_action()

def constrained_generation(prompt, valid_actions):
    """限制输出在有效动作集合内"""
    output = llm.generate(
        prompt,
        logit_bias={action: 10 for action in valid_actions}
    )
    return output

九、实验结果

9.1 方法对比

在 R2R 数据集上的性能对比：

方法	SR	SPL	推理时间
传统端到端	55%	50%	快
LM-Nav	58%	52%	中
NavGPT (GPT-4)	63%	55%	慢
LLM-Planner	61%	56%	中
混合方法	65%	58%	中

9.2 消融实验

CoT 的影响：

配置	SR	推理质量
无 CoT	55%	低
零样本 CoT	60%	中
少样本 CoT	63%	高
NavCoT	65%	高

示例数量的影响：

示例数	SR	Token 消耗
0	52%	低
1	58%	中
3	63%	高
5	64%	很高

9.3 泛化能力

在未见环境上的零样本性能：

方法	训练环境 SR	新环境 SR	泛化差距
端到端训练	65%	45%	-20%
LLM 方法	62%	58%	-4%

LLM 方法展现出更好的泛化能力，这得益于其丰富的世界知识。

十、小结

本文系统介绍了大语言模型在视觉语言导航中的应用：

LLM 角色：
- 高层规划器：任务分解、子目标生成
- 指令解析器：语义理解、结构化输出
- 决策推理器：场景分析、动作选择
- 知识库：世界知识、常识推理
思维链推理：
- 零样本/少样本 CoT
- NavCoT：导航专用思维链
- 自洽性：多路径投票
Prompt 工程：
- 结构化输入输出
- 上下文信息设计
- 动态 Prompt 调整
代表性方法：
- LM-Nav：地标序列规划
- NavGPT：端到端 LLM 导航
- LLM-Planner：分层架构
- DiscussNav：多角色讨论
混合策略：
- LLM 增强传统模型
- 知识蒸馏
- LLM 作为评判器
挑战与对策：
- 幻觉：视觉验证
- 延迟：异步推理、模型压缩
- 成本：选择性调用
- 可控性：输出约束

LLM 为 VLN 带来了新的可能性，特别是在语言理解、常识推理和零样本泛化方面。在下一篇文章中，我们将介绍视觉语言模型（VLM）在 VLN 中的应用，探讨如何利用多模态大模型的能力提升导航性能。

参考文献

[1] Shah D, Osinski B, Levine S. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. CoRL, 2022.

[2] Zhou X, et al. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models. arXiv:2305.16986, 2023.

[3] Song C H, et al. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. ICCV, 2023.

[4] Wei J, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 2022.

[5] Long Y, et al. Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions. arXiv:2305.02928, 2023.

[6] Schumann R, et al. VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation. arXiv:2307.06018, 2023.

下篇预告

下一篇文章《视觉语言模型在 VLN 中的应用》将介绍 GPT-4V、LLaVA、Qwen-VL 等视觉语言模型的能力，以及它们在视觉语言导航任务中的应用方式、优势和局限性。

AI Agent技术社区

Agent 垂直技术社区，欢迎活跃、内容共建。

更多推荐

“不可替代内容”=GEO 核心：AI 抄不走的经验、数据、案例

当 Gemini、各类生成式 AI 全面渗透谷歌搜索，GEO（生成式引擎优化）正式从可选玩法变成所有英文独立站、跨境站点、垂直内容站的必做项之后，行业里出现了一种普遍的焦虑：AI 可以在几秒内生成一篇完整文案、整理行业知识、仿写页面内容，人工创作的常规内容正在快速失去竞争力。把亲自使用总结的经验、反复测试得出的数据、一步步落地的案例，转化为网站内容，既能补齐 E-E-A-T 四大维度的评分短板，建

AI Agent技术社区

零信任，验证一切！Anthropic发布企业自主智能体安全白皮书

AI Agent 正在接管越来越多的事务，从搜索网页、综合信息到调用数据库、操作文件系统，全程无需人工介入。Anthropic 刚发布了 Zero Trust 安全白皮书：Zero Trust for AI Agents（对AI智能体零信任）。白皮书提出了一个尖锐的问题：当 Agent 能以机器速度行动，你的安全体系跟得上吗？白皮书内容梳理了包括 Agent 面临的新威胁、六个安全能力域的三级路线