【面经】多模态大模型

晓山清

377人浏览 · 2026-05-09 12:38:40

晓山清 · 2026-05-09 12:38:40 发布

知识体系框架

基础：深度学习、NLP、CV核心

Transformer架构：（基本必考）Self-Attention、Multi-Head Attention、Positional Encoding（正余弦、RoPE、ALiBi）、Feed-Forward Network、LayerNorm、残差连接。手撕Multi-Head Attention代码。
CV基础：CNN（ResNet）、Faster R-CNN、ViT（Vision Transformer）原理、YOLO、图像Patch Embedding、位置编码。
评价指标：precision、recall、accuracy、F1、mAP、IOU
损失函数：交叉熵、MSE、RMSE
NLP基础：词袋模型、ngram、马尔科夫链
语音基础：声音（如频率、幅度）、TTS、ASR
其他：如EEG
基础NLP/CV任务：语言建模（LM）、文本分类、目标检测（如DETR）、图像分类、分割。

大语言模型 (LLM) 核心

架构变体：Causal Decoder (GPT系列)、Encoder-Decoder (T5)、Prefix LM。为什么现在主流Decoder-only？
预训练：Next Token Prediction，数据来源与清洗，Tokenizer（BPE, SentencePiece）。
高效训练：混合精度(FP16/BF16)、梯度累积、ZeRO优化器（Stage 1/2/3）、FlashAttention 1&2。
参数高效微调 PEFT：LoRA（原理、为什么能减少参数、合并方式）、Adapter、Prefix Tuning。面试很喜欢问LoRA。
对齐技术：RLHF（奖励模型、PPO算法）、DPO（直接偏好优化，更简单常见）。了解基本流程和区别。
推理优化：KV-Cache、量化（GPTQ, AWQ, GGUF）、推测解码。

多模态核心

常见范式
- 对齐：CLIP, ALIGN （双塔模型，对比学习）。CLIP的InfoNCE loss，训练细节。
- 融合：LLaVA, Flamingo, BLIP-2, Qwen-VL, InternVL。
关键组件与技术
- 视觉编码器：CLIP-ViT, SigLIP, DINOv2。为什么用冻结的预训练ViT？
- 连接器/适配器：MLP Projector (LLaVA), Q-Former (BLIP-2), Perceiver Resampler (Flamingo)。它们的作用。
- 位置注入：绝对/相对位置，2D RoPE。
- 多模态训练阶段：预训练（图-文对）-> 指令微调（多模态对话数据）-> 对齐/DPO。
高级能力：视频理解（帧采样、时空注意力）、Referring expression分割/检测、具身智能基础、Any-to-Any（如NextGPT, AnyGPT）。

工程与工具

深度学习框架：PyTorch （必须精通，尤其是nn.Module, Dataset, Dataloader, autograd）。
分布式训练：torch.distributed，torchrun，NCCL。
常用库：HuggingFace (transformers, datasets, peft, accelerate), DeepSpeed, vLLM, XTuner, LLaMA-Factory。
评估基准：MMLU, C-Eval (文本)；MMBench, SEED-Bench, MME, MMMU, MathVista (多模态)。

常见题

基础知识

Q: Explain convolution in CNN. What is receptive field?（卷积、感受野）

Convolution applies a learnable kernel over an input feature map to extract local patterns. Receptive field is the region in the input that a particular feature (pixel in output) can see. For a 3x3 conv, receptive field increases with depth.

Q: Difference between L1 and L2 regularization?（正则化）

L1 adds absolute weight sum – leads to sparsity (feature selection). L2 adds squared sum – spreads weight decay, no sparsity, but smoother.

Q: What is vanishing gradient? How to mitigate?（梯度消失）

Gradients become extremely small in deep networks, preventing early layers from learning. Solutions: ReLU activation, batch normalization, residual connections, careful weight initialization (e.g., He).

Q: Difference between ResNet18，ResNet50，ResNet101?（ResNet）

18 layers，50 layers，101layers

Q: YOLO vs Faster R-CNN – key differences?

YOLO is one-stage: directly predicts boxes and classes in one pass – very fast but slightly less accurate on small objects. Faster R-CNN is two-stage: first RPN proposes regions, then second stage refines – slower but better accuracy.

Q: YOLO structure
the Backbone for feature extraction, the Neck for feature fusion, and the Head for final prediction.
Backbone：enhanced version of the CSPDarknet, C2f (Cross Stage Partial with 2 convolutions) instead of C3 module.
Neck (bidirectional feature fusion approach): PAN-FPN (Path Aggregation Network with Feature Pyramid Network) structure.
Head: Decoupled + Anchor-Free.
Loss Function: CIoU + BCE/VFL + DFL
Input：640* 640* 3

What is NMS? How does Soft‑NMS differ?
NMS removes boxes with high overlap (>IoU threshold) to keep only the highest confidence one. Soft‑NMS decays confidence scores of overlapping boxes instead of suppressing them completely – helps when objects are close.

LLM

Self-Attention的计算过程，包括Q, K, V是怎么来的，公式是什么，以及为什么需要除以 $dk\sqrt{d_k}$ ？另外，Multi-Head Attention是如何工作的？

LoRA（Low-Rank Adaptation）是目前微调LLM最常用的方法。请解释LoRA的核心思想——它如何做到“参数高效”？假设我们要微调一个7B的模型，使用LoRA（rank=8）大约能节省多少参数量和显存？最后，在推理时，我们是否需要保留原始的base model和LoRA weights两个文件？

MLLM

Q: How does CLIP work?
CLIP learns a joint embedding space for images and text using contrastive learning. It trains image encoder (ViT/ResNet) and text encoder (Transformer) on 400M image‑text pairs. At inference, it matches an image with the most similar text prompt.

请描述CLIP的训练过程——它的batch组成、损失函数的形式（InfoNCE）。另外，CLIP有什么明显的缺点？比如在细粒度任务（计数、属性识别）上为什么表现不佳？

请说明LLaVA（Large Language and Vision Assistant）的核心架构。它有哪些训练阶段？每个阶段分别冻结/更新哪些模块？（例如：视觉编码器、MLP连接器、LLM）。如果你来改进LLaVA，你会怎么做？

应用

Q: How do you optimize a PyTorch model for deployment?
Export to ONNX, then convert to TensorRT or TFLite. Apply quantization (INT8), layer fusion, and remove unused operations. On NPU, use vendor SDK (e.g., Huawei HiAI).