【面经】多模态大模型
知识体系框架
- 基础:深度学习、NLP、CV核心
-
Transformer架构:(基本必考)Self-Attention、Multi-Head Attention、Positional Encoding(正余弦、RoPE、ALiBi)、Feed-Forward Network、LayerNorm、残差连接。手撕Multi-Head Attention代码。
-
CV基础:CNN(ResNet)、Faster R-CNN、ViT(Vision Transformer)原理、YOLO、图像Patch Embedding、位置编码。
-
评价指标:precision、recall、accuracy、F1、mAP、IOU
-
损失函数:交叉熵、MSE、RMSE
-
NLP基础:词袋模型、ngram、马尔科夫链
-
语音基础:声音(如频率、幅度)、TTS、ASR
-
其他:如EEG
-
基础NLP/CV任务:语言建模(LM)、文本分类、目标检测(如DETR)、图像分类、分割。
- 大语言模型 (LLM) 核心
-
架构变体:Causal Decoder (GPT系列)、Encoder-Decoder (T5)、Prefix LM。为什么现在主流Decoder-only?
-
预训练:Next Token Prediction,数据来源与清洗,Tokenizer(BPE, SentencePiece)。
-
高效训练:混合精度(FP16/BF16)、梯度累积、ZeRO优化器(Stage 1/2/3)、FlashAttention 1&2。
-
参数高效微调 PEFT:LoRA(原理、为什么能减少参数、合并方式)、Adapter、Prefix Tuning。面试很喜欢问LoRA。
-
对齐技术:RLHF(奖励模型、PPO算法)、DPO(直接偏好优化,更简单常见)。了解基本流程和区别。
-
推理优化:KV-Cache、量化(GPTQ, AWQ, GGUF)、推测解码。
- 多模态核心
-
常见范式
-
对齐:CLIP, ALIGN (双塔模型,对比学习)。CLIP的InfoNCE loss,训练细节。
-
融合:LLaVA, Flamingo, BLIP-2, Qwen-VL, InternVL。
-
-
关键组件与技术
-
视觉编码器:CLIP-ViT, SigLIP, DINOv2。为什么用冻结的预训练ViT?
-
连接器/适配器:MLP Projector (LLaVA), Q-Former (BLIP-2), Perceiver Resampler (Flamingo)。它们的作用。
-
位置注入:绝对/相对位置,2D RoPE。
-
多模态训练阶段:预训练(图-文对)-> 指令微调(多模态对话数据)-> 对齐/DPO。
-
-
高级能力:视频理解(帧采样、时空注意力)、Referring expression分割/检测、具身智能基础、Any-to-Any(如NextGPT, AnyGPT)。
- 工程与工具
-
深度学习框架:PyTorch (必须精通,尤其是nn.Module, Dataset, Dataloader, autograd)。
-
分布式训练:torch.distributed,torchrun,NCCL。
-
常用库:HuggingFace (transformers, datasets, peft, accelerate), DeepSpeed, vLLM, XTuner, LLaMA-Factory。
-
评估基准:MMLU, C-Eval (文本);MMBench, SEED-Bench, MME, MMMU, MathVista (多模态)。
常见题
基础知识
Q: Explain convolution in CNN. What is receptive field?(卷积、感受野)
Convolution applies a learnable kernel over an input feature map to extract local patterns. Receptive field is the region in the input that a particular feature (pixel in output) can see. For a 3x3 conv, receptive field increases with depth.
Q: Difference between L1 and L2 regularization?(正则化)
L1 adds absolute weight sum – leads to sparsity (feature selection). L2 adds squared sum – spreads weight decay, no sparsity, but smoother.
Q: What is vanishing gradient? How to mitigate?(梯度消失)
Gradients become extremely small in deep networks, preventing early layers from learning. Solutions: ReLU activation, batch normalization, residual connections, careful weight initialization (e.g., He).
Q: Difference between ResNet18,ResNet50,ResNet101?(ResNet)
18 layers,50 layers,101layers
Q: YOLO vs Faster R-CNN – key differences?
YOLO is one-stage: directly predicts boxes and classes in one pass – very fast but slightly less accurate on small objects. Faster R-CNN is two-stage: first RPN proposes regions, then second stage refines – slower but better accuracy.
Q: YOLO structure
the Backbone for feature extraction, the Neck for feature fusion, and the Head for final prediction.
Backbone:enhanced version of the CSPDarknet, C2f (Cross Stage Partial with 2 convolutions) instead of C3 module.
Neck (bidirectional feature fusion approach): PAN-FPN (Path Aggregation Network with Feature Pyramid Network) structure.
Head: Decoupled + Anchor-Free.
Loss Function: CIoU + BCE/VFL + DFL
Input:640* 640* 3
What is NMS? How does Soft‑NMS differ?
NMS removes boxes with high overlap (>IoU threshold) to keep only the highest confidence one. Soft‑NMS decays confidence scores of overlapping boxes instead of suppressing them completely – helps when objects are close.
LLM
Self-Attention的计算过程,包括Q, K, V是怎么来的,公式是什么,以及为什么需要除以dk\sqrt{d_k}dk?另外,Multi-Head Attention是如何工作的?
LoRA(Low-Rank Adaptation)是目前微调LLM最常用的方法。请解释LoRA的核心思想——它如何做到“参数高效”?假设我们要微调一个7B的模型,使用LoRA(rank=8)大约能节省多少参数量和显存?最后,在推理时,我们是否需要保留原始的base model和LoRA weights两个文件?
MLLM
Q: How does CLIP work?
CLIP learns a joint embedding space for images and text using contrastive learning. It trains image encoder (ViT/ResNet) and text encoder (Transformer) on 400M image‑text pairs. At inference, it matches an image with the most similar text prompt.
请描述CLIP的训练过程——它的batch组成、损失函数的形式(InfoNCE)。另外,CLIP有什么明显的缺点?比如在细粒度任务(计数、属性识别)上为什么表现不佳?
请说明LLaVA(Large Language and Vision Assistant)的核心架构。它有哪些训练阶段?每个阶段分别冻结/更新哪些模块?(例如:视觉编码器、MLP连接器、LLM)。如果你来改进LLaVA,你会怎么做?
应用
Q: How do you optimize a PyTorch model for deployment?
Export to ONNX, then convert to TensorRT or TFLite. Apply quantization (INT8), layer fusion, and remove unused operations. On NPU, use vendor SDK (e.g., Huawei HiAI).
假设我们要在8张A100(80G)上训练一个7B的LLM + Vision Encoder(~300M参数)。如果使用bf16混合精度,需要多大的显存?你会采用哪些分布式并行策略(DP, ZeRO, TP, PP)?如何用HuggingFace的accelerate或DeepSpeed配置实现?
设计一个小型多模态模型,能够根据用户的口语描述,从一张复杂的街景图中“高亮”出所有红色车辆。不用考虑实时性,请描述你的模型架构、损失函数、训练数据构造方法。
更多推荐



所有评论(0)