为什么 qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4 推理图片的需要超过8GB的额外显存？

Question

为什么 qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4 推理图片的需要超过8GB的额外显存？

universe_king

3.4k13387842

发布于
2024-10-11 浙江

更新于
2024-10-12

最近在测试一些底显存占用的 VLM，希望带1-2张图片的 QA 时，显存占用可以在 16GB 内

测试的列表在：目前的开源视觉大模型有哪些？

在测试 qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4 发现一个问题，就是纯文本的推理，显存占用在 7GB+，但是只要 question 中带上一个图片，立刻 OOM

区区一个图片呀，就 OOM 了？

我用于测试的 GPU 是 Tesla T4， 16GB 显存

File ~/.local/share/virtualenvs/modelscope_example-DACykz4b/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py:404, in VisionSdpaAttention.forward(self, hidden_states, cu_seqlens, rotary_pos_emb)
    402 k = k.transpose(0, 1)
    403 v = v.transpose(0, 1)
--> 404 attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
    405 attn_output = attn_output.transpose(0, 1)
    406 attn_output = attn_output.reshape(seq_length, -1)

OutOfMemoryError: CUDA out of memory. Tried to allocate 6.10 GiB. GPU 0 has a total capacity of 14.58 GiB of which 679.56 MiB is free. Including non-PyTorch memory, this process has 13.91 GiB memory in use. Of the allocated memory 13.57 GiB is allocated by PyTorch, and 220.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

下面放一些测试代码

纯文本的：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4")

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     model_dir,
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained(model_dir)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            # {
            #     "type": "image",
            #     "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            # },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

运行起来，就占用 7.6G 显存

图片.png

带图片的：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4")

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     model_dir,
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained(model_dir)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

此时会 OOM

图片.png

需要复现的，下面有完整的依赖

loguru
peewee
pydantic
PyMySQL
python-dotenv
python-multipart
pytz
PyYAML
requests
uvicorn
fastapi
minio
filetype



modelscope
opencv-python
torch
regex
ftfy
torchvision

transformers
bitsandbytes
accelerate

addict
decord
qwen-vl-utils
optimum

auto-gptq # https://github.com/chatchat-space/Langchain-Chatchat/issues/2993

貌似 GLM-4V 也非常的占用显存，相比于 GLM-4 的话多了300%+

https://github.com/THUDM/GLM-4/tree/main/finetune_demo

图片.png

算法自然语言处理 llm llama 深度学习

阅读 1.9k

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

为什么 qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4 推理图片的需要超过8GB的额外显存？

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

Transformer推理过程中token输出不一致如何处理？

本地部署使用 miniCpmV2-6 、chatglm 这些本地 LLM 的时候，如何实现统计 token？

是否可以使用分步骤的方式来学习算法？

Java转大模型应用开发是否可以？

线段树如何实现区间加和区间查询非零数个数？

为什么Dify知识库经济型索引格式搜索不稳定，如何解决？