Multimodal

multimodal

Also called multi-modal/ vision-language model/ VLM

A multimodal model can process more than one kind of input: text + images, text + audio, sometimes text + video. The model is still a transformer at its core, but its tokenizer has been extended to encode pixels, audio waveforms, or video frames as token-like representations the transformer can attend to.

Photo: Evie Shaffer / Pexels

Originally, language models worked on text only: tokens in, tokens out. Multimodal models extend the input side. An image gets cut into patches; each patch gets encoded into a vector that lives in the same embedding space as the text tokens. From the transformer's perspective, an image is just a sequence of "image tokens" sitting alongside the text tokens. Attention works the same way; the model can read text and look at the image with one mechanism.

Frontier models in 2026 are nearly all multimodal on input. You can paste a screenshot, attach a PDF, share a photo, and the model can describe, summarize, extract structured data, transcribe text in the image, or reason about the visual content alongside text. Output is mostly still text, though image-out (Gemini Nano Banana, GPT image-1, Claude image generation) is a separate set of models trained for that.

The capability matrix is uneven. Strong areas: optical character recognition (OCR), chart reading, diagram interpretation, document analysis, "what is this object" identification. Weak areas: precise spatial reasoning ("which corner is the cup in"), counting (especially with many similar objects), reasoning about novel visual compositions, anything requiring 3D understanding.

For multimodal pricing, image input is typically billed as a token equivalent: a single image counts as some number of input tokens (often 500-2000 depending on resolution and model). The cost calculation in the chat side rail aggregates image-tokens with text-tokens for the displayed cost.

Audio multimodal (speech in, speech or text out) is following a similar trajectory. Real-time voice models are now common (GPT realtime, Gemini Live). Their core architecture is still a transformer; the tokenizer just understands waveforms.

Video is the frontier. Frame sampling, temporal attention, and very long contexts are the engineering challenges. Useful video understanding exists but lags image understanding by 12-18 months.

Related concepts

Want the rest?

There are 40 terms total.

See the full glossary