Vision Benchmark Report

mac-bench Vision Benchmark

Automated local benchmark comparing vision-capable LLMs on Mac mini / Apple M4 Pro / 64 GB RAM. Each model was asked to describe the same set of images, and we measured latency, memory use, throughput, and reliability. 20 stable results

Prompt

What the run actually asked

Every model received the exact same prompt and images. The prompt below was sent as the user message alongside each test image.

Leaderboard

Fastest reliable results

Top 5 models that successfully processed every image, ranked by average response time. Click a model name to view it in LM Studio.

Speed vs Size

Latency vs memory footprint

Each bubble is a model. Position shows the trade-off between response speed (vertical) and RAM usage (horizontal). Bubble size reflects token throughput. Ideal models sit in the bottom-left corner: fast and light.

Stable (all images OK) Partial (some failures) Failed (no successes)
Latency

Average response time

How long each stable model takes on average to describe an image. Shorter bars are faster.

Reliability

Success rate by model

What fraction of test images each model handled without errors. A full bar means every image got a valid response.

Memory

RAM usage comparison

How much system memory each model requires when loaded. Lower is better, especially on machines with limited RAM.

Throughput

Token generation speed

How many completion tokens each model generates per second. Higher throughput means faster, more detailed responses.

Efficiency

Tokens per second per GiB

A combined efficiency metric: how many tokens each model generates per second for each GiB of RAM it uses. Higher is better — it rewards models that are both fast and lightweight.

Load Time

Model loading speed

How long each model takes to load into memory from disk. Shorter load times mean faster cold starts.

Latency Spread

Min / Median / Max latency per model

Shows the range of response times across images for each stable model. A tight spread indicates consistent performance; a wide spread means the model is sensitive to image complexity.

Heatmap

Per-image latency heatmap

Each cell shows how long a model took to respond to a specific image. Darker cells mean slower responses. Helps identify which images are hardest for each model.

Format Comparison

GGUF vs safetensors performance

Average latency and throughput grouped by model format. Compares how different weight formats perform on Apple Silicon.

Speed Tiers

Models by response time tier

Models grouped into speed tiers: under 2s (real-time), 2–5s (interactive), and 5s+ (batch). The green zone marks the sub-2-second target.

Responses

Side-by-side response comparison

For each test image, compare what every model actually said. Useful for judging response quality and accuracy beyond just speed.

Summary Table

All benchmarked results

Full metrics for every model tested. Click a model name to open it in LM Studio. RAM is the memory used while the model is loaded; Avg/Median are per-image response times; Tok/s is completion token throughput.

Result Format RAM GiB Avg s Median s Tok/s Success Reasoning
Model Detail

Per-image outputs

Expand any model to see exactly how it responded to each test image, including timing, token counts, and the raw response text.

Benchmark Image

Image preview