What the run actually asked
Every model received the exact same prompt and images. The prompt below was sent as the user message alongside each test image.
Automated local benchmark comparing vision-capable LLMs on Mac mini / Apple M4 Pro / 64 GB RAM. Each model was asked to describe the same set of images, and we measured latency, memory use, throughput, and reliability. 20 stable results
Every model received the exact same prompt and images. The prompt below was sent as the user message alongside each test image.
Top 5 models that successfully processed every image, ranked by average response time. Click a model name to view it in LM Studio.
Each bubble is a model. Position shows the trade-off between response speed (vertical) and RAM usage (horizontal). Bubble size reflects token throughput. Ideal models sit in the bottom-left corner: fast and light.
How long each stable model takes on average to describe an image. Shorter bars are faster.
What fraction of test images each model handled without errors. A full bar means every image got a valid response.
How much system memory each model requires when loaded. Lower is better, especially on machines with limited RAM.
How many completion tokens each model generates per second. Higher throughput means faster, more detailed responses.
A combined efficiency metric: how many tokens each model generates per second for each GiB of RAM it uses. Higher is better — it rewards models that are both fast and lightweight.
How long each model takes to load into memory from disk. Shorter load times mean faster cold starts.
Shows the range of response times across images for each stable model. A tight spread indicates consistent performance; a wide spread means the model is sensitive to image complexity.
Each cell shows how long a model took to respond to a specific image. Darker cells mean slower responses. Helps identify which images are hardest for each model.
Average latency and throughput grouped by model format. Compares how different weight formats perform on Apple Silicon.
Models grouped into speed tiers: under 2s (real-time), 2–5s (interactive), and 5s+ (batch). The green zone marks the sub-2-second target.
For each test image, compare what every model actually said. Useful for judging response quality and accuracy beyond just speed.
Full metrics for every model tested. Click a model name to open it in LM Studio. RAM is the memory used while the model is loaded; Avg/Median are per-image response times; Tok/s is completion token throughput.
| Result | Format | RAM GiB | Avg s | Median s | Tok/s | Success | Reasoning |
|---|
Expand any model to see exactly how it responded to each test image, including timing, token counts, and the raw response text.