arxiv:2508.08531

Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Published on Aug 12, 2025

Authors:

Abstract

Apple Silicon's unified memory architecture provides efficient on-device LLM inference performance that rivals NVIDIA GPUs, with detailed analysis of hardware metrics and quantization effects across multiple device configurations.

AI-generated summary

A systematic understanding of Apple Silicon is lacking in the current landscape of hardware efficiency; research focus is largely centered on accelerating GPUs for large-scale training or inference on CUDA devices. This paper investigates Apple Silicon's unique memory architecture that offers a unified memory integrating CPU and GPU memory and its implications for on-device LLM inference. We decipher myths about whether Apple Silicon is efficient for on-device inference compared to competitors such as NVIDIA GPUs by directly conducting latency and throughput comparison benchmarks. We explain the performance gap between them through profiling low level hardware metrics - ALU utilization, memory bandwidth, buffer usage, cache residency etc. at runtime. We draw several insights regarding performance bottlenecks such as dequantization overhead, compute throughput and memory bandwidth. We debunk existing false claims regarding large language model inference such as compressing models to lower bit precision is a defacto promise for faster inference across all hardware platforms. We find that the large unified memory enables Apple Silicon to be both cost effective and efficient against NVIDIA GPUs for ultra large language models. Our large scale evaluation on 5 hardware testbeds incorporating three Apple M-series devices: M2 Ultra, M2 Max and M4 Pro and two NVIDIA GPUs: NVIDIA RTX A6000, a multi GPU setup with 2xNVIDIA RTX A6000, 5 model scales ranging from 8B to 405B parameters and 14 quantization schemes gives an understanding of how Apple Silicon fits within the paradigm of on-device LLM inference. Our analysis reveals multiple resource interdependencies and unexpected findings, while also quantifying established insights. To the best of our knowledge, this study makes the first attempt to present a thorough characterization and analysis of Apple Silicon for on-device inference.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.08531 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.08531 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.08531 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.