I have been running Qwen3 locally and it can be a bit slow since my GPU is not great and the model ends up running on my CPU. GPU shortages are a real concern, which got me thinking about TPUs. I started digging into how Tensor Processing Units work and whether they could be useful for local LLM inference. Here is what I found.
What is a TPU, exactly?
TPUs (Tensor Processing Units) are Google's custom-designed chips built specifically to accelerate machine learning workloads, particularly the matrix math that dominates neural network training and inference. General-purpose CPUs are flexible but inefficient for this kind of work. GPUs helped considerably, since their parallel architecture handles matrix operations far better than CPUs. But GPUs were designed for graphics first and ML second. Google built TPUs from the ground up with one goal: run tensor operations as fast as possible, as efficiently as possible.
The core architecture
The heart of a TPU is a systolic array: a grid of simple multiply-accumulate (MAC) units where data flows rhythmically through the mesh. Each unit receives an input, multiplies it, adds to a running sum, and passes results along to the next unit. This maps perfectly onto matrix multiplication, which is essentially what a neural network forward and backward pass reduces to.
The key insight is that data reuse is baked into the hardware. Values flow through the grid and get used multiple times in flight, dramatically reducing memory bandwidth pressure. That bandwidth bottleneck is typically what limits performance in ML workloads, so designing around it yields real gains.
TPUs also lean heavily on reduced precision formats. Google invented the bfloat16 format specifically for this purpose, and INT8 inference is a first-class concern. Neural networks are surprisingly tolerant of numerical imprecision, so you get roughly 2 to 4x the throughput by using smaller data types without meaningful quality loss.
TPU vs GPU: A Quick Comparison
| Attribute | GPU | TPU |
|---|---|---|
| Design goal | General parallel compute | Tensor ops only |
| Flexibility | High (CUDA ecosystem) | Lower |
| Memory bandwidth | High | Very high (HBM) |
| Precision focus | FP32 / FP16 / BF16 | BF16 / INT8 |
| Programming | CUDA / ROCm | XLA via JAX or TF |
| Ecosystem | Mature, massive | Narrower, growing |
GPUs win on flexibility and ecosystem maturity. TPUs win on efficiency for the specific workloads they target. Google claims TPUs deliver significantly better performance per watt for training large models.
Note on the software moat
CUDA has a decade-plus head start. Every major ML framework, library, and developer workflow is built around it. TPUs require XLA and JAX, which have historically had a steeper learning curve. That said, AI coding tools are making JAX considerably more approachable, which could meaningfully shift this dynamic over the next few years.
How TPUs Are Used for LLMs
TPUs are actively used for LLM inference at scale, primarily through Google's own products. Gemini across all its model sizes runs inference on TPU infrastructure in Google's data centers. Google Search's AI features, Gemini in Workspace, and the Gemini API are all serving responses off TPU hardware. This is arguably the largest LLM inference operation in the world.
Why TPUs suit LLM inference
During inference the bottleneck shifts compared to training. The challenge becomes memory bandwidth rather than raw compute, because you are loading billions of model weights from memory for each token generated. TPUs with High Bandwidth Memory are well-suited here, and recent TPU generations have been specifically tuned for inference workloads alongside training.
When serving thousands of requests simultaneously, you can batch them together and keep the systolic arrays saturated. Google's serving infrastructure compiles model weights and inference graphs via XLA ahead of time, meaning the inference path is highly optimized rather than relying on dynamic dispatch at runtime.
What about external developers?
Google Cloud's TPU v5e was explicitly positioned for inference workloads and is available to cloud customers running their own models. Hugging Face has done integration work so models from their hub can run on Cloud TPUs. For most developers outside of Google, though, NVIDIA GPUs remain dominant because the tooling (vLLM, TensorRT-LLM, llama.cpp) is GPU-native, and most open-weight models are released with PyTorch weights rather than JAX.
What About the Coral Edge TPU?
This is where my research took an interesting turn. Consumer-accessible TPU hardware does exist: the Google Coral line includes USB sticks, PCIe cards, and dev boards. As a hardware engineer, this was naturally appealing. Could something like Qwen3 run on one?
The short answer is no, and the reasons are instructive.
Coral's hard constraints
- 4MB of on-chip SRAM only, with no external memory interface
- INT8 quantized models exclusively, no floating point support
- Approximately 4 TOPS of compute
- TensorFlow Lite models only, compiled through a specific Coral toolchain
The Coral was designed for a completely different problem class: real-time image classification on a Raspberry Pi, keyword spotting, anomaly detection on sensor data. Think MobileNet and EfficientDet, not transformers.
Why Qwen3 is a non-starter on Coral
Even the smallest Qwen3 model at 0.6B parameters presents an insurmountable problem. A 0.6B parameter model in INT8 is roughly 600MB of weights. The Coral TPU has 4MB of on-chip SRAM and no way to stream weights from external memory the way a CPU or GPU can. The entire accelerated model must fit on-chip. Memory alone makes it impossible, before even considering that transformer attention mechanisms do not map cleanly onto what the Coral compiler can handle.
Key insight
The Coral Edge TPU predates the LLM era entirely. Its constraints reflect a different generation of ML thinking, where small, static, quantized models were the target. It is genuinely excellent at that problem. It is simply a different category of hardware from what LLM inference requires.
Realistic Hardware for Local LLM Inference
If the goal is faster local LLM inference without relying on cloud GPUs, here is what actually delivers:
Raspberry Pi 5 — CPU-only but llama.cpp runs small models at usable speeds. A reasonable low-cost starting point.
Apple Silicon (M-series) — Unified memory architecture is surprisingly strong for LLM inference per dollar. Even an M1 Mac Mini punches well above its price.
NVIDIA Jetson Orin — The serious edge ML platform. Handles small LLMs well with proper GPU acceleration.
Hailo-8 / 8L — More capable edge AI chips with greater memory headroom than Coral. Worth watching as the ecosystem matures.
For pure CPU inference today, a well-specced x86 machine running llama.cpp with a quantized small model (Qwen3-0.6B or 1.7B in Q4) is often the most practical path for tinkering without significant hardware investment.
TPUs are remarkable hardware: purpose-built, efficient, and quietly responsible for serving a significant fraction of the world's AI inference workloads. They are just not accessible in the way GPUs are for consumer and hobbyist use. The consumer Coral line occupies a completely different niche and is not suited for LLMs in any meaningful sense.
For local LLM experimentation, the most practical upgrade path today is either Apple Silicon for the unified memory advantage, or waiting for the GPU market to normalize and picking up a used NVIDIA card with sufficient VRAM. The TPU rabbit hole was worth exploring, even if it led back to familiar ground.