Part 2 of the Research Notes series. Part 1: TPUs and LLMs: A Hardware Engineer's Research Rabbit Hole
It Started at CERN
After going deep on TPUs, I thought the hardware rabbit hole was closed. Then I came across this piece in IEEE Spectrum about AI and particle physics. The part that stopped me was a throwaway detail buried in the middle of the article: at the Large Hadron Collider, the detectors record 40 million particle collisions per second. There is no way to save all that data. So engineers build filters to decide what is interesting and what gets thrown away. And increasingly, those split-second decisions are being handed to machine learning models.
Running on FPGAs.
The article even quotes a theorist pleading with an engineer: "Which of my algorithms fits on your bloody FPGA?" That line alone made me want to understand what is so special about these chips that CERN is betting particle physics discoveries on them.
The CERN Connection
At the LHC, ML systems running on field-programmable gate arrays make real-time decisions about which collision data gets saved, because no other hardware can match that latency profile at scale. The code has to run on the chip's limited logic and memory, which means compressing a neural network into that hardware is a genuine engineering challenge. — IEEE Spectrum, March 2026
That sent me down a new rabbit hole entirely. What makes FPGAs so compelling for ML inference? And the question that actually drove the research: could a consumer-grade FPGA run something like Qwen3 better than my CPU is doing right now?
What is an FPGA?
A Field-Programmable Gate Array is a chip that is, by design, unfinished. Unlike a CPU or GPU where the silicon logic is permanently baked in at fabrication, an FPGA ships as a blank canvas of configurable logic blocks, lookup tables (LUTs), DSP slices, and on-chip memory. You program it not with software, but with a hardware description language like VHDL or Verilog that literally defines the circuit. Every time you power it on, it loads that configuration and becomes the specific piece of hardware you described.
This is the property that makes FPGAs so unusual. A GPU is a fixed parallel processor. An ASIC is a custom chip frozen in silicon forever. An FPGA sits between them: it can be reconfigured, but when it is configured, it behaves like dedicated custom hardware for exactly the task you described. You are not running code on a processor. You are building a processor optimized for your workload.
The key building blocks
- LUTs (Lookup Tables) — The basic logic unit. A small memory that maps any Boolean function of N inputs to an output. Essentially programmable combinational logic.
- DSP Slices — Hard-wired multiply-accumulate units. Fast, power-efficient, and directly relevant to matrix math.
- BRAM / URAM — On-chip block RAM and ultra RAM. High bandwidth, zero latency compared to anything off-chip. The resource that determines what fits on-chip.
- Configurable interconnect — The programmable routing fabric that connects everything. This is what makes FPGAs flexible, and also what makes them slower clock-for-clock than a GPU.
- HBM (high-end boards) — High Bandwidth Memory stacked directly on the package, found on data center cards like the Alveo series. Changes the memory bandwidth equation dramatically.
Can you program an FPGA to be a CPU or GPU?
Yes, and this actually happens in practice. Xilinx ships a soft CPU core called MicroBlaze, ARM licenses soft cores for FPGA implementation, and the open-source RISC-V ecosystem has produced several implementations that run on FPGAs routinely. You can also implement GPU-like parallel processing architectures, and researchers have built soft GPU cores. The catch is that an FPGA's reconfigurable fabric runs at far lower clock speeds than hard silicon (typically 200–500MHz vs a GPU's 2GHz+), so a soft GPU will never match the real thing in raw throughput. The interesting flip side of this is that AMD acquired Xilinx in 2022 partly to combine FPGA fabric with CPU and GPU silicon on the same package, which is exactly what the Versal and Alveo product lines do. The FPGA does not replace the GPU; it works alongside it, handling the workloads it is better suited for.
Why FPGAs for AI and ML?
The CERN use case reveals something fundamental. The LHC's trigger system needs to process data with microsecond latency, in real time, continuously. GPUs are powerful but they have overhead: you enqueue work, the driver dispatches it, the CUDA runtime manages it. That software stack takes time. FPGAs have no runtime. The circuit is the computation. Data flows in, and results flow out, with deterministic latency measured in nanoseconds.
For ML inference, this maps onto a concept called dataflow architecture. Rather than running a neural network layer by layer on a general-purpose processor, you build a pipeline where distinct hardware units handle specific operations, connected by streaming buffers. Data flows continuously through the pipeline. Multiple layers execute in parallel. The bottleneck of repeatedly reading weights from memory gets dramatically reduced because data is passing between on-chip units, not shuttling back and forth to DRAM.
The Dataflow Advantage
FPGA spatial architectures specialize distinct processing units for specific operators or layers, facilitating direct communication between them through streaming buffers. This dataflow execution substantially reduces off-chip memory accesses and enables concurrent processing of multiple pipeline stages. — Chen et al., 2023
There is also the reconfigurability angle. A GPU has fixed Tensor Cores designed for standard matrix multiply in standard precisions. An FPGA lets you design compute units tailored to unusual numeric formats: 4-bit integers, ternary weights (values of just -1, 0, +1), or entirely custom arithmetic. As neural network research pushes toward increasingly aggressive quantization to reduce model size, FPGAs become increasingly interesting because they can be built to exploit those quantization schemes natively.
The Memory Wall Problem (Again)
In the TPU article I spent time on the memory wall: the bottleneck in LLM inference is not compute, it is the cost of loading billions of model weights from memory for every token generated. This killed the Coral Edge TPU dream, and it is the same wall FPGAs have to deal with.
But FPGAs attack it differently. Instead of working around the wall, research is exploring how to make it irrelevant by keeping as much of the model on-chip as possible, or by redesigning the computation so that memory accesses happen less frequently and more efficiently.
The key insight from recent research is that LLM inference at small batch sizes is almost entirely memory-bound, not compute-bound. A GPU at batch size 1 is catastrophically underutilized because its thousands of CUDA cores sit idle waiting for memory. An FPGA's dataflow pipeline, by contrast, can be sized exactly to the memory bandwidth available, with no wasted silicon. This 2023 paper from Cornell and CMU showed that FPGA spatial acceleration can actually outperform an NVIDIA A100 GPU in the decode stage of LLM inference, specifically in the single-batch latency scenario that matters for local use, with 5.7x better energy efficiency.
Can an FPGA Run Qwen3?
Here is where the research got genuinely exciting. It turns out there is a November 2025 paper that does exactly this: runs Qwen3 1.7B on an FPGA. Not a simulation. An actual implementation.
The paper is called LUT-LLM, from researchers at UCLA and Microsoft Research Asia. The key idea is radical: instead of doing matrix multiplication with arithmetic operations, replace it entirely with lookup tables. Pre-compute the possible dot products between quantized weights and quantized activations, store the results in a table, and then during inference just look up the answer rather than calculating it. FPGAs are literally built around lookup tables as their fundamental unit of logic, so this plays directly to the hardware's strengths.
| Metric | Result |
|---|---|
| Latency vs AMD MI210 GPU | 1.66x lower |
| Energy efficiency vs NVIDIA A100 | 1.72x better |
| Memory bandwidth reduction | 7x less used |
Those numbers are from the LUT-LLM paper, implemented on an AMD Alveo V80 FPGA running a Qwen3 1.7B model with activation-weight co-quantization. Beating an A100 in energy efficiency while using a fraction of the memory bandwidth is a result that deserves attention. The paper also projects the same approach scaling to Qwen3 32B with a 2.16x energy efficiency gain over the A100.
Separately, TerEffic, a February 2025 paper, demonstrated ternary-quantized LLM inference on an AMD U280 FPGA achieving 16,300 tokens per second on a 370M parameter model, which is 192x higher throughput than a Jetson Orin Nano, at 19x better power efficiency. For a larger 2.7B model using HBM-assisted inference, they hit 727 tokens per second at 46W, which is 3x the throughput of an A100 in that configuration.
What makes LUT inference work
Activation-weight co-quantization reduces the number of unique entries needed in the lookup tables because multiple weight vectors map to the same centroid. This reduces both the memory footprint and the bandwidth pressure simultaneously, which is why the approach can outperform GPUs despite using less bandwidth. — LUT-LLM, 2025
The COTS Reality Check
Before you start shopping, a grounding moment is needed.
The AMD Alveo V80 that LUT-LLM ran on is a data center accelerator card. It costs somewhere in the range of $10,000 to $15,000. The U280 used in TerEffic and FlightLLM work is more accessible but still runs several thousand dollars on the used market. These are not hobbyist boards.
Consumer FPGA options look like this:
| Board | On-chip RAM | External Memory | Approx. Cost | LLM Viability |
|---|---|---|---|---|
| Xilinx Artix-7 | ~4.8MB BRAM | DDR3 (slow BW) | $50–150 | Tiny models only, severe limits |
| Zynq UltraScale+ | ~11MB BRAM+URAM | DDR4 | $200–500 | Sub-100M models feasible |
| AMD Kria KV260 | ~11MB BRAM+URAM | 4GB DDR4 | ~$300 | Research platform, tight fit |
| Alveo U50 | ~28MB + HBM | 8GB HBM2 | $1,500–2,500 | Small LLMs feasible with work |
| Alveo U280 | ~42MB + HBM | 8GB HBM2 | $3,000–6,000 | Research papers run 1–7B models here |
The consumer boards face the same fundamental constraint as the Coral Edge TPU: not enough memory bandwidth or capacity to stream the weights of a modern LLM at useful speed. The gap between a Zynq and an Alveo U280 is not just price, it is the HBM that makes the difference for anything above a few hundred million parameters.
Can you just add more memory to an FPGA board?
It is a reasonable instinct, and the answer is: not really, at least not in the way you are probably imagining. The memory interface on an FPGA board (DDR4, HBM) is routed through fixed I/O pins on a PCB designed specifically for those chips. You cannot solder more RAM onto a board that was not designed for it without a full PCB redesign. The two workarounds people do use are connecting external memory expansion boards via PCIe or high-speed interfaces, and writing custom memory controllers in the FPGA fabric to reach whatever storage is physically accessible. Neither is straightforward. The harder truth for HBM boards is that the memory is stacked directly on the chip package during fabrication. It is physically part of the assembly and completely non-modifiable after the fact. This is part of why HBM bandwidth numbers are so good, and also why the memory spec of the board you buy is the memory spec you are stuck with.
That said, the ternary quantization direction is genuinely promising for smaller form factors. TeLLMe v2 demonstrates end-to-end ternary LLM inference achieving up to 25 tokens per second on edge FPGAs, targeting boards much closer to the consumer tier. The catch: ternary models require quantization-aware training from scratch, not just post-training quantization of an existing Qwen3 checkpoint. You cannot simply convert a standard model.
The tooling gap
Unlike GPUs where you load a GGUF file into llama.cpp and run, FPGA-based LLM inference requires compiling a hardware design, synthesizing it, and generating a bitfile specific to your FPGA and your model. AMD's Vitis HLS toolchain is the main path, and the compile times are measured in hours, not seconds. This is active research territory, not a plug-and-play solution yet.
The Bigger Picture
FPGAs represent a fundamentally different philosophy from every other accelerator covered in this series. A GPU is a fixed piece of silicon and you adapt your workload to it. An FPGA lets you adapt the hardware to your workload. As LLM research pushes toward more aggressive quantization (ternary, binary, lookup-table approaches), the hardware that can natively exploit those representations becomes more compelling.
The CERN use case is a perfect illustration of where FPGAs genuinely win: when latency is paramount, batch size is one, and the application justifies building custom hardware. Local LLM inference shares two of those three properties. The question is whether the third one (justifying the effort) crosses the threshold for hobbyists and developers as tooling matures.
The research trajectory is moving fast. In 2024, running a 1B+ parameter LLM on an FPGA was considered a hard problem. By November 2025, there was a paper running Qwen3 1.7B on an FPGA and beating an A100 on energy efficiency. A year from now, the picture will look different again.
The particle physicists at CERN figured out how to compress neural networks into FPGAs tight enough to catch new physics at 40 million collisions per second. The hobbyist running Qwen3 on a CPU at 3 tokens per second is working with different constraints, but the underlying hardware philosophy is the same: stop fighting the memory wall, and redesign the computation around it.
The rabbit hole is still open. The next one probably involves ternary quantization, BitNet, and whether training a purpose-built small model for FPGA deployment makes sense for edge applications. That is a question for another article.