Wllama

May 26th, 2026

Transformers.js provides a JavaScript API for running machine learning models in the browser. It’s super cool, and it has worked well for a few side quests I’ve built, most recently Whimscribe. Under the hood, Transformers.js uses ONNX Runtime, which means models generally need to be available in ONNX form. Recently, I came across wllama, which takes a different route. It is a WebAssembly binding for llama.cpp, with support for running inference directly in the browser using WebAssembly SIMD, and more recently optional WebGPU support.

The key difference, at least for my use cases, is model format. wllama accepts GGUF models, which is the same broad model ecosystem used by llama.cpp. That opens up the possibility of using a large number of small, quantized LLMs that may not have ready to use ONNX equivalents.

I wanted to understand where wllama sits in practice. If roughly the same model is available in both ONNX and GGUF form, are the browser runtimes in the same performance ballpark?

The recently published Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU paper was a useful starting point. It provides a number of benchmarks and argues that browser LLM inference is becoming increasingly practical, while still acknowledging that it is constrained by memory, hardware variability, model formats, and browser support.

I decided to run a small scale test.

This is by no means a definitive benchmark! I used one machine, one model, one prompt, one browser, and a tiny number of runs. I mainly wanted to get a feel for the rough runtime characteristics.

Test setup: Bonsai 1.7B q1

For this test I used Bonsai 1.7B, a small 1-bit language model. I ran the GGUF version through wllama (prism-ml/Bonsai-1.7B-gguf) and an ONNX version through Transformers.js (onnx-community/Bonsai-1.7B-ONNX).

I tested each runtime with the same short prompt and the same output lengths:

64 tokens
128 tokens
256 tokens
512 tokens

For each output length I ran five samples and reported the median.

The benchmark recorded:

load/init time
time to first emitted token
decode tokens/sec
generation time

All runs were performed after the model files were already cached by the browser. Load/init time measures runtime/model initialisation after page load, not network download time. Each sample created a fresh runtime instance. Generation timing starts immediately before calling the generation API and ends after the final streamed token.

I ran this on my MacBook Pro M3 Pro (36GB RAM) using Google Chrome (Version 148.0.7778.179 (arm64)) with WebGPU enabled for both runtimes.

Results

Here is the summary from this run, using Bonsai 1.7B q1 with WebGPU and a short 18 token prompt:

Output tokens	Runtime	Load/init	First emit	Decode	Generation
64	Transformers.js	2538 ms	135 ms	66.24 tok/s	1085 ms
64	wllama	800 ms	155 ms	60.26 tok/s	1201 ms
128	Transformers.js	2505 ms	134 ms	66.27 tok/s	2051 ms
128	wllama	753 ms	156 ms	60.14 tok/s	2268 ms
256	Transformers.js	2523 ms	134 ms	65.29 tok/s	4041 ms
256	wllama	737 ms	157 ms	58.75 tok/s	4499 ms
512	Transformers.js	2523 ms	134 ms	61.75 tok/s	8409 ms
512	wllama	734 ms	155 ms	58.00 tok/s	8965 ms

Off the bat, I would not read this as “Transformers.js is faster than wllama”. This only says that for this model, on my machine, with my browser and prompt, Transformers.js had slightly better sustained decode while wllama initialised much faster.

Transformers.js decoded at around 62–66 tokens/sec, compared with wllama at around 58–60 tokens/sec. It also reached the first emitted token slightly sooner: around 134–135 ms versus 155–157 ms.

The biggest difference was load/init time. wllama stayed between 734 ms and 800 ms, while Transformers.js was between 2505 ms and 2538 ms, roughly three times slower to get ready. That gap matters more than it might seem. In a lot of browser ML applications, the model isn’t sitting warm and ready; it gets initialised in response to something the user does. Think of a transcription app that loads a model on first upload. In these cases, the user is waiting. 800 ms can still feel fairly responsive. 2.5 seconds is much more noticeable, and long enough that you would likely add a loading state.

There are ways to mitigate a slow init: load eagerly in the background, show a progress state, or warm the model on page load rather than on first use. But a faster cold start removes much of the problem. This gives wllama’s init time an interesting and meaningful practical advantage.

It’s worth noting that both init times were measured after model files were already cached by the browser. This is runtime initialisation, not download time. The Llamas on the Web paper also discusses optimisations around browser model loading for wllama, including OPFS caching and avoiding redundant memory copies.

Where wllama fits

Transformers.js is still the library I would reach for first for a lot of browser ML work, mostly because it has a mature API and I am reasonably familiar with it. But wllama is now very much on my radar. GGUF support is a big deal. A lot of local LLM experimentation already happens in the llama.cpp/GGUF world, especially with small quantized models. wllama has now opened the door to that ecosystem in the browser.

Thank you!