llama3pure

Supports both Llama and Gemma architectures. Try the Web engine here.

demo

Building the engine (macOS / Linux)
Building the engine (Windows)
Running the engine (macOS / Linux / Windows)
Running in Node.js
Running in Web Environments
Suggested Models and Engines
Tested Models
Author's Notes
Credits

Building the engine (macOS / Linux)

make llama3pure

Building the engine (Windows)

Use the x64 Native Tools Command Prompt for VS.

cl /O2 llama3pure-c-engine.c /Fe:llama3pure.exe

Running the engine (macOS / Linux / Windows)

# On macOS / Linux
./llama3pure -model Llama3.gguf -prompt "Tell me in 1 line what is Microsoft."
./llama3pure -model Llama3.gguf -chathistory chat.txt

# On Windows
llama3pure.exe -model Llama3.gguf -prompt "Tell me in 1 line what is Microsoft."
llama3pure.exe -model Llama3.gguf -chathistory chat.txt

Argument	Required	Description	Default Value
-model	Yes	Path to a GGUF model file.	-
-prompt	No	Input prompt text (single-turn, alternative to -chathistory).	-
-chathistory	No	Path to a .txt file containing a JSON chat history (multi-turn, alternative to -prompt).	-
-system_prompt	No	System prompt prepended to every conversation.	`You are a helpful assistant.`
-max_tokens	No	Maximum number of tokens to generate per response.	-1 (unlimited)
-context_size	No	Context window size (capped by the model's own limit).	Model's max.
-temperature	No	Sampling temperature. Higher values produce more varied output.	0.9
-top_p	No	Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered.	0.9
-top_k	No	Top-K sampling. Only the K most probable tokens are considered at each step.	40
-debug	No	Show detailed model loading and performance logs (including tok/s).	disabled

Sample chat history in tests.txt.

Running in Node.js

Step 1: Load a model

Read the GGUF file into an ArrayBuffer and pass it to llama3pure with type: "load".

import llama3pure from "./llama3pure-js-engine.js"
import fs from "fs"

const readFileAsArrayBuffer = (filePath) => {
  const fd = fs.openSync(filePath, "r")
  const fileSize = fs.fstatSync(fd).size
  const arrayBuffer = new ArrayBuffer(fileSize)
  const fileUint8 = new Uint8Array(arrayBuffer)
  const chunkSize = 256 * 1024 * 1024
  let pos = 0
  while (pos < fileSize) {
    const toRead = Math.min(chunkSize, fileSize - pos)
    fs.readSync(fd, fileUint8, pos, toRead, pos)
    pos = pos + toRead
  }
  fs.closeSync(fd)
  return arrayBuffer
}

llama3pure({
  type: "load",
  model: readFileAsArrayBuffer("/path/to/your-model.gguf"),
  cbRender: (token) => {
    process.stdout.write(token)
  },
  systemPrompt: "You are a helpful assistant.",
  maxTokens: 256,
  contextSize: 2048,
  temperature: 0.9,
  topP: 0.9,
  topK: 40,
})

Parameter	Type	Required	Description	Default Value
type	string	Yes	Must be `load`	-
model	ArrayBuffer	Yes	The GGUF model file contents.	-
cbRender	function	Yes	Callback invoked with each generated token as a string.	-
systemPrompt	string	No	System prompt prepended to every conversation.	`You are a helpful assistant.`
maxTokens	number	No	Maximum number of tokens to generate per response.	-1 (unlimited)
contextSize	number	No	Context window size (capped by the model's own limit).	Model's max.
temperature	number	No	Sampling temperature. Higher values produce more varied output.	0.9
topP	number	No	Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered.	0.9
topK	number	No	Top-K sampling. Only the K most probable tokens are considered at each step.	40

Step 2: Generate a response

Call llama3pure with type: "generate" and a chatHistory array. The engine uses the cbRender callback provided during load to stream tokens. The last message in chatHistory should have role: "user" - that is the message the model will respond to. Previous messages provide conversation context, enabling multi-turn conversations.

llama3pure({
  type: "generate",
  chatHistory: [
    { role: "user", content: "Tell me in 1 line what is Microsoft." },
    {
      role: "assistant",
      content:
        "Microsoft is a global technology leader known for its innovative products and services.",
    },
    { role: "user", content: "Tell me in 1 line the names of the founders." },
  ],
})

Parameter	Type	Required	Description
type	string	Yes	Must be `generate`.
chatHistory	array	Yes	Array of message objects representing the conversation.

Full example in llama3pure-nodejs-demo.js.

Running in Web Environments

Step 1: Load a model

Read the GGUF file as an ArrayBuffer and send it to the worker with type: "load". The ArrayBuffer is transferred (not copied) for performance.

const reader = new FileReader()

reader.onload = (event) => {
  const arrayBuffer = event.target.result
  worker.postMessage(
    {
      type: "load",
      model: arrayBuffer,
      systemPrompt: "You are a helpful assistant.",
      maxTokens: 256,
      contextSize: 2048,
      temperature: 0.9,
      topP: 0.9,
      topK: 40,
    },
    [arrayBuffer]
  )
}

reader.readAsArrayBuffer(file)

Parameter	Type	Required	Description	Default Value
type	string	Yes	Must be `load`	-
model	ArrayBuffer	Yes	The GGUF model file contents.	-
systemPrompt	string	No	System prompt prepended to every conversation.	`You are a helpful assistant.`
maxTokens	number	No	Maximum number of tokens to generate per response.	-1 (unlimited)
contextSize	number	No	Context window size (capped by the model's own limit).	Model's max.
temperature	number	No	Sampling temperature. Higher values produce more varied output.	0.9
topP	number	No	Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered.	0.9
topK	number	No	Top-K sampling. Only the K most probable tokens are considered at each step.	40

Step 2: Generate a response

worker.postMessage({
  type: "generate",
  chatHistory: [
    { role: "user", content: "Tell me in 1 line what is Microsoft." },
    {
      role: "assistant",
      content:
        "Microsoft is a global technology leader known for its innovative products and services.",
    },
    { role: "user", content: "Tell me in 1 line the names of the founders." },
  ],
})

Parameter	Type	Required	Description
type	string	Yes	Must be `generate`.
chatHistory	array	Yes	Array of message objects representing the conversation.

Step 3: Receiving messages from the Worker

worker.onmessage = function (e) {
  var data = e.data
  switch (data.type) {
    case "progress":
      // Fired during model loading
      break

    case "loaded":
      // Fired once the model is fully loaded and ready
      break

    case "token":
      // Fired for each generated token during inference
      console.log(data.token)
      break

    case "complete":
      // Fired when generation is finished
      console.log(data.output)
      break
  }
}

Event	Fields	Description
progress	-	Emitted during model loading to indicate progress.
loaded	-	Emitted once when the model has been fully loaded and is ready for inference.
token	token (string)	Emitted for each token as it is generated, enabling real-time streaming of the response.
complete	output (string)	Emitted when generation finishes. Contains the full generated text.

Try the Web engine here or with custom maxTokens, contextSize, topP and topK here.

A standalone version is available here; it offers the same functionality as the standard version but uses a base64-embedded Worker, allowing you to run it as a local file without a web server.

Suggested Models and Engines

MODEL	C	NODE.JS	WEB
Gemma-3-1B-it-Q8_0.gguf	✅	✅	✅
Llama-3.2-1B-Instruct-Q8_0.gguf	✅	✅	✅
Llama-3.2-3B-Instruct-Q8_0.gguf	✅	✅	❌
Gemma-3-4b-it-Q8_0.gguf	✅	✅	❌

Tested Models

MODEL	C	NODE.JS	WEB
Gemma-3-270M-it-Q2_K_L.gguf	✅	✅	✅
Gemma-3-270M-it-Q3_K_M.gguf	✅	✅	✅
Gemma-3-270M-it-Q4_K_M.gguf	✅	✅	✅
Gemma-3-270M-it-Q5_K_M.gguf	✅	✅	✅
Gemma-3-270M-it-Q6_K.gguf	✅	✅	✅
Gemma-3-270M-it-Q8_0.gguf	✅	✅	✅
Gemma-3-270M-it-F16.gguf	✅	✅	✅
Gemma-3-1B-it-Q2_K_L.gguf	✅	✅	✅
Gemma-3-1B-it-Q3_K_M.gguf	✅	✅	✅
Gemma-3-1B-it-Q4_K_M.gguf	✅	✅	✅
Gemma-3-1B-it-Q5_K_M.gguf	✅	✅	✅
Gemma-3-1B-it-Q6_K.gguf	✅	✅	✅
Gemma-3-1B-it-Q8_0.gguf	✅	✅	✅
Gemma-3-1B-it-BF16.gguf	✅	✅	✅
Llama-3.2-1B-Instruct-Q3_K_L.gguf	✅	✅	✅
Llama-3.2-1B-Instruct-Q4_K_L.gguf	✅	✅	✅
Llama-3.2-1B-Instruct-Q5_K_L.gguf	✅	✅	✅
Llama-3.2-1B-Instruct-Q6_K_L.gguf	✅	✅	✅
Llama-3.2-1B-Instruct-Q8_0.gguf	✅	✅	✅
Llama-3.2-1B-Instruct-f16.gguf	✅	✅	❌
Llama-3.2-3B-Instruct-Q3_K_L.gguf	✅	✅	❌
Llama-3.2-3B-Instruct-Q4_K_L.gguf	✅	✅	❌
Llama-3.2-3B-Instruct-Q5_K_L.gguf	✅	✅	❌
Llama-3.2-3B-Instruct-Q6_K_L.gguf	✅	✅	❌
Llama-3.2-3B-Instruct-Q8_0.gguf	✅	✅	❌
Llama-3.2-3B-Instruct-f16.gguf	✅	✅	❌
Gemma-3-4b-it-Q2_K_L.gguf	✅	✅	❌
Gemma-3-4b-it-Q3_K_M.gguf	✅	✅	❌
Gemma-3-4b-it-Q4_K_M.gguf	✅	✅	❌
Gemma-3-4b-it-Q5_K_M.gguf	✅	✅	❌
Gemma-3-4b-it-Q6_K.gguf	✅	✅	❌
Gemma-3-4b-it-Q8_0.gguf	✅	✅	❌
Gemma-3-4b-it-BF16.gguf	✅	✅	❌
Llama-3-8B-Instruct-Q2_K.gguf	✅	✅	❌
Llama-3-8B-Instruct-Q3_K_M.gguf	✅	✅	❌
Llama-3-8B-Instruct-Q4_K_M.gguf	✅	✅	❌
Llama-3-8B-Instruct-Q5_K_M.gguf	✅	✅	❌
Llama-3-8B-Instruct-Q6_K.gguf	✅	✅	❌
Llama-3-8B-Instruct-Q8_0.gguf	✅	✅	❌
Llama-3-8B-Instruct-fp16.gguf	✅	✅	❌

Author's Notes

Using quantizations below Q4 is generally discouraged because the loss in logic and coherence makes them nearly unusable for most tasks.
Due to universal browser memory constraints regarding ArrayBuffer size limits, the Web engine can only read GGUF files up to 2 GB.
There isn't a Python engine because a ported and pure version would be very slow. Using NumPy wouldn't make sense because it uses C under the hood, and for that, there is already a C engine.

Based on the work of

https://github.com/karpathy/llama2.c

Tags: ai inference engine

Last modified 22 March 2026