Source

Supports both Llama and Gemma architectures. Try the Web engine here.

demo

Table of Contents

Building the engine (macOS / Linux)

make llama3pure

Building the engine (Windows)

Use the x64 Native Tools Command Prompt for VS.

cl /O2 llama3pure-c-engine.c /Fe:llama3pure.exe

Running the engine (macOS / Linux / Windows)

# On macOS / Linux
./llama3pure -model Llama3.gguf -prompt "Tell me in 1 line what is Microsoft."
./llama3pure -model Llama3.gguf -chathistory chat.txt

# On Windows
llama3pure.exe -model Llama3.gguf -prompt "Tell me in 1 line what is Microsoft."
llama3pure.exe -model Llama3.gguf -chathistory chat.txt
Argument Required Description Default Value
-model Yes Path to a GGUF model file. -
-prompt No Input prompt text (single-turn, alternative to -chathistory). -
-chathistory No Path to a .txt file containing a JSON chat history (multi-turn, alternative to -prompt). -
-system_prompt No System prompt prepended to every conversation. You are a helpful assistant.
-max_tokens No Maximum number of tokens to generate per response. -1 (unlimited)
-context_size No Context window size (capped by the model's own limit). Model's max.
-temperature No Sampling temperature. Higher values produce more varied output. 0.9
-top_p No Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered. 0.9
-top_k No Top-K sampling. Only the K most probable tokens are considered at each step. 40
-debug No Show detailed model loading and performance logs (including tok/s). disabled

Sample chat history in tests.txt.

Running in Node.js

Read the GGUF file into an ArrayBuffer and pass it to llama3pure with type: "load".

import llama3pure from "./llama3pure-js-engine.js"
import fs from "fs"

const readFileAsArrayBuffer = (filePath) => {
  const fd = fs.openSync(filePath, "r")
  const fileSize = fs.fstatSync(fd).size
  const arrayBuffer = new ArrayBuffer(fileSize)
  const fileUint8 = new Uint8Array(arrayBuffer)
  const chunkSize = 256 * 1024 * 1024
  let pos = 0
  while (pos < fileSize) {
    const toRead = Math.min(chunkSize, fileSize - pos)
    fs.readSync(fd, fileUint8, pos, toRead, pos)
    pos = pos + toRead
  }
  fs.closeSync(fd)
  return arrayBuffer
}

llama3pure({
  type: "load",
  model: readFileAsArrayBuffer("/path/to/your-model.gguf"),
  cbRender: (token) => {
    process.stdout.write(token)
  },
  systemPrompt: "You are a helpful assistant.",
  maxTokens: 256,
  contextSize: 2048,
  temperature: 0.9,
  topP: 0.9,
  topK: 40,
})
Parameter Type Required Description Default Value
type string Yes Must be load -
model ArrayBuffer Yes The GGUF model file contents. -
cbRender function Yes Callback invoked with each generated token as a string. -
systemPrompt string No System prompt prepended to every conversation. You are a helpful assistant.
maxTokens number No Maximum number of tokens to generate per response. -1 (unlimited)
contextSize number No Context window size (capped by the model's own limit). Model's max.
temperature number No Sampling temperature. Higher values produce more varied output. 0.9
topP number No Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered. 0.9
topK number No Top-K sampling. Only the K most probable tokens are considered at each step. 40

Call llama3pure with type: "generate" and a chatHistory array. The engine uses the cbRender callback provided during load to stream tokens. The last message in chatHistory should have role: "user" - that is the message the model will respond to. Previous messages provide conversation context, enabling multi-turn conversations.

llama3pure({
  type: "generate",
  chatHistory: [
    { role: "user", content: "Tell me in 1 line what is Microsoft." },
    {
      role: "assistant",
      content:
        "Microsoft is a global technology leader known for its innovative products and services.",
    },
    { role: "user", content: "Tell me in 1 line the names of the founders." },
  ],
})
Parameter Type Required Description
type string Yes Must be generate.
chatHistory array Yes Array of message objects representing the conversation.

Full example in llama3pure-nodejs-demo.js.

Running in Web Environments

Read the GGUF file as an ArrayBuffer and send it to the worker with type: "load". The ArrayBuffer is transferred (not copied) for performance.

const reader = new FileReader()

reader.onload = (event) => {
  const arrayBuffer = event.target.result
  worker.postMessage(
    {
      type: "load",
      model: arrayBuffer,
      systemPrompt: "You are a helpful assistant.",
      maxTokens: 256,
      contextSize: 2048,
      temperature: 0.9,
      topP: 0.9,
      topK: 40,
    },
    [arrayBuffer]
  )
}

reader.readAsArrayBuffer(file)
Parameter Type Required Description Default Value
type string Yes Must be load -
model ArrayBuffer Yes The GGUF model file contents. -
systemPrompt string No System prompt prepended to every conversation. You are a helpful assistant.
maxTokens number No Maximum number of tokens to generate per response. -1 (unlimited)
contextSize number No Context window size (capped by the model's own limit). Model's max.
temperature number No Sampling temperature. Higher values produce more varied output. 0.9
topP number No Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered. 0.9
topK number No Top-K sampling. Only the K most probable tokens are considered at each step. 40
worker.postMessage({
  type: "generate",
  chatHistory: [
    { role: "user", content: "Tell me in 1 line what is Microsoft." },
    {
      role: "assistant",
      content:
        "Microsoft is a global technology leader known for its innovative products and services.",
    },
    { role: "user", content: "Tell me in 1 line the names of the founders." },
  ],
})
Parameter Type Required Description
type string Yes Must be generate.
chatHistory array Yes Array of message objects representing the conversation.
worker.onmessage = function (e) {
  var data = e.data
  switch (data.type) {
    case "progress":
      // Fired during model loading
      break

    case "loaded":
      // Fired once the model is fully loaded and ready
      break

    case "token":
      // Fired for each generated token during inference
      console.log(data.token)
      break

    case "complete":
      // Fired when generation is finished
      console.log(data.output)
      break
  }
}
Event Fields Description
progress - Emitted during model loading to indicate progress.
loaded - Emitted once when the model has been fully loaded and is ready for inference.
token token (string) Emitted for each token as it is generated, enabling real-time streaming of the response.
complete output (string) Emitted when generation finishes. Contains the full generated text.

Try the Web engine here or with custom maxTokens, contextSize, topP and topK here.

A standalone version is available here; it offers the same functionality as the standard version but uses a base64-embedded Worker, allowing you to run it as a local file without a web server.

Suggested Models and Engines

MODEL C NODE.JS WEB
Gemma-3-1B-it-Q8_0.gguf
Llama-3.2-1B-Instruct-Q8_0.gguf
Llama-3.2-3B-Instruct-Q8_0.gguf
Gemma-3-4b-it-Q8_0.gguf

Tested Models

MODEL C NODE.JS WEB
Gemma-3-270M-it-Q2_K_L.gguf
Gemma-3-270M-it-Q3_K_M.gguf
Gemma-3-270M-it-Q4_K_M.gguf
Gemma-3-270M-it-Q5_K_M.gguf
Gemma-3-270M-it-Q6_K.gguf
Gemma-3-270M-it-Q8_0.gguf
Gemma-3-270M-it-F16.gguf
Gemma-3-1B-it-Q2_K_L.gguf
Gemma-3-1B-it-Q3_K_M.gguf
Gemma-3-1B-it-Q4_K_M.gguf
Gemma-3-1B-it-Q5_K_M.gguf
Gemma-3-1B-it-Q6_K.gguf
Gemma-3-1B-it-Q8_0.gguf
Gemma-3-1B-it-BF16.gguf
Llama-3.2-1B-Instruct-Q3_K_L.gguf
Llama-3.2-1B-Instruct-Q4_K_L.gguf
Llama-3.2-1B-Instruct-Q5_K_L.gguf
Llama-3.2-1B-Instruct-Q6_K_L.gguf
Llama-3.2-1B-Instruct-Q8_0.gguf
Llama-3.2-1B-Instruct-f16.gguf
Llama-3.2-3B-Instruct-Q3_K_L.gguf
Llama-3.2-3B-Instruct-Q4_K_L.gguf
Llama-3.2-3B-Instruct-Q5_K_L.gguf
Llama-3.2-3B-Instruct-Q6_K_L.gguf
Llama-3.2-3B-Instruct-Q8_0.gguf
Llama-3.2-3B-Instruct-f16.gguf
Gemma-3-4b-it-Q2_K_L.gguf
Gemma-3-4b-it-Q3_K_M.gguf
Gemma-3-4b-it-Q4_K_M.gguf
Gemma-3-4b-it-Q5_K_M.gguf
Gemma-3-4b-it-Q6_K.gguf
Gemma-3-4b-it-Q8_0.gguf
Gemma-3-4b-it-BF16.gguf
Llama-3-8B-Instruct-Q2_K.gguf
Llama-3-8B-Instruct-Q3_K_M.gguf
Llama-3-8B-Instruct-Q4_K_M.gguf
Llama-3-8B-Instruct-Q5_K_M.gguf
Llama-3-8B-Instruct-Q6_K.gguf
Llama-3-8B-Instruct-Q8_0.gguf
Llama-3-8B-Instruct-fp16.gguf

Author's Notes

Based on the work of

https://github.com/karpathy/llama2.c


Tags: ai   inference engine  

Last modified 22 March 2026