Source

LLVM IR via Rust, OpenCL JIT for GPU map operations.

Camp: Syntactic
Also spans: Verification
Author: Paul Williams (paulprogrammer)
Implementation language: Rust
Compilation target: LLVM IR (then native via clang); OpenCL JIT for GPU map kernels at runtime
Licence: GPL-3.0 with Runtime Exception
First seen: May 2026
Maturity: working compiler

Agent tooling:
- MCP server (llm-mcp binary, stdio transport)
- MCP tools: analyze_codebase, search_symbols, get_definition, get_diagnostics, find_callers, structural_search, patch_symbol
- MCP resources: llm://spec (LLM_SPEC.md), llm://agent-workflow (MCP_GUIDE.md)
- GEMINI.md (Gemini CLI orientation)
- Stable diagnostic codes (E000-E018, W001) catalogued in DIAGNOSTICS.md
- .llmi signature files for cross-module imports

Key idea

LLMLang takes the token-efficiency move to its density extreme. Source is a prefix-arity AST in single-character ASCII operators (+, -, >, $, ~, ?, ., #); variables are De Bruijn indices (^0, ^1) rather than names; affine ownership is enforced at compile time (move >, borrow $, mut-borrow ~). The compiler ships OpenTelemetry auto-instrumentation as a metadata marker (M "otel" "span_name" : func) that injects span entry/exit and timing around the function body, plus an OpenCL JIT that translates pure map bodies to GPU kernels at runtime and falls back to CPU vectorisation if OpenCL is absent.

Thesis

LLMLang takes the syntactic camp's premise — that the symbols an LLM emits cost tokens, so the language surface should minimise them — to its density extreme. The LLM_SPEC.md header is [TOKEN_OPTIMIZED: HIGH_DENSITY] and the design guide names the audience directly: "Target Audience: Large Language Models (LLMs). Non-Goal: Human readability." Source is a prefix-arity AST written in single-character ASCII operators: + 10 20 is addition, > ^0 consumes the most-recent binding, $ ^1 borrows the next-most-recent, ? cond t f is a branch, # Point x y declares a struct-of-arrays shape, : name args body defines a function, . e1 e2 sequences. There are no parentheses, no semicolons, no infix precedence to disambiguate. Variables are referenced by their De Bruijn index in the binding stack — ^0, ^1, ^2 — rather than by names; the parser also accepts named identifiers but resolves them to indices before the AST stores anything.

"Target Audience: Large Language Models (LLMs). Non-Goal: Human readability."

The distinctive move sits in two places at once. The first is the density lever: where NERD bets on English keywords because BPE tokenisers fragment punctuation, LLMLang bets the opposite — that single ASCII characters cost one token each in the right tokeniser and the win is biggest when there is no punctuation to fragment. The second is enforcement: affine ownership (> move, $ borrow, ~ mut-borrow) is verified at compile time in src/compiler/analysis/verify.rs, with a VariableState stack that issues E004 for use-after-move, E005 for double-move, E009 for branch-state mismatch, and E016 for moving a borrowed variable. The same syntactic-camp surface ships a Rust-style borrow checker rather than relying on convention, which is why the entry spans into verification — the safety story is enforced, not advisory.

What it looks like

// Factorial. ^0 refers to the most-recent binding (the parameter n).
: fact n ? ^0 * $ ^0 @ fact - > ^0 1 > ^0

// Auto-instrumented function. The M marker triggers compiler-injected
// span entry/exit and timing around handle_request.
M "otel" "handle_request" : handle_request req
+ $ req 1

Every form is prefix-arity; ^0 is De Bruijn for "most-recent binding"; > consumes, $ borrows. The M metadata marker is read by the compiler in src/main.rs and routes the following definition through a code path that wraps the body in llm_otel_enter_span / llm_get_time_ns / llm_otel_emit_span / llm_otel_exit_span calls.

Distinctive moves

Maturity

v0.4.0 at the time of cataloguing, sixteen tagged releases (v0.1.0 to v0.4.0) cut between 18 and 24 May 2026 against a repository created 18 May 2026 — one feature wave per day for roughly a week, then consolidation commits through 27 May. Roughly 13,300 lines of Rust and C across 46 source files (src/compiler/{lexer,parser,ast,analysis,codegen} and a C runtime covering HTTP client and server with picohttpparser, TLS via mbedtls, cJSON, SQLite/Redis/MongoDB drivers, OpenCL dispatcher, MPSC emission queue, and a libtai-baseline temporal module); 31 self-hosted test programs under tests/lang/ and 47 Rust unit tests in tests/compiler_tests.rs. GPLv3 with the llmlang Runtime Exception — a GCC-style carve-out that keeps the compiler copyleft but lets generated binaries link the runtime libraries into proprietary code without the licence propagating. Single author Paul Williams (paulprogrammer, Denver, Colorado, GitHub bio "Barefoot Coders"); 0 stars and 0 forks at time of cataloguing.

The README opens with the disclosure: "This entire repository has been largely vibecoded with humans acting as the product owners, and the LLM acting as the developer." That places LLMLang in the same factual family as AILANG's "written autonomously by AI agents" framing and Codong's "designed for AI to write, humans to review" position — what is shipped is real engineering with real automated tests, and the catalogue notes the authorship model as context rather than judgement. MAYBE.md separates roadmap from shipped: first-class AST manipulation beyond the existing patch_symbol, formal intent-and-contract metadata nodes, and TDD/BDD scenario nodes are not yet in the compiler, with OpenTelemetry already crossed off the list. The bet is the syntactic camp's bet intensified — that a surface compressed to single-character prefix operators with indexed variables, plus an MCP server that exposes the same AST the compiler sees, will produce more correct output per token than a conventional language plus a smarter model.

Agent tooling

The llm-mcp binary is the primary agent surface and ships as a second cargo target alongside the compiler. It exposes seven tools over stdio: analyze_codebase walks a directory and parses every .llm file into the same AST the compiler uses; search_symbols looks up functions and shapes by name; get_definition returns the realised AST and file location of a symbol; get_diagnostics runs the parser front-end against a file and returns E00x/W00x codes; find_callers traverses the call graph; structural_search computes a SHA-256 hash of the operator-and-control-flow shape of a function body (literals and names omitted) and returns other functions sharing the same fingerprint — an LLM can ask "what else does the same thing?" without relying on name similarity. patch_symbol accepts a JSON AST for a new function body, parses the source file, swaps the matching Define node's body, and rewrites the file through the compiler's own pretty-printer (PrettyExpr in src/compiler/ast/display.rs), so edits stay syntactically valid by construction. Two MCP resources back the tools: llm://spec embeds LLM_SPEC.md directly (the token-density grammar reference), and llm://agent-workflow embeds MCP_GUIDE.md (the analyse → locate → extract → patch workflow). Stable diagnostic codes (E000E018, W001) are catalogued in DIAGNOSTICS.md so the same identifiers appear in compiler output, MCP responses, and the spec text the model receives from llm://spec.

Design DNA


Tags: language   ai   syntactic   verification   llvm  

Last modified 21 June 2026