Places
Reading
General
-
Sebastian Raschka: author of "How to Build a Large Language Model (from Scratch)" and "How to Build a Reasoning Model (from Scratch)"
-
Are language models a commodity?
- A Beginner's Reading list for LLMs for 2026
- 100+ LLM interview questions with answers
- 7 things I wish I knew when I started self-hosting LLMs
- "Boost your AI Apps with DSLs"
- "Why LLMs get called “stochastic parrots”: That label (from Bender et al., 2021) was meant as a caution: Mimicry over meaning: LLMs are trained to predict the next word, not to “understand.”; Surface fluency: They can generate convincing text that looks like knowledge without grounding in truth.; Bias and error risks: Because they remix training data, they can reinforce biases or produce confident nonsense.
So the “parrot” critique points to the danger of mistaking correlation for comprehension. Today’s LLMs do much more than repeat phrases: Cross-domain synthesis: They can connect physics, philosophy, and finance in ways that no single expert can, because their training corpus spans millions of sources.; Speed and breadth: A human expert might master a domain over decades, but an LLM can retrieve and correlate patterns across thousands of domains instantly.; Interactive reasoning: With Gen-AI, you can push through Socratic dialogues—testing assumptions, iterating on drafts, exploring alternatives. That’s not “parroting”; it’s a kind of statistical exploration of human knowledge.
Where the gap remains: Grounding: LLMs still lack intentionality or teleonomy (a sense of purpose). They don’t know why something matters.; Validation: Their statistical generalizations can mislead without external grounding (e.g., real data, experiments, or trusted sources).; Knowledge vs. Wisdom: They provide access to structured correlations; the human role is to discern meaning, relevance, and ethical direction.
So: the “parrot” label misses the true capability. LLMs are more like global pattern synthesizers—statistical telescopes that let us see across domains of human knowledge. But until they are coupled with grounding mechanisms (like digital genomes, cognizing oracles, or real-world data feedback), their outputs remain knowledge access tools rather than autonomous knowers." --https://www.linkedin.com/posts/raomikkilineni_why-llms-get-called-stochastic-parrots-activity-7366446664368201729-doqb
- How to Enrich LLM Context to Significantly Enhance Capabilities
- MIT researchers propose a new model for legible, modular software: "The coding framework uses modular concepts and simple synchronization rules to make software clearer, safer, and easier for LLMs to generate. Coding with large language models (LLMs) holds huge promise, but it also exposes some long-standing flaws in software: code that’s messy, hard to change safely, and often opaque about what’s really happening under the hood. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) are charting a more “modular” path ahead. Their new approach breaks systems into “concepts,” separate pieces of a system, each designed to do one job well, and “synchronizations,” explicit rules that describe exactly how those pieces fit together. The result is software that’s more modular, transparent, and easier to understand. A small domain-specific language (DSL) makes it possible to express synchronizations simply, in a form that LLMs can reliably generate. In a real-world case study, the team showed how this method can bring together features that would otherwise be scattered across multiple services."
- What You See Is What It Does: A Structural Pattern for Legible Software: "The opportunities offered by LLM coders (and their current limitations) demand a reevaluation of how software is structured. Software today is often “illegible”—lacking a direct correspondence between code and observed behavior—and insufficiently modular, leading to a failure of three key requirements of robust coding: incrementality (the ability to deliver small increments by making localized changes), integrity (avoiding breaking prior increments) and transparency (making clear what has changed at build time, and what actions have happened at runtime). A new structural pattern offers improved legibility and modularity. Its elements are concepts and synchronizations: fully independent services and event-based rules that mediate between them. A domain-specific language for synchronizations allows behavioral features to be expressed in a granular and declarative way (and thus readily generated by an LLM). A case study of the RealWorld benchmark is used to illustrate and evaluate the approach."
- How AI fixed my procrastination
- 10 LLM Engineering Concepts Explained in 10 Minutes:
- Understanding Context Engineering
- Implementing Tool Calling
- Adopting the Model Context Protocol
- Enabling Agent-to-Agent Communication
- Leveraging Semantic Caching
- Utilizing Contextual Compression
- Applying Reranking
- Implementing Hybrid Retrieval
- Designing Agent Memory Architectures
- Managing Inference Gateways and Intelligent Routing
-
LLM Inference: Inside a Fast LLM Inference Server
- You can (and should) run a tiny LLM on your Android phone: Using MNN (Play Store)
Using LLMs
Implementation
- BUILDING AN INTELLIGENT ARCHITECTURE AGENT: LEVERAGING LLMS FOR AUTOMATED DIAGRAM GENERATION AND DOCUMENTATION
- "How to Build a Large Language Model from Scratch Using Python"
- "How to build knowledge graphs using LLMs"
- "Language Models are Few-Shot Learners": "We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora."
- The Large Language Model Course
- Simple-LLM: ~950 line, minimal, extensible LLM inference engine built from scratch.
- Let's reproduce GPT-2 (124M)
- "Researchers from Virginia Tech and Microsoft Introduce Algorithm of Thoughts: An AI Approach That Enhances Exploration of Ideas And Power of Reasoning In Large Language Models (LLMs)"
- A Survey of Reinforcement Learning for Large Reasoning Models: "In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs"
- WebLLM: High-Performance In-Browser LLM Inference Engine Website | Source
Context Engineering
- Context-Engineering - a frontier, first-principles handbook inspired by Karpathy and 3Blue1Brown for moving beyond prompt engineering to the wider discipline of context design, orchestration, and optimization
- Awesome-Context-Engineering - a comprehensive survey on Context Engineering: from prompt engineering to production-grade AI systems
Verification
- BEAVER: As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.
- The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality: "We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts ."
Wow
Python
awesome-local-llm - Crack these apart into pages, eliminate dupes
Inference engines
- BitNet - official inference framework for 1-bit LLMs
- distributed-llama - connect home devices into a powerful cluster to accelerate LLM inference
- dynamo - a datacenter scale distributed inference serving framework
- exo - run your own AI cluster at home with everyday devices
- FastFlowLM - run LLMs on AMD Ryzen™ AI NPUs
- flashinfer - kernel library for LLM serving
- gpustack - simple, scalable AI model deployment on GPU clusters
- LiteRT - Google's on-device framework for high-performance ML & GenAI deployment on edge platforms, via efficient conversion, runtime, and optimization
- LiteRT-LM - Google's production-ready, high-performance, open-source inference framework for deploying Large Language Models
- krasis - a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
- llm-scaler - run LLMs on Intel Arc™ Pro B60 GPUs
- mistral.rs - fast, flexible LLM inference
- mlx-lm - generate text and fine-tune large language models on Apple silicon with MLX
- mlx-vlm - a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX
- Nano-vLLM - a lightweight vLLM implementation built from scratch
- omlx - LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
- TensorRT-LLM - provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs
Tools
Computer Use
- open-interpreter - a natural language interface for computers
- OmniParser - a simple screen parsing tool towards pure vision based GUI agent
- openwork - an open-source alternative to Claude Cowork, powered by OpenCode
- cua - the Docker Container for Computer-Use AI Agents
- Agent-S - an open agentic framework that uses computers like a human
- self-operating-computer - a framework to enable multimodal models to operate a computer
- OpenRoom - a browser-based desktop where AI Agent operates every app through natural language, from MiniMaxAI
Browser Automation
- puppeteer - a JavaScript API for Chrome and Firefox
- playwright - a framework for Web Testing and Automation
- browser-use - make websites accessible for AI agents
- firecrawl - turn entire websites into LLM-ready markdown or structured data
- stagehand - the AI Browser Automation Framework
- nanobrowser - open-source Chrome extension for AI-powered web automation
Memory Management
- mem0 - universal memory layer for AI Agents
- mempalace - the highest-scoring AI memory system ever benchmarked
- letta - the stateful agents framework with memory, reasoning, and context management
- supermemory - memory engine and app that is extremely fast, scalable
- cognee - memory for AI Agents in 5 lines of code
- LMCache - supercharge your LLM with the fastest KV Cache Layer
- memU - an open-source memory framework for AI companions
- reasoning-bank - a memory mechanism for agents that learns from both successful and failed trajectories, with reasoning stored as memory content
Testing, Evaluation and Observability
- langfuse - an open-source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more
- opik - debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards
- openllmetry - an open-source observability for your LLM application, based on OpenTelemetry
- garak - the LLM vulnerability scanner from NVIDIA
- giskard - an open-source evaluation & testing for AI & LLM systems
- agenta - an open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place
Research
- Perplexica - an open-source alternative to Perplexity AI, the AI-powered search engine
- gpt-researcher - an LLM based autonomous agent that conducts deep local and web research on any topic and generates a long report with citations
- SurfSense - an open-source alternative to NotebookLM / Perplexity / Glean
- open-notebook - an open-source implementation of Notebook LM with more flexibility and features
- RD-Agent - automate the most critical and valuable aspects of the industrial R&D process
- local-deep-researcher - fully local web research and report writing assistant
- local-deep-research - an AI-powered research assistant for deep, iterative research
- maestro - an AI-powered research application designed to streamline complex research tasks
Security and Sandboxing
- Guardrails - an open-source toolkit from NVIDIA for easily adding programmable guardrails to LLM-based conversational systems
- OpenShell - the safe, private runtime for autonomous AI agents from NVIDIA
- CubeSandbox - instant, concurrent, secure & lightweight sandbox for AI agents
- cai - Cybersecurity AI (CAI), the framework for AI Security
Miscellaneous
- speakr - a personal, self-hosted web application designed for transcribing audio recordings
- presenton - an open-source AI presentation generator and API
- OmniGen2 - exploration to advanced multimodal generation
- 4o-ghibli-at-home - a powerful, self-hosted AI photo stylizer built for performance and privacy
- Observer - local open-source micro-agents that observe, log and react, all while keeping your data private and secure
- mobile-use - a powerful, open-source AI agent that controls your Android or IOS device using natural language
- gabber - build AI applications that can see, hear, and speak using your screens, microphones, and cameras as inputs
- promptcat - a zero-dependency prompt manager/catalog/library in a single HTML file
Tags:
ai
Last modified 07 May 2026