Website | Source

Tools:

Distributed Systems class (Source) (Outline): (This is a really good outline)

What makes a thing distributed?

Lamport, 1987:

A distributed system is one in which the failure of a computer
you didn't even know existed can render your own computer
unusable.

Nodes and networks

Nodes

Networks as message flows

Causality diagrams

Synchronous networks

Semi-synchronous networks

Asynchronous networks

When networks go wrong

Low level protocols

TCP

UDP

Clocks

Wall Clocks

Lamport Clocks

Vector Clocks

GPS & Atomic Clocks

Availability

Total availability

Sticky availability

High availability

Majority available

Quantifying availability

Consistency

Monotonic Reads

Monotonic Writes

Read Your Writes

Writes Follow Reads

Causal consistency

Sequential consistency

Linearizability

Transactional Models

Does any of this actually matter?

Tradeoffs

Availability and Consistency

Harvest and Yield

Hybrid systems

Avoid Consensus Wherever Possible

CALM conjecture

Gossip

CRDTs

HATs

Fine, We Need Consensus, What Now?

Paxos

ZAB

Humming Consensus

Viewstamped Replication

Raft

What About Transactions?

Review

Systems which only add facts, not retract them, require less coordination to
build. We can use gossip systems to broadcast messages to other processes,
CRDTs to merge updates from our peers, and HATs for weakly isolated
transactions. Serializability and linearizability require consensus, which we
can obtain through Paxos, ZAB, VR, or Raft. Now, we'll talk about different
scales of distributed systems.

Characteristic latencies

Multicore systems

Local networks

Geographic replication

Review

We discussed three characteristic scales for distributed systems: multicore
processors coupled with a synchronous network, computers linked by a LAN, and
datacenters linked by the internet or dedicated fiber. CPU consequences are
largely performance concerns: knowing how to minimize coordination. On LANs,
latencies are short enough for many network hops before users take notice. In
geographically replicated systems, high latencies drive eventually consistent
and datacenter-pinned solutions.

Common distributed systems

Outsourced heaps

KV stores

SQL databases

Search

Coordination services

Streaming systems

Distributed queues

Review

We use data structure stores as outsourced heaps: they're the duct tape of
distributed systems. KV stores and relational databases are commonly deployed
as systems of record; KV stores use independent keys and are not well-suited to
relational data, but offer improved scalability and partial failure vs SQL
stores, which offer rich queries and strong transactional guarantees.
Distributed search and coordination services round out our basic toolkit for
building applications. Streaming systems are applied for continuous,
low-latency processing of datasets, and tend to look more like frameworks than
databases. Their dual, distributed queues, focus on the messages rather
than the transformations.

A Pattern Language

Don't distribute

Use an existing distributed system

Never fail

Accept failure

Recovery First

Reconciliation Loops

Backups

Redundancy

Sharding

Independent domains

ID structure

Immutable values

Mutable identities

Confluence

Backpressure

  1. Consume resources and explode
  2. Shed load. Start dropping requests.
  3. Reject requests. Ignore the work and tell clients it failed.
  4. Apply backpressure to clients, asking them to slow down.

Services for domain models

Structure Follows Social Spaces

Cross-service coordination

Migrations

Review

When possible, try to use a single node instead of a distributed system. Accept
that some failures are unavoidable: SLAs and apologies can be cost-effective.
To handle catastrophic failure, we use backups. To improve reliability, we
introduce redundancy. To scale to large problems, we divide the problem into
shards. Immutable values are easy to store and cache, and can be referenced by
mutable identities, allowing us to build strongly consistent systems at large
scale. As software grows, different components must scale independently,
and we break out libraries into distinct services. Service structure goes
hand-in-hand with teams.

Production Concerns

Distributed systems are supported by your culture

Test everything

"It's Slow"

Instrument everything

Logging

Shadow traffic

Versioning

Rollouts

Automated Control

Feature flags

Chaos engineering

Oh no, queues

Further reading

Online

Trees


Tags: distribution   safety   liveness   theorem-proving   place  

Last modified 14 December 2025