I reimplemented a CAP-aware total-order broadcast protocol in Rust and verified it six ways
I co-invented a total-order broadcast protocol in the 1990s for power-plant control. I rebuilt it in Rust, checked it six ways — and a model checker found a bug I'd shipped 30 years ago.
In the early 1990s, in a research lab at Cegelec near Paris, a small team I was part of built a way to keep industrial control systems running through hardware failures — the kind that supervise power plants, where "the screen froze" is not an acceptable outcome. The mechanism was a protocol we called the data train: messages assembled into "cars", coupled into a train, and sent round and round a ring of machines so that every machine saw the same events in the same order. If every replica receives the same operations in the same order, every replica stays in the same state. It was patented in 1996 (US 5,483,520 A), and the idea behind it had come from Flaviu Cristian, who advised the project in its early days.
I left that work behind for a long career in software and cloud. Last month I came back to it — not out of nostalgia, but because I wanted to know something I never got to find out at the time: was it actually correct?
In 1992 the honest answer was "we tested it and it seemed fine." That is not the same as correct. So I rebuilt the protocol from scratch in Rust, wrote a formal specification of it, and handed that specification to a model checker.
Both pieces are now open source:
trains-rust — the protocol, formally verified.
trains-valkey — a practical payoff: loss-free failover for Valkey/Redis, built on top of it.
This post is the story of what I found.
What the protocol does, and where it belongs
A total-order broadcast is the quiet primitive underneath a lot of distributed systems: a group of machines agree on one order for a stream of operations and apply them identically, so the group behaves like one reliable machine that doesn't lose its memory when a part of it dies. The famous way to do this today is to elect a leader — Paxos, Raft — and route everything through it. TRAINS does it without a leader: the right to order travels the ring as a circulating train, and every node does an equal share of the work. (If you like taxonomies: it's in the privilege-based family of Défago, Schiper and Urbán's survey — the token that grants the right to order is the train.)
It helps to say what this is for, because it's easy to judge it on the wrong axis. This is a control-plane primitive — the same job as etcd or ZooKeeper, not Kafka. Control planes move small, order-critical messages — configuration, membership, leases, locks, "who is primary" — across a handful of nodes, where getting everyone to agree on the order matters far more than raw bandwidth, and where the right answer under a network partition is to stop, not guess. That is exactly what TRAINS is good at, and exactly what it was born doing: keeping a power-plant control system consistent through failures. It is not a bulk data pipe, and if you benchmark it like one it will disappoint you — which is the honest framing for the numbers later on.
A small piece of history I'm fond of
Two details from the 1990s connect that work to this one.
The first: we verified the protocol with the formal tools we had then — coloured Petri nets, using software from a professor at the CNAM. Thirty years on I'm doing the same thing with TLA+ and an SMT solver. The instinct is identical: don't ship a broadcast protocol you have only tested.
The second: in testing, the system would simply stop when the network lost coherence — a partition, a loss of quorum. We had no vocabulary for that in 1992. The vocabulary arrived in 2000, when Brewer named the CAP theorem: under a partition you cannot have both consistency and availability. TRAINS had quietly chosen consistency — the same call etcd makes today — eight years before there was a name for the choice. For a power-plant controller that is exactly right: "available" should mean correct when healthy, silent when not.
What the model checker found
I wrote the spec in TLA+ and ran it through TLC (exhaustive state enumeration), Apalache (symbolic checking with an SMT backend), and Kani (bounded model checking of the Rust itself), plus differential testing of the production code against a deliberately naive reference implementation.
It found five real bugs. Four were the ordinary kind — an operator precedence mistake, a missing guard. The fifth was the one that justified the whole exercise: a genuine ordering violation that arose only when two concurrent trains advanced their clocks at just the wrong relative moment, so one node could deliver message B before another node delivered message A that should have come first. It passed every hand-written test and every randomized one. Only the exhaustive search found it. I had shipped the idea of this protocol decades ago; it was sobering, and a little funny, to watch a tool in 2026 surface a corner I'd never have reached by hand.
The fix is in both the spec and the Rust. The full report, the TLA+ files, and the benchmark study are in the repo.
The part the patents never covered
The original protocol could survive a crash — it kept going on the machines that were still alive — but it could not recover one. A node that failed and came back stayed out; the system ran one fault away from disaster forever. The 1990s patents were explicit about leaving this open.
So I added it: a node lifecycle. A crashed node is excluded by an agreed membership change; when it returns it first rejoins as a passive read-replica, catches up from a survivor's snapshot, and is then re-admitted as a full voting member by a virtually-synchronous view change — the membership change ordered inside the broadcast stream itself, so consistency is never broken to get the node back. This is the direct descendant of Cristian's group-membership work and of Isis virtual synchrony.
And — the point of the whole project — I re-verified it. The model checker confirms the uniform-ordering property survives membership changes that both shrink and grow the cluster (6.28 million distinct states, no violation; it also caught a naive first attempt that diverged). The membership layer was put through the same machine, not taken on faith.
The payoff: Valkey without losing your writes
A formally-verified ring is satisfying but abstract, so I built something concrete on it: trains-valkey, a proxy that gives an unmodified Valkey (or Redis) loss-free failover. Redis Sentinel, the standard HA story, can lose acknowledged writes during a failover — it has no formal model, and Jepsen has measured it discarding over a thousand acknowledged writes in a single partition, because asynchronous replication makes that loss a property of the design, not a bug. trains-valkey doesn't: replication runs over the verified ring, so a write that was acknowledged survives partition, double-kill, and rejoin. I validated it on AWS EC2 — zero acknowledged-write loss across the fault scenarios that lose data under Sentinel. It is "Sentinel without losing your acknowledged writes."
Where it honestly wins and loses
No silver bullets, and the benchmark study (laptop, a real lossy Wi-Fi link, EC2) is candid about it — but the candour only makes sense once you fix the axis. On the control-plane workload it's built for — small messages, a handful of nodes, total order, consistency first — it has a clean, predictable envelope: small-message throughput is its strong suit, latency stays in the low milliseconds through clusters of five, and it never sacrifices correctness under load or fault. On the data-plane workload it isn't built for — pushing large payloads at line rate — a leader-based log wins, and a single message waits two ring laps. That second result isn't a defect; it's the wrong axis. It halts under partition rather than diverge, by design. So the honest one-liner is not "faster than Raft" — it's "a small, ordered, consistency-first coordination primitive, proven to a degree its peers aren't."
Why I bothered
Partly to close a thirty-year-old open question for myself. But also because of a gap I keep noticing. The consensus protocols we run in production are tens of thousands of lines of code we mostly trust; the proofs of those protocols, when they exist, usually live in separate research artifacts, not in the code you deploy. A few teams do better — etcd's raft ships a TLA+ spec and even checks running execution traces against it; MongoDB and ZooKeeper ship specs in their repositories too — but they are the exceptions.
In the particular family this protocol belongs to — ring-based, token-passing total-order broadcast — I went looking, and the exceptions run out entirely. Corosync's Totem engine (the thing under a lot of Linux HA clusters) and Michel Simatic's own open-source TrainsProtocol are both battle-tested and both ship no formal specification; Totem's correctness lives in papers from the 1990s. As far as I can tell, trains-rust is the first ring total-order broadcast with a machine-checked spec, a reference implementation, and differential tests sitting in the same repository as the code. A 2,400-line kernel is what makes that maintainable — the distance between "proven" and "what runs" stays short enough to keep. That's the part I think is worth copying, whatever you make of trains itself.
The code, the TLA+, the verification report, and the benchmarks are all here:
Both are MIT-licensed. The patents expired years ago; the techniques are in the public domain. If you read the spec and find a sixth bug, I'd genuinely like to know.
— Yves Eychenne
Sources: Redis Sentinel documents Sentinel as eventually consistent with “last failover wins”; Jepsen measured 1,126 acknowledged writes lost out of 1,998 in one partition. See Redis Sentinel docs and Jepsen.
