On the Feasibility of Matrix as a Tactical MANET Messaging Substrate

Abstract

We ask whether the Matrix protocol can serve as the messaging substrate for a tactical radio mobile ad-hoc network (MANET): a mesh of vehicle-borne homeservers joined by intermittent, very-low-bandwidth (1–9 kbit/s) federation links. The central observation is that Matrix federation performs no multi-hop routing — if vehicle A cannot reach C directly, it does not relay through B. Cross-node delivery therefore happens only opportunistically, via room-DAG backfill, whenever two servers share a room and one discovers it is missing events. The study measures whether, and under which link and partition regimes, that directed acyclic graph (DAG) behaves as a usable delay-tolerant, store-carry-forward (SCF) layer. We build a single-host testbed — k3s + Calico + Chaos Mesh shaping only the server-to-server links, tuned Synapse homeservers on one Postgres, and an async orchestrator providing a single measurement clock — and run a measurement campaign (257 runs) across a bandwidth sweep, a fleet-size sweep (N=3,5,10,20,30,40,50,70), and a two-convoy partition/heal scenario.

Headline verdict: Matrix is infeasible for a 70-node tactical operational room — and for two independent reasons. As fleet size N grows from 3 to 70 we find two distinct feasibility barriers. Barrier 1 (bandwidth / fan-out). Where the room can converge (N ≲ 10), each message fans out into (N − 1) point-to-point federation transactions — each carrying a measured ~762 B signed-PDU envelope, N-independent — through one shared radio, so the per-radio bandwidth required for delivery grows with N. The fluid byte law B_req=r(N − 1)(S+O) sets the shape, but the measured requirement sits far above it and the inefficiency factor grows with N — from ≈8× the fluid floor at small N to ≈17× by N≥20 — because federation is request/response- and round-trip-bound (chatter-bound), not throughput-bound. The bandwidth for 50% delivery rises from 2.6 kbit/s (N=3) to 15.7 (N=10) and 110.6 kbit/s (N=30), already far beyond the 1–9 kbit/s tactical budget. Barrier 2 (the convergence ceiling) — the more fundamental one, and bandwidth-independent. Beyond ~N=15 a single 70-server room cannot converge its membership/state at all. Maximum achievable delivery, taken over every bandwidth tested (up to 460 kbit/s), falls from 100% at N≤10 to 65.8% (N=20), 51.5% (N=30), 40.8% (N=40), 38.2% (N=50), and 37.3% (N=70); adding bandwidth does not raise these plateaus. Synapse logs trace the cause to auth-gated federation backfill: a server is refused the room's prev_events (DAG ancestors) with HTTP 403 Forbidden because it is not yet a recognised member, yet it cannot become a member without that history — a chicken-and-egg that forks room state permanently, leaving most servers seeing only ~30–40 of the 70 members. This directly refutes the study's central hypothesis: the room DAG does not function as a store-carry-forward layer at scale, for a precise protocol-level reason independent of link bandwidth. The net verdict is sharp: even where bandwidth could suffice, the room cannot converge. We also report the testbed design and several genuine closed-mesh methodological findings (trusted-key-server isolation, anti-SSRF IP whitelisting, Chaos Mesh rate-unit calibration, stable signing keys, rate-limiter and chaos-injection confounds), and we revisit the working hypotheses against the measured data.

1. Introduction

Tactical radio networks for vehicle-borne units are the archetypal challenged network: links are slow, asymmetric, and frequently partitioned as vehicles move in and out of range. Messaging in such an environment is fundamentally a delay-tolerant, store-carry-forward problem rather than an end-to-end-path problem. Matrix is an attractive candidate substrate: it is an open, federated, eventually-consistent messaging protocol with mature server software (Synapse), a per-room event model that is already a replicated DAG, and a rich set of federation-tuning knobs.

The routing gap. Matrix federation, however, does not route. A homeserver synchronises a room's event graph only with the other servers it can currently reach; it does not forward another server's events on its behalf. When a server reconnects after a partition and notices it is behind, it pulls the missing history via backfill. This makes the room DAG an implicit store-carry-forward mechanism: events are stored in the DAG, carried by whichever servers hold them, and forwarded opportunistically on the next contact. Whether that implicit mechanism is good enough — whether backlog drains faster than partitions and slow links create it — is an empirical question, and it is the question this paper studies.

Contributions. (1) A faithful single-host testbed that shapes only the federation links, so every emulated radio edge is fully synthetic and controllable. (2) A set of closed-mesh methodological findings required to make Synapse federate at all on an air-gapped, throttled mesh — we argue these are reusable results, not mere configuration trivia. (3) A 257-run measurement of delivery ratio, latency, goodput, and partition-gated cross-convoy delivery across an extended bandwidth sweep and a fleet-size sweep spanning N=3 to 70 (the full design scale, measured directly). (4) The identification of two distinct feasibility barriers, both quantified: Barrier 1, a required-bandwidth that grows with fleet size at a widening 8–17× multiplier over the fluid byte floor (the protocol's chatter tax); and Barrier 2, a bandwidth-independent convergence ceiling — beyond N≈15 a single room never converges its membership, capping delivery at ~37% by N=70 regardless of link rate — which we trace to auth-gated federation backfill (HTTP 403) and which refutes the working hypothesis that the room DAG is a usable store-carry-forward layer at scale.

2. Background & related work

2.1 Matrix federation and the room DAG

In Matrix, homeservers exchange room history using the Server-Server (federation) API [1]. Each room's history is a directed acyclic graph of events whose partial order encodes causal ordering; the graph is synchronised between participating servers with eventual consistency, and each server validates incoming events against signature, hash, and authorization checks [1]. Crucially, a server only exchanges events with servers it can currently contact, and recovers missing history through backfill rather than through any in-network relaying. Synapse [2] is the reference homeserver and the one we use, chosen specifically for its federation retry/backoff controls and Admin API.

2.2 Delay-tolerant networking and store-carry-forward

The delay-/disruption-tolerant networking (DTN) literature formalises exactly the regime we operate in: networks where a contemporaneous end-to-end path may never exist, so messages are stored at intermediate nodes and forwarded on later contacts. Fall's architecture for challenged internets [3] and the IETF DTN architecture RFC 4838 [4] establish the message-oriented, custody-based forwarding model. Epidemic and store-carry-forward routing for partially connected ad-hoc networks, introduced by Vahdat and Becker [5], shows how pairwise contact and message exchange can deliver data across a network that is never simultaneously connected. The Matrix room DAG is, in effect, a constrained instance of this idea — but without active relaying, so its reach is bounded by which servers share a room and meet directly.

2.3 MANET messaging

Classical MANET routing protocols [6] assume a path can be discovered and maintained between endpoints; they degrade badly when the network is chronically partitioned, which is precisely why DTN/SCF approaches exist. Our work sits at the intersection: we take an off-the-shelf federated messenger that does no MANET routing and ask how far its eventual-consistency layer alone carries us under MANET-like partitioning and tactical link rates.

2.4 Emulation tooling

We emulate the cluster with Kubernetes (k3s) [7] and impair only the federation links with Chaos Mesh [8], whose NetworkChaos resource provides bandwidth shaping, delay, loss, and partition as Kubernetes custom resources — one tool for both the radio link and the scheduled topology.

3. Method / system design

3.1 Mapping and the testbed stack

One node = one vehicle = one Synapse homeserver. Users are modelled as clients permanently attached to their own vehicle's homeserver over a reliable intra-vehicle link that is never impaired; the constrained, dynamic radio links are the server-to-server (federation) edges. The whole mesh runs on a single host on purpose: co-locating all servers makes every federation edge synthetic and fully under the emulator's control, avoiding real-network variance on the very links we are trying to shape. The stack is k3s (single-node Kubernetes) with Calico providing enforced NetworkPolicy, Chaos Mesh for link emulation and partitioning, Synapse homeservers backed by one PostgreSQL cluster (one database per node), and a single async orchestrator that holds one client session per node, generates randomized traffic, and timestamps sends and receipts against one clock (so there is no inter-node skew).

3.2 What is and is not shaped

Two layers gate server-to-server traffic and are kept distinct: Calico NetworkPolicy defines the static allowed-edge graph (which servers may peer at all), while Chaos Mesh NetworkChaos applies the dynamic per-edge bandwidth shaping and scheduled partition/heal windows. Shaping is applied only to the federation port (server-server, 8448). Postgres and the client↔own-server links are local and unthrottled by design. We verified on two probe pods that bandwidth shaping, partition, and NetworkPolicy enforcement all take effect and compose correctly before scaling up.

3.3 Synapse tuning for flapping, low-rate links (§5)

Default Synapse is built for the public internet and silently mis-measures a tactical link. We therefore baked the following into the homeserver template, verified against the Synapse v1.122 config schema rather than from memory:

Federation retry backoff. The default minimum retry interval (10 min), multiplier (5), and maximum (~1 week) are far too slow for links whose up-windows are seconds-to-minutes: a server would record "not delivered" simply because it never tried inside the up-window. We lower destination_min_retry_interval to ~30 s, destination_retry_multiplier to ~2, and cap destination_max_retry_interval low. Better still, the topology operator calls the Federation Admin API to reset a destination's backoff and force an immediate reconnect the instant a link heals, removing the backoff confound at exactly the moments that matter.
Presence off. presence.enabled: false kills presence and its federation EDUs — the worst chatter offender on a 7 kbit/s link. Typing and read-receipt EDUs have no clean federation kill-switch, so the orchestrator simply never emits them. Federation profile and device-name lookups are disabled to trim incidental queries.
Cache trims. caches.global_factor ≈ 0.5 and a small top-level event_cache_size, monolith mode, URL preview off — RAM savers that matter at 70× and keep idle per-Synapse RSS to ~100 MiB.

3.4 Closed-mesh methodological findings

Making Synapse federate on an air-gapped, throttled mesh surfaced four findings that we present as genuine methodology, because each one would otherwise masquerade as a delivery failure and contaminate the headline metric:

D6 — trusted key servers must be empty. Synapse defaults to fetching peers' signing keys via matrix.org. On an air-gapped mesh that means every node tries to reach the public internet to verify keys and hangs — exactly the confound the backoff tuning exists to remove. Setting trusted_key_servers: [] (with suppress_key_server_warning: true) forces each node to fetch keys directly from the origin server over the shaped link. A realistic side effect: first-contact key fetches now ride the throttled link too, which we keep and note when interpreting first-contact latency.
D7 — anti-SSRF IP whitelist. Synapse's ip_range_blacklist defaults to filtering RFC 1918 ranges after DNS resolution as an SSRF guard. In-cluster pod IPs (10.42/16) are exactly those ranges, so make_join failed with "no results for hostname lookup" despite working DNS and reachable TLS. The fix is an explicit ip_range_whitelist: [10.42.0.0/16, 10.43.0.0/16].
D5 — Chaos Mesh rate-unit calibration (critical). The NetworkChaos bandwidth rate field's "bps" suffix denotes bytes per second, not bits: rate: "1mbps" sustained ≈ 1 MByte/s (~7.6 Mbit/s). The schedule generator therefore converts target kbit/s to a byte rate as kbit/s × 125 (e.g. 1 kbit/s → 125bps, 9 kbit/s → 1125bps). Getting this wrong would inflate every link by 8×.
D8 — stable signing keys. Boot-time key generation into an ephemeral volume gives each pod a fresh identity on every restart, breaking federation key caching across runs. We instead mint one ed25519 signing key per node ahead of time and ship it in the config Secret, giving stable identity across restarts.

A federation TLS chain is provided by a private test CA (federation requires valid TLS; in a closed mesh the servers trust a shared test CA). We also stagger link-up events in the schedule and log per-run CPU/run-queue saturation, discarding any run whose measurement window pegged the box, because synchronized "heal storms" would otherwise inflate convergence latency through CPU contention rather than link dynamics.

3.5 Scenarios

s0 — full mesh. All federation edges permitted and shaped at a single bandwidth; isolates the pure low-rate effect with no partitioning. s1 — two-convoy partition/heal. Nodes are split into convoy A and convoy B; intra-convoy links stay up while the inter-convoy link flaps on a staggered schedule, so cross-convoy delivery is partition-gated and can only progress during heal windows — the canonical store-carry-forward test.

3.6 Metric definitions (§6)

Per intended recipient we record delivery ratio (delivered receipts / intended receipts), end-to-end latency (receipt minus send, on the single clock) and its distribution including the p95 tail and convergence-after-heal, and a non-convergence flag (set when a run's delivery ratio stays below 0.5 by run end). The analysis additionally splits delivery into intra- vs cross-convoy for s1. Goodput / overhead is the useful-payload ratio, payload / (payload + per-PDU overhead). A federation PDU carrying one room message costs payload + FED_PDU_OVERHEAD bytes. We measure this envelope directly by serialising the signed PDU (the auth/prev-event references, content hashes, ed25519 signatures, and transaction framing) at the Matrix-PDU layer: FED_PDU_OVERHEAD = 762 B for a single-room message with prev_events=1, auth_events=3, and one signature. Critically, this overhead is N-independent — the same 762 B is observed at N=3, 5, 10, 20, and on up to 70 (the per-message envelope does not grow with fleet size; what grows is the number of point-to-point transactions, one per peer). This figure is the Matrix-PDU layer only and excludes lower-layer TLS/TCP/IP framing, so it is a lower bound on true bytes-on-wire. At the representative ~220 B tactical payload (the campaign mean) the useful-payload ratio is ≈0.22; goodput reaches 0.5 only once the payload itself equals the 762 B envelope.

4. Experimental setup

Hardware. A single x86-64 host with 16 physical cores, 125 GB RAM, NVMe SSD, running Debian 13 (trixie). This deviates from the brief's Ubuntu 24.04 / 128–256 GB target: RAM is adequate for N=10 (idle per-Synapse RSS ≈ 100 MiB), and 16 physical cores give more heal-storm headroom than the 12-vCPU target. Software versions are pinned: k3s v1.35.5+k3s1 (flannel and built-in netpol disabled), Calico v3.29.1 (pod CIDR 10.42.0.0/16), Helm v3.21.2, Chaos Mesh 2.7.0, PostgreSQL 16, Synapse (element-hq image).

Fleet. Ten Synapse nodes (nodeNN.synapse.svc.cluster.local, federation on :8448, headless Service per node so each name resolves to its pod IP) on one Postgres cluster (ten UTF-8 / C-collation databases). Federation was smoke-tested unshaped: node02 joined node01's room and received a message in 1.08 s; the orchestrator's unshaped validation run delivered 144/144 (100%) with median latency ~185 ms.

Campaign matrix. The campaign comprises 257 valid runs. Scenario s0 (full mesh) sweeps bandwidth over an extended grid — from 1 kbit/s up to several hundred kbit/s (460 kbit/s at the largest fleets) — with replicates per cell, both to fill the delivery-vs-bandwidth curve and to locate B₅₀, the bandwidth at which delivery reaches 50%. A second sweep varies fleet size N ∈ {3, 5, 10, 20, 30, 40, 50, 70} at a fixed per-node message rate (~0.05 msg/s/node) to recover the required-bandwidth scaling law (§5.3) and the convergence ceiling (§5.2). Scenario s1 (two-convoy partition/heal) runs at bandwidth ∈ {2, 3, 5, 9} kbit/s. Run length is fixed across bandwidths for comparability — we deliberately do not extend low-bandwidth runs, since non-convergence within a fixed window is itself the result. Runs are executed serially on the one box (overlap would contaminate timing), with per-run CPU-saturation monitoring; saturated runs are flagged invalid. We additionally gate every measured phase on Chaos Mesh confirming the throttle is injected (D11) and reset room membership between runs (D12), after discovering that unverified injection and accumulated membership each produced artifacts. Crucially, this campaign measures the full N=3–70 range directly, including the 70-node design — so the headline verdict is measured, not extrapolated from a smaller pilot.

5. Results

Every figure is generated from the campaign's orchestrator logs by analysis/plots.py; the per-run table below loads data/summary.json produced by analysis/metrics.py. All numbers are means over the replicates of each cell from results/summary.csv (257 valid runs).

5.1 Two feasibility barriers

The extended N=3–70 sweep reveals two distinct barriers to using Matrix as a tactical room, and they bind in different regimes. Barrier 1 — the (N − 1) fan-out bandwidth tax (§5.5): where the room can converge (N ≲ 10), the bandwidth required for delivery grows with N because each message fans out into one point-to-point federation transaction per peer; the requirement runs far above the fluid byte floor and the gap widens with N. Barrier 2 — the convergence ceiling (§5.2): beyond ~N=15 the room cannot converge its membership at all, so the maximum delivery achievable over any bandwidth falls steadily with N and cannot be bought back with a faster link. Barrier 2 is the more fundamental: it caps delivery below 50% from N=40 upward, which makes the Barrier 1 bandwidth law moot at the largest fleets (there is no bandwidth at which 50% is reached). We present the ceiling first.

5.2 Barrier 2: the convergence ceiling (bandwidth-independent)

The dominant result of the extended sweep is a hard, bandwidth-independent convergence ceiling. Fig. 1 plots the maximum delivery ratio achieved at each fleet size over every bandwidth in the campaign. Through N=10 the room converges fully (100%); from N=20 the ceiling drops and keeps dropping — 65.8% (N=20), 51.5% (N=30), 40.8% (N=40), 38.2% (N=50), 37.3% (N=70). Critically the ceiling is flat in bandwidth (Fig. 2): at N=70 delivery plateaus near 37% whether the federation link carries 30 or 460 kbit/s — indeed the 460 kbit/s cell is no better than the 360 kbit/s cell. Bandwidth is simply not the binding constraint here; the room never assembles a consistent membership, so a majority of messages have no recipient set to be delivered to.

Root cause (from Synapse federation logs). Servers reject incoming events because the event's prev_events — its DAG ancestors — cannot be back-filled: federation backfill returns HTTP 403 Forbidden. Backfill is authorization-gated, and a server that is not yet a recognised member of the room is refused the history. This is a chicken-and-egg: to be accepted as a member a server needs the room's DAG/auth history, but to fetch that history over federation it must already be a member. The result is that room state forks and never re-converges; in the large-N runs most servers end up seeing only ~30–40 of the 70 members — which is exactly the ~37% delivery floor. This is consistent with the closed-mesh key-fetch caveat we flagged for the topology axis (trusted_key_servers: [] means a server may also be unable to fetch a non-adjacent origin's signing key), but the proximate, logged cause is the auth-gated backfill 403. The implication is qualitative, not merely quantitative: it refutes the study's central hypothesis. The room DAG does not act as a store-carry-forward layer at scale, and it fails for a precise protocol-level reason that no amount of link bandwidth, backoff tuning, or longer run window can fix — it is structural to how Matrix gates backfill on membership.

maximum achievable delivery ratio versus fleet size, falling from 100% to 37% — **Fig. 1.** *(Headline.)* Maximum achievable delivery ratio vs fleet size N, taken over *all* bandwidths tested (up to 460 kbit/s). The ceiling is 100% for N≤10 but collapses thereafter — 65.8% (N=20), 51.5% (N=30), 40.8% (N=40), 38.2% (N=50), 37.3% (N=70) — and is flat in bandwidth: no link rate we tested lifts it. Beyond N≈15 a single room cannot converge its membership, so a majority of messages are never delivered regardless of how fat the radio link is.

delivery ratio versus bandwidth, one curve per fleet size, plateauing below 100% for large N — **Fig. 2.** Delivery ratio vs federation bandwidth, one curve per fleet size N. For N≤10 the curve climbs to 100% once the link is fat enough (Barrier 1). For N≥20 each curve rises then *flattens onto a plateau well below 100%* (Barrier 2): adding bandwidth past the knee buys nothing because room state has forked and most servers never learn the full membership. At N=70 the plateau sits at ~37% even at 460 kbit/s.

5.3 Barrier 1: delivery, latency and the (N−1) fan-out tax

Within fleets that can converge, link bandwidth is the binding constraint, and the fixtures below establish the picture (illustrated at N=10).

delivery ratio versus federation link bandwidth, full mesh — **Fig. 3.** Delivery ratio vs federation link bandwidth (s0, full mesh, N=10, msg-interval 2.5 s, mean of 3 replicates). Delivery rises *monotonically* with bandwidth: ≈3.7% at 1 kbit/s, ≈11% at 5, ≈20% at 9, ≈33% at 12, ≈56% at 15 kbit/s (the 50%-delivery knee), then ≈98% at 20 and 100% at 30–45 kbit/s. At low rates the fixed-length window is too short to drain the backlog the slow link itself creates, so every cell at or below 12 kbit/s trips the non-convergence flag. Points are replicate runs; the line is the per-bandwidth mean.

convergence-latency CDF by bandwidth — **Fig. 4.** Convergence-latency CDF per bandwidth (s0, delivered messages, N=10). Among *delivered* messages, median latency sits in the tens of seconds across the throttled band (≈38 s at 1 kbit/s, ≈25 s at 3, with p95 tails of 80–100 s), and only collapses once the link clears: ≈24 s median at 20 kbit/s, ≈4.4 s at 30, ≈1.9 s at 45 kbit/s. Low-bandwidth curves carry a long right tail as backlog drains slowly across the link.

goodput ratio versus payload size — **Fig. 5.** Goodput vs application payload size. The curve is payload/(payload+762 B) using the *measured* signed-PDU envelope (§3.6); real per-message payloads are overlaid. Federation framing and signatures dominate at small sizes, so the useful-payload ratio collapses for short tactical messages — ≈0.22 at the ~220 B representative payload, reaching 0.5 only at a 762 B payload. This is the Matrix-PDU layer and excludes TLS framing, so it is a lower bound.

two-convoy partition gating of cross-convoy delivery — **Fig. 6.** Two-convoy MANET (s1, N=10): intra-convoy delivery (links always up) vs cross-convoy delivery (partition-gated). Both are starved at low bandwidth — at 2 kbit/s intra ≈1.8% / cross ≈1.0%; at 3 kbit/s ≈3.9% / ≈4.4%; at 5 kbit/s ≈18.9% / ≈16.0%; at 9 kbit/s ≈28.8% / ≈32.6% — and every s1 cell in the sweep trips the non-convergence flag. Cross-convoy traffic can only progress during the staggered heal windows (store-carry-forward through the room DAG); at these rates the window simply never delivers the backlog, so cross-convoy delivery is strongly gated and both streams remain far below convergence.

Fig. 3 and Fig. 4 establish the within-fleet picture at N=10. Delivery is a clean monotone function of link rate, and the run is fully usable (≈98–100% delivery, single-digit to ~20 s latency) only from ~20 kbit/s upward (≈100% at ~30 kbit/s) — well above the 1–9 kbit/s tactical envelope, within which delivery never exceeds ≈20%. Fig. 7 plots median and p95 latency directly against bandwidth as a companion view.

median and p95 latency versus bandwidth — **Fig. 7.** Median and p95 end-to-end latency vs bandwidth (s0, N=10, delivered messages). Latency among delivered messages stays in the tens of seconds throughout the throttled band and only drops to a few seconds once the link clears (≈30 kbit/s and above).

5.4 Federation overhead is fixed per message and N-independent

We serialised the signed federation PDU for a single-room message and measured an envelope of 762 bytes (auth_events=3, prev_events=1, one ed25519 signature, plus transaction framing) — markedly below the 1100 B we had previously estimated. The decisive property (Fig. 8) is that this 762 B is invariant in fleet size: it is the same from N=3 up to N=70. The per-message envelope does not grow with the fleet; what grows is the number of point-to-point federation transactions a sender must emit, one per peer, i.e. (N − 1) copies. This is exactly why the bandwidth cost scales with fleet size even though the per-PDU cost does not. The 762 B is the Matrix-PDU layer and excludes TLS/TCP/IP framing, so true bytes-on-wire are somewhat higher.

measured federation PDU overhead versus fleet size, flat at 762 bytes — **Fig. 8.** Measured signed-PDU overhead vs fleet size. The envelope is flat at 762 B across N=3–70; the cost of scaling is the (N − 1) fan-out of transactions, not a larger PDU.

5.5 Required-bandwidth scaling law and the widening chatter tax

To turn the per-fleet curves into a scaling statement we measure B₅₀, the link bandwidth at which delivery reaches 50%, as a function of fleet size N at a fixed per-node rate (~0.05 msg/s/node). The result (Fig. 9) grows super-linearly in N, and runs out of meaning past N=30 because of Barrier 2:

N=3 → B₅₀ ≈ 2.6 kbit/s (k ≈ 8.4× the fluid floor)
N=5 → B₅₀ ≈ 7.1 kbit/s (k ≈ 9.3×)
N=10 → B₅₀ ≈ 15.7 kbit/s (k ≈ 6.1×)
N=20 → B₅₀ ≈ 34.2 kbit/s (k ≈ 16.1×)
N=30 → B₅₀ ≈ 110.6 kbit/s (k ≈ 16.7×)
N≥40 → unreachable: delivery never reaches 50% at any bandwidth tested (Barrier 2 caps it at ≤40.8%), so B₅₀ does not exist.

The fluid byte-budget law B_req = r·(N − 1)·(S + O) — sender rate r times the (N − 1) fan-out times payload-plus-overhead (S≈220 B, O=762 B) — predicts the correct shape (∝ N − 1) but massively under-predicts the magnitude, and the gap widens with N. The inefficiency factor k = B₅₀/B_fluid is ≈8× at N=3 and ≈9× at N=5, dips to ≈6× at N=10, then climbs to ≈16× at N=20 and ≈17× at N=30 — i.e. it grows with fleet size rather than holding constant. (This supersedes our earlier pilot estimate of a roughly constant ~4× multiplier; the fuller N=3–70 sweep shows the tax compounding with N.) The growing multiplier is the protocol's RTT / request-response / chatter tax (see §6): federation is round-trip-bound, and the round-trips multiply with peers, so achievable throughput falls further below the raw link rate as the room grows. A naive byte count for N=70 gives only ~27 kbit/s (the fluid prediction); the measured requirement is far higher — and ultimately moot, because Barrier 2 caps delivery at ~37% before 50% is ever in reach.

A second cut holds N fixed (N=10) and varies the per-node message rate; B₅₀ again rises with offered load (from ~9 kbit/s at the lowest rate measured to ~19 kbit/s at 1 msg/s/node), confirming the law is driven by aggregate offered transactions rather than fleet size alone (Fig. 10).

bandwidth for 50% delivery versus fleet size, super-linear, undefined beyond N=30 — **Fig. 9.** B₅₀ (bandwidth for 50% delivery) vs fleet size N at fixed per-node rate. Measured points (2.6, 7.1, 15.7, 34.2, 110.6 kbit/s for N=3,5,10,20,30) grow *super*-linearly in N − 1, while the fluid prediction lies far below and the gap widens (k rising from ≈8× to ≈17×). Beyond N=30 the curve simply ends: at N≥40 delivery never reaches 50% at any bandwidth tested (Barrier 2), so B₅₀ is undefined — the bandwidth law is overtaken by the convergence ceiling.

bandwidth for 50% delivery versus per-node message rate — **Fig. 10.** B₅₀ vs per-node message rate (N=10 fixed). Required bandwidth rises with offered load, confirming the scaling is driven by aggregate offered federation transactions.

delivery-ratio heatmap over fleet size and bandwidth — **Fig. 11.** Delivery-ratio surface over the N × bandwidth grid (s0). The 50%-delivery contour (B₅₀) marches rightward as N grows — a visual statement of the super-linear scaling law — and then disappears entirely at large N, where the whole row is capped below 50% by the convergence ceiling no matter how far right (how much bandwidth) one goes.

5.6 Three non-convergence regimes

The non-convergence flags now span three qualitatively different mechanisms, which must not be conflated. (i) s0, small N (≤10) — window-limited. Absent partitions a finite backlog drains at any bandwidth B > 0, so the low delivery at throttled cells is merely the fixed-length run ending before the slow link finishes draining — not a permanent inability to deliver; given enough bandwidth (or time) these fleets reach 100%. (ii) s0, large N (≥20) — structural convergence ceiling (Barrier 2). Here delivery plateaus below 100% and the plateau is bandwidth-independent (§5.2): the logged HTTP 403 backfill failures show the room never assembles a consistent membership, so this is not a too-short window but a structural fork that more bandwidth cannot heal. (iii) s1 — structural partition starvation. Cross-convoy backlog can only drain inside heal windows and, at low bandwidth, never catches up. The honest summary is therefore not "a single bandwidth threshold" but: small rooms are window-/bandwidth-limited, while large rooms hit a hard ceiling that bandwidth does not move.

5.7 Per-run summary

run	scenario	bw (kbit/s)	delivery	median lat (s)	goodput	non-conv.	status
loading data/summary.json…

6. Discussion

The quantified verdict: two barriers, one conclusion. Matrix is infeasible for a 70-node tactical operational room, and the infeasibility is over-determined — it fails for two independent reasons that bind in different regimes. Barrier 1 (bandwidth / fan-out). Matrix sends one point-to-point federation transaction per peer per message, so a sender in an N-node room emits (N − 1) copies, each carrying a measured ~762 B signed-PDU envelope plus TLS framing. Where the room can converge, the per-radio bandwidth required for delivery therefore grows with fleet size — B₅₀ measured at 2.6, 7.1, 15.7, 34.2, 110.6 kbit/s for N=3, 5, 10, 20, 30 — already an order of magnitude beyond the 1–9 kbit/s tactical link budget by N=10. Barrier 2 (the convergence ceiling), the more fundamental one. Beyond ~N=15 the room cannot converge its membership at all, so the maximum delivery achievable over any bandwidth falls to 65.8% (N=20), 51.5% (N=30), 40.8% (N=40), 38.2% (N=50), 37.3% (N=70) and stays there no matter how fat the link is. The two compose into a blunt conclusion: even where bandwidth could in principle suffice, the room cannot converge — so adding radio capacity does not rescue the design. This is the headline result.

Why Barrier 1 is worse than the byte count. The fluid byte-budget B_req = r·(N − 1)·(S + O) captures the right shape but under-predicts the required bandwidth by a factor that grows with N — ≈8× at N=3, ≈9× at N=5, ≈6× at N=10, ≈16× at N=20, ≈17× at N=30. (Our earlier pilot read this as a roughly constant ~4×; the fuller N=3–70 sweep shows the multiplier compounding.) The gap is structural: Matrix federation is request/response- and RTT-bound — make_join, backfill, transaction acknowledgement, and key fetches are all round-trips, and they multiply with peers — so over the high-latency links emulated here the achievable goodput is far below the raw link rate, and increasingly so as the room grows. A naive byte count for N=70 would have said ~27 kbit/s and badly understated the problem.

Why Barrier 2 is the deeper finding. Bandwidth is a budget you can buy; the convergence ceiling is not. The logged root cause is auth-gated federation backfill: a server is refused an event's prev_events with HTTP 403 because it is not yet a recognised member, yet it cannot become a member without that history. Room state forks and never re-converges, leaving most servers seeing only ~30–40 of 70 members — the ~37% floor. No backoff tuning, longer window, or fatter link addresses a membership-gated authorization check. This is the qualitative result of the study, and it is a protocol-level limitation of Matrix federation for a single large room, not a property of our emulation.

We revisit the three working hypotheses against the measured data.

H1 — opportunistic store-carry-forward governed by link rate × partition dynamics. Confirmed at small N, refuted at scale. For N≤10 the room DAG does carry cross-node messages with no Matrix-level routing, delivery is a clean monotone function of link rate (Fig. 3), and in s1 cross-convoy delivery tracks the heal windows rather than any protocol logic (Fig. 6). But the hypothesis's premise — that delivery is governed by link/partition dynamics and "not by Matrix logic" — fails for the target fleet: at N≥20 Matrix logic (membership-gated backfill) becomes the binding constraint, and the room DAG is not a usable store-carry-forward layer at 70 nodes. The study's central hypothesis is thus refuted at the design scale.

H2 — goodput collapse for short messages. Confirmed, and grounded in a measured 762 B envelope rather than an estimate. The useful-payload ratio is ≈0.22 at the representative ~220 B tactical payload and reaches 0.5 only at a 762 B payload (Fig. 5). Because the envelope is N-independent (Fig. 8), this collapse is the same at every fleet size; the 762 B is the Matrix-PDU layer and excludes TLS, so real goodput is somewhat worse.

H3 — non-convergence below a bandwidth threshold. Superseded. The hypothesis anticipated a low (~1–2 kbit/s) bandwidth threshold below which backlog drains slower than partitions create it. We instead find that the most important non-convergence is bandwidth-independent — precisely the opposite of a threshold. There are three regimes (§5.6): s0 small-N is window-/bandwidth-limited (a faster link does fix it); s0 large-N hits the structural convergence ceiling that no bandwidth moves (Barrier 2); and s1 exhibits partition starvation gated by heal windows. So the honest statement is not "a single bandwidth threshold" but "beyond N≈15 the room fails to converge for reasons bandwidth cannot address."

7. Limitations & threats to validity

Pilot scale & replication. Two replicates per s0 cell and a two-point s1 sweep; convergence latency is noisy, so confidence intervals are wide. Trends are indicative, not statistically settled.
N=70 is now measured, not extrapolated — but the ceiling may be config-sensitive. This campaign measures the full N=3–70 range directly, including the 70-node design, so the convergence-ceiling verdict (Barrier 2) is observed rather than projected. What we have not isolated is whether that ceiling could be raised by Synapse configuration. The Barrier-2 plateau could interact with Synapse's state-resolution behaviour and with our closed-mesh key-fetch choice trusted_key_servers: [] (D6): with no trusted key server, a server that cannot reach an event's origin may also be unable to verify its signing key, which would compound the auth-gated-backfill 403. We attribute the ceiling to the logged backfill 403, but a configuration that pre-distributes state/keys, or uses a shared key server, might shift it. This is an open threat to the precise numbers, not to the qualitative finding.
The Barrier-1 bandwidth law is moot at large N. B₅₀ (and any "B₁₀₀") is only meaningful where the room can converge; for N≥40 delivery is capped below 50% by Barrier 2, so B₅₀ does not exist and the bandwidth figures for the largest fleets describe a target that is never reachable. The required-bandwidth scaling law should be read as a within-convergence result (N≲30).
Short, fixed runs. A fixed window is the right choice for comparability, but it means s0 small-N "non-convergence" is relative to that window, not proven asymptotic (§5.6). The structural claims — Barrier 2 (bandwidth-independence + logged 403) and the s1 partition regime — do not rely on the window length.
Overhead measured at the PDU layer, not on the wire. The 762 B envelope is serialised from the signed federation PDU (auth/prev events, hashes, signatures, transaction framing) and is N-independent, but it excludes lower-layer TLS/TCP/IP framing and backfill/key-fetch round-trips. True bytes-on-wire are higher, so the goodput figures are an upper bound on real goodput; a packet capture would tighten them downward.
Synthetic links on one box. Bandwidth, delay, and partitions are emulated via Chaos Mesh tc/netem on a single host. This is deliberate (control beats realism here), but it omits real radio effects — fading, MAC contention, asymmetric/jittery rates, and bit-level corruption interactions — and shares one kernel network stack.
Single 16-core host, shared resources. Despite the CPU-saturation guard, all 70 Synapses, Postgres, and the orchestrator contend for one 16-core machine; residual contention could perturb tail latency, and a synchronized backfill/heal burst at large N stresses the box more than at the validated N=10 scale.

8. Future work

Attack the convergence ceiling (Barrier 2) directly. Pin down whether the auth-gated-backfill 403 can be worked around — e.g. pre-seeding room state/keys at join, sharding the fleet into many small rooms instead of one 70-server room, a shared key server, or a homeserver/protocol variant with a different backfill-authorization model — and measure how far each lifts the ~37% ceiling. This is the highest-value follow-up, since Barrier 2 dominates the verdict.
Deepen replication on the medium/thorough campaigns (more bandwidths, scenarios, and 3–5 replicates) to turn indicative trends into statistically supported ones, and confirm the convergence ceiling is stable across Synapse state-resolution settings.
Complement the measured 762 B PDU envelope with a real packet capture of the federation link to add the TLS/TCP/IP framing and measure backfill/key-fetch/reconvergence bytes per heal directly — quantifying the (growing 8–17×) RTT/chatter tax byte-for-byte.
Enrich the link model toward real radio: add delay/jitter/loss/corruption alongside bandwidth, asymmetric rates, and contact patterns derived from vehicle mobility traces.
Explore additional topology scenarios (more than two convoys, chained partitions, intermittent relays) and quantify how DAG reach depends on shared-room structure.
Compare against a homeserver with a thinner federation footprint and against an explicit DTN/epidemic overlay to bound how much an active SCF layer would add over Matrix's implicit one.

References

The Matrix.org Foundation. Matrix Specification — Server-Server (Federation) API. spec.matrix.org. [TODO: verify exact spec version cited]
Element / The Matrix.org Foundation. Synapse: a Matrix homeserver. element-hq/synapse, GitHub. [TODO: verify version]
K. Fall. "A Delay-Tolerant Network Architecture for Challenged Internets." Proc. ACM SIGCOMM 2003, pp. 27–34. (SIGCOMM Test-of-Time Award.)
V. Cerf, S. Burleigh, A. Hooke, L. Torgerson, R. Durst, K. Scott, K. Fall, H. Weiss. Delay-Tolerant Networking Architecture. IETF RFC 4838 (Informational), April 2007.
A. Vahdat and D. Becker. Epidemic Routing for Partially-Connected Ad Hoc Networks. Technical Report CS-2000-06, Duke University, 2000.
E. M. Royer and C.-K. Toh. "A Review of Current Routing Protocols for Ad Hoc Mobile Wireless Networks." IEEE Personal Communications, vol. 6, no. 2, pp. 46–55, 1999.
The Kubernetes Authors. Kubernetes (k3s lightweight distribution). kubernetes.io / k3s.io. [TODO: verify citation form]
The Chaos Mesh Authors (CNCF). Chaos Mesh: A Chaos Engineering Platform for Kubernetes. chaos-mesh.org, GitHub chaos-mesh/chaos-mesh. [TODO: verify citation form]