Background
MariaDB is a popular open source SQL database which began as a fork of MySQL. MariaDB Galera Cluster is an active-active replication system for MariaDB which allows reads and writes on every node. In 2015 this author analyzed MariaDB with Galera Cluster and found that although Galera claimed to offer Snapshot Isolation, Codership Oy intentionally designed their system without a key Snapshot Isolation property, called first-committer-wins. This allowed MariaDB with Galera Cluster to lose or create money out of thin air in a simulated bank account transfer workload. In 2025 MariaDB acquired Codership Oy, bringing Galera Cluster under the MariaDB umbrella.
Galera Cluster is based on virtual synchrony group
communication framework called gcomm. Transactions are
initially executed optimistically
on any node. When a transaction commits it is synchronously
replicated to other nodes, which certify the transaction based on
the primary keys it wrote. Conflicts with other transactions are
identified based on a sequence number, or seqno.
The MariaDB Galera Replication Guide says that Galera uses unanimous replication:
Unlike traditional asynchronous or semi-synchronous replication, Galera ensures that transactions are committed on all nodes (or fail on all) before the client receives a success confirmation.
This is obviously wrong. If Galera actually required transactions to
commit on all nodes, it would not tolerate a single node failure.
MariaDB’s documentation often repeats this claim, saying “a
transaction is not truly considered committed until it has passed
certification on all nodes”, or “when
a transaction COMMITs, all nodes in the cluster have the
same value”, or “only
after Node A gets an ‘OK’ from all other nodes does it tell the client,
‘Your transaction is committed.’” In reality, Galera Cluster
continues to operate when a minority of nodes has failed. This is
consistent with MariaDB’s
claims about fault tolerance: if a quorum of nodes are online
and connected, that component can make progress.1
Galera used to require a manual recovery procedure when quorum was lost: an operator would have to log in to every node, identify the node with the highest sequence number, and use it to bootstrap the cluster. However, newer versions of Galera can recover from failures automatically.
Safety
“Data is consistent across all nodes at all times,” says the Galera Cluster Replication Guide, “preventing data loss upon node failures.” Galera “essentially transforms a set of individual MariaDB servers into a robust, highly available, and consistent distributed database system.”
This system should provide a real-time consistency model like Strong Snapshot Isolation. MariaDB’s Galera Cluster Guide says that Galera Cluster’s synchronous replication means that changes are “instantly replicated to all other nodes, ensuring no replica lag and no lost transactions.” The “no lost transactions” claim is repeated in MariaDB’s Galera Cluster README.
The Galera
Cluster Usage Guide promises that “Standard SQL transactions
(START TRANSACTION, COMMIT,
ROLLBACK) work as expected.” From this one might assume
that MariaDB with Galera Cluster supports the same consistency models as
a single MariaDB node. Is this true? It is surprisingly difficult to
find out! MariaDB’s Galera documentation does include a section on known
limitations. Some kinds of explicit locking are unsupported, and
MariaDB must use the InnoDB storage engine. However, this list makes no
mention of isolation levels or consistency anomalies. In fact, the sole
reference to isolation levels Jepsen found in MariaDB’s Galera
documentation is buried in the Management section, under Installation
and Deployment, on the Tips
on Converting to Galera page, under the “Transaction size”
heading.2 It says:
Galera’s tx_isolation is between Serializable and Repeatable Read. tx_isolation variable is ignored.
Repeatable Read is a remarkably strong consistency model. In most formalisms it is equivalent to Serializability so long as objects are selected by primary key, rather than predicates. In MariaDB “Repeatable Read” used to allow non-repeatable reads but now prohibits them; per MDEV-35124, MariaDB “Repeatable Read” should actually provide Snapshot Isolation. We therefore expect MariaDB Galera Cluster to provide a consistency model weaker than Serializable, but stronger than Repeatable Read, Snapshot Isolation, or both.
Test Design
We adapted Jepsen’s existing test suite for MySQL & MariaDB to set up three-node clusters of MariaDB with Galera Cluster, running on Debian Trixie. We used MariaDB’s official Debian repositories to install MariaDB 12.1.2 through 12.2.2, and Galera 26.4.13 through 26.4.25. We used MariaDB’s official Java client at version 3.5.6 to submit transactions to the cluster. While testing we introduced a variety of faults, including network partitions, process pauses, and process kills.
As in our previous MySQL analysis, our main workload used Elle’s list-append checker for transactional isolation. In a nutshell, Elle infers Adya’s write-write, write-read, and read-write dependencies between transactions, then looks for cycles in the resulting dependency graph, as well as a few other phenomena.
To infer these dependencies, our append workload performed randomly
generated transactions over lists of integers, with each list identified
by a unique primary key. Each micro-operation within a transaction could
either read a list, or append a unique integer element to a list. As in
previous work, we encoded these lists as a text column of
comma-separated elements, and used SQL concat to append
elements to a specific row. We split rows across multiple tables with a
structure like:
create table "txn0" (
id int not null primary key,
val text
);Since rows were only changed by appending a unique integer to its
val column, any read of a row told Elle exactly which
transactions wrote to it, and in which order. From this Elle inferred
the version order for each row, which allowed inference of all three
types of transaction data dependencies.3 It
also inferred session and real-time dependencies based on the
concurrency structure of the recorded history. Elle then found strongly
connected components in that graph, and searched for cycles with
particular shapes to find counterexamples to a variety of consistency
models. For example, a cycle involving only write-write and write-read
edges would constitute G1c, a violation
of Read
Committed.
Results
Write Loss on Coordinated Process Crash (MDEV-38974)
When all nodes crashed at approximately the same time, MariaDB with Galera Cluster regularly lost committed transactions. For example, in this one-minute test run, the cluster lost nine values appended to three different rows. Reads of row 112 around the time of a process crash observed:
| Time (s) | Elements |
|---|---|
| 50.63 | … 38, 45, 51, 53 |
| 50.64 | … 38, 45, 51, 53, 56, 57, 58 |
| 50.64 | … 38, 45, 51, 53, 56, 57 |
| 50.65 | … 38, 45, 51, 53, 56, 57, 58 |
| 50.65 | … 38, 45, 51, 53, 56, 57, 58 |
| 50.65 | … 38, 45, 51, 53, 56, 57, 58, 71 |
| 50.65 | … 38, 45, 51, 53, 56, 57, 58, 71 |
| 50.66 | … 38, 45, 51, 53, 56, 57, 58, 71 |
| 65.73 | … 38, 45, 51, 158, 159 |
| 65.73 | … 38, 45, 51, 158, 159, 160 |
All of the transactions which wrote these values were acknowledged as successfully committed. However, when the cluster restarted the appends of 53, 56, 57, 58, and 71 to row 112 were lost, and new elements were appended in their place: 158, 159, 160, and so on. The lost elements never appeared in any later read.
This behavior seemed to be caused by setting
innodb_flush_log_at_trx_commit = 0; setting it to
1 dramatically reduced the frequency of data loss. We
initially chose 0 because MariaDB described it as “a safer,
recommended option” in the documentation on Configuring
MariaDB Galera Cluster:
innodb_flush_log_at_trx_commit=0— This is not usually recommended in the case of standard MariaDB. However, it is a safer, recommended option with Galera Cluster, since inconsistencies can always be fixed by recovering from another node.
This works when failures are uncoordinated, but coordinated failures do sometimes happen! Flooding, lightning, cooling, network bugs, and other failures can cause all nodes in a cluster to fail in rapid succession, and when this occurs, unsynced data can be lost. Jepsen reported this issue as MDEV-38974.
More Write Loss (MDEV-38976)
Setting innodb_flush_log_at_trx_commit=1 significantly
reduced data loss, but did not eliminate it. Infrequently, MariaDB
Galera Cluster lost the effects of committed transactions when tests
involved process crashes and network partitions. For example, at roughly
141 seconds into this
test run, the cluster lost approximately nineteen seconds of writes
across four separate objects: 0, 285,
410, and 446. Some, like key 0,
lost only a short postfix of elements. Key 410, on the
other hand, lost all twenty-five elements and began afresh:
| Time (s) | Elements |
|---|---|
| 141.36 | 17, 19, 26, …, 91, 92, 97 |
| 152.79 | 175 |
| 153.21 | 175, 176, 177, 179 |
| 154.46 | 175, 176, 177, 179, 180 |
Note that the transactions which wrote 17,
19, and so on were successfully committed; their effects
definitely should not have been lost. This issue appeared only once
every few hours of testing, and seems unlikely to affect production
users. Nevertheless the loss of committed writes is concerning, and
Jepsen reported this to MariaDB as MDEV-38976.
Lost Update (MDEV-38977)
Even when write loss did not occur, Galera Cluster allowed P4 (Lost Update) and other forms of G-single. These anomalies occurred even in healthy clusters, without faults. For example, consider this test run, which contained the following pair of transactions:
The top transaction read key 468 and found nothing, then
appended 3 to it. The bottom transaction appended
6 to key 468. However, later reads of key
468 all found values beginning with
[6, 3, ...], which implies that the bottom transaction
(apparently) modified key 468 between the top transaction’s
read and write of it. This is a straightforward example of Lost Update,
from Berenson et al.’s paper
defining Snapshot Isolation:
P4 (Lost Update): The lost update anomaly occurs when transaction T1 reads a data item and then T2 updates the data item (possibly based on a previous read), then T1 (based on its earlier read value) updates the data item and commits. In terms of histories, this is:
P4: r1[x]…w2[x]…w1[x]…c1.
The problem … is that even if T2 commits, T2’s update will be lost.
P4 violates Snapshot Isolation. Since all operations here involved access by primary key, rather than predicates, this cycle is also G2-item: a violation of Repeatable Read.
We also observed more complex cycles involving multiple keys, or more than two transactions, all of which had a single read-write dependency edge. These cycles are examples of G-single, which is a more general violation of Snapshot Isolation. They similarly violate Repeatable Read. Jepsen reported this issue as MDEV-38977.
Stale Read (MDEV-38999)
Finally, under normal operation, MariaDB Galera Cluster occasionally allowed Stale Reads: a transaction could commit, be acknowledged as successfully completed to the client, then a second transaction could begin and fail to observe the first transaction’s writes. For example, take this test run, which contained the following pair of transactions:
The top transaction appended 9 to key 17693, then committed and was acknowledged to the client. The bottom transaction began after that acknowledgement, hence the real-time (rt) dependency edge from top to bottom. However, the bottom transaction read key 17963, and failed to observe the top transaction’s append of 9; hence the read-write (rw) dependency. This is a stale read, which is inconsistent with Galera Cluster’s claims of instant, lag-free replication.
This behavior occurred every few minutes in our testing, even without fault injection. Jepsen reported this issue to MariaDB as MDEV-38999.
| № | Summary | Event Required | Fixed in |
|---|---|---|---|
| MDEV-38974 | Loss of committed writes | Coordinated process crashes | Unresolved |
| MDEV-38976 | Loss of committed writes | Process crashes and network partitions | Unresolved |
| MDEV-38977 | Lost Update | None | Unresolved |
| MDEV-38999 | Stale Read | None | Unresolved |
Discussion
MariaDB Galera Cluster claimed to offer an isolation level “between Serializable and Repeatable Read”, and that transactions were “instantly replicated to all other nodes, ensuring no replica lag and no lost transactions”. However, when configured with MariaDB’s recommended settings, it lost committed transactions when multiple nodes failed in rapid succession. It also occasionally lost committed transactions under process crashes and network partitions. Even in healthy clusters, MariaDB Galera Cluster exhibited Lost Update and Stale Read; it provided neither Snapshot Isolation nor Repeatable Read, nor their stronger real-time variants. Indeed, the loss of committed transactions suggests MariaDB Galera Cluster was weaker than Read Uncommitted.
Users should set innodb_flush_log_at_trx_commit=1 to
reduce the probability of write loss on coordinated failure. MariaDB
should revise their documentation to make it clear that changing this
setting to 0 allows data loss in Galera Cluster.
Even with innodb_flush_log_at_trx_commit=1, users should
expect MariaDB Galera Cluster to lose committed writes when node
failures and network partitions occur. Thankfully, this behavior does
not appear to be common. It also exhibits Stale Read, Lost Update, and
other forms of G-single in healthy clusters, when no faults occur.
Transactions may (apparently) modify data in the interval between a
single transaction’s reads and writes; read-modify-write patterns, like
those used in many ORMs, are likely unsafe. Users should also assume
that committed transactions may not be visible to later
transactions.
MariaDB’s documentation makes it difficult to tell what consistency models Galera Cluster supports. It seems likely that Galera Cluster is supposed to provide Strong Snapshot Isolation or Strong Repeatable Read, but in practice, it appears weaker than Read Uncommitted. We suggest MariaDB update the documentation to make it clear what consistency models Galera Cluster is intended to (and actually does) provi..
These results are from a brief exploration of Galera Cluster—there may be other behaviors not documented here. As always, Jepsen takes an experimental approach to safety verification: we can prove the presence of bugs, but not their absence. While we make extensive efforts to find problems, we cannot prove correctness.
Future Work
While our tests used CONCAT to append to strings, it
seems likely that MariaDB Galera Cluster would also exhibit Lost Update
with blind writes to registers, and would therefore fail the simulated banking
workload used in earlier Jepsen tests—money could be destroyed or
created out of thin air. We have also not explored predicates, slow
networks, clock skew, or disk faults; all might prove fruitful avenues
for future research.
Jepsen wishes to thank Gordan Bobic and Teemu Ollakka from the MariaDB mailing list. Our thanks to Irene Kannyo for her editorial support. This research was performed independently by Jepsen, without compensation, and conducted in accordance with the Jepsen ethics policy.
Quorums are determined by tunable weights.↩︎
… in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying “Beware of the Leopard”.↩︎
Technically, we infer a mostly complete prefix of the version order, such that any “real” version order must be compatible with it. For more details, see the Elle paper.↩︎