Byzantine Fault Tolerance – Distributed Agreement

Lately, I’ve been thinking a lot about Byzantine agreement protocols. The Byzantine generals is a famously unsolvable problem in computer science. Except now we have ways to solve it, and that might mean we need to rethink some things.

The Byzantine generals problem

Imagine there are two generals, each of which commands an army on the top of a different hill. One of the two generals realizes an opportunity for a combined attack — if both generals attack together, they will prevail. However, if either of them attacks alone, they will lose.

So the general has an idea, he will send a messenger to deliver a message to the other general, “we both attack, tomorrow at dawn”. But what if the message doesn’t get through? If he attacks, he will fail horribly if the messenger is captured or killed.

He could instruct his messenger to deliver the message and then return with confirmation. But what if the messenger is captured on his way back? Then the other general will attack alone.

It is not that difficult to see that this problem cannot be solved. Any scheme you try to design will always contain at least one point at which one general is committed to attack even if no further messages go through but the other is not — if no further messages go through after that point, one will attack and the other will not.

Why this matters

A significant number of fairly routine things that you might think were easy to accomplish actually require solving the Byzantine generals problem.

For example, imagine coordinating a bank’s accounting system with an ATM. We want the bank to record a debit to the account if and only if the ATM successfully dispenses cash. This is a Byzantine generals problem.

Or imagine shutting down a TCP connection. We might want to have the property that one end will see no errors in the shut down of the connection if and only if the other end also sees no errors. That too is a Byzantine generals problem.

A significant number of protocols and systems were designed with the understanding that the Byzantine generals problem cannot be solved. And if we had an off-the-shelf solution to this problem, they might have been designed differently.

The solution

As I said, the Byzantine generals problem is provably unsolvable. But that doesn’t mean that we can’t do anything at all about it.

Let’s frame a simple example of the problem: We have two machines. One of those machines wants to deliver a message to the other machine. It wants to consider the message delivered if and only if the other machine has accepted custody of the message. And the other machine wants to accept custody if and only if the first machine understands that is has accepted custody.

The initial part is easy. The first machine sends the message to the second machine and the second machine tells the first machine that it is willing to accept custody of the message. But now the two machines need a way to agree that the first saw the second’s agreement and the second saw the first’s acknowledgement of that agreement, and so on.

To see that this is solvable, consider first the following scheme: The second computer indicates its willingness to accept custody of the message by posting an agreed upon Bitcoin transaction to the blockchain. The first computer considers the second computer to have taken custody of the message as soon as it sees the signed message get sufficient confirmations.

This solves the problem given a few assumptions. If the Bitcoin network breaks somehow during this process, we do not have agreement until it begins working again. If the transaction, for some reason, does not ever get confirmed, we will not have reliable agreement. Note that if the second computer loses connectivity as it broadcasts its transactions, it may not know what to do until connectivity is regained.

But we can easily fix most of these things. For example, we can build a network of nodes that run a Byzantine agreement protocol (like PBFT) and simply ask that network to perform an agreement operation each time we need such an agreement. We can build such networks using any number of nodes whose sole purpose is to provide “consensus as a service”, operated by reliable operators whose business revolves around their services being functional. The cost to provide a single agreement is absurdly low, on the same order of magnitude as responding to a DNS query.

So what?

Imagine if we had a standard for forming such networks, discovering them, and using them to reach agreement. We could easily drop Byzantine agreement into any Internet-connected protocol or system any place we needed it.

What would that be good for? Stay tuned…

Author: JoelKatz

CTO at Ripple and one of the original architects of the XRP Ledger. Known in many online communities as "JoelKatz". View all posts by JoelKatz

2 thoughts on “Byzantine Fault Tolerance”

Brett C. says:

July 22, 2016 at 11:50 pm

Still excited to see the follow-up. Just dropping a note to remind you you’re not blogging in a vacuum.

Jonathon Supper says:

August 23, 2016 at 3:31 pm

Good read man! I think i have solutions for Byzantine problems, i hope to prove and release one day, thanks for getting the word out there 😀