It seems that many readers ignore the checkpoint mechainsm described in the PBFT paper. In this blog, I will share my thoughts about the checkpoint mechainsm. Let us start with a problem.
I. A Problem: an increasing size of messages
As we talked about in the last blog, when view is switched, each replica (including the leader) has to judge if an index \(i\) has been committed
in the old view. In some blockchain-structure consensus (e.g., Tendermint
and HotStuff
), the index being considered as committed
is also named as locked
. To make such judgements, some consensus algorithms simply rely on its local status (local judgement
for short), while others rely on collected opinions (interactive judgement
for short).
To carry out the local judgements
, a replica has to maintain the messages collected for each index \(i\). To carry out the interactive judgements
, a replica has to not only maintain the messages collected for each index \(i\), but also broadcast them when the view is changed. PBFT
takes the routine of interactive judgements
.
As time goes by, \(i\) keeps increasing. Messages being maintained (i.e., "storage overhead") by each replica will be more and more. When view is changed, broadcasting (i.e., "communication overhead") and reprocessing (i.e., "computation overhead") of messages for old indexes will also be more and more.
Therefore, a problem in designing PBFT
would be "how to reduce the storage overhead in a replica and the communication/compuatation overhead in the process of view change?"
II. How does PBFT
deal with this problem
To deal with it, PBFT
introduces three notions of checkpoint
, stable checkpoint
, and low-water mark
. A replica will produce a checkpoint every time when \(k\) (e.g., 100) more requests are executed. Once a replica generates a checkpoint for an index \(i\), it will broadcast a CKP
message in the format {CKP
, \(i\)}. If a replica collects 2/3+ CKP
messages for the same index \(i\), the corresponding checkpoint becomes a stable checkpoint
. Low-water mark
is equal to the index \(i\) of the latest stable checkpoint
.
A. How to reduce overhead?
PBFT
allows a replica to discard all the messages with indexes lower than low-water mark
(storage discarding), thus reducing the storage overhead. This mechanism is named "garbage collection" in the PBFT
paper. Besides, PBFT
allows a replica to only broadcast messages with indices higher than low-water mark
, thus reducing the communication overhead. A replica will only need to reprocess (make judgements for) indexes higher than low-water mark
, which reduces the computation overhead.
Further, during the process of view change, each replica will broadcast its latest stable checkpoint
to the leader. And the leader will package and broadcast all these stable checkpoint
in a message of new view
. If a replica finds a stable checkpoint
in the new view
message with a higher index than its own, it will try to synchronize the state data from the replica possessing the stable checkpoint
. The corresponding contents can be found in Section 4.4 View Changes
in the PBFT
paper.
B. Will the above mechanism compromise safety
or liveness
?
1.As for safety
Almost all the existing leader-based consensus algorithms adopt the time-out mechanism. If a replica cannot receive a specific number of specific messages in time, it will trigger a time-out and strive to change the leader/view. With the time-out mechanism, for a replica's actions, the situation where some correct replicas are silent equals the situation where the messages sent by these correct replicas are very slow. A correct consensus can guarantee safety
under the latter situation, which will also guarantee safety
under the former situation.
Therefore, we can draw a conclusion: a correct consensus can always guarantee safety
even if some correct replicas are silent.
Consider that a replica \(rep\) discards the storage for an index \(i\) once it has executed/committed \(i\). Then, \(rep\) will be silent to messages for \(i\) from other replicas. With the above conclusion, this will not compromise the safety
property.
Therefore, the stable checkpoint
and storage discarding machanisms by PBFT
will not compromise safety
.
2.As for liveness
However, it seems that the storage discarding mechanism by PBFT
will possibly compromise liveness
. The reason is that if a replica (\(rep\)) discards the storage for an index \(i\), \(rep\) cannot respond to the messages for \(i\) from other replicas. Other replicas have not committed \(i\) and need the messages from \(rep\) to finish the commitment.
We can further take an specific example to demonstrate the liveness
problem as follows:
- If a commitment for \(i\) with \(r\) is finished exactly by \(f+1\) correct replicas and \(f\) Byzantine replicas, with the remaining \(f\) correct replicas being unaware of \(r\);
- After \(r\) is committed and executed by \(f+1\) correct replicas, they will discard storage for \(i\). When they become the leaders later, they do not know \(r\) and cannot propose \(r\) at \(i\);
- Byzantine replicas will decline to propose \(r\) at \(i\);
- The other \(f\) correct replicas being unaware of \(r\) cannot propose \(r\) at \(i\) too;
- In this regard, the other \(f\) correct replicas will never commit \(i\) with \(r\), thus violating
liveness
.
A simple idea to settle the above problem is to synchronize the state from other replicas. However, a new question is that: how to guarantee the state in a remote replica is correct?
This is what the stable checkpoint
mechanism in PBFT
aims to do. A stable checkpoint
must be attached with a proof consisted of {CKP
, \(i\)} from 2/3+ replicas. The proof can guarantee the correctness of a state.
However, in my opinion, it seems that {CKP
, \(i\)} from 1/3+ rather than 2/3+ replicas are enough to generate a proof for a stable checkpoint
.
Reference
- Castro, M., & Liskov, B. "Practical byzantine fault tolerance." In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation (OSDI'99), pp. 173-186.
- Buchman, E. "Tendermint: Byzantine fault tolerance in the age of blockchains." Doctoral dissertation, 2016.
- Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., & Abraham, I. "HotStuff: BFT consensus with linearity and responsiveness." In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing (PODC'19), pp. 347-356.