0%

Thoughts About Checkpoint Mechanism In PBFT

It seems that many readers ignore the checkpoint mechainsm described in the PBFT paper. In this blog, I will share my thoughts about the checkpoint mechainsm. Let us start with a problem.

I. A Problem: an increasing size of messages

As we talked about in the last blog, when view is switched, each replica (including the leader) has to judge if an index \(i\) has been committed in the old view. In some blockchain-structure consensus (e.g., Tendermint and HotStuff), the index being considered as committed is also named as locked. To make such judgements, some consensus algorithms simply rely on its local status (local judgement for short), while others rely on collected opinions (interactive judgement for short).

To carry out the local judgements, a replica has to maintain the messages collected for each index \(i\). To carry out the interactive judgements, a replica has to not only maintain the messages collected for each index \(i\), but also broadcast them when the view is changed. PBFT takes the routine of interactive judgements.

As time goes by, \(i\) keeps increasing. Messages being maintained (i.e., "storage overhead") by each replica will be more and more. When view is changed, broadcasting (i.e., "communication overhead") and reprocessing (i.e., "computation overhead") of messages for old indexes will also be more and more.

Therefore, a problem in designing PBFT would be "how to reduce the storage overhead in a replica and the communication/compuatation overhead in the process of view change?"

II. How does PBFT deal with this problem

To deal with it, PBFT introduces three notions of checkpoint, stable checkpoint, and low-water mark. A replica will produce a checkpoint every time when \(k\) (e.g., 100) more requests are executed. Once a replica generates a checkpoint for an index \(i\), it will broadcast a CKP message in the format {CKP, \(i\)}. If a replica collects 2/3+ CKP messages for the same index \(i\), the corresponding checkpoint becomes a stable checkpoint. Low-water mark is equal to the index \(i\) of the latest stable checkpoint.

A. How to reduce overhead?

PBFT allows a replica to discard all the messages with indexes lower than low-water mark (storage discarding), thus reducing the storage overhead. This mechanism is named "garbage collection" in the PBFT paper. Besides, PBFT allows a replica to only broadcast messages with indices higher than low-water mark, thus reducing the communication overhead. A replica will only need to reprocess (make judgements for) indexes higher than low-water mark, which reduces the computation overhead.

Further, during the process of view change, each replica will broadcast its latest stable checkpoint to the leader. And the leader will package and broadcast all these stable checkpoint in a message of new view. If a replica finds a stable checkpoint in the new view message with a higher index than its own, it will try to synchronize the state data from the replica possessing the stable checkpoint. The corresponding contents can be found in Section 4.4 View Changes in the PBFT paper.

B. Will the above mechanism compromise safety or liveness?

1.As for safety

Almost all the existing leader-based consensus algorithms adopt the time-out mechanism. If a replica cannot receive a specific number of specific messages in time, it will trigger a time-out and strive to change the leader/view. With the time-out mechanism, for a replica's actions, the situation where some correct replicas are silent equals the situation where the messages sent by these correct replicas are very slow. A correct consensus can guarantee safety under the latter situation, which will also guarantee safety under the former situation.

Therefore, we can draw a conclusion: a correct consensus can always guarantee safety even if some correct replicas are silent.

Consider that a replica \(rep\) discards the storage for an index \(i\) once it has executed/committed \(i\). Then, \(rep\) will be silent to messages for \(i\) from other replicas. With the above conclusion, this will not compromise the safety property.

Therefore, the stable checkpoint and storage discarding machanisms by PBFT will not compromise safety.

2.As for liveness

However, it seems that the storage discarding mechanism by PBFT will possibly compromise liveness. The reason is that if a replica (\(rep\)) discards the storage for an index \(i\), \(rep\) cannot respond to the messages for \(i\) from other replicas. Other replicas have not committed \(i\) and need the messages from \(rep\) to finish the commitment.

We can further take an specific example to demonstrate the liveness problem as follows:

  • If a commitment for \(i\) with \(r\) is finished exactly by \(f+1\) correct replicas and \(f\) Byzantine replicas, with the remaining \(f\) correct replicas being unaware of \(r\);
  • After \(r\) is committed and executed by \(f+1\) correct replicas, they will discard storage for \(i\). When they become the leaders later, they do not know \(r\) and cannot propose \(r\) at \(i\);
  • Byzantine replicas will decline to propose \(r\) at \(i\);
  • The other \(f\) correct replicas being unaware of \(r\) cannot propose \(r\) at \(i\) too;
  • In this regard, the other \(f\) correct replicas will never commit \(i\) with \(r\), thus violating liveness.

A simple idea to settle the above problem is to synchronize the state from other replicas. However, a new question is that: how to guarantee the state in a remote replica is correct?

This is what the stable checkpoint mechanism in PBFT aims to do. A stable checkpoint must be attached with a proof consisted of {CKP, \(i\)} from 2/3+ replicas. The proof can guarantee the correctness of a state.

However, in my opinion, it seems that {CKP, \(i\)} from 1/3+ rather than 2/3+ replicas are enough to generate a proof for a stable checkpoint.

Reference

  1. Castro, M., & Liskov, B. "Practical byzantine fault tolerance." In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation (OSDI'99), pp. 173-186.
  2. Buchman, E. "Tendermint: Byzantine fault tolerance in the age of blockchains." Doctoral dissertation, 2016.
  3. Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., & Abraham, I. "HotStuff: BFT consensus with linearity and responsiveness." In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing (PODC'19), pp. 347-356.