P.S. Medium is the most terrible platform for math — If your eyes are as scalded by LaTeXIT as mine are, there is a link to a Dropbox Paper version of this article at the end (but please upvote here if you liked this article!). I was just committed to finishing this series on the platform that I started on :)
In the first two parts of this series, we explored the interplay between the probabilistic nature of distributed block data structures (DBDS) and how one has to modify the traditional analysis of Byzantine Fault Tolerant (BFT) systems for DBDS. Recall that our goal with this series is to demonstrate that there are a small number of common security primitives that underlie most (if not all) distributed consensus algorithms and DBDS. We first evinced this by discussing the seeming universality of analysis techniques for DBDS and distributed consensus protocols, most of which appear to do the following: list a set of axioms, construct an environment for a game to be played between honest agents and a malicious adversary, and then provide probabilistic bounds of the adversary’s success in this game. We also found that the set of axioms that are desired by these systems often boil down to two things:
- Liveness: The ledger continues to grow and accepts new transactions or state transitions
- Persistence/Safety: Honest transactions eventually make it into the ledger and cannot be removed (without gargantuan expense) after a sufficient amount of time
Given this level of similarity between the analyses of seemingly distinct protocols, it seems obvious to ask if there is a way to describe a single framework that can encompass all known distributed consensus methodologies. In this article and the sequel, we will look at two vignettes that suggest this might be possible:
- [Part 3] Avalanche can be viewed as a randomized realization of the Stellar Consensus Protocol
- [Part 4] Reductions to “ideal functionality” work for asynchronous models of Bitcoin, Oasis Labs’ Ekiden, and Zexe
But how does a generic framework for analysis relate to what makes blockchains secure? While a general framework for evaluating security is nice, it doesn’t quite provide insight into security itself. What I hope to convey is that by generalizing security (e.g. figuring out which pieces of security are common to all protocols), one can get a better understanding of what the minimal necessary & sufficient conditions for a secure blockchain look like. Moreover, the fact that certain protocols can be implemented with the primatives of other protocols suggests that there is some underlying commonality to the more specialized protocol. In this vein (and unlike protocols discussed in the previous articles), the Stellar Consensus Protocol (SCP) provides a general framework as opposed to a concrete implementation — we will first dive into this protocol and illustrate that it is a general framework for a variety of consensus mechanisms. Next, we will give an overview of Avalanche, describe how it can be implemented as an instance of the SCP, and provide a sketch of a proof and numerical evidence for this. We should note that this will implicitly endow Avalanche with much better security properties than those proved in the original paper. Finally, we will conclude by looking forward to ideal functionality and how probabilistic equivalences between abstract state machines (such as the equivalence between Avalanche and an instance of the SCP) provide a way for unifying the analysis of consensus and to bootstrap strong guarantees from weak ones.
The Stellar Consensus Protocol was written by David Mazières, a professor at Stanford and Chief Scientist at the Stellar Development Foundation (which maintains the client code for the Stellar network). The SCP is implemented in stellar-core and in my opinion, is one of the best written pieces of blockchain client code that exists within the ecosystem. Mazières’s previous work includes Kademlia, a peer-to-peer distributed hash table (and mesh networking algorithm) and limits on the trade-offs between safety and liveness in BFT systems. Given this background, it is unsurprising that this work focuses on a protocol that has BFT-like guarantees with blockchain-esque communication complexity. In particular, the protocol tries to toe the line between a fully permissioned BFT system and a permissionless blockchain system. It does this by removing the all-to-all validity set of traditional BFT and replaces it with a localized set of “trusted nodes” that are used for validation. Let’s try to describe this visually:
At a high-level, the SCP achieves this by having users choose subsets of users that they trust and a user only amends a transaction if it gets unanimous agreement among these subsets. Why does this provide an improvement?
Recall that in traditional BFT, we require at least 2/3 of validators to agree before a transaction can be amended. In the first drawing, we see a network of five nodes that communicate to each other via the graph in the drawing. In order for any node to decide whether or not to add a transaction to it’s local copy of a ledger, it must receive valid messages from all other nodes (e.g. at least 2/3 of nodes). This means that in order for consensus to be reached, there needs to be a network path from any pair of nodes, which implies that the communication cost is quadratic in the number of nodes [0]. Instead of focusing on the explicit communication network, as BFT protocols do, the SCP focuses instead on what Mazières terms a quorum slice [1]. Intuitively, a quorum slice is a set of subsets of nodes that we need to reach agreement with before we can amend a transaction to our local ledger. Importantly, this subset does not need to be large (relative to the number of nodes). Moreover, a single node usually participates in multiple quorum slices and this is specified via a quorum selection function:
for a node set N. For instance, in the lower portion of the diagram, we see that for node 1, there are two subsets that it can reach consensus with: {1,3,5} and {1,2,5}. The SCP protocol provides an algorithm for a node to only have to reach consensus with one element of its quorum slice. This means that node 1 only needs to reach consensus with {1,3,5} or {1,2,5} to make progress within its own ledger. The idea is that if the size of the set of subsets we have to reach consensus with is sublinear, then we can reach consensus while having subquadratic communication complexity. This seems like a free lunch!
Before continuing onto some of the other assumptions and constraints that are necessary to execute the SCP, let’s first look at how traditional BFT can be represented within this framework (and is a subset of the SCP). Suppose that we have a BFT protocol on a node set N with |N| = n. For each element n of N, define the following set of subsets:
These sets are the smallest sets that n can reach BFT consensus with, allowing it to add a transaction to its local copy of a ledger. Then, any valid quorum selection function Q_{BFT} , must satisfy the following constraint:
Less formally, this says that in order for a node n in N to reach consensus, it must have at least reached consensus with a set of size at least 2n/3 in order for it to add a transaction to its ledger.
At this point, a reader versed in distributed systems theory should be skeptical of the idea that a single node need only agree with a sublinear number of people in order to reach consensus. Indeed, there is no free lunch here — instead of explicitly quantifying the number of nodes that we need to agree with to reach consensus, the SCP instead provides necessary and sufficient conditions on the set of quorum slices such that consensus can be achieved. These conditions are topological in nature and correspond to connectivity properties of quorum slices. We will present visualizations of these conditions. Firstly, let’s consider the concept of a quorum, which is a set of nodes and quorum slices that have a sort of “transitive closure” property. In a sense, this means that U is a self-consistent set — the nodes do not need input from any other nodes to reach consensus. Here’s a picture to illustrate this:
In the left panel, we see three nodes and a set of quorum slices such that each slice is contained within the set of three nodes. On the right, we see a set of five nodes such that the same set, {1,2,3} is not self-consistent. We need to add nodes 4 and 5 in order to create a quorum. Consensus protocols all specify (sometime implicitly) a minimum quorum size for consensus [e.g. 2n/3 for BFT and n/2 for Nakamoto]. However, the key to SCP is that each node chooses its own notion of ‘size needed to reach consensus’. Moreover, since each node can be contained within multiple quorum slices, there is also redundancy — we don’t rely on a single set of nodes for agreement, but rather we rely on just achieving consensus with one. This gives us two parameters to play with: average size of a quorum slice and the number of slices associate to a single node. In BFT and Nakamoto, both of these parameters are fixed; the SCP provides minimal conditions for reaching consensus so that a protocol developer can tune these two parameters.
The topological conditions that the SCP provides are reminiscent of definitions in elementary point-set topology. In particular, the most important property that is required for consensus is called the quorum intersection property (QIP), which says that any two quora have non-empty intersection (reminiscent of the finite intersection property of compactness, no?). This strong condition implies that one can get from any node to another node via quora. If we rephrase this in human terms and let sets of friends represent a quorum, it would mean that if I agree with my friends and one of my friends agrees with your friends, then I will agree with your friends. The beauty of this type of formalism is that it provides a mechanism for agreement to propagate between nodes, regardless of the underlying network graph. It also formalizes how nodes build in redundancy to avoid Eclipse attacks without resorting to describing the explicit peer-to-peer network graph. With this framework, one can reason about security assumptions, such as Algorand’s liveness assumption, in a much cleaner manner as the quora size distribution and overlap controls security. However, there is still no free lunch — we have to construct a quorum selection function that satisfies all of these constraints in order for the protocol to be useful. This has been the main criticism of the protocol, as it is seems extremely hard to construct such a quorum selection function (indeed, doing it algorithmically might be #P-complete)
Finally, we can speak to the protocol itself: the SCP is a modified version of the “single decree synod” of Paxos and provides a way for transactions to be nominated, tentatively accepted, and then finalized via local agreement with one’s quora. The main assumption needed to ensure that this modified Paxos algorithm achieves liveness and persistence is that the set of quora still satisfies the QIP when we remove the set of Byzantine nodes. There are a variety of details in the paper about the precise state transitions that take place within the system — the only thing we will need to know is that Avalanche (defined below) also has a notion of a bivalent state, which is when a network partition causes part of the network to believe that statement a is true and another part to believe a is false. Avalanche and Stellar both provide explicit solutions for the network to regain consensus, even if the network is trapped within a bivalent state. Now onto Avalanche!
Avalanche is a clever reuse of the heavy-hitters algorithm (e.g. generalization of cardinality estimation algorithms like HyperLogLog) to generate a probabilistic consensus algorithm [2]. The core consensus idea is very simple and has far less complexity than the SCP when trying to agree on whether a statement A is either true or false (represented as nA):
- Initialize a table that stores your confidence in either A or nA
- Repeatedly query k other users for their decision on A or nA
- Increment your confidence in A, nA when a k-sample votes for A, nA (respectively)
- Once you have overwhelming confidence in either A or nA, accept that choice
This methodology achieves consensus by relying on ‘metastability’ — once you get enough samples, you are ‘forced’ into a decision of truthiness, just as you are forced into a minima of a potential energy function when you are at a metastable critical point (e.g. ball 2) [3]:
Although Avalanche cites the SCP paper, they do not mention that metastability is exactly the same as bivalency in Stellar.
To go from consensus to a protocol, Avalanche stores transactions in a DAG and decides on whether a new path in the DAG is the one accepted by most participants via this consensus mechanism. If we ignore a few technical details, this is starting to look reminiscent of the SCP:
- We have randomized quorum slices (e.g. our queries to k other participants)
- Each query to k participants corresponds to a new quorum slices
- With enough samples from the set of k-element subsets, we get close to covering the whole space which should generate quora (with high probability)
- At some point, we should have a high-enough density of quora that we satisfy the QIP with high probability
Let’s enumerate the set of technical hurdles that one needs to surmount in order to make the above analogy exact:
- We need to prove that Avalanche’s sampling methodology generates quora with high probability in a sample number of samples
- We need to prove that Avalanche’s sampling methodology yields a quorum selection function that satisfies the QIP with high probability
- We need to ensure that the liveness and persistence guarantees of the SCP hold when one has a dynamic quorum selection function
For the first two pieces, I’m going to provide a sketch of a proof that gives the techniques needed to prove that quora are generated and the QIP is satisfied. The main source of randomness within Avalanche’s consensus mechanism is the hitting time — how many k-samples do I have to take before I have sufficient confidence to accept/reject a statement A? We will model this via a simplification called Poissonization that is often used in distribution testing [4]. Here is our generative model for creating the Avalanche quroum selection function Q_{ava}, where λ is the expected number of samples that we have to take [5]:
The Poissonization allows for us to more easily compute expectations related to Quora. For instance, it is an elementary calculation to show that for a set U ⊂[n], we have:
where the last step used:
and (n)_k is the falling Pochhammer symbol. As even this simple calculation shows, there is inherently a tradeoff in quorum probability between the size of a quorum and the expected number of samples (represented via λ). The exact relationship is unknown, but the functional form is reminiscent of critical percolation — quora correspond to clusters within a percolation set up [6]. There is a sense in which the elements of a dense enough quorum slice form a random intersection graph for which we have the following subcritical and supercritical percolation results for a fixed number of samples [7]:
We need to do a little more work — adapt the Poisson process and use a martingale limit theorem to make sure that our Y \wedge 2^n construction isn’t causing any issues — but we basically have a result showing that we get quora with high probability and there is a sharp phase transition, if the above moment conditions are satisfied. For the second point regarding the QIP, it is quite unlikely that we cannot construct analytic bounds for the likelihood of the QIP being satisfied. However, just as one solves the famous ‘k rats to find m poisoned bottles’ interview question, we can use the Frankl-Füredi Theorem for union-free sets to provide explicit lower bounds on the minimum size of a set of quora that satisfy the QIP. Once we have a handle on the size distribution of quora when Y is in the tail of Poisson(λ), we can combine these two results to get bounds on the probability of the QIP in the subcritical and supercritical regimes.
Thus, we only need to show that the results of the SCP hold for randomized quorum selection functions in order to show that Avalanche is an instantiation of the SCP! We’re going to discuss one method that I think will work for this — using ideal functionality. My understanding of ideal functionality, at least in the blockchain world, is that it serves as a way to show that there is a type of probabilistic universality between different protocols. One should think of this as being akin to the Wigner or the Tracy-Widom laws of Random Matrix Theory — as long as our protocols satisfy some simple conditions (akin to moment conditions in probability), the difference between statistics for these protcools converges (perhaps weakly) in distribution to zero. This allows for us to create an equivalence relationship between protocols that are statistically similar and then transfer proofs from simpler protocols to more complicated protocols. For instance, Pass, Seeman, and Shelat use this methodology to transfer security proofs for Bitcoin in synchronous settings to Bitcoin in asynchronous settings. But more on this next time!
Even though the SCP seems to have been received with little fanfare — most articles that cite it are like Avalanche and just mention it as a competitor/alternative — I hope that I have illustrated that it is actually a useful consensus algorithm. The flexibility of the SCP lets a protocol designer interpolate between the extremes of a BFT protocol and a Nakamoto protocol via topological constraints (instead of constraints on the percentage of honest participants). It also lets one bootstrap the security of an algorithm (e.g. to give Avalanche more sharp bounds, as per the phase transition of the last section) from an abstract description. We have also seen that when we formulate Avalanche as an instantiation of the SCP, we can derive sharper concentration bounds for when consensus can be reached.
Hopefully, this has whet your appetite for understanding the final piece of the consensus classification puzzle — ideal functionality and it’s relation to universality classes.
I want to thank Uthsav Chitra, Arthur Breitman, Yi Sun, Rei Chiang, and Zaki Manian for helpful comments and suggestions.
[0] In the original Byzantine General Problem paper, the authors provide a more explicit necessary and sufficient condition on the network graph:
Note the intersection avoidance clause in the definition of regularity — this condition implicitly provides constraints on the graph’s conductance (via Cheeger’s Theorem). In particular, it can be shown that all cuts of a regular graph $$G = (V,E)$$ have size $$\Omega(|V|²)$$, which implies, via the max flow-min cut theorem, that the min flow has quadratic cost.
[1] Interestingly, quorum slices are simply hypergraph edges on the set of nodes. The quorum selection function just returns the set of hypergraph edges incident on a specific node.
[2] For some reason, people in the blockchain space do not seem to know much about randomized algorithms and view Avalanche as made up of ‘comprehensive low-level details and proofs that are not easily comprehended by non-academics.’ However, if you look at notes on the heavy hitters algorithm such as this one from an undergraduate class taught by Tim Roughgarden and Gregory Valiant, you will basically see the same base algorithms as Avalanche analyzed in far more rigorous detail. Moreover, people have proved much better hitting time bounds than what Sirer / Team Rocket prove in the original paper and these much better bounds have even gotten popular science press coverage!
[3] If you have seen Emin Gün Sirer talk about Avalanche, he will inevitably mention that it is a ‘physics-based’ approach. This is actually exactly codified in the picture above in that if you construct a Markov Chain (e.g. via the Metropolis Algorithm) to sample the metastable potential above, then you will have a similar hitting time result as Lemma 3 / Theorem 4 of the Avalanche paper.
[4] See this STOC 2011 paper by Valiant and Valiant for a good example
[5] This is, of course, assuming that the majority of Avalanche users are honest and there are not adversaries attempting to perform Eclipse-style attacks that bias our sampling distribution. It is possible to use a VRF/VDF to synchronize randomness, but this will likely destroy the claimed communication complexity benefits of Avalanche
[6] In order to get some numerical evidence for a phase transition, I made a plot of a slice of this function:
The heat value represents the value of $$|U|/n$$ where the probability is maximized (e.g. $$argmax_{|U|} Pr[U\text{ is a quorum}]$$) for $$n = 128$$. I wasn’t totally able to eliminate numerical artifacts in the falling Pochhammer computation (e.g. using ratios of $$\Gamma$$ functions, $$(n)_k = \frac{\Gamma(n)}{\Gamma(n-k-1)}$$), but I used the higher precision computation within scipy to estimate this [if you look there, you can see the tricks they do to get higher precision; note that the definition for scipy is the rising Pochhammer symbol, $$\mathsf{poch}(n, k) = \frac{\Gamma(n+k)}{\Gamma(n)}$$]. Note that the upper part of the y-axis corresponds to $$\log_{10} \lambda = -2$$
[7] These results stem from Theorems 5 and 6 of this paper; they adapt standard edge percolation results on Erdös-Renyi graphs by using Galton-Walton branching processes to generate spanning trees
Blog