a16z: Fundamental Principles for Evaluating Blockchain Performance
Written by: Joseph Bonneau, member of a16z crypto research
Compiled by: Amber, Foresight News
The discussion around performance and scalability is one of the most enduring debates in the entire crypto world.
The debate over the pros and cons of layer one and layer two solutions has been ongoing, but due to the lack of standardized metrics and assessment criteria, the data presented by various parties in the debate often lacks consistency, undoubtedly exacerbating the divergence of opinions.
In simple terms, we need a more detailed and thorough approach to compare performance. For instance, we need to break down performance into several dimensions for separate comparison and find a comprehensive trade-off standard. In this article, I will start with basic terminology, outline the challenges currently faced in the market, and elaborate on some fundamental principles to keep in mind when assessing blockchain performance.
Scalability & Performance
First, let’s define two terms: scalability and performance. These two words have standard meanings in computer science but are often misused in the blockchain context. Performance is generally used to measure the effectiveness a system can achieve, with performance metrics potentially including the number of processes that can be handled per second or the time required for specific demands. Scalability, on the other hand, is used to measure how well a system can enhance performance by adding certain resources.
The reason we need to clarify these definitions first is that many methods to improve performance do not actually enhance scalability. A simple example is using a more efficient digital signature scheme, such as BLS signatures, which are about half the size of Schnorr or ECDSA signatures. If Bitcoin switched from ECDSA to BLS, the number of transactions per block could increase by 20-30%, thereby improving performance overnight. However, we can only do this once—there are no more space-efficient signature schemes to switch to (BLS signatures can also be aggregated to save more space, but that is also just another one-time trick).
In reality, there are many one-time tricks to improve blockchain networks (like SegWit), but what we truly need is a scalable architecture to achieve continuous performance improvements, allowing us to continuously enhance performance by adding resources. In fact, this has already become a common approach in the Web2 era; for example, when building servers, although we can directly set up a sufficiently fast server, we generally end up needing to upgrade to a multi-server architecture, requiring the continuous addition of new servers to meet the growing data storage/processing demands.
Understanding this distinction also helps avoid common sense errors in statements like "this blockchain has high scalability; it can process so many transactions per second!" While such rhetoric may be provocative, the fact is that the number of transactions processed is a performance metric, not a scalability metric.
Scalability essentially requires leveraging parallelism. In the blockchain domain, layer one scalability often requires sharding or something that resembles sharding. The basic concept of sharding is to divide the state into several pieces so that different validators can independently process a portion, which aligns very well with the definition of scalability. Of course, layer two offers more options for adding parallel processing, including off-chain channels, Rollups, and sidechains, among others.
Latency and Throughput
In the past, we often evaluated blockchain performance using two dimensions: latency and throughput. Latency can be used to measure how quickly a single transaction can be confirmed, while throughput measures the total number of transactions that can be confirmed in a specific time frame. This measurement approach applies to both layer one and layer two networks and is also entirely applicable to other types of computer systems outside of blockchain.
Unfortunately, both latency and throughput are actually difficult to measure and compare. Another important point is that individual users do not actually care about throughput; they only care about latency and transaction fees. Transaction fees are an important dimension in blockchain systems that do not exist in traditional computing.
Challenges in Measuring Latency
Measuring latency seems straightforward: how long does it take for a transaction to be confirmed? But the issues become apparent in practice. First, the latency we measure at different points in time is often different. Do we start counting from when the user clicks the submit button locally? Or do we start counting when the task reaches the memory pool? And when the block is confirmed, do we stop the timer immediately? Different operational details can yield different results.
The most common method is to measure from the validator's perspective, from the time the client first broadcasts the transaction to when the transaction is reasonably confirmed (in a sense, real-world merchants would consider when they receive payment and dispatch goods). Of course, different merchants may adopt different acceptance criteria, and even a single merchant may use different standards based on the size of the transaction.
The validator-centric approach overlooks some important practical considerations. First, it ignores the latency on the peer-to-peer network (how long does it take for most nodes to hear the transaction after the client broadcasts it?) and client-side latency (how long does it take to prepare the transaction? How long does it take to load on the client's local machine?). For simple transactions like signing Ethereum payments, client-side latency may be very small and predictable, but it differs for more complex situations (e.g., proving that a privacy transaction is correct).
Even if we standardize the time window for measuring latency, the final answer remains situational. No cryptocurrency system has ever guaranteed constant transaction latency. The basic rule of thumb to remember is that latency is a distribution, not a single number.
The network research community has long recognized this and pointed out that the long tail is crucial; even a 0.1% process experiencing latency can severely impact the final user experience.
For blockchains, confirmation latency can vary for several reasons:
Batch Processing: Most systems batch transactions in some way, leading to variable latency, as some transactions must wait until the batch queue is filled before being processed. Network participants may be fortunate enough to catch the last train of that batch. These transactions will be confirmed immediately without any additional delay, but those who entered the queue earlier must wait longer for confirmation.
Uncertain Congestion: Most systems experience congestion, meaning the number of published transactions exceeds what the system can process immediately. When transactions are broadcast at unpredictable times (often abstracted as a Poisson process), or when the rate of new transactions changes over a day or week, or in response to external events, the level of congestion may vary.
Consensus Layer Differences: Confirming transactions on layer one usually requires a set of distributed nodes to reach consensus on a block, which can increase variable latency, independent of congestion. Proof-of-work systems discover blocks at unpredictable times. Proof-of-stake systems may also introduce various latencies.
For these reasons, a good guideline is that statements about latency should present the distribution of confirmation times rather than a single number like an average or median.
While summary statistics like averages, medians, or percentiles can indicate some patterns, accurately assessing a system requires considering the entire distribution. In some applications, if the latency distribution is relatively simple, average latency can provide good insights. But this ideal situation is rare in cryptocurrencies: typically, confirmation times can be long.
Payment channel networks (like the Lightning Network) are a good example. As a classic L2 scaling solution, these networks often provide very fast payment confirmation services, but sometimes they require channel resets, which can lead to latency increasing by several orders of magnitude.
Even if we have good statistical data on the exact latency distribution, they may change over time with variations in systems and system demands, making it very unclear how to compare latency distributions between competing systems. For example, consider a system that confirms transactions with a uniform distribution of latencies between 1 and 2 minutes (with an average and median of 90 seconds). If a competing system accurately confirms 95% of transactions within 1 minute and the remaining 5% within 11 minutes (with an average of 90 seconds and a median of 60 seconds), which system is better? The answer is that different categories of applications may choose differently.
Finally, it is important to note that in most systems, not all transactions have the same priority. Users can pay more fees to gain higher inclusion priority, so in addition to all the above, latency also depends on the transaction fees paid. In summary: latency is complex. The more details in the prerequisites, the better. Ideally, the complete latency distribution should be measured under different congestion conditions. Breaking down latency into different components (local, network, batch, consensus latency) is also helpful.
Challenges in Measuring Throughput
Throughput seems simple at first glance: how many transactions can a system process per second? But in fact, the problems are similarly hidden beneath the surface. The difficulty mainly lies in two aspects: first, what counts as a transaction? Are we measuring what a system has done today? Or are we measuring what it can do?
While transactions per second (or TPS) is a common standard for measuring blockchain performance, using transactions as a unit of measurement is problematic. For systems that provide general programmability (smart contracts) or even Bitcoin's multi-transaction or multi-signature verification options, a fundamental question is: not all transactions are equal.
In the Ethereum network, transactions can include arbitrary code and any state. The concept of Gas in Ethereum is used to quantify (and charge fees for) the total workload being executed by a transaction, but this is highly limited to the EVM execution environment. There is no simple way to directly compare the total workload completed by a set of EVM transactions with a set of Solana transactions using a BPF environment. It is also unreasonable to directly compare either of these with a set of Bitcoin transactions.
Blockchains that separate the transaction layer into consensus and execution layers can clarify this point. In a (pure) consensus layer, throughput can be measured by the number of bytes added to the chain per unit of time. The execution layer is much more complex.
Simpler execution layers, such as rollup servers that only support payment transactions, avoid the difficulties of quantifying computation. However, even in such cases, the number of inputs and outputs for payments can vary. The number of variable parameters required for payment channel transactions may differ, affecting throughput. The throughput of rollup servers may depend on how much a batch of transactions can be "reduced" to a smaller set of data packets.
Another challenge for throughput is to assess theoretical capacity beyond empirically measuring current performance. This introduces various modeling issues to evaluate potential capacity. First, we must determine the actual transaction workload of the execution layer. Second, real systems almost never reach theoretical capacity, especially blockchain systems. For robustness reasons, we want nodes to be heterogeneous and diverse in practice (rather than all clients running a single software implementation). This makes accurate simulation of blockchain throughput even more difficult.
Overall, weighing throughput requires careful interpretation of transaction workload and the number of validators. In the absence of any clear standards, we can only compare metrics based on the historical load of popular networks like Ethereum.
Comprehensive Consideration of Latency and Throughput
After statistically analyzing latency and throughput, we also need to balance between the two. As Lefteris Kokoris-Kogias noted, this trade-off is often not smooth; latency can spike dramatically when system load approaches its maximum throughput.
ZK Rollup systems provide a natural example of the throughput/latency trade-off. Large batches of transactions increase proof times, thereby increasing latency. However, in terms of proof size and verification costs, on-chain computational power will tilt towards larger clusters of transactions, thereby increasing throughput.
Transaction Fees
It is understandable that end users are more concerned about the trade-off between latency and fees rather than latency and throughput. Users do not need to care about throughput; they just want their transactions confirmed quickly at the lowest possible cost (some users care more about fees, while others care more about latency). Overall, fees are influenced by various factors:
How much market demand is there?
What is the total throughput achievable by the system?
How much revenue does the system provide to validators or miners?
How much of this revenue is based on transaction fees versus inflation rewards?
In simple terms, all else being equal, higher throughput should lead to lower fees. However, points 3 and 4 mentioned above are fundamental issues in blockchain system design. Despite many economic analyses of blockchain consensus protocols, we still do not have a consensus model on how much revenue validators need. Today, most systems are built on informed guesses about how much revenue is sufficient to incentivize validators to act honestly without compromising the network's attractiveness to users. In a simplified model, the cost of launching a 51% attack should be proportional to the rewards for validators.
Increasing the cost of attacks is a good thing, but we also do not know how much security is "enough." Imagine you are considering going to two amusement parks. One claims to spend 50% less on ride maintenance than the other. Is it a good idea to go to this park? It might be that they are more efficient and can achieve the same level of safety for less money. Perhaps the other park's spending exceeds what is needed to keep the rides safe, without any benefits. But it could also be that the first park is dangerous. Blockchain systems are similar. Once throughput is considered, lower-fee blockchains have lower fees because they reward less. We currently do not have good tools to assess whether this is viable or whether it makes the system more vulnerable to attacks. In general: comparing fees between systems can be somewhat misleading. While transaction fees are important to users, they are influenced by many factors beyond the system design itself. Throughput is a better metric for analyzing the entire system.
Conclusion
Fair and accurate performance evaluation is challenging. Measuring blockchain performance is as complex as determining whether a car is worth buying; different people care about different things. For cars, some users care about top speed or acceleration, some care about fuel efficiency, and others only care about cargo capacity. For this reason, the U.S. Environmental Protection Agency even directly issued guidelines for a car rating standard.
In the blockchain space, we are far from reaching a point where standardized guidelines can be established. At times, we may find a standard workload and draw a "standard chart" of blockchain network throughput and latency distributions, but for researchers and builders today, the best approach is to collect as much data as possible and depict the testing environment as thoroughly as possible before expressing opinions, as only then can we achieve a relatively objective comparison result.