Celer: Pantheon - ZKP Development Framework Evaluation Platform
Author: Celer
In the past few months, we have invested a significant amount of time and effort in developing cutting-edge infrastructure built on zk-SNARK succinct proofs. This next-generation innovative platform enables developers to create unprecedented new paradigms of blockchain applications.
During our development work, we tested and utilized various zero-knowledge proof (ZKP) development frameworks. While this journey has been rewarding, we have also realized that the diversity of ZKP frameworks often poses challenges for new developers trying to find the most suitable framework for their specific use cases and performance requirements.
In light of this pain point, we believe there is a need for a community assessment platform that can provide comprehensive performance testing results, which will greatly facilitate the development of these new applications.
To meet this demand, we are launching the Zero-Knowledge Proof Development Framework Evaluation Platform "Pantheon" as a public community initiative. The first step of this initiative will encourage the community to share reproducible performance testing results of various ZKP frameworks. Our ultimate goal is to collaboratively create and maintain a widely recognized testing platform to evaluate low-level circuit development frameworks, advanced zkVMs, compilers, and even hardware acceleration providers.
We hope this initiative will provide developers with more performance comparison references when selecting frameworks, thereby accelerating the adoption of ZKP. At the same time, we aim to promote the upgrade and iteration of ZKP frameworks themselves by providing a set of universally referable performance testing results. We will invest heavily in this plan and invite all like-minded community members to join us in contributing to this work!
First Step: Performance Testing of Circuit Frameworks Using SHA-256
In this article, we take the first step towards building ZKP Pantheon by providing a set of reproducible performance testing results using SHA-256 across a range of low-level circuit development frameworks. While we acknowledge that other granularities of performance testing and primitives may also be viable, we chose SHA-256 because it is applicable to a wide range of ZKP use cases, including blockchain systems, digital signatures, zkDIDs, and more.
It is also worth mentioning that we use SHA-256 in our own systems, so this is convenient for us as well! ?
Our performance testing evaluates the performance of SHA-256 across various zk-SNARK and zk-STARK circuit development frameworks. Through comparison, we strive to provide developers with insights into the efficiency and practicality of each framework. Our goal is for the performance testing results to serve as a reference for developers in selecting the best framework, enabling them to make informed decisions.
Proof Systems
In recent years, we have observed a surge in zero-knowledge proof systems. Keeping up with all the exciting advancements in this field is challenging, and we have carefully selected the following proof systems as test subjects based on maturity and developer adoption. Our goal is to provide a representative sample of different front-end/back-end combinations.
++Circom+++ ++snarkjs++/ ++rapidsnark++: Circom is a popular DSL for writing circuits and generating R1CS constraints, while snarkjs can generate Groth16 or Plonk proofs for Circom. Rapidsnark is also a prover for Circom that generates Groth16 proofs and is generally much faster than snarkjs due to the use of ADX extensions and parallelization of proof generation.
++gnark++: gnark is a comprehensive Golang framework from Consensys that supports Groth16, Plonk, and many more advanced features.
++Arkworks++: Arkworks is a comprehensive Rust framework for zk-SNARKs.
++Halo2 (KZG)++: Halo2 is Zcash's zk-SNARK implementation with Plonk. It features highly flexible Plonkish arithmetic and supports many useful primitives, such as custom gates and lookup tables. We use a Halo2 fork with KZG supported by the Ethereum Foundation and Scroll.
++Plonky2++: Plonky2 is a SNARK implementation based on PLONK and FRI technologies from Polygon Zero. Plonky2 uses a small Goldilocks field and supports efficient recursion. In our performance tests, we target 100 bits of presumed security and use parameters that yield the best proof time for performance testing. Specifically, we used 28 Merkle queries, an amplification factor of 8, and 16 proof-of-work challenge bits. Additionally, we set numofwires = 60 and numroutedwires = 60.
++Starky++: Starky is a high-performance STARK framework from Polygon Zero. In our performance tests, we target 100 bits of presumed security and use parameters that yield the best proof time. Specifically, we used 90 Merkle queries, a 2x amplification factor, and 10 proof-of-work challenge bits.
The table below summarizes the frameworks mentioned above and the relevant configurations used in our performance tests. This list is by no means exhaustive, and we will also explore many cutting-edge frameworks/technologies (e.g., Nova, GKR, Hyperplonk) in the future.
Please note that these performance testing results apply only to circuit development frameworks. We plan to publish a separate article in the future to perform performance testing on different zkVMs (e.g., Scroll, Polygon zkEVM, Consensys zkEVM, zkSync, Risc Zero, zkWasm) and IR compiler frameworks (e.g., Noir, zkLLVM).
Performance Evaluation Methodology
To perform performance testing on these different proof systems, we calculated the SHA-256 hash of N bytes of data, where we experimented with N = 64, 128, …, 64K (Starky is an exception, where the circuit computes the SHA-256 of a fixed 64-byte input but maintains the same total number of message blocks). The performance code and SHA-256 circuit configuration can be found in ++this repository++.
Additionally, we used the following performance metrics to evaluate each system:
Proof generation time (including witness generation time)
Peak memory usage during proof generation
Average CPU usage percentage during proof generation. (This metric reflects the degree of parallelization during the proof generation process)
Please note that we are making some "casual" assumptions regarding proof size and proof verification costs, as these aspects can be mitigated by combining with Groth16 or KZG before on-chain.
Machines
We conducted performance tests on two different machines:
Linux Server: 20 cores @2.3 GHz, 384GB RAM
Macbook M1 Pro: 10 cores @3.2GHz, 16GB RAM
The Linux server was used to simulate a scenario with many CPU cores and ample memory. The Macbook M1 Pro, typically used for development, has a more powerful CPU but fewer cores.
We enabled optional multithreading, but we did not use GPU acceleration in this performance test. We plan to conduct GPU performance testing in the future.
Performance Evaluation Results
Number of Constraints
Before we continue discussing detailed performance testing results, it is useful to understand the complexity of SHA-256 by looking at the number of constraints in each proof system. It is important to note that the number of constraints in different arithmetic schemes cannot be directly compared.
The results below correspond to an input size of 64KB. While results may vary with other input sizes, they can be roughly linearly scaled.
Circom, gnark, and Arkworks all use the same R1CS algorithm, with the number of R1CS constraints for computing 64KB SHA-256 roughly between 30M and 45M. The differences between Circom, gnark, and Arkworks may be due to configuration differences.
Halo2 and Plonky2 both use Plonkish arithmetic, with the number of rows ranging from 2^22 to 2^23. Halo2's SHA-256 implementation is significantly more efficient than Plonky2's due to the use of lookup tables.
Starky uses the AIR algorithm, where the execution trace table requires 2^16 transformation steps.
Proof Generation Time
[Figure 1] shows the proof generation time for SHA-256 across each framework tested on the Linux server at various input sizes. We can draw the following conclusions:
For SHA-256, Groth16 frameworks (rapidsnark, gnark, and Arkworks) generate proofs faster than Plonk frameworks (Halo2 and Plonky2). This is because SHA-256 primarily consists of bitwise operations, where wire values are either 0 or 1. For Groth16, this reduces most of the computations from elliptic curve scalar multiplication to elliptic curve point addition. However, wire values are not directly used in Plonk's computations, so the special wire structure in SHA-256 does not reduce the computational load required in the Plonk framework.
Among all Groth16 frameworks, gnark and rapidsnark are 5 to 10 times faster than Arkworks and snarkjs. This is due to their superior ability to parallelize proof generation across multiple cores. Gnark is 25% faster than rapidsnark.
For the Plonk framework, Plonky2's SHA-256 is 50% slower than Halo2 when using larger input sizes (>= 4KB). This is because Halo2's implementation primarily uses lookup tables to accelerate bitwise operations, resulting in roughly half the number of rows compared to Plonky2. However, if we compare Plonky2 and Halo2 with the same number of rows (e.g., SHA-256 in Halo2 with over 2KB compared to SHA-256 in Plonky2 with over 4KB), Plonky2 is 50% faster than Halo2. If we implement SHA-256 in Plonky2 using lookup tables, we should expect Plonky2 to be faster than Halo2, even though Plonky2's proof size is larger.
On the other hand, when the input size is small (<= 512 bytes), Halo2 is slower than Plonky2 (and other frameworks) due to the fixed setup cost of the lookup table dominating most of the constraints. However, as the input size increases, Halo2's performance becomes more competitive, maintaining a consistent proof generation time for input sizes up to 2KB, as shown in the graph, which almost scales linearly.
As expected, Starky's proof generation time is much shorter than any SNARK framework (5x-50x), but this comes at the cost of a larger proof size.
It is also worth noting that even though the circuit size is linearly related to the input size, the proof generation for SNARKs grows superlinearly due to O(nlogn) FFT (although this phenomenon is not obvious in the graphs due to the logarithmic scale).
We also conducted proof generation time performance tests on the Macbook M1 Pro, as shown in [Figure 2]. However, it is important to note that rapidsnark was not included in this performance test due to a lack of support for the arm64 architecture. To use snarkjs on arm64, we had to generate witnesses using WebAssembly, which is slower than the C++ witness generation used on the Linux server.
There are also several additional observations when running performance tests on the Macbook M1 Pro:
All SNARK frameworks, except for Starky, encounter out-of-memory (OOM) errors or use swap memory (leading to slower proof times) as the input size increases. Specifically, Groth16 frameworks (snarkjs, gnark, Arkworks) start using swap memory at input sizes >= 8KB, while gnark encounters out-of-memory issues at input sizes >= 64KB. Halo2 runs into memory limitations at input sizes >= 32KB, and Plonky2 starts using swap memory at input sizes >= 8KB.
FRI-based frameworks (Starky and Plonky2) are about 60% faster on the Macbook M1 Pro than on the Linux server, while other frameworks have similar proof times on both machines. Therefore, even without using lookup tables in Plonky2, it achieves proof times nearly identical to Halo2 on the Macbook M1 Pro. The main reason is that the Macbook M1 Pro has a more powerful CPU but fewer cores. FRI primarily performs hashing operations, which are sensitive to CPU clock cycles, but its parallelism is not as strong as KZG or Groth16.
Peak Memory Usage
[Figure 3] and [Figure 4] show the peak memory usage during proof generation on the Linux Server and Macbook M1 Pro, respectively. Based on these performance testing results, the following observations can be made:
Among all SNARK frameworks, rapidsnark is the most memory-efficient. We also see that Halo2 uses more memory when the input size is small due to the fixed setup cost of the lookup table, but consumes less overall memory when the input size is larger.
Starky's memory efficiency is over 10 times higher than that of SNARK frameworks. Part of the reason is that it uses fewer rows.
It should be noted that due to the use of swap memory, the peak memory usage on the Macbook M1 Pro remains relatively stable as the input size increases.
CPU Utilization
We evaluated the degree of parallelization of each proof system by measuring the average CPU utilization during the proof generation of SHA-256 with a 4KB input. The table below shows the average CPU utilization on the Linux Server (20 cores) and Macbook M1 Pro (10 cores) (with the average utilization per core in parentheses).
The main observations are as follows:
Gnark and rapidsnark exhibit the highest CPU utilization on the Linux server, indicating their ability to effectively utilize multiple cores and parallelize proof generation. Halo2 also shows good parallelization performance.
Most frameworks have CPU utilization on the Linux server that is twice that of the Macbook Pro M1, with snarkjs being the only exception.
Although it was initially expected that FRI-based frameworks (Plonky2 and Starky) might struggle to effectively utilize multiple cores, their performance in our performance tests was not worse than that of some Groth16 or KZG frameworks. Whether there will be differences in CPU utilization on machines with more cores (e.g., 100 cores) remains to be seen.
Conclusion and Future Research
This article comprehensively outlines the performance testing results of SHA-256 across various zk-SNARK and zk-STARK development frameworks. Through comparison, we gained insights into the efficiency and practicality of each framework, aiming to assist developers who need to generate succinct proofs for SHA-256 operations.
We found that Groth16 frameworks (e.g., rapidsnark, gnark) are faster at generating proofs than Plonk frameworks (e.g., Halo2, Plonky2). The lookup tables in Plonkish arithmetic significantly reduce the constraints and proof times for larger input sizes of SHA-256. Additionally, gnark and rapidsnark demonstrate excellent capabilities in utilizing multiple cores for parallel operation. On the other hand, Starky's proof generation time is much shorter, but at the cost of a significantly larger proof size. In terms of memory efficiency, rapidsnark and Starky outperform other frameworks.
As the first step in building the zero-knowledge proof evaluation platform "Pantheon," we acknowledge that these performance testing results are far from sufficient to serve as the comprehensive testing platform we hope to build. We welcome and are eager to receive feedback and criticism, and invite everyone to contribute to this initiative so that developers can more easily and accessibly utilize zero-knowledge proofs. We are also willing to provide funding for individual independent contributors to cover the computational resource costs of large-scale performance testing. We hope to collectively improve the efficiency and practicality of ZKP for the broader benefit of the community.
Finally, we would like to thank the Polygon Zero team, the gnark team from Consensys, Pado Labs, and the Delphinus Lab team for their valuable reviews and feedback on the performance testing results.