Won 15 top zero-day vulnerabilities: The consensus protocol debug agent framework built by 0G Lab in collaboration with the New National Institute, Peking University, and Beijing University of Posts and Telecommunications

2026-06-11 14:24:48

Collection

The brand new multi-agent testing framework Agora has successfully captured 15 deep vulnerabilities in top consensus protocols, achieving dimensionality reduction against native large models at an ultra-low cost, and fully ushering in a new era of automated security auditing.

The "Holy Grail" of Distributed Systems ------ Consensus Protocols, has long been the "Bug Hell" for top infrastructure engineers. Due to its extremely complex state and multi-node interweaving, traditional testing and monolithic LLMs are almost powerless against hardcore Deep Bugs.

Recently, in the latest ICML 2026 paper, researchers from 0G Labs, along with top academic and industry teams from the National University of Singapore, Peking University, and Beijing University of Posts and Telecommunications, proposed the first automated testing framework that deeply integrates domain knowledge with large model multi-Agent collaboration ------ Agora.

This framework, through an innovative architecture, directly addresses the pain points of protocols, successfully identifying 15 previously unknown protocol-level Deep Bugs in industrial and academic core protocols such as Raft, EPaxos, HotStuff, and BullShark! In contrast, powerful native models like GPT-5.2 and Claude 4.5 have all failed to find any bugs. As multi-agent systems and "Agentic Quality Control" become the hottest tracks in 2026, Agora offers not just a paper, but a set of practical industrial solutions.

Paper: Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

1. Background: 0G and NUS Join Forces, Long-Term System Knowledge Accumulation and Cross-Generational Integration of the Multi-Agent Paradigm

The evolution of distributed consensus protocols is both a history of genius innovation and a bloody history of top engineers stepping into pitfalls. As Turing Award winner Lamport said, ensuring the correctness of distributed protocol implementations is as difficult as navigating a constantly shaking maze blindfolded. On this "hell-level" track, the market is quietly shifting: according to Gartner, the enterprise consulting volume for multi-agent systems has surged more than tenfold in just over a year, and the multi-agent platform market is entering a period of nearly doubling every year ------ using "multi-Agent collaboration" for the most hardcore underlying system verification is transforming from a cutting-edge concept into an industrial necessity.

Faced with this hell-level track, tech giants with halos have taken the lead in heavy asset exploration. For example, the industry-leading Anthropic has recently advanced the Glasswing project internally in Claude Code, attempting to use Agents to touch underlying infrastructure testing, but its architecture still heavily relies on top-tier commercial large models, with project details remaining vague and only engaging in closed-door cooperation with a very few large tech institutions and multinational giants. More critically, such giant solutions may exhibit terrifying Token consumption during operation, creating a high computational barrier that directly excludes budget-constrained startups and SMEs.

Are small companies and open-source communities destined to be unable to afford top-tier automated vulnerability auditing tools?

Engineers from 0G Labs, along with Liu Xiang from the National University of Singapore, Song Sa from Beijing University of Posts and Telecommunications, and doctoral student Zhang Zhao and researcher Zhang Ceyao from Peking University's School of Intelligence, have empowered their deep knowledge in the Agent field to launch a disruptive innovation of "small against big," with their work already submitted to the 2026 AI conference ICML.

The academic community's "long-term system knowledge accumulation" meets the industry's "pain points and keen sense," how can we ignite the next generation of system security revolution?

The 0G team has accumulated rich production-level offensive and defensive experience in the implementation of blockchain consensus protocols; the team also has profound academic foundations in high-performance distributed systems, underlying concurrency control, and system formal verification. They are well aware that traditional methods (such as Fuzzing) often face limitations due to state space explosion when dealing with industrial-grade codebases. Multiple researchers decided to inject the long-accumulated global invariants logic reasoning knowledge of distributed systems as the "soul" into the cutting-edge multi-agent collaborative paradigm and automated Harness architecture, launching the open-source egalitarian Agora framework.

At the same time, as a cutting-edge modular AI infrastructure and high-performance decentralized data availability network, the 0G team has accumulated rich production-level offensive and defensive experience and real-world protocol defect samples in the industrial implementation of blockchain consensus protocols and high-concurrency BFT architectures.

This cross-disciplinary integration has completely changed the game: it is neither blind brute force testing nor the "blind men touching an elephant" of large models lacking domain knowledge, but rather through specialized Agent division of labor, transforming decades of logical reasoning intuition from seasoned system experts into the game and collaboration between Agents, thus possessing the hardcore strength to outperform traditional testing tools.

Unlike the heavy asset route of Glasswing, which consumes massive top-tier Tokens, Agora offers an extremely friendly alternative for SMEs ------ it proves that even with a "slightly inferior" base model and higher cost-effectiveness, through a sophisticated domain-aware multi-Agent collaborative architecture, it can still uncover hardcore Deep Bugs!

2. Pain Points: Monolithic LLMs Struggle to Cross the Threshold, Distributed Systems Hang Under the "Deep Logical Damocles Sword"

In today's world dominated by big data, blockchain, and distributed databases, consensus protocols (such as Paxos, Raft, PBFT, etc.) form the underlying foundation of the entire digital world. However, the implementation of consensus protocols is notoriously "hellishly difficult." Even industrial benchmarks like etcd, which have been refined by countless top engineers globally and have been running for years, still hide Deep Bugs that can make one break into a cold sweat.

These types of bugs differ from ordinary low-level implementation bugs (such as memory leaks and integer overflows); they span multiple execution phases and depend on complex concurrent states. Once maliciously triggered, they can not only lead to core data corruption but can also cause catastrophic financial losses.

In recent years, large language models (LLMs) have performed brilliantly in ordinary code analysis, but they appear "intellectually challenged" when facing distributed consensus. They can at most identify shallow defects in local code, while when confronted with protocol-level logical bugs that depend on global state, monolithic LLMs often get bogged down in local code, completely unable to perform global temporal reasoning.

3. Breaking the Deadlock: Agora's Three-Agent Transformation and Core Harness Architecture

To break this deadlock, Agora is the first to introduce the classic academic paradigm of Hypothesis-Driven Testing (HDT) into large model Agent systems. To achieve efficient global reasoning, Agora completely abandons the traditional "lone soldier" model, elegantly decoupling the workflow into three highly specialized Agents, each with its own responsibilities:

Orchestrator Agent: Responsible for global state maintenance and exploiting known vulnerabilities;
Strategy Agent: Responsible for injecting distributed domain knowledge and generating highly aggressive anomalous scenarios for CFT and BFT protocols;
TestGen Agent: The practical implementer. The key to Agora's ability to generate effective tests in a closed loop lies in its core automated testing architecture.

Its architecture is illustrated as follows:

In Agora's overall design, this "small against big" egalitarian magic does not come from nowhere but is derived from its sophisticated agent interaction mechanism and the deep integration of the testing Harness architecture.

The research team has specifically designed a minimalist and efficient communication and memory mechanism (Succinct Memory & Communication) within the system framework, ensuring that each Agent focuses on its core tasks while minimizing redundant context transmission overhead. Under this extreme communication constraint, the Orchestrator Agent (responsible for global coordination and state control), Strategy Agent (responsible for generating distributed anomalous environments and scenarios), and TestGen Agent (responsible for code testing and dynamic evaluation) interweave perfectly, collectively driving and satisfying the Harness architecture:

Automated Closed Loop of Dual Swords: When the Strategy Agent deduces abstract distributed attack scenarios, relying on the highly decoupled interaction framework, the TestGen Agent can immediately initiate the underlying tests. This architecture not only possesses strong environmental adaptability, capable of crossing different programming language environments such as Go and Rust, converting attack hypotheses into real runnable unit tests, but also incorporates efficient Reflection-Loop technology.
Once a test runs in the environment and encounters an error, the system will accurately and in real-time capture the call stack and execution logs, and succinctly relay them back to the Agent for targeted self-correction. This organic combination of "multi-Agent minimalist interaction + dynamic Harness closed loop" enables Agora to capture the most hidden deep logical Bugs with extremely low Token costs, while also producing detailed analysis reports with very low false positive rates.

Its final operational overview is illustrated as follows:

4. Achievements: Captured 15 Top Zero-Day Deep Bugs, All Baseline Large Models Failed

The evaluation results are shocking. The research team conducted a comprehensive review on four renowned consensus protocol libraries (including the production-level etcd and the underlying components of the emerging public chain core Sui) and compared them with the strongest models such as GPT-5.2, Gemini 3.0 Pro Preview, Claude Sonnet 4.5, and Qwen3 Coder.

The results not only made the consensus system operated by 0G itself more secure but also presented an overwhelming dimensionality reduction strike:

15 New Logic Deep Bugs Surface: Agora successfully discovered 15 previously unknown protocol-level deep logical vulnerabilities. These bugs span high-risk areas such as execution divergence, monotonicity violations, topological defects, and signature vulnerabilities.
Native Large Models All Failed: In contrast, baseline models (even equipped with advanced ReAct dynamic toolchains) all failed to find any deep logical bugs (0/15). They consumed a large number of Tokens but could only circle around low-level code implementation Bugs.
Extremely Low False Positive Rate and High Cost-Effectiveness: Among all Bug reports produced by Agora, the proportion of real logical vulnerabilities reached 73.9% (with a false positive rate of only 26.1%). Even more astonishing, on average, each top-level logical Bug that made senior architects lose their hair required only about 5.32M tokens (approximately $40), showcasing high cost-effectiveness.

The results across multiple LLMs are shown as follows:

5. Future: High Scalability, Expanding into More Hardcore "No-Man's Land"

Agora's success not only provides a strong boost to the security of distributed systems but also points the way for large models to land in vertical industrial applications.

Crucially, Agora's architectural design demonstrates extremely high scalability and universality. The research team emphasizes that Agora can also be quickly reproduced and used by a wide range of users in the form of plugins or skills; our code (github.com/0gfoundation/agora) provides corresponding skills to assist in reproduction. Moreover, the "large model + multi-Agent collaboration + hypothesis-driven" paradigm of Agora is not limited to consensus protocols. Due to its deep decoupling of underlying workflow control and upper domain knowledge base and testing, this architecture can not only help many users quickly utilize it for consensus protocol debugging but can also be rapidly promoted to other hardcore fields similarly plagued by "deep logical bug hell" in a "plug-and-play" manner:

Database Concurrency Control: Used to test complex transaction conflict defects in distributed databases under extreme isolation levels (such as Serializable).
Operating System Kernel / Concurrent Systems: Deeply uncover hidden deadlocks and race conditions in multithreaded infrastructures.
Web3 Smart Contract Auditing: Conducting in-depth security boundary exploration for cross-chain protocols and DeFi logic involving complex economic models. The blockchain security market is expected to reach approximately $8.5 billion by 2026, and commercial products have emerged that use "multi-agent security systems" for smart contract auditing, compressing audit cycles from weeks to hours, with market demand exploding.

The era of AI automated security for industrial-grade underlying infrastructure may officially begin with Agora and its Harness architecture.

We have reason to believe that Agora can help better test the capabilities of coding LLMs by discovering more deep bugs in various fields, and the deep bug use cases it uncovers can also help coding LLMs improve their code understanding capabilities.

Agora can significantly enhance the security of code repositories that serve as the foundation for financial security transactions, such as consensus protocols, concurrency control, and smart contracts. Moreover, Agora can help more tech companies discover deeper logic bugs while consuming fewer tokens, saving costs while being more efficient!

More importantly, this precisely aligns with the two hottest tracks today: first, multi-agent systems are transitioning from experimentation to production ------ Gartner predicts that by 2028, over 30% of enterprise software will incorporate agentic AI, and the multi-agent platform market size will surge from the tens of billions to hundreds of billions in just a few years; second, "agents auditing agents" through Agentic Quality Control is becoming the industry standard for 2026.

In the context of the Veracode 2025 report indicating that approximately 45% of AI-generated code contains security vulnerabilities, and the agentic AI security market is racing with a compound annual growth rate of about 42%, Agora enables tech companies to uncover deeper Logic Bugs at lower token costs, upgrading security auditing from "human labor billed by the week" to "automated capabilities delivered by the hour."

As the landscape of this track gradually clarifies, those who truly seize the opportunity are often not the loudest giants but the teams that first successfully implement and can continuously replicate the methodology.

Join ChainCatcher Official

Telegram Feed: @chaincatcher

X (Twitter): @ChainCatcher_