NodeSmith: AI-Driven Automation for Blockchain Node Upgrades

TL;DR: NodeSmith is Coinbase’s AI-driven system for automating blockchain node upgrades across 60+ blockchains. It has reduced engineering upgrade effort by 30%, eliminated missed mandatory upgrades, and added automated analysis of every upgrade’s code changes.

By Mark Landgrebe, Faizaan Madhani, Kidus Negesse, Jeff Gilbert

Engineering

, September 16, 2025

Running Blockchain Nodes at Coinbase Scale

Operating blockchain infrastructure at Coinbase means running nodes for more than 60 chains with a small, specialized team. Each protocol announces upgrades and breaking changes in its own way, from providing detailed documentation well in advance, such as Aptos, Bitcoin, Celo, Ethereum, and Story, to last-minute Discord posts.

In the last three months alone, we have processed more than 500 upgrades. Critical details are often scattered across GitHub release notes, Discord announcements, onchain governance proposals, and dense Telegram discussions, making it challenging to track and act on every change in time.

Missing a mandatory upgrade directly impacts customers. When a node falls behind or stops syncing, it can delay transactions for millions of customers using Coinbase products. Preventing these disruptions while keeping pace with hundreds of upgrades is essential to maintaining trust and reliability at scale. Manual processes consume significant engineering time on repetitive upgrade tasks, limiting focus on platform improvements and new blockchain integrations.

We needed systems that could:

Extract deadlines and requirements from sparse release notes and governance proposals

Analyze code modifications for breaking changes across our crypto stack

Adapt code intelligently to new protocol changes and requirements

Resolve build and runtime errors and execute deployments across environments

NodeSmith Architecture

NodeSmith manages blockchain upgrades through a two-phase AI automation system. The Triage Agent handles intelligence gathering, using LLMs to research and analyze unstructured data. The Upgrade Orchestrator handles execution, combining AI reasoning with deterministic code for problem solving and reliable action. This architecture follows three core design principles:

Separate intelligence gathering from execution to preserve context and enable independent improvements in each component.

Use LLMs for reasoning and analysis, and deterministic Python code for critical operations such as builds, deployments, and CI/CD integration.

Fully automate routine upgrades, while providing quality checks and detailed context for complex cases that require human review.

Figure 1. NodeSmith System Diagram

Phase 1: The Triage Agent

The Triage Agent gathers upgrade intelligence from GitHub (commits, PRs, code changes), governance systems (voting results, activation parameters), and community channels (Discord, Telegram discussions), then synthesizes this information into a unified view. The system turns code changes into semantically searchable documents by parsing diffs into logical units called “hunks” and using lightweight LLMs to generate descriptions of each change.

Figure 2. Knowledge Base Ingestion and Retrieval

For example, when analyzing a consensus mechanism change where the raw diff shows a variable update from NO_ACTIVATION_HEIGHT to 2726400, the system adds context: "This change activates the NU6 network upgrade at block height 2726400. Previously it was set to NO_ACTIVATION_HEIGHT, meaning it would never activate automatically. This is a mandatory consensus change."

This semantic layer enables powerful queries. When determining if an upgrade is mandatory, the agent can search for concepts such as activation height, consensus change, or hard fork, and still retrieve relevant code changes even when those exact terms do not appear in the diff.

The semantic layer is enhanced with a Retrieval Augmented Generation (RAG) system built on Coinbase’s Knowledge Embedding Service (KES) with vector search. Each supported protocol has its own knowledge base for targeted searches of protocol-specific code changes and historical patterns. Code diffs are chunked by individual hunks, enabling precise retrieval while managing the high information density of blockchain releases without losing context or accuracy.

The Triage Agent produces structured intelligence that feeds directly into the Upgrade Orchestrator and is distributed through notification channels to the engineers responsible for the upgrade. Automation tracks rollout progress against deadlines identified by the Triage Agent and escalates alerts as those deadlines approach. This ensures that actionable intelligence is both machine-readable for automation and immediately useful to human operators.

Phase 2: The Upgrade Orchestrator

Classifying upgrades is only the beginning. The Upgrade Orchestrator turns this classification into action by coordinating five specialized agents that blend AI reasoning with deterministic execution. The orchestrator follows a structured workflow, triggering agents sequentially based on results while maintaining context and data flow across steps. The diagram below shows the orchestrated flow between agents:

Figure 3. Flow of the Upgrade Process

The orchestrator follows a multi-agent supervisor-with-tools model. It operates as a ReAct (Reasoning and Acting) agent with access to specialized tools, each corresponding to an agent with focused capabilities:

Repository Analysis Agent: Maps the codebase and identifies relationships between Dockerfiles, Helm charts, scripts, and configuration files.

Helm Agent: Updates Kubernetes deployment configurations within strict boundaries that preserve critical infrastructure.

Script and Config Agent: Manages auxiliary files that make blockchain nodes operational, working both proactively and reactively to resolve deployment issues.

Docker Agent: Manages the build process, iteratively debugs failures, and oversees infrastructure operations in GitHub Actions.

Deploy Agent: Performs progressive deployments to testnets and then mainnet, validates sync status, analyzes logs, and delegates fixes when needed.

The orchestrator uses pattern matching and LLM analysis to map relationships between Dockerfiles, build scripts, Helm charts, and configuration files across each repository. These mappings remain accessible to all agents through a shared in-memory context store, ensuring that repository details identified early in the process are preserved as the orchestrator’s context grows. Future iterations will embed repository data and enable agents to semantically search codebases for more intelligent change handling.

The Docker Agent is designed for iterative problem-solving at scale. Building blockchain nodes requires compiling complex software with many dependencies, where even minor version changes can cause failures.

Figure 4. Build Execution and Debugging Flow of the Docker Agent

The Docker Agent operates in three phases:

Update Phase: Updates Dockerfiles with new version details, preserving structure while applying required upgrade changes.

Build and Debug Loop: Runs builds through GitHub Actions and monitors progress. On failure, it analyzes error patterns to identify common issues, such as missing dependencies or incompatible libraries, and generates targeted fixes.

Iteration Management: Repeats this cycle until the build succeeds or a set attempt limit is reached, gathering more insight into the upgrade’s requirements with each iteration.

For example, when a blockchain upgrades from OpenSSL 1.1 to 3.0, the agent may update the base image from Ubuntu 20.04 to 22.04 to ensure compatibility, then rebuild and validate the fix.

Finally, the Deploy Agent applies a philosophy of careful, validated rollouts. Deployments occur first in sandbox testnet canary environments, enabling easier recovery and minimizing impact on downstream services.

Validation includes monitoring node status endpoints, tracking block height progression, verifying peer connections and consensus participation, and detecting configuration issues for remediation. When nodes fail to sync despite successful builds, the agent recognizes patterns and remediates by updating configuration flags, adjusting peer settings, or modifying startup sequences.

Results & Key Learnings

Since deployment, NodeSmith has reduced engineering effort by 30% and prevented any missed mandatory upgrades. Engineers now focus on platform improvements rather than routine upgrades, and consistent processes have eliminated deployment failures while codifying tribal knowledge across upgrades.

NodeSmith’s architecture demonstrates that separating LLM reasoning from deterministic execution delivers both adaptability and reliability. LLMs handle error analysis, upgrade planning, and targeted fix generation, while deterministic Python code executes precise, repeatable actions. This design avoids hallucinations and ensures stable outcomes in production.

Figure 5. Diagram of Abstraction & Validation Layer between the LLM and its tools

Infrastructure complexity is abstracted from agents, enabling them to operate with unified views of processes such as builds, deployments, and log retrieval.

Figure 6. Internal View of the Build Dockerfile and Download Logs Tool

Failure recovery draws on patterns learned from thousands of upgrades, allowing the system to resolve more than half of build failures without human intervention. Novel errors are addressed by combining pattern recognition with LLM analysis for context-aware fixes.

Four principles stand out:

Maintain a clear contract between LLM reasoning and deterministic execution, with well-defined handoffs that preserve context and prevent errors.

Invest in high-quality tools that give agents precise control while abstracting away unnecessary complexity.

Provide domain-specific knowledge to improve LLM accuracy for blockchain upgrades.

Design automation to give operators complete, actionable context when human review is needed.

Conclusion

As blockchain protocols evolve with new consensus mechanisms, governance models, and technical requirements, NodeSmith evolves alongside them. With each upgrade processed and each edge case addressed, the system becomes increasingly capable and resilient. NodeSmith provides a reliable, scalable foundation for managing blockchain upgrades, freeing the team to focus on higher-value engineering work.

Interested in building AI-powered infrastructure? The Coinbase team is hiring! Check out our careers page for open positions: www.coinbase.com/careers

For questions about our blockchain infrastructure, reach out on GitHub or follow us on X @CoinbasePltfrm.

Engineering

Recent Stories

Company,

Dec 22, 2025,

2 minutes read time

Coinbase to acquire The Clearing Company: Powering the future of prediction markets

Coinbase has entered an agreement to acquire The Clearing Company, a prediction markets company innovating at the frontier of regulated, onchain markets. The team brings deep expertise that will help power and scale prediction markets on Coinbase.