Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 138 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

</div>

Term Challenge is a WASM evaluation module for AI agents on the Bittensor network. It runs inside [platform-v2](https://github.com/PlatformNetwork/platform-v2) validators to evaluate miner submissions against SWE-bench tasks.
Term Challenge is a WASM evaluation module for AI agents on the Bittensor network. It runs inside [platform-v2](https://github.com/PlatformNetwork/platform-v2) validators to evaluate miner submissions against SWE-bench tasks. Miners submit Python agent packages that autonomously solve software engineering issues, and the network scores them through a multi-stage review pipeline including LLM-based code review and AST structural validation.

---

Expand All @@ -22,6 +22,7 @@ flowchart LR
Miner[Miner] -->|Submit Agent ZIP| RPC[Validator RPC]
RPC --> Validators[Validator Network]
Validators --> WASM[term-challenge WASM]
WASM --> Storage[(Blockchain Storage)]
Validators --> Executor[term-executor]
Executor -->|Task Results| Validators
Validators -->|Scores + Weights| BT[Bittensor Chain]
Expand All @@ -31,30 +32,97 @@ flowchart LR

---

## Evaluation Flow
## Evaluation Pipeline

```mermaid
sequenceDiagram
participant M as Miner
participant V as Validators
participant LLM as LLM Reviewers (×3)
participant AST as AST Reviewers (×3)
participant W as WASM Module
participant E as term-executor
participant BT as Bittensor

M->>V: Submit agent zip + metadata
V->>W: validate(submission)
W-->>V: approved (>50% consensus)
W-->>V: Approved (>50% consensus)
V->>LLM: Assign LLM code review
V->>AST: Assign AST structural review
LLM-->>V: LLM review scores
AST-->>V: AST review scores
V->>E: Execute agent on SWE-bench tasks
E-->>V: Task results + scores
V->>W: evaluate(results)
W-->>V: Aggregate score + weight
V->>V: Store agent code & logs
V->>V: Log consensus (>50% agreement)
V->>V: Log consensus (>50% hash agreement)
V->>BT: Submit weights at epoch boundary
```

---

## Validator Assignment

```mermaid
flowchart TB
Sub[New Submission] --> Seed[Deterministic Seed from submission_id]
Seed --> Select[Select 6 Validators]
Select --> LLM[3 LLM Reviewers]
Select --> AST[3 AST Reviewers]
LLM --> LR1[LLM Reviewer 1]
LLM --> LR2[LLM Reviewer 2]
LLM --> LR3[LLM Reviewer 3]
AST --> AR1[AST Reviewer 1]
AST --> AR2[AST Reviewer 2]
AST --> AR3[AST Reviewer 3]
LR1 & LR2 & LR3 -->|Timeout?| TD1{Responded?}
AR1 & AR2 & AR3 -->|Timeout?| TD2{Responded?}
TD1 -->|No| Rep1[Replacement Validator]
TD1 -->|Yes| Agg[Result Aggregation]
TD2 -->|No| Rep2[Replacement Validator]
TD2 -->|Yes| Agg
Rep1 --> Agg
Rep2 --> Agg
Agg --> Score[Final Score]
```

---

## Submission Flow

```mermaid
flowchart LR
Register[Register Name] -->|First-register-owns| Name[Submission Name]
Name --> Version[Auto-increment Version]
Version --> Pack[Package Agent ZIP ≤ 1MB]
Pack --> Sign[Sign with sr25519]
Sign --> Submit[Submit via RPC]
Submit --> RateCheck{Epoch Rate Limit OK?}
RateCheck -->|No: < 3 epochs since last| Reject[Rejected]
RateCheck -->|Yes| Validate[WASM validate]
Validate --> Consensus{>50% Validator Approval?}
Consensus -->|No| Reject
Consensus -->|Yes| Evaluate[Evaluation Pipeline]
Evaluate --> Store[Store Code + Hash + Logs]
```

---

## Decay Mechanism

```mermaid
flowchart LR
Top[Top Score Achieved] --> Grace[72h Grace Period]
Grace -->|Within grace| Full[100% Weight Retained]
Grace -->|After grace| Decay[Exponential Decay Begins]
Decay --> Half[50% per 24h half-life]
Half --> Min[Decay to 0.0 min multiplier]
Min --> Burn[Weight Burns to UID 0]
```

---

## CLI Data Flow

```mermaid
Expand Down Expand Up @@ -101,11 +169,37 @@ flowchart TB

---

## Route Architecture

```mermaid
flowchart LR
Client[Client] -->|JSON-RPC| RPC[RPC Server]
RPC -->|challenge_call| WE[WASM Executor]
WE -->|handle_route request| WM[WASM Module]
WM --> Router{Route Match}
Router --> LB[/leaderboard]
Router --> Subs[/submissions]
Router --> DS[/dataset]
Router --> Stats[/stats]
Router --> Agent[/agent/:hotkey/code]
LB & Subs & DS & Stats & Agent --> Storage[(Storage)]
Storage --> Response[Serialized Response]
Response --> WE
WE --> RPC
RPC --> Client
```

---

## Features

- **WASM Module**: Compiles to `wasm32-unknown-unknown`, loaded by platform-v2 validators
- **SWE-bench Evaluation**: Tasks selected from HuggingFace CortexLM/swe-bench datasets
- **LLM Judge**: Integrated LLM scoring via platform-v2 host functions
- **LLM Code Review**: 3 validators perform LLM-based code review via host functions
- **AST Structural Validation**: 3 validators perform AST-based structural analysis
- **Submission Versioning**: Auto-incrementing versions with full history tracking
- **Timeout Handling**: Unresponsive reviewers are replaced with alternate validators
- **Route Handlers**: WASM-native route handling for leaderboard, submissions, dataset, and agent data
- **Epoch Rate Limiting**: 1 submission per 3 epochs per miner
- **Top Agent Decay**: 72h grace period, 50% daily decay to 0 weight
- **P2P Dataset Consensus**: Validators collectively select 50 evaluation tasks
Expand Down Expand Up @@ -137,22 +231,29 @@ This repository contains the WASM evaluation module and a native CLI for monitor

```
term-challenge/
├── wasm/ # WASM evaluation module
├── wasm/ # WASM evaluation module
│ └── src/
│ ├── lib.rs # Challenge trait implementation
│ ├── types.rs # Submission, task, and config types
│ ├── scoring.rs # Score aggregation and decay
│ ├── tasks.rs # Active dataset management
│ ├── dataset.rs # Dataset selection consensus
│ ├── routes.rs # RPC route definitions
│ └── agent_storage.rs # Agent code & log storage functions
├── cli/ # Native TUI monitoring tool
│ ├── lib.rs # Challenge trait implementation (validate + evaluate)
│ ├── types.rs # Submission, task, config, route, and log types
│ ├── scoring.rs # Score aggregation, decay, and weight calculation
│ ├── tasks.rs # Active dataset management and history
│ ├── dataset.rs # Dataset selection and P2P consensus logic
│ ├── routes.rs # WASM route definitions for RPC (handle_route)
│ └── agent_storage.rs # Agent code, hash, and log storage functions
├── cli/ # Native TUI monitoring tool
│ └── src/
│ ├── main.rs # Entry point, event loop
│ ├── app.rs # Application state
│ ├── ui.rs # Ratatui UI rendering
│ └── rpc.rs # JSON-RPC 2.0 client
├── AGENTS.md # Development guide
│ ├── main.rs # Entry point, event loop
│ ├── app.rs # Application state
│ ├── ui.rs # Ratatui UI rendering
│ └── rpc.rs # JSON-RPC 2.0 client
├── docs/
│ ├── architecture.md # System architecture and internals
│ ├── miner/
│ │ ├── how-to-mine.md # Complete miner guide
│ │ └── submission.md # Submission format and review process
│ └── validator/
│ └── setup.md # Validator setup and operations
├── AGENTS.md # Development guide
└── README.md
```

Expand All @@ -162,11 +263,15 @@ term-challenge/

1. Miners submit zip packages with agent code and SWE-bench task results
2. Platform-v2 validators load this WASM module
3. `validate()` checks signatures, epoch rate limits, and Basilica metadata
4. `evaluate()` scores task results and applies LLM judge scoring
5. Agent code and hash are stored on-chain for auditability (≤ 1MB per package)
6. Evaluation logs are proposed and validated via P2P consensus (>50% hash agreement)
7. Scores are aggregated via P2P consensus and submitted to Bittensor
3. `validate()` checks signatures, epoch rate limits, package size, and Basilica metadata
4. **6 review validators** are deterministically selected (3 LLM + 3 AST) to review the submission
5. LLM reviewers score code quality; AST reviewers validate structural integrity
6. Timed-out reviewers are automatically replaced with alternate validators
7. `evaluate()` scores task results, applies LLM judge scoring, and computes aggregate weights
8. Agent code and hash are stored on-chain for auditability (≤ 1MB per package)
9. Evaluation logs are proposed and validated via P2P consensus (>50% hash agreement)
10. Scores are aggregated via P2P consensus and submitted to Bittensor at epoch boundaries
11. Top agents enter a decay cycle: 72h grace → 50% daily decay → weight burns to UID 0

---

Expand All @@ -190,6 +295,15 @@ term-cli --hotkey 5GrwvaEF... --tab leaderboard

---

## Documentation

- [Architecture Overview](docs/architecture.md) — System components, host functions, P2P messages, storage schema
- [Miner Guide](docs/miner/how-to-mine.md) — How to build and submit agents
- [Submission Guide](docs/miner/submission.md) — Naming, versioning, and review process
- [Validator Setup](docs/validator/setup.md) — Hardware requirements, configuration, and operations

---

## License

Apache-2.0
51 changes: 51 additions & 0 deletions cli/src/rpc.rs
Original file line number Diff line number Diff line change
Expand Up @@ -214,4 +214,55 @@ impl RpcClient {
.map(|r| ChallengeInfo { id: r.id })
.collect())
}

pub async fn fetch_agent_journey(
&self,
challenge_id: &str,
hotkey: &str,
) -> anyhow::Result<serde_json::Value> {
let params = serde_json::json!({
"challengeId": challenge_id,
"method": "GET",
"path": format!("/agent/{}/journey", hotkey)
});
let result = self.call("challenge_call", params).await?;
Ok(result)
}

pub async fn fetch_submission_history(
&self,
challenge_id: &str,
hotkey: &str,
) -> anyhow::Result<serde_json::Value> {
let params = serde_json::json!({
"challengeId": challenge_id,
"method": "GET",
"path": format!("/agent/{}/logs", hotkey)
});
let result = self.call("challenge_call", params).await?;
Ok(result)
}

pub async fn fetch_stats(&self, challenge_id: &str) -> anyhow::Result<serde_json::Value> {
let params = serde_json::json!({
"challengeId": challenge_id,
"method": "GET",
"path": "/stats"
});
let result = self.call("challenge_call", params).await?;
Ok(result)
}

pub async fn fetch_decay_status(
&self,
challenge_id: &str,
) -> anyhow::Result<serde_json::Value> {
let params = serde_json::json!({
"challengeId": challenge_id,
"method": "GET",
"path": "/decay"
});
let result = self.call("challenge_call", params).await?;
Ok(result)
}
}
Loading
Loading