R2E-Gym: Scaling Open-Weights SWE Agents

Figure 1: (Left) SWE-Gen creates executable training environments from commits using automated test generation and backtranslation. (Middle) Our hybrid approach achieves 51% success rate through complementary strengths of execution-based and execution-free verifiers. (Right) Our approach establishes a new state-of-the-art for open-weight SWE agents.

🚀 Key Contributions

🔮

SWE-Gen Data Engine

Generates 8.1K executable environments directly from commits with automated tests and natural language descriptions

🔍

Hybrid Verification System

Combines execution-based testing with execution-free verification for superior solution ranking

🏆

SOTA Performance

51% success rate on SWE-Bench with just 26 rollouts — 19% higher than previous open-weight models

Introduction

Autonomous software engineering (SWE) has made remarkable progress in solving real-world programming challenges. While LLM-based SWE agents have demonstrated impressive capabilities, current state-of-the-art performance is predominantly achieved by proprietary models, with open-source alternatives lagging significantly behind. Closing this performance gap requires addressing two core challenges: First, we need scalable methods to curate diverse, high-quality execution environments for training. Second, we need efficient strategies for scaling test-time compute.

R2E-Gym addresses these challenges as the largest procedurally generated environment for training real-world SWE agents—comprising over 8.1K problems with executable environments and problem statements through our SWE-Gen pipeline. Next, we introduce a hybrid verifiers that combines the strengths of execution-based and execution-free verification methods, enabling significantly better performance at test time.

R2E-Gym: Procedural Synthetic Data Generation

The SWE-Gen Approach

SWE-Gen breaks traditional dependencies on human-written issues through a novel synthetic data pipeline:

1

Automated Test Generation: Creates reproduction tests that validate patch correctness in cases where humans don't provide one

2

Issue Generation: Converts code changes into natural language problem statements using execution assisted back-translation for agent training

3

Executable Environments: Released over 8.1K unique tasks across 13 repositories with pre-built docker images

Complete dataset distribution

Subset (non overlapping with SWE-Bench) dataset distribution

Figure 2: Dataset composition showing the diversity of repositories and problems in our synthetic dataset, enabling significantly larger and more diverse training environments than previous methods. Left: Full R2E-Gym dataset. Right: R2E-Gym subset (non-overlapping with SWE-Bench) used for training.

Training SWE-Agents

We use the R2E-Gym dataset to train our R2E-Gym-32B agent, a 32B parameter model based on the Qwen model trained on high-quality trajectories collected on the R2E-Gym dataset. Our model achieves SOTA - 34.4% Pass@1 performance on the SWE-Bench Verified benchmark.

Figure 3: Performance scaling with increasing training data volume shows consistent improvement as more synthetic examples are added, demonstrating that the SWE-Gen methodology enables more effective scaling than previous approaches.

Hybrid Verifiers: Inference-Time Scaling

We introduce Hybrid Test-time Scaling, a novel paradigm for scaling test-time compute. We show that while both execution-based and execution-free verifiers elicit inference-time gains; they exhibit complementary strengths and weakness. Leveraging the strengths of each approach allows significantly better performance when scaling test-time compute - resulting in a 51% pass@1 performance on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-Agents.

Execution-Based Verifier Limitations

While execution-based verifiers provide a functional evaluation of patch semantics, they face the following challenges:

1

Low Distinguishability: Often only 20% of generated tests effectively distinguish between top-ranked correct and incorrect patches

2

Test Toxicity: Some tests pass incorrect patches while failing correct ones, further degrading ranking effectiveness

Distinguishability rates: Most problems have less than 20% of tests providing meaningful discriminative signals.

Toxicity rates showing rare generation of tests that pass incorrect patches but fail correct ones.

Execution-Free Verifier Limitations

Our analysis also reveals limitations of execution-free verifiers:

1

Trajectories Are Necessary for Performance: Removing trajectory information from verifier inputs reduces accuracy by 3-5%, indicating over-reliance on non-patch related signals

2

Heuristic Dependency: Verifiers attention patterns show heavy reliance on agent thoughts rather than evaluating the actual patch quality

Method	Accuracy (%)	Best@26 (%)
Final Patch + Traj.	71.82	42.8
Patch Only	68.01	37.6
Traj. - Thoughts	68.77	41.4

1. Successfully reproduced the issue
2. Implemented a fix [...]
4. Ensured edge cases are handled
5. Maintained backward compatibility [...]
<function=finish>submit</function> [...]

Great! The fix works. Let's see what we did to fix the issue:
1. We identified that the original code was failing because it was trying to use the `.inverse()` method directly on permutations, which [...]

Figure 5: Limitations of execution-free verifiers: (Left) Our quantitative ablation shows that removing patches or thoughts reduces performance, indicating that verifiers rely heavily on heuristics beyond just the final code. (Right) Example of misleading heuristics where the verifier attends to agent thoughts with confident language ("Successfully", "Great", "works") when predicting a patch is correct, despite the patch being incorrect.

The Hybrid Advantage

Our hybrid verification approach combines the complementary strengths of both methods:

1

First-Stage Filtering with Execution Testing

Quickly eliminates non-functional solutions using binary pass/fail signals, reducing the candidate pool to only working solutions

2

Second-Stage Ranking with Execution-Free Evaluation

Provides fine-grained continuous scoring among solutions that pass execution tests, provided finer disambiguation

3

Complementary Strengths, Minimized Weaknesses

Overcomes the binary limitations of execution testing and the misleading heuristics problem of execution-free evaluation

Figure 4: Best@K performance with increasing agent rollouts. Our hybrid approach achieves superior scaling, reaching 51% - a significant 7-8% improvement over either method alone.

Conclusion

R2E-Gym establishes a new state-of-the-art for open-weights SWE agents with 51% success rate—a significant improvement over previous approaches:

Key Innovation: Synthetic Data Pipeline

Created 8.1K executable training environments
Eliminated dependency on human-written issues
Enabled 34.4% Pass@1—a 14% improvement over previous models

Key Innovation: Hybrid Verification

Combined execution-based and execution-free verifiers
Required only 26 rollouts vs. 500+ in other methods
Showed competitive performance with some proprietary models

Our open-source tools and datasets are available to enable further research and innovation in this area.

R2E-Gym

Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents