Demo

Vibe Fine-tuning LLMs

How I reproduced cutting-edge LoRA research from Thinking Machines Lab by just prompting Orchestra with natural language.

November 17, 2025
8 min read

I'm a physics PhD who has done some AI research, but GRPO and large-scale RL fine-tuning wasn't my main area. So when I saw Thinking Machines Lab's blog post with striking claims about LoRA fine-tuning, I was fascinated. How could rank=1 LoRA really outperform full fine-tuning on math reasoning? I wanted to find it out myself.

The challenge wasn't conceptual—I understand the theory. The challenge was engineering: to validate their findings, I'd be looking at 2-3 weeks of setting up GRPO implementations, debugging training code, provisioning GPU infrastructure, and wrangling large datasets. Even with AI research experience, that overhead means interesting ideas outside my main area often stay untested.

But I built Orchestra specifically to eliminate this barrier. So I decided to test it: could I skip weeks of engineering overhead and quickly validate cutting-edge research in unfamiliar territory (GRPO RL fine-tuning) using just natural language conversation?

What I Wanted to Validate

I wanted to see if I could reproduce these findings—particularly the claim that rank=16 achieves 99% of rank=256's performance on supervised fine-tuning, and that rank=1 LoRA beats full fine-tuning on reinforcement learning tasks. But instead of spending weeks setting up infrastructure, I decided to use Orchestra and just have a conversation about what I wanted to test.

From Curiosity to Results: How Orchestra Made This Happen

Here's the thing that still feels a bit surreal: I didn't write a single line of training code. I didn't provision GPUs. I didn't debug CUDA errors at 2am. I just had a conversation with Orchestra and explained what I wanted to test:

The Conversations: Setting Up Two Experiments

I ran two separate experiments with two different Orchestra agents. For the supervised fine-tuning experiment, I asked:

"Fine-tune Llama 3.2 1B on Tulu3 dataset. Compare LoRA rank=16 vs rank=256 on MLP layers only."

For the reinforcement learning experiment, I started a new agent and said:

"run a GRPO RL algorithm on the Qwen2.5-0.5B instruct model with the GSM8k dataset. let's compare full fine-tuning vs. lora"

We went back and forth for about 20 minutes. I clarified which hyperparameters to use, what metrics to track, which baselines to compare against. The agent asked clarifying questions about dataset sizes, training duration, evaluation strategy. It felt more like collaborating with a research engineer than using a tool.

Conversation with Orchestra - Part 1
Conversation with Orchestra - Part 2

Conversation with Orchestra: defining experiments, clarifying hyperparameters, and discussing baselines

What the Orchestra Agent Did: The Full Workflow

1. Writing the Code

Orchestra generated complete training scripts for both experiments: SFT with LoRA at different ranks, and GRPO with LoRA vs full fine-tuning. It structured the code with proper experiment tracking, checkpointing, and evaluation loops. Set up configurations for Llama 3.2 1B on Tulu3 and Qwen2.5-0.5B on GSM8k.

2. Debugging and Testing

Before running full experiments, the agents ran test runs on GPU with small sample sizes and just a few training steps to catch bugs early. This caught issues in the GRPO reward function during testing—things that would have caused silent failures during full training runs.

3. Provisioning GPUs

Once tests passed, Orchestra automatically provisioned the necessary GPUs via Modal: 4x H100s for the SFT experiment, 1x H100 for the RL fine-tuning. It handled all the infrastructure setup—Docker containers, environment configuration, dependency installation. I didn't touch a single config file.

4. Running GPU Experiments in Parallel

Both experiments ran overnight in parallel. SFT with rank=16 and rank=256. GRPO with rank=1 LoRA vs two full FT baselines (low LR and high LR). All metrics were logged through Orchestra's internal SDK and rendered in real time.

5. Monitoring Progress

The agent continuously monitored training metrics. It detected when full FT with high LR flatlined at 0% correctness and flagged it. It noticed when LoRA hit 100% format compliance and predicted strong correctness would follow (it did).

Real-time monitoring of training metrics

Real-time monitoring of training metrics showing the agent detecting and flagging issues

6. Plotting Results

When experiments finished, Orchestra generated 11 publication-ready plots: training/eval loss curves for SFT, correctness over time for GRPO, format compliance trends, total reward progression.

Automated generation of publication-ready plots

The agent automatically generating publication-ready plots with side-by-side comparisons

7. Writing the Analysis Report

Finally, it generated a comprehensive markdown report with statistical analysis, key findings, comparison tables, and recommendations. It even offered to create a PowerPoint presentation.

Automated report generation and PowerPoint offer

The agent generating analysis reports and offering to create PowerPoint presentations

The Results: We Reproduced Their Findings

Experiment 1: Supervised Fine-Tuning (Rank=16 vs Rank=256)

Our LoRA Rank Comparison

Our experiment: Llama 3.2 1B on Tulu3 (rank=16 vs 256)

Thinking Machine Lab SFT Results

Original paper: Multiple ranks on Llama 3.1 8B & 3.2 1B

Our curves basically overlap with the original paper's findings. Rank=16 achieves 99.4% of rank=256's performance with 16x fewer trainable parameters—exactly validating what Thinking Machines Lab found.

Experiment 2: Reinforcement Learning (Rank=1 LoRA vs Full Fine-Tuning)

Our GRPO correctness over training

Our experiment: Qwen2.5-0.5B on GSM8k (rank=1 LoRA vs Full FT)

Thinking Machine Lab RL Results

Original paper: LoRA ranks vs Full FT on RL tasks

52.1%

LoRA (Rank=1) Final Correctness

Full FT (low LR)
33.3%
Full FT (high LR)
0%

56% relative improvement over best full fine-tuning baseline

Rank-1 LoRA absolutely demolished both full fine-tuning configurations—exactly matching what Thinking Machines Lab found. Full FT with high LR flatlined at 0%. Full FT with low LR peaked at 43.8% then degraded to 33.3%. Meanwhile, LoRA shot up to 56% by step 50 and stayed stable around 52%. Same story, different model and dataset.

Format compliance over training steps

Format compliance—LoRA hit 100% by step 100, while Full FT maxed at 82.3%

Total reward over training steps

Total reward progression (correctness + format + reasoning quality)

Here's a demo of the actual RL fine-tuning experiment running in Orchestra—watching the agent set up the environment, provision GPUs, and monitor training in real-time:

Video: GRPO fine-tuning on GSM8k with Orchestra—from setup to results

The Timeline Comparison

Traditional Approach

  • Day 1-3:Rent compute, queue for cluster access, configure GPU environments
  • Day 4-7:Write training code, implement GRPO, debug reward functions
  • Day 8-10:Run experiments, realize something's misconfigured, rerun everything
  • Day 11-14:Generate plots, write analysis, make comparison tables
~2-3 weeks

With Orchestra

  • Evening:20-minute conversation explaining what I want to test
  • Overnight:Agent writes code, debugs, provisions H100s, runs experiments in parallel
  • Morning:Complete results with plots and analysis ready. Spot 4 reward function issues
  • Next day:Fixed experiments validate paper's claims with detailed analysis report
~2 days

Here is what I got: Production-ready experiments that ran to completion overnight. The code worked. The infrastructure provisioned correctly. The metrics tracked properly. The only "issue" was me changing my mind about reward weights halfway through—and the agent handled the rerun without complaint.

This eliminated weeks of engineering work that would normally block me from exploring areas outside my main expertise. I went from "I'd love to test this but don't have 3 weeks for setup" to "let me validate this today" in a single evening. That's the difference between ideas staying theoretical and actually testing them.

What I Learned About LoRA

The paper's claims hold up

Rank=16 really does get you 99.4% of rank=256's performance on SFT. Apply LoRA to MLPs, use 10x higher LR, and you're good. The conventional wisdom of "attention-only LoRA" is leaving performance on the table. I wouldn't have believed this without seeing it myself.

Rank=1 for RL isn't a typo—it's actually optimal

As a physicist, I was deeply skeptical. But the results are undeniable: rank=1 LoRA demolished full FT (52.1% vs 33.3%) on math reasoning. The information-theoretic argument makes sense—RL gives you ~1 bit per episode, so low-rank is actually regularization. This deserves to be standard practice.

Format compliance predicts final performance

LoRA hit 100% format compliance by step 100, and correctness followed immediately. For structured outputs, this is a leading indicator worth tracking. The agent caught this pattern before I did.

What I Learned About How Research Should Work

Researchers can now quickly explore adjacent fields

This is the big one for me. GRPO RL fine-tuning wasn't my specialty, but I was curious about the paper's claims. Instead of 2-3 weeks of engineering setup, I validated them in 1 day. The barrier between "interesting idea" and "empirical validation" just got dramatically lower.

Focus on the research question, not the infrastructure

I spent my time thinking about experimental design, interpreting results, and understanding the information-theoretic arguments. Not debugging GPU drivers, provisioning GPUs, or writing boilerplate training loops.

AI agents excel at iterative research workflows

The workflow that typically takes weeks (code → debug → provision → run → analyze → plot) is exactly what agents are good at: structured, multi-step processes with clear success criteria. Orchestra didn't just "assist"—it executed the entire pipeline autonomously while I slept.

Trust but verify

The back-and-forth with Orchestra felt natural. I didn't need to know a specific ML framework or GPU cluster setup details—I just explained what I wanted in plain language and let the agent translate that into working code and infrastructure setup. The agent's code worked, but I still reviewed the experimental setup, checked the metrics, and validated the conclusions. Agents handle execution; you handle scientific judgment.

We are redefining how research would be done in the future.

This experience crystallized an important shift: the bottleneck in science is moving from "can we run the experiment" to "what should we test."

Previously, many ideas remained untested because execution costs were too high. A paper would spark curiosity about generalization, but validating it required 2-3 weeks of infrastructure work—time most researchers don't have.

When you can articulate a research question clearly, you can now get empirical answers. This enables researchers to follow their curiosity, run validation studies that would otherwise be deprioritized, and accelerate the scientific process by removing artificial friction—not by cutting corners. Everyone can be a scientist ✨.

References

  1. LoRA Without Regret — John Schulman and Thinking Machines Lab, Sep 2025
  2. LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., 2021
  3. Group Relative Policy Optimization — Shao et al., 2024
  4. GSM8K: Training Verifiers to Solve Math Word Problems — Cobbe et al., 2021
  5. Tulu 3: Pushing Frontiers in Open Language Model Post-Training — Lambert et al., 2024

Experiments conducted using Orchestra Research, an AI-powered research platform for accelerating scientific discovery. All code, configurations, and experimental logs are available for reproducibility.

Citation

If you reference this work, please cite:

BibTeX

@article{zhang2025lora_reproduction,
  title={Reproducing "LoRA Without Regret" with Orchestra},
  author={Zhang, Zechen and Liu, Amber},
  journal={Orchestra Research Blog},
  year={2025},
  url={https://www.orchestra-research.com/perspectives/LLM-with-Orchestra}
}

Acknowledgements

We'd like to thank Modal for their generous support in providing cloud compute infrastructure for these experiments. Their platform made it seamless to provision H100 GPUs on-demand, which was essential for running both the supervised fine-tuning and reinforcement learning experiments at scale.