Staying sane down the rabbit hole

If you have done some kind of research (be it academic or in the industry), you have had to figure out how to experiment properly. Since I'm a Software Engineer by trade with a focus on algorithms, experiments for me typically take the shape of some kind of benchmark. This often means measuring the performance, coming up with a hypothesis for an improvement, implementing and then quantifying it. This is very repetitive work that can become a gigantic rabbit hole. To stay sane, I've kept a "research log" usually in the form of a small notebook on my desk. Recently I made a change to digitalize this process and do small experiments in a more automated fashion via LLM-based agents.

I replaced my notebook with a markdown file EXPERIMENTS.md in the root of the repository. Each entry has an experiment name, a hypothesis and eventually an outcome. It can be as simple as the following:

## workgroup_size

Hypothesis: The shader workgroup size impacts the performance of the nearfar/deltastep algorithms.
Outcome: The workgroup size is largely irrelevant.

These entries are kept fairly brief, I either write them by hand if I think of something worth investigating or let an LLM agent create them for me based on some profiling results. Each experiment name maps to a branch experiment/{name}, e.g. experiment/workgroup_size, which contains commits for setting up the experiment. I prefer this branch-based approach, over, say, keeping all experiments on the main branch because it makes it natural to set up small variations as individual commits and work with changes on the main branch via rebasing.

For each variant I mark the state where I want to perform a measurement by special commits that instruct an experiment runner to execute a suite of benchmarks with parameters. For the example above this looks something like this:

738d4f9a2 result:xp:workgroup_size
30ea164df xp:workgroup_size run:deltastep run:nearfar param:delta=900,1800,3600 param:data=berlin
853dd8231 Set workgroup size to 256
21f830ab0 xp:workgroup_size run:deltastep run:nearfar param:delta=900,1800,3600 param:data=berlin
cd6315f95 Set workgroup size to 128
60dd7f45a xp:workgroup_size run:deltastep run:nearfar param:delta=900,1800,3600 param:data=berlin
651371fae Set workgroup size to 64
b1641ebeb xp:workgroup_size run:deltastep run:nearfar param:delta=900,1800,3600 param:data=berlin
7a35badec Set workgroup size to 32

The experiment runner looks for xp:{name} commits, checks them out and runs the experiments. When all experiments have been run the results are committed in a single result:xp:{name} commit. The special instrumentation commits are simply empty commits, only their commit message carries meaning. Using commits works better than git tags here because we don't actually want to attach data to a specific commit sha, we are more interested in marking a position in a branch. You may want to redo a measurement after you have found some other unrelated improvements, in which case we can simply rebase this branch on main, drop the result commit and rerun the experiment.

Once we have this basic process in place we can automate it! The low-hanging fruit is to create simple tools that orchestrate the process:

Creating dedicated experiment runners. For example the binary experiment_nearfar runs an experiment that captures the performance of this specific algorithm on a dataset.
Creating a tool xps.py that makes it easy to create these special instrumentation commits, then run them and capture the results.

For our workgroup_size example this would be used as follows:

# Creates the `experiment/workgroup_size` branch and checks it out
./xps.py create workgroup_size
# Create an XP commit that runs both the deltastep and the nearfar experiment with the given parameters
./xps.py add deltastep,nearfar data=berlin delta=900,1800,3600
# Runs all the experiments since the `result:xp:{name} commit (if there is one)
./xps.py run
# Print results for each variant and compare them (e.g. seletcs winner and shows the speed-ups)
./xps.py compare workgroup_size

Since the tools are doing the heavy lifting on enforcing the process, we can use this reliably with a more fuzzy tool like an LLM. I use opencode for my personal agentic-coding needs. Opencode makes it simple to define custom agents, so I created a labrat agent that has the following agent prompt:

You are `labrat`, GPUSSSP's experimentation agent. Follow this playbook every time:

1. **Research first** – Use the given context to provide sensible defaults, for example (if applicable) inspect the commits provided - be brief.
   - `experiments/README.md` is a good starting point.
   - Unless said otherwise we want to use `berlin_zorder` as dataset.
   - Unless said otherwise only run experiments for the mentioned algorithm, or the algorithms affected by the changes.

1. **Clarify** – If the previous step was not sufficient then pin down the following:
   - experiment name + hypothesis (what metric should change and why)
   - datasets / cache inputs, algorithms (`deltastep`, `nearfar`, etc.), and tunables (`delta`, `gpu`, repetitions)

2. **Scope documentation** – once clarified:
   - Edit `EXPERIMENTS.md` by adding/refreshing a section `## <experiment_name>`.
   - Describe the hypothesis briefly. Don't use more than 2-3 sentences. Be short and precise.
   - Create a dedicated git commit for this documentation update (no binaries or build artifacts). Do not proceed until this commit succeeds.

3. **Experiment setup** – perform `experiments/xps.py create <name>` (only after confirming you are on a clean tree)
   - If non-trivial code changes are required, pause and explicitly ask the user to invoke the primary `build` agent (or switch agents) to perform those modifications before you continue.
   - Instrumentation commits via `xps.py add` using the clarified run targets + params. Surface each git change to the user as you go.

4. **Pre-run confirmation** – summarize the planned branch, run targets, params, and expected duration. Ask the user for an explicit go/no-go before launching `xps.py run`.

5. **Run + collect** – after receiving approval:
   - Ensure Release build configuration exists, rebuild required experiment binaries, then execute `./experiments/xps.py run`.
   - Collect logs/metrics from `experiments/results/<name>` and (if available) `experiments/xps.py compare` output.

6. **Result logging** – append a concluding sentence to the existing section in `EXPERIMENTS.md` stating whether the hypothesis was validated or invalidated (e.g., "Outcome: hypothesis invalidated – fixed dispatch slowed DeltaStep by 3% on berlin"). Include key evidence (dataset + metric) in that line.

7. **Commit etiquette** – after updating results, stage the modified artifacts (`EXPERIMENTS.md`, results directory, relevant compare output) and create two commits if needed: one for docs/results, another for instrumentation outputs. Never rewrite or drop user commits. Do not push unless explicitly asked.

8. **Safety + delegation** – if you encounter merge conflicts, build failures, or repo dirt unrelated to your work, stop, report the issue, and wait for instructions.

Maintain a concise running log to the user: research ➜ clarify ➜ document ➜ commit ➜ confirm ➜ run ➜ summarize.

With that in place running the current very basic experiment is as simple as telling the labrat agent:

I want to investigate the effect of the `workgroup_size` on the performance of the
deltastep and nearfar algorithms. Take a look at @include/gpu/deltastep.hpp and
@include/gpu/nearfar.hpp and implement variants 32, 64, 128 and 256.

For these simple tasks using a cheap model like gemini-3-flash is sufficient. If the changes are larger I would switch to the build agent based on sonnet-4.6 or opus-4.6, and steer it to a sufficient result. Often I already have multiple commits on the main branch where I tried something and can tell labrat to build a dedicated experiment on top of that.

This is a far cry from completely automated experimentation but it has tremendously reduced the amount of toil for me trying out different variations. That alone has led to much better outcomes because there are many things I would have previously not bothered to validate. Completely automated tools like OpenEvolve sound cool, but only work for really isolated problems. In a complex code base you will still need extensive guiding to get meaningful results.

I think we are in an interesting age where the actual code to implement an approach like this is kind of irrelevant. Most of the actual experiment orchestration was generated using agentic-coding with minimal guidance. You can most likely get something up and running just by giving this blog post to an LLM and asking it to implement something similar for your code base.