Below are some general pieces of advice I would give to anybody conducting LLM evals / behavioral LLM science research. Take it or leave it -- you may find that you work more effectively doing things differently -- these are just some points to consider.
(Note: I expect some of this advice to become outdated as AI technology advances, particularly the lines I draw on what AI can and can't be trusted with.)
Logging is important for two reasons. First, facilitating reproducibility. Second, saving the data from your experiments in a way that makes downstream analysis easier.
All of your experiments should log:
global_params.csv .logs.csv file where every row is a different LLM query.Even when heavily relying on an AI coding tool, you should always have a good sense of what inputs are being fed into the LLM. It's easier than you think to waste money on completely meaningless experiments, e.g. because there is something wrong with the prompt. In particular, if a prompt is confusing or underspecified, LLMs tend to behave more incoherently, which can make your job more difficult.
Personally, I like to have a dedicated file prompts.py where all prompts or prompt templates live. That way, it's easier to know what prompts the LLM is seeing, compared to if the prompt is being built in a scattered way across multiple methods/files.
In addition to looking at whatever output metrics are of interest, you should always have a good sense of what the raw LLM completions look like. This way you can make sure the LLM is behaving in a way you expect.
Personally, I like having an AI coding tool write to a .txt file some example complete transcripts, to make sure everything looks right. For larger datasets, it's also very easy to ask an AI coding tool to make you a dashboard visualizing the raw data (I would not trust it to do a good job at interpreting the data).
It's also important to read some outputs carefully, not just skim them. Here's an example from my own research. I ran a small experiment, and it seemed like a change in the prompt had a certain effect on LLM behavior. Then I read the raw LLM outputs. It turns out, the bulk of the variation in the LLM outputs was due to it making math mistakes (like writing 200/0.2 = 400 instead of 1000), rendering the measured effect on LLM behavior meaningless.
Testing on cheap LLMs is better because, if you are iterating quickly, most of your experiment data will be garbage anyway. (For example, if you notice a problem with the prompt, all the data with the old prompt is now worthless.)
Testing on two LLMs from different providers is often helpful so that you can make sure your behavior of interest replicates, and is not just a quirk of a specific LLM.
Depending on the nature of the project, "thinking" can mean: reading, brainstorming, running tiny (<2hr) pilot experiments, etc. And by engineering, I mean any nontrivial software design.
Reasons to postpone engineering (until it's clear you need it):
I think reading is very important and highly underrated (by the applied AI safety community) as a part of the research process.
Some reasons to read in general:
Some further reasons to read human writing specifically:
Personally, my preferred workflow is as follows:
If you are skeptical that reading is worth the time, you might consider trying out the following. Ask your favorite AI to critique a paper you are extremely familiar with (e.g. are an author of), or write a lit review on an area you are extremely familiar with. My guess is you will notice issues with what the AI says: maybe it nitpicks things that are not actual problems, maybe it doesn't notice or emphasize the real problems, maybe it overlooks/misunderstands certain aspects entirely, etc.
This refers to the phenomenon that AI outputs are often less diverse / creative / novel than high-quality human thought.
One big place this crops up is experiment design. AI coding tools will design experiments with "default" choices, which has two problems. First, often, these defaults are bad, sometimes for subtle reasons. Second, even when the defaults are good, research is about doing something different, so if you are deferring to the AI coding tool for everything, you're less likely to find something new and interesting.
In particular, if you rely on an AI coding tool too heavily, it's likely that your work will have essentially zero counterfactual impact, because what you did is likely to be easily replicable by a slightly more advanced AI coding tool acting completely autonomously. So, there is this balance to strike between doing work with counterfactual impact, and doing work that is bitter lesson compatible. I don't think anybody knows how to get this right, I'm just pointing it out as a something to keep in mind.
I also think that AI cognitive monoculture can be a problem when you learn too much from AI, rather than from humans. The main mechanism through which this happens is unknown unknowns, which are more of a problem with AI due to sycophancy. (That is, the AI "code-switches" according to its model of you, and explains things from a perspective compatible with its existing model of your worldview, leaving out possibly important things you would have never considered.) Even putting personalization aside, I think AI systems can have specific perspectives/worldviews/opinions that shape how it explains things.