How to Run an Experiment

What is an Experiment?

An experiment, also known as an A/B test or randomized controlled trial, is a research method used to answer questions of cause and effect. Does some factor “X” affect outcome “Y”.

Imagine you’ve noticed that when you’re wearing a hat people tend to compliment you. You may think the hat “causes” the compliments. But people may also be complimenting you when you’re not wearing a hat, you just don’t notice. Or it could be that you only wear a hat when you’re with a certain group of friends who just compliment more in general (reverse causality). Or maybe you only wear a hat when it’s sunny, and people are in a better mood when it’s sunny.

An experiment takes care of these issues by providing a comparison or “control” condition (days when you’re not wearing a hat) and “randomizing” the thing of interest (hat vs. no hat). Let a coin flip decide whether you wear a hat on certain days, and if you still get more compliments the hat is likely why, because now it isn’t the sun or your friends causing the hat. It’s the coin.

The experiment process is pretty simple once you understand the two basic ingredients of control and random assignment. So we’ve broken it down into 7 basic steps.

Steps to Design and Run an Experiment

1. Start with a question: Does some factor “X” cause outcome “Y”?
Example: Does wearing a hat make me look better?

2. Choose how you will measure outcome “Y”.
Example: Social media likes.

3. Decide how you will present your factor “X” to people.
Example: Take a photo of you wearing a hat.

4. Make a Control condition identical to #3 above, but without factor “X”
Example: Take a photo identical to the first, but without the hat (same shirt, pants, smile, etc.).

5. Randomize your causal factor “X”.
Example: Take your list of Facebook friends and randomize which photo they see, so that only half can see your photo with a hat, and only the other half can see your photo without a hat.

6. Launch your experiment and collect your outcome data.
Example: Post each photo on Facebook to the designated friend list, and let the likes accumulate for a few days.

7. Analyze your results with the appropriate statistical test, comparing the averages.
Example: Chi-squared test for binary outcomes (e.g., like-dislike), or T-test for a 1-5 scale

It’s that simple. Experiments can take on more complex designs, such as multiple “X”s or outcome measures, but those are a bit beyond our scope here. Nevertheless, there are a few other important considerations you should pay attention to if you want to design a quality experiment.

Sample Size

The number of people who will participate in your experiment is very important, as this affects how confident you can be that your experiment results are not just due to chance. The more people who participate in your experiment, the more likely you’d find similar results if you ran the experiment again.

Consider flipping a coin. If you do it only 4 times, you may end up with 75% or even 100% of one side winning. But if you flip a coin 1,000 times, the outcome is much more likely to be around 50-50.

So how many people should take your experiment? Here are some general guidelines:

60 people is usually enough to detect large effects between two conditions.
(e.g., a 1-point difference on a 1-5 survey scale).

120 people is usually enough to detect medium effects.
(e.g., about 2/3 of a point difference on a 1-5 scale).

200-400 people or more are generally required for small effects.
(e.g., about 200 people for a 1/2 point difference, about 450 people for a 1/3 point).

There are a variety of free sample size calculators (e.g. GPower) that can help you determine a reasonable sample size, given your required level of confidence over the results and the estimated effect size.

Effect Size

Hand-in-hand with sample size is the effect size that your causal factor has on the outcome. This is basically the average difference in your outcome between your control condition and casual factor condition.

For example, if survey takers rate your photo with a hat on average at about 4.0 on a 1-5 scale, and the average rating of your photo without a hat is 3.0, your effect size is 1.0.

How do you know whether that’s large, medium, or small? We can compute a standardized effect size, which makes our result more comparable to other experiments.

For example, the formula for Cohen’s d is simply the difference in average outcomes between your two conditions, divided by the sample’s standard deviation (a measure of how spread out your data is). In our hat example, if your sample’s standard deviation is 1.25, simply take 1.0 divided by 1.25, and you have a standardized effect size of 0.80, which is Cohen’s threshold for a large effect.

But wait. If we’ve never run the experiment before, how would we know the effect size? You may not. And that’s ok. You can provide your best guess, or decide what effect size is the smallest you’d be comfortable with and then plan your sample size accordingly.

For example, if you don’t care about any effect of wearing a hat that’s smaller than a 1/2 point difference on a 1-5 scale, a sample size calculator would tell you that recruiting 200 people would give you an 80% chance of finding the 1/2 point effect, if it actually exists. If you run the experiment and you don’t find a significant result, it’s likely because the difference is smaller than half a point, maybe even closer to zero.

Analysis

If you hate math, you’ll love experiments. Well, at least relative to other research methods. Unlike big data, experiments rely on only a handful of basic statistics, all of which can be computed using free online software.

Sample Size (number of people in your experiment)
Mean (average) outcome in each condition
Standard Deviation (how spread-out your outcome data is)

These numbers can be plugged into a t-test calculator which will analyze the difference in averages between your control and treatment groups. The test will spit out a probability value (“p-value”), which tells us the probability of finding results at least this extreme if the real difference was zero. The smaller the p-value, the better. If the p-value is below 0.05, we consider the result "statistically significant" since we can be 95% confident that we wouldn't find a difference this large from our data if the real difference was actually zero.

Conclusion

From thinking of your cause-effect question to analyzing your results, you now have the basic tools to design an experiment. Of course, there are other considerations to truly master the art and science of experiments. But mastery isn’t required to benefit from this research tool. Following the steps above will put you well on your way to gaining valuable insight into cause-effect relationships you care about.

Curious to see an experiment in action? Check out the article below!

Do Bulleted Lists Improve Memory?

In this experiment, we tested whether communicating written information in a bulleted list (3 common grocery items) helped readers remember the items on that list.