How T-Tests Can Be Useful Beyond A/B Testing (Data Science in the Wild)
Making imperfect data work for you
A couple of weeks ago at work, the growth team reached out to me asking for help determining if a new app feature they were planning to launch would help boost conversions.
I said of course, and we started planning the A/B test.
Now let me tell you what actually happened…
The growth team reached out to me a week after the flow was released to ask if I could help determine its impact on conversions.
The first thought on my mind was Why didn’t we do this as an experiment? 🤦♂️
But of course, I still offered to help out as much as I could.
Fortunately (or unfortunately), I’ve been down this road many times before in my career so I already have a few strategies for handling these types of analysis, and one of those includes using the t-test.
The real world is messy, which is why as data scientists, we need as many tools in our toolbox as possible to handle the unpredictability of real-world data and the inevitable human tendency to take shortcuts.
First, what is a t-test?
In case you didn’t know or need a refresher, a t-test is a statistical test used to compare the averages (means) of two groups and determine if the difference between them is statistically significant.
For an A/B test, these groups would be your control group (group A) and your variant group (group B). The t-test helps calculate the famous p-value, which answers the question: How likely is it that these results happened by random chance?
Doing a t-test (and calculating a p-value) in Python is extremely easy using the SciPy library:
from scipy.stats import ttest_ind
# Example data: Group A and Group B
group_a = [2.3, 1.9, 2.8, 3.4, 2.5]
group_b = [3.1, 3.4, 3.8, 4.2, 3.9]
# Perform independent t-test
t_stat, p_value = ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat}, p-value: {p_value}")
💡 If you need a full refresher on the p-value, check out this article
What most people overlook is that the t-test (or similar statistical tests) can still be used even if don’t have a randomized control and a variant group.
But of course, there are some caveats…
Using it in practice
In my situation, the team had introduced a new call to action (CTA) button in the app meant to facilitate conversions.
Since my time was limited, I started with the simplest approach: conducting exploratory data analysis (EDA) to identify any trends.
This is what I did during EDA:
Conversion analysis: Calculate conversion rates for users who clicked vs. didn’t click the CTA.
Cohort analysis: Perform a weekly cohort analysis to visualize trends (heatmap and line chart).
Baseline comparison: Include conversion data from the same period last year for additional context (e.g. seasonality)
At this point in the analysis, it was evident that there had been a significant increase in conversion on iOS, while the increase on Android was negligible.
But of course, it wasn’t enough to stop there.
💡 Remember, the goal of exploratory analysis is to collect as many signals that help reinforce the confidence in your findings.
So as a way to gain more confidence in my findings, I did the following:
Created groups: Segment users into two groups—one for those who signed up two weeks before the CTA addition, and another for those who signed up two weeks after.
Statistical test: Perform an independent t-test to compare the conversion rates of the two groups
Result evaluation: Interpret the p-value to determine statistical significance.
In the end, the results of the t-test reinforced my observations and I could say with confidence that there had been a significant increase in conversion rates on iOS after the release of CTA.
Checking for assumptions
Now, this is probably the most important thing to know before diving into an independent t-test, and unfortunately, it's something many junior data scientists completely overlook: Checking for assumptions.
Here’s what to check:
Are the data groups independent?
The groups should not influence each other. If they do—like comparing before and after data for the same users—you’ll need a paired t-test instead.Is the data approximately normal?
The t-test assumes the data follows a normal distribution. For large datasets, this matters less, but for smaller ones, it’s worth testing with tools like a Q-Q plot or the Shapiro-Wilk test.Is the variance equal across groups?
If the variance differs significantly between groups, a standard t-test may not be accurate. In such cases, use Welch’s t-test, which accounts for unequal variances.
Validating these assumptions only takes a few minutes but ensures your conclusions are reliable.
Correlation doesn’t imply causation
In the end, we always have to remember that correlation doesn’t imply causation and while the results of my analysis gave me some confidence in the impact of the new button, I couldn’t say with certainty that it drove the increase in conversions (not without a proper experiment at least).
But there are two key takeaways I want to leave you with:
Be ready to handle imperfect analysis: In the real world, data is rarely perfect, and you won’t always have clean experiments or clear causation. The key is to focus on actionable insights that can still drive business impact, even if they aren’t definitive. For example, my analysis may not have proven causation, but it provided enough evidence to guide the next steps confidently.
Educating stakeholders is crucial: Even when we can draw reasonable conclusions, it’s just as important to educate stakeholders about why setting up proper tests (like A/B testing) upfront is critical. It saves time, reduces guesswork, and leads to more robust decisions. This is especially important when trying to establish causality rather than relying on correlations.
In later articles, I’ll share more advanced strategies for dealing with scenarios like this one, such as causal inference or propensity score matching, to better understand causation in real-world data.
Thank you for reading! I hope these tips help you become a more well-rounded data scientist.
See you next week!
- Andres
Before you go, please hit the like ❤️ button at the bottom of this email to help support me. It truly makes a difference!
Does the data distribution itself need to be normal? I’m not sure but I think that the sampling distribution of the data needs to be normal. Correct me if I’m wrong. :)