The theme of this Think Tank has been to discuss why we believe it is important to be aware of various issues with psychology and applied linguistics research methods and analyses. I decided to write my story about how I became aware of these issues and what I understand about them. In focusing on how I learned about conducting quantitative research and interpreting the results of published studies, my aim is to give an example of how a deeper understanding of other people’s research begins with doing your own. My journey is just beginning and is one I have undertaken with the help of textbooks, articles, and online resources, but few taught classes or discussions. I do not pretend to be an expert, but I will share some of the important lessons I’ve learned over the last five years as I continue to work on getting a PhD.
About a year after I got accepted onto a PhD. programme in Applied Linguistics, it suddenly dawned on me that the research I had chosen to do would require some fairly heavy data analysis, and that I also needed to work out which statistical tests I should use and how to conduct them. I had taken two basic statistics courses as part of my undergraduate degree in psychology many years ago (based on SPSS software, at a time when the results were churned out on reams of continuous form paper rather than in Word-ready tables). The experience led me to ensure that both my undergraduate and MA theses involved analysis of only qualitative data. So, the realisation that I would now be forced to perform statistical analyses of quantitaive data was scary, to say the least. But, knowing I had to at least try to do things right, I set out to learn about the dreaded S-word I’d avoided for so long.
I started by working through an introductory statistics book like this one, learning things like how to perform t-tests by hand on simple data sets. Okay, I lie, on the first attempt I read the text and dodged doing the actual maths because it takes so much longer if you do it. When I realised I hadn’t really understood the analyses, I embarked on my second attempt and this time I was a good student and completed all the exercises. After I’d worked through that book properly, I started looking for more stuff to read, to develop my rookie skills. I stumbled across a couple of great books by Cumming (2012) and Plonsky (2015), which improved my understanding. One of the most important lessons I learned about data analysis from these authors is to always visualize your data, even if you don’t intend to publish the graphs or figures. When you can see it, you can understand it. And the converse is also true.
At the same time, I also discovered that a lot of people were talking about something called “R.” In fact, almost every time I read anything about data analysis, R was mentioned. R is free software for data analysis (in contrast to SPSS, which comes with a price tag but is easier to use). There is huge online support for R, such as a series of introductory cookbooks, as well as various YouTube channels. Given such popularity, I downloaded it and signed up for a couple of MOOCs (online courses) to learn how to use it. How naïve I was!
The first thing I learned about R (often used through the more friendly R studio interface) is that it should come with a health warning for those without a maths or programming background. The second thing I learned was to copy-and-paste every single line of code that actually executed the function that I wanted it to do into a document for future reference. Other researchers may find R easier to use than I do but, for me, performing an analysis in R provides no guarantee I will be able to successfully repeat it ever again. However, since I started learning, new books on R for linguists have been released, which are easing my pain (Levshina, 2015; Winter, 2019). I keep battling with R because, with the aid of the various packages that have been created for it (such as “tidyverse”), R can produce any analysis you need, along with wickedly funky visual graphics that make me cheer like I’ve won the lottery after hours of cursing. Take a second to look at the R graph gallery, if you want to be amazed by what R can do (of course, this is way more than I can do in R).
 For readers familiar with R, I am now trying to conduct my analyses in R markdown.
One of the most important lessons I learned about R, though, came from a PhD. maths student, who kindly volunteered to help statistically-challenged people like me with their data analysis. When I showed him what I’d been doing and what I wanted to do next, his response was, “Why have you been trying to learn R? Why don’t you use JASP instead? It’s much easier and quicker to use.” He was 100% correct; I now use both software programs. R may be able to do more, but only if you can write accurate code to tell it what to do. JASP visualises and analyses your data with just a few mouse clicks. (It’s also free and new versions are constantly being released with increased analytical capabilities).
Overall, the learning curve has been hard. The thing that kept me going in the first few years was that, at this stage in my quest, I also read articles such as Cohen’s “The earth is round (p < .05)” (1994). Papers like this made me realise that statistics wasn’t some monolithic beast but actually a vibrant research field with its own disputes and controversies just like any other. To my surprise and near-horror, I started to get interested in statistics. Sure, I still found the maths as difficult to deal with as confronting [insert your own worst fear] but the discourse around it all, the stories and arguments, that was something I could get into. I even discovered what has become my favourite statistical catchphrase in an awesome, short, and maths-free book by Abelson (1995): “chance is lumpy” (p. 19).
These three innocent-looking words bring me to what I’ve learned about understanding published research, as they pretty neatly sum up what has been called the replication crisis in psychology and many other fields, as explained by Curtis Kelly earlier in this issue. Basically, many research studies did not include enough participants to reliably detect whether any difference between groups was due to chance or a systematic difference. The more I learn about statistical analysis, the better I can appreciate why and how this crisis happened, without any intentional wrongdoing. Many people underestimate the lumpiness of chance. An example, when I was growing up in the UK, was that most people knew that you waited 30 minutes for a bus and then three came along at once. This was not actually public transport being deliberately frustrating, but due to chance variance in traffic, passengers, etc.
As I start to understand this issue statistically, I’m becoming able to question the analyses reported in published papers and, equally importantly, the authors’ interpretations of their results. And I’ve come to realise that this ability to think more critically about analyses is really useful. By learning even just basic statistics, I’m not only able to analyse my own research findings, I’m also becoming a better consumer of research. Which is why I now think everyone reading about quantitative research, even if they don’t intend to conduct their own such studies, should have some conceptual understanding of statistics. Even if they don’t fully understand the maths. To illustrate, I’m going to discuss aspects of one major issue, which does require the use of some statistical terms. If you need to check, they are briefly explained in a glossary at the end of this article.
So, what’s been going wrong in the research? It seems to me the biggest issue is an over-reliance on deciding whether two or more groups are different by conducting a t-test or ANOVA, without understanding why or when to use these statistical tests, coupled with the difficulty of interpreting their results expressed by the p value. For now, I’m going to focus on the p value produced by these (and other) tests. Within the social sciences the p value must be less than .05, indicating less than a 5% probability that the result was obtained by chance similarities in the data sets. This value is the miracle point at which the difference detected can be declared “statistically significant.” It is a difference that is large enough that you can declare it a systematic (interesting) difference and publish your results in an academic journal (the fact that studies that fail to reach this threshold are rarely published is known as the file drawer problem).
As far as I can tell, there seem to be three main critiques of how p values are interpreted. First, the p < .05 threshold tells you the probability your results occurred by chance, but it neither supports nor rejects alternative (non-chance) explanations for patterns in the data, for example that the hypothesis you were testing (that the treatment or teaching intervention is effective) is true. Or, to use a movie example, if you hypothesise that there are more vampires in horror movies than in romantic movies, even if the results of your analysis indicate that it is very unlikely that there is no systematic difference between the two genres, this does not mean that the difference is necessarily one of the presence or absence of vampires. You need to use your knowledge of such movies to decide how likely vampires are to be a cause of the difference you observed. This problem is explained clearly and concisely on the Minitab Blog.
 For simplicity, and because I still need to learn much more about them, I’m ignoring Bayesian arguments against null hypothesis testing here, although I acknowledge their importance.
Second, p values exist on a sliding scale, but are often reported as if they were an all-or-nothing judgement, as though a value less than .05 meant there’s a significant difference between the groups; greater than .05, there’s no difference at all. In plainer English, just how ridiculous this is becomes clear. If you are trying to find a difference between groups, a 4% probability of the difference being due to (the lumpiness of) chance is accorded much more significance than a 6% probability. However, the difference between 4 and 6% is not so huge in most fields (there are exceptions, such as medical research). Yet, due to the dominance of this number in deciding whether research is important or not, this difference has become a publish-or-perish career-breaker within fields such as psychology and linguistics. (Fields such as physics demand much lower p values for publication.)
The third critique of p values, which Cumming (2012) refers to as “the new statistics,” offers ways to improve analyses by supplementing p values through calculating the size of the difference between groups (known as the “effect size”) and your confidence in the results obtained, as represented by confidence intervals. This has been recommended practice by the American Psychological Association for over 20 years (Wilkinson & Task Force on Statistical Inference, 1999), yet it is still not standard practice. The first issue is that the p value only assesses the probability of a non-random difference, it provides no information about the size of the difference. Especially if the data set is very large, a p value can be statistically significant but the difference may not be of any practical significance. The effect size is a measure of whether the difference is large enough to make a difference.
 Cumming also includes meta-analysis in the new statistics, which involves the calculation of the probable effect size of any research finding from the results of multiple studies researching the same topic. Two examples in SLA research are Norris & Ortega (2000) and Nicklin & Vitta (2021). For an overview, see Oswald & Plonsky (2010).
The other issue is that when data sets are small, they are subject to the lumpiness of chance, which means the p value is difficult to interpret in any meaningful way, as you cannot be very confident that the average value or mean of your sample(s) is similar to the true mean (the mean for all possible participants, such as the whole population of high school students in your country). The whole point of statistics is to make an inference from your sample to the population, so this is a very big problem. It is common in SLA research (Plonsky, 2013), as well as in some areas of psychology and neuroscience, for a variety of reasons. The probability that an obtained mean is close to the true population mean can be measured by confidence intervals, as you can see on this visualisation. Confidence intervals tend to be larger with smaller sample sizes, indicating less certainty regarding the accuracy of the results obtained. Reporting and understanding effect sizes and confidence intervals improves data interpretation.
 Change the sample size to 100, hit return, and see what happens.
For the reasons above and others, I now believe that, even though statistics is hard, it is important to understand something about data analysis to be able not only to conduct quantitative research but also to interpret it. Far greater minds think so, too, which is why science and statistics are evolving to improve the quality of research. Results based on small sample sizes are unreliable (Ioannidis, 2005), which is a huge factor in the replication crisis in psychology (Open Science Collaboration, 2015). But there are solutions, such as collaboration and more systematic replication. This involves not only forming research teams but also sharing ideas and analyses, known as “open science.” For example, the journal Language Learning is promoting replication studies and registered reports, in which the hypotheses and data collection and analysis are approved before the study is undertaken and publication is not dependent on obtaining significant results. The IRIS Database is a repository for tools and materials in SLA research, to promote open science within the field.
The next step in my journey is to learn how to conduct analyses using mixed effects models. I’m actually looking forward to it, too, as I know that a key advantage of such designs for SLA research is that they account for facts such as students’ tendency to learn in classrooms with other students. Without the aid of statistics, I know that how any student acts and learns in my classroom is partly predicted by how all their classmates act and learn. Through the reflection involved in writing this story I’ve realised that the most important lesson I’ve learned about data analysis is that I’ve reached the stage where not everything feels like an uphill battle. I’m starting to think that possibly, probably, I can do this. And with the support of that belief, I no longer need stories and controversy to keep me interested, I want to keep moving forward because I want to be able to do more and understand more. And if you’ve stayed with me up to this point, I hope that maybe you feel that way, too.
Abelson, R. P. (1995). Statistics as principled argument. Lodon, UK: Routledge. https://www.routledge.com/Statistics-As-Principled-Argument/Abelson/p/book/9780805805284
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://dx.doi.org/10.1037/0003-066X.49.12.997
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. London, UK: Routledge. https://www.routledge.com/Understanding-The-New-Statistics-Effect-Sizes-Confidence-Intervals-and/Cumming/p/book/9780415879682
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
JASP Team (2020). JASP (Version 0.14.1) [Computer software]. Retrieved from https://jasp-stats.org/
Levshina, N. (2015). How to do linguistics with R: Data exploration and statistical analysis. Amsterdam, Netherlands: John Benjamins. https://benjamins.com/catalog/z.195
Nicklin, C., & Vitta, J. P. (2021). Effect‐driven sample sizes in second language instructed vocabulary acquisition research. The Modern Language Journal, 0, 0. https://doi.org/10.1111/modl.12692
Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50(3), 417–528. https://doi.org/10.1111/0023-8333.00136
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), 1–8. Retrieved from http://science.sciencemag.org/content/349/6251/aac4716.full?ijkey=1xgFoCnpLswpk&keytype=ref&siteid=sci https://doi.org/10.1126/science.aac4716
Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and challenges. Annual Review of Applied Linguistics, 30, 85–110. https://doi.org/10.1017/S0267190510000115
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655–687. https://doi.org/10.1017/S0272263113000399
Plonsky, L. (2015). Advancing quantitative methods in second language research. London, UK: Routledge. https://www.routledge.com/Advancing-Quantitative-Methods-in-Second-Language-Research/Plonsky/p/book/9780415718349
R Core Team. (2016). R: A language and environment for statistical computing [computer software]. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.r-project.org/
RStudio Team. (2015). RStudio: Integrated development for R [computer software]. Boston, MA: RStudio, Inc. Retrieved from http://www.rstudio.com/
Wickham, H., et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wilkinson, L., & Task Force on Statistical Inference, American Psychological Association, Science Directorate. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594– 604. https://doi.org/10.1037/0003-066X.54.8.594
Winter, B. (2019). Statistics for linguists: An introduction using R. London, UK: Routledge. https://www.routledge.com/Statistics-for-Linguists-An-Introduction-Using-R/Winter/p/book/9781138056091
Caroline Handley, the BRAIN SIG Coordinator, is an English lecturer at Seikei University. She is currently pursuing a PhD in Applied Linguistics at Swansea University, where she is researching the relation between conceptual and linguistic knowledge in lexical processing, using an embodied cognition perspective.