Thursday, June 7, 2018

Why I Love Bayesian Statistics for Developmental Research

"Have you heard about Bayesian statistics?"

Let's start by acknowledging one thing: People who say you should use Bayesian statistics are super annoying. The first few times some guy at a conference told me I should really be using Bayesian methods, I treated him like a Jehovah's Witness handing me a copy of The Watch Tower. "I'll think about it," I'd say, and mentally slam the door in his face.

But then I got some very nice colleagues who started teaching Bayesian methods to my grad students,* and over the years, my grad students have shown me how these methods really can be useful in developmental research.

And now it's really easy, because in the last couple of years, people have made free statistical software programs like JASP (see No. 6, below) that make Bayesian methods as easy to use as frequentist ones, even if you're not a 'stats head.' Here are some of my favorite things about Bayesian data analysis for developmental studies.

1. Preferential stopping. (Saves time and money!)


With frequentist statistics (i.e., the regular old kind), you're not supposed to look at your data before you've finished collecting them. For example, you're not supposed to test 16 babies, see if there's an effect, test a few more, look again, etc. Nope, that's cheating. It's one of the statistical no-no's that fall under the label of 'p-hacking.'  It's a no-no because frequentist tests depend on the assumption that you decided ahead of time how many kids you were going to test, and that's how many you tested, and then you stopped testing and analyzed the data all at once.

But with Bayesian methods, you can look at your data as much as you want. You can look at the data after every dang kid if you want to, and it won't mess anything up! So you can keep testing until you find an effect, and then stop. Think about how much time and money that saves!

2. You can find evidence for the null hypothesis. (New superpower!)


When you were reading the point about stopping above, maybe you thought, "Well, that's fine if you see an effect, but what if you don't see one? What if you test 16, 20, 24, 30 children and there's still no effect . . . how do you decide when to stop?"

This is the second great thing about Bayesian methods: You can not only find evidence that there is an effect, you can find evidence that there's not one.

For example, say you are studying empathy in preschoolers, and you want to know whether boys and girls perform differently on some empathy task. So you test a bunch of boys and a bunch of girls on this task, and now you want to know whether the mean scores of the two groups are the same or different.

If you do a frequentist t-test, there are basically two possible outcomes:

  • p<.05, which means you have found a difference between the groups. ('Reject the null')
  • p>.05 which means you have not found anything. ('Fail to reject the null.')

If p>.05, have you shown that the two groups are the same? Nope, not at all. (People sometimes think you have, but you haven't.) Basically, it means you can't say anything: The groups might be different or they might not.

If you're asking, "Well, how can I use a p-value to show that the two groups are the same?" That's just it--you can't.

That's one of the reasons for p-hacking. With frequentist tests, if you don't have a significant p-value, you don't have a finding, which means you don't have a publication. So people are desperate to find a significant p-value, and they start doing questionable things.

But with Bayesian statistics, you can actually find evidence for the null. When you do a Bayesian t-test instead of a frequentist one, the result you get is not a p-value, but a number called a Bayes factor. And it can show evidence for your effect, evidence against your effect, or it can say you don't have enough evidence to decide.  For example, let's say you set up your Bayesian t-test like a frequentist t-test, where the null hypothesis is that there's no difference between the boys' and girls' scores, and the alternative hypothesis is that there is a difference (in either direction). So you will get a Bayes factor for the alternative hypothesis.

  • If this Bayes factor is greater than 1, you have evidence that the groups are different: The higher the number, the more sure you can be. A Bayes factor of 3 means you should believe three times as much that the groups are different, compared to what you believed at the beginning. A Bayes factor of 10 means that this evidence should make your beliefs ten times stronger. A Bayes factor of 115.8 means they should get 115.8 times stronger, and so on. (This is evidence, like when you have a p-value of <.05)
  • If the Bayes factor is close to 1, that means your evidence is inconclusive: You can't tell from these data whether the groups are the same or different. (This is an absence of evidence, like when you have a p-value of >.05)
  • If the Bayes factor is less than 1, that means that you have evidence that the groups are the same. A Bayes factor of 1/3 means your belief that the groups are the same should be three times stronger than when you started; 1/10 means your belief gets 10 times stronger, 1/115.8 means it gets 115.8 stronger, and so on.  (This is evidence of absence, evidence for the null-- something you can never get with a p-value.)
In other words, whereas a frequentist t-test just lets you say either, 'The groups are different' or 'I don't know,'  a Bayesian t-test lets you say that this evidence makes you more sure the groups are different, or that it makes you more sure they are the same, or that it doesn't really tell you anything either way. So cool, right?

3. No more worrying about power. (Yay!)


Someone told me a long time ago that p-values took power into account, and that if I didn't have enough data to answer a question, the p-value wouldn't be significant. So I thought that a p-value of <.05 automatically meant I had enough power. Ummm . . . nope.

Turns out that plenty of studies in psychology, including developmental psychology, are waaaaay underpowered, even though they report significant p-values. That means we can't really tell if the findings are real or just a fluke.  (We can only believe in those findings after they've been replicated a few times. But direct replication is pretty rare in our field-- reviewers and journals seem to think it's boring. I hope that will change.)

Well, Bayes factors do tell you whether you have enough data, in pretty much the way I (incorrectly) thought p-values did. If you don't have enough data, the Bayes factor will stay close to 1. So if you get a Bayes factor far above or below 1, you automatically know you that have enough data to answer the question. That's why you can stop when you see an effect (see No. 1, above).

4. Bayes is more intuitive. (No joke.)


At first I thought Bayesian statistics were hard to understand. But that's because I was used to doing things the frequentist way. When I tried to explain things to my kids, I found it was really easy to explain Bayesian reasoning, and hard to explain frequentist reasoning.

True story: A month or two ago, I was driving with both of my sons (ages 12 and 18) to drop the older one off at the train station. One of them asked me to explain Bayesian statistics.

I said, well, we're driving to the train station, right? And we're not there yet, but we know that the train might be there when we arrive, or it might not. It might just be late, or maybe some crazy thing happened, like it got blown up by terrorists.

So before we even get to the train station, based on our previous experiences, let's say we think there's a 50% chance the train is already there, a 49% chance that it's not there because it's running late, and a 1% chance it's not there because it was blown up by terrorists. That's called our 'prior' distribution.

So say we arrive at the station, and the train is not there. That's evidence-- it's something we observed. So based on that evidence, we update our set of probabilities. Now we think there's a 0% chance that the train is there, a 98% chance that it's running late, and a 2% chance that it got blown up by terrorists. This is called the 'posterior' distribution. The way Bayesians say it is that you 'update' from the prior to the posterior distribution, based on the evidence.

So the chances that it was blown up are still really small, but they're bigger than before, because we have a little bit of evidence that would be consistent with the train getting blown up. (If it got blown up, it wouldn't be here on time.) The fact that the train isn't here (the evidence) is equally consistent with the train running late or being blown up. It's our prior belief (that running late is common and terrorist attacks are rare) that makes us think now there's still only a 2% chance that the train was blown up.

My kids were like, "Yep, that makes sense. So how would the other kind of statistics, the not-Bayesian kind, do this problem?" And. . . I couldn't figure out what to say. I had to think about it for a long time, and I eventually came up with some story that involved multiple trains, and averaging over their arrival times . . . but it was really difficult, and I can't even remember the story now, and the kids were bored.

Maybe someone reading this could have come up with a much better train story to illustrate frequentist reasoning, but I couldn't. And when you consider that I've been using frequentist statistics for 20 years and Bayesian statistics only recently, that's pretty amazing.

5. You can build in what you already know. (Fancy!)


In the train example, the prior distribution was whatever we believed about the trains before we got to the station. If we were placing bets, we might have said, "OK, I'll put $50 on 'Train is there,' $49 on 'Train is late,' and $1 on 'Terrorist attack.'" 

But if we had different ideas going in, we could have started with different priors. For example, if we had been taking this train every week for the past year and it was usually at the station when we got there, we'd put a high probability on it being there. Maybe we'd say there was an 90% chance it would be there, a 9.9% chance it would run late, and a 0.1% chance it would be blown up.

Or if we knew for sure that it would arrive within an hour of the scheduled time, but we didn't want to predict anything more specific than that, we could say that our prior puts equal probability on every arrival time from one hour before the scheduled time to one hour after it, but nothing outside of those times. In other words, we think there's zero chance that the train will be more than an hour early or more than an hour late, but all the times in between those two limits are equally likely. 

Frequentist tests don't have priors, which means there's no way for you build what you already know into your test. (Of course, in a Bayesian analysis, if you know absolutely nothing about what you're testing ahead of time, you can set a prior that makes equal bets on all possible outcomes. That's called an 'uninformative prior.') 

Being able to set a prior is great. It lets you test a whole range of different hypotheses. To take a totally fictional example, let's say you are studying the effects of chronic ear infections on language development. These ear infections basically make children deaf so they don't hear language for a few weeks or even months at a time, and you want to study the effect on their language learning.  

Previous studies in this literature going back 20 years are a mixed bag: Some say that chronic ear infections have a slight negative effect on language development; others find no effect. Importantly, no studies find really big negative effects, and of course no studies have reported positive effects. In other words, ear infections might cause a slight delay in language learning or they might not, but they definitely don't cause huge problems, and they definitely don't help.

If you were asking this question using frequentist methods, you could avoid looking for a positive effect by using a one-tailed test. But how could you look for a small negative effect without looking for a big one? In a one-tailed frequentist test, all possible negative outcomes are treated as equally likely, whether they say that ear infections delay language development by 2 months or 200 years.

Using Bayesian tests, you can set your priors in a way that reflects what you believe. You can specify that the possible answers range from no effect at all to whatever you think the largest effect could realistically be. Let's say you think the biggest effect possible is that ear infections delay language development by five years. You actually think the true average delay is somewhere in the range of a few months. But in any case, you feel sure it's not more than five years.

So when you set up your priors, you place zero prior probability on ear infections having a positive effect (so your analysis will not consider the possibility that ear infections might make language development better), and you also place zero prior probability on any negative effect causing more than a five-year delay.

Why is this better than having an uninformative prior, like you would automatically have in a frequentist test? Because limiting the space of possibilities makes the test more powerful. The more narrow and specific your priors are, the more evidence you can get from fewer data points.

This is actually how we reason in real life all the time. If my 12-year-old son starts looking fat, my prior knowledge tells me that it's probably because of the sugary drinks and pastries he's been buying every day from the Starbucks next to his new school.  I'm not 100% sure of that, of course . . . for example, there's a small chance that he might have developed some rare endocrine disorder. But I don't rush him to an endocrinologist right away; first we cut back on the Starbucks, because that's the most probable explanation. If he stops eating junk but keeps gaining weight, or if he develops other symptoms, then I might take him to a doctor. 

And here's an important point: No matter how fat he gets or what his symptoms are, I never think he might be pregnant. The prior probability of his being pregnant is zero, so it's not even in the space of possibilities I consider. Other examples of zero-probability explanations include (a) he has been possessed by evil spirits, (b) he got fat because he started using wireless headphones, and (c) he is being deliberately fattened up by NASA scientists who sneak into our house at night to inject him with chemicals.

In other words, prior to him getting fat, I already had a set of hypotheses about things that could happen to him. Bad diet? Absolutely. Sudden endocrine disorder? Eh, maybe. Pregnancy? No way. I observed some evidence (he gained weight) and I combined that information with my prior beliefs to come up with a new set of beliefs: I think he's probably getting fat because of a bad diet; probably not because of some rare disease, and definitely not because he's pregnant. That's what Bayesians mean when they talk about first specifying a prior distribution, then adding evidence (the data) and then updating to a posterior distribution.

I'll be honest, changing the priors for your analysis isn't quite as easy (yet) as just picking the Bayesian t-test instead of the frequentist t-test in JASP. And if the idea of setting priors freaks you out, you don't have to worry about it. You can just use the uninformative prior that JASP automatically assigns, which is sort of like saying, "I have no prior beliefs."

And speaking of frequentist tests, they have assumptions built into them too. In a frequentist approach, you're always having to assume either that the data are normally distributed or not; that there are equal variances or not; that the data are independent observations or not, and so forth. Maybe you're used to just letting the stats program make those decisions for you, but don't kid yourself-- decisions are being made.

With Bayesian statistics, once you get comfortable with them, you can make the decisions yourself. 

6. JASP is free and makes it easy to do Bayesian stats. (YouTube tutorials for everything!)


So for a few years now, I've known that Bayesian statistics had advantages over frequentist ones, but they had one big disadvantage, which was that there was no easy, user-friendly statistics program like SPSS or SAS that would do Bayesian tests. You had to use something like R or Matlab, and write code, and who wants to bother with that? (I can see the stats guys smirking when I say I don't want to write code. Listen, Imaginary Stats Guy, last time I counted, my lab had tested over five thousand kids. How many have you tested? Oh, none? Yeah-- that's what I thought. I guess that's why you have so much time to write code.)

But a year or two ago, some lovely and not-at-all smug people made a free, super easy-to-use statistics program called JASP that makes doing Bayesian tests as simple as doing frequentist ones. No coding necessary. And whenever I have a question about something, I watch one of the gazillions of tutorial videos on their YouTube channel. They are so helpful and friendly and fun to watch. It makes me really happy. :-)

7. All reviewers are familiar with Bayesian statistics. (Not.) 


Yes, this one is a joke. The one awkwardness left about using Bayesian statistics in developmental papers is that some reviewers are unfamiliar with them. I worry that they'll think I must be hiding something, or else why wouldn't I just report p-values and effect sizes like a normal person? (Sometimes we report both frequentist and Bayesian tests, and reviewers seem okay with that.)

But I'm not the only one in our community already using these methods (see other examples here and here), and I'm hoping that if we start talking about them more, others will join us. If you want to read more about Bayesian methods for psychologists, here are a whole bunch of great articles to get you started, and here's one specifically for developmental researchers.  Or you could just download JASP and start messing around with it. Have fun!



*I am grateful to my current and former grad students, Emily Slusser, James Negen, Meghan Goldman, Ashley Thomas and Emily Sumner, for teaching me The Ways of Bayes, and to my wonderful colleagues Michael Lee and Joachim Vandekerckhove for teaching all of us.

16 comments:

  1. Super informative and compelling to someone who actually uses both JASP and R but hasn't got a proper introduction to Bayesian analyses and why I need them. I'm glad I got to read this! Thanks for sharing!

    ReplyDelete
  2. I am one of those possibly annoying statisticians, so apologies in advance. But in my defence, I do experimental work as well.

    Some corrections:

    1. Frequentist statistics allows multiple peeks, as long as alpha correction is done accordingly. Clinical trials do this all the time to avoid wasting patients' time to avoid damaging health. See Pocock's book on clinical trials.

    2. It's not true that the null hypothesis cannot be argued for in frequentist statistics. Two one-sided t-tests are routinely used to establish bio-equivalence of generic vs brand name drugs. This procedure is no different from a regular t-test.

    3. You're right that you don't have to worry about power, but you do have to worry about imprecision in your estimates. Even in Bayes, when estimates have large imprecision (large SEs) the estimates will tend to fluctuate wildly, and will sometimes be overestimates. If you do want accurate estimates (and I always do), you have to worry about precision.

    4. Bayes factors should always be computed with a range of priors on the target parameter. They can vary a lot depending on the prior, so it's never a plug-and-play affair with Bayes. This is why I am wary of quick-fix software. I have a lot of respect for the JASP developers and I know where they are coming from. It makes a lot of sense to get people into Bayes like this. But eventually one has to get one's hands dirty. You sound contemptuous of the idea of coding; but perhaps if you reconsider this with an open mind, you may one day write a blog post talking about the empowering effect of coding :). I have been down that road, I also started with SPSS and Excel, and I remember wandering the corridors of the linguistics department at Ohio State asking random colleagues passing by what an ANOVA is. Coding ability gives you access to tools like JAGS and Stan, and these are pure gold. See some of our recent published work on my home page. We could have never done it with a point and click software.

    ReplyDelete
    Replies
    1. Interesting that you present your comments as “corrections” rather than points of disagreement. (Are you familiar with the term ‘mansplaining’?) Although I’m not a statistician, I did seek feedback on this post from three actual statisticians before publishing it, and I’m happy with the content.

      Responding to your last point, I’m not contemptuous of coding at all; for many research questions, it's essential. But there are also people doing great experimental work where the statistics are the least interesting part of it. The innovation in their work is in the theory and design, not in the analysis--which may require nothing more than t-tests, ANOVAs, tests for inter-rater reliability, etc. If your research requires you code up experiments or models, then coding is a necessary professional skill . . . for YOU. But if JASP does everything I need, and lets me answer the questions I want to ask, then what's the problem? (And why on earth do you care what statistics package I use anyway? Don't you have your own work to do?)

      Delete
    2. I found both the post and Shravan response really informative! Congrats to both Barbara and Shravan for sharing their point of view.

      Delete
    3. First off, I apologize if my comments came across as `mansplaining.' I'm familiar with the feeling; I look and sound like a third-worlder (that's because I am one), and even though I am male, I often have white people `whitesplaining' stuff to me, because of course I couldn't know anything. So I know the feeling! I've been given 50 cents in Hamburg outside a toilet while waiting for my wife, the guy who gave me the money thought I was the toilet attendant. I'm also mistaken for a taxi driver in Berlin, and taxi drivers are routinely surprised when they ask me what I do for a living ("you don't look like a professor.") So I'm quite used to being talked down to by white people in general, and I know how it feels to be talked down to. It's not pleasant, and I am unhappy that I sounded like that to you in my comment.

      I won't argue with you on any of these points but I would like to point you to my tutorials. E.g., I discuss the point about Bayes factors (among other things) here: https://osf.io/g4zpv/

      I must say I don't feel like an expert on these matters, not yet anyway. There are many gaps in my understanding and I am happy to be corrected. The osf site I link to above gives an opportunity for comments and corrections, and anyone should feel free to post corrections there. I have an online supplement where I can add corrections to the paper, if they are needed.

      Delete
    4. Barbara, thank you for writing this. I think many of us studying development who have read about and used Bayesian Statistics have wanted to share something like this with others, but I'm grateful you got to it first because I don't think I could have done nearly as good of a job.

      Shravan, your 2nd point about testing a null hypothesis under NHST with two one-sided tests is a very interesting idea, one that I hadn't come across before. Thank you for pointing this out.

      Another argument for accepting the null under the NHST approach comes from Edouard Machery (although, I think the original argument traces back to J. Cohen), who suggests setting beta = .05 (power = .95) and deciding to accept H0 is logically equivalent to setting alpha = .05 and deciding to reject H0. Machery also directly addresses some of the concerns raised by Hoenig and Heisey (2001) about using power to defend negative results. For the full argument:
      Machery, E. (2012). Power and negative results. Philosophy of science, 79(5), 808-820.

      Delete

    5. Thanks to both of you for these thoughtful and helpful comments. And thanks, Shravan, for making your tutorials available on OSF, and for your gracious reply. Let's hope that a generation from now, we will have to define both 'mansplaining' and 'whitesplaining' for the youngsters, because the experiences will be completely unfamiliar to them.

      Delete
  3. Now I understand why Reviewer #2 suggested Bayesian analysis and why I should learn how to do it. Thank you!

    ReplyDelete
    Replies
    1. Thank you! To know that someone read this post and felt less intimidated, and more empowered to learn something new, makes me really happy.

      Delete
  4. I arrived at this post via the cogdev listserv. Thank you so much for this helpful information! It'll be incredibly useful as I pursue the next steps in academia.

    ReplyDelete
    Replies
    1. You're very welcome, and good luck with your next steps!

      Delete
  5. Great post -- informative and entertaining (such a good combo)!

    ReplyDelete
  6. I'm bookmarking this one, Barbara. Thanks! :)

    ReplyDelete