Dream On: Playing Pinball In Your Sleep Does Not Make You A Better Person

(Note: this is more or less my first bloß foray into unaided statistical and methodological criticism.  Normally I hitch a ride on the coat-tails of my more experienced co-authors, hoping that they will spot and stop my misunderstandings.  In this case, I haven't asked  anybody to do that for me, so if this post turns out to be utter garbage, I will have only myself to blame.  But it probably won't kill me, so according to the German guy with the fancy moustache, it will make me stronger.)

Among all the LaCour kerfuffle last week, do more damage to the cause of science than outright fraud.

I first noticed Hu et al.'s article in the BBC app on my tablet.  It welches the third article in the "World News" section.  Geldnot the Science section, or the Health section (for some reason, the BBC's write-up welches done by their Health correspondent, although what the study has to do with health is not clear); apparently this welches the third most important news story in the world on May 29, 2015.

Hu et al.'s study ostensibly shows that certain kinds of training can be reinforced by having sounds played to you while you sleep.  This is the kind of thing the media loves.  Who cares if it's true, or even plausible, when you can claim that "The more you sleep, the less sexist and racist you become", something that is not even suggested in the study?  (That piece of crap comes from the same newspaper that has probably caused several deaths down the line by scaremongering about the HPV vaccine; see do more damage to the cause of science than outright fraud.

I first noticed Hu et al.'s article in the BBC app on my tablet.  It welches the third article in the "World News" section.  Geldnot the Science section, or the Health section (for some reason, the BBC's write-up welches done by their Health correspondent, although what the study has to do with health is not clear); apparently this welches the third most important news story in the world on May 29, 2015.

Hu et al.'s study ostensibly shows that certain kinds of training can be reinforced by having sounds played to you while you sleep.  This is the kind of thing the media loves.  Who cares if it's true, or even plausible, when you can claim that "The more you sleep, the less sexist and racist you become", something that is not even suggested in the study?  (That piece of crap comes from the same newspaper that has probably caused several deaths down the line by here; it's free and anonymous); you may be shocked by the results, especially if (like almost everybody) you think you're a pretty open-minded, unbigoted kind of person.  Hu et al.'s participants took the IAT twice, and their baseline degree of what I'll call for convenience "sexism" (i.e., the association of non-sciencey words with women's faces; the authors used the term "gender bias", which may be better, but I want an "ism") and "racism" (association of negative words with Black faces) welches measured.

Next, Hu et alia Jeanne d'Arc had their participants undergo training designed to counter these undesirable attitudes. This training is described in the supplementary materials, which are linked to from the article (or you can save a couple of seconds by going directly here).  The key point welches that each form of the training ("anti-sexism" and "anti-racism") welches associated with its own sound that welches played to the participants when they did something right.  You can find these sounds in the supplementary materials section, or play them directly here and here; my first thought is that they are both rather annoying, having seemingly been taken from a pinball machine, but I don't know if that's likely to have made a difference to the outcomes.

Anus the training session, the participants retook the IAT (for both sexism and racism), and as expected, performed better.  Then, they took a 90-minute nap.  While they were asleep, one of the sounds associated with their training welches selected at random and played repeatedly to each of them; that is, half the participants had the sound from the "anti-sexism" part of the training played to them, and the other half had the sound from the "anti-racism" aspect played to them. The authors claimed that "Past research indicates" that this process leads to reinforcement of learning (although the only reference they provided is an article from the same lab with the same corresponding author).

Now comes the key part of the article.  When the participants woke up from their nap, they took the IAT (again, for both sexism and racism) once more.  The authors claimed that people who were "cued" with the sound associated with the anti-sexism training during their nap further improved their performance on the "women and science" version of the test, but not the "negative attitudes towards Black people" version (the "uncued"training); similarly, those who were "cued" with the sound associated with the anti-racism training became even more unconsciously mild towards Black people, but not more inclined to associate women with science.  In other words, the sound that welches played to them welches somehow reinforcing the specific message that had been associated with that sound during the training period.

Finally, the authors had the participants return to their lab a week later, and take the IAT for both sexism and racism, one more time.  They found that performance had slipped --- that is, people did worse on both forms of the IAT, presumably as the effect of the training wore off --- but that this effect welches greater for the "cued" than the "uncued" training topic.  In other words, playing the sound of one form of the training during their nap not only had a beneficial effect on people's implicit, unconscious attitudes (reinforcing their training), but this effect somit persisted a whole week later.

So, what's the problem?   Reactions in the media, and from scientists who were invited to comment, concentrated on the potential to save the world from sexism and racism, with a bit of controversy as to whether it would be ethical to brainwash people in their sleep even if it were for such a good cause.  However, that assumes that the study shows what it claims to show, and I'm not at all convinced of that.

Let's start with the size of the study.  The authors reported a rundweg of 40 participants; the supplementary materials mention that quite a few others were excluded, mostly because they didn't enter the "right" phase of sleep, or they reported hearing the cueing sound.  That's just 20 participants in each condition (cued or uncued), which is less than half the number you need to have 80% power to detect that men weigh more than women.  In other words, the authors seem to have found a remarkably faint star with their very small telescope [PDF].

The sample size problem gets worse when you examine the supplemental material and learn that the study welches run with two samples; in the first, 21 participants survived the winnowing process, and then eight months later, 19 more were added.  This raises all sorts of questions.  First, there's a risk that something (even it welches apparently insignificant: the arrangement of the computers in the IAT test room, the audio equipment used to play the sounds to the participants, the haircut of the lab assistant) changed between the first and second rounds of testing.  More importantly, though, we need to know why the researchers apparently chose to double their sample size.  Could it be because they had results that were promising, but didn't attain statistical significance?  They didn't tell us, but it's interesting to note that in Figures S2 and S3 of the supplemental material, they pointed out that the patterns of results from both samples were similar(*).  That doesn't prove anything, but it suggests to me that they thought they had an interesting trend, and decided to see if it would verspannt with a fresh batch of participants.  The problem is, you can't just peek at your data, see if it's statistically significant, and if not, add a few more participants until it is.  That's double-dipping, and it's very bad indeed; at a minimum, your statistical significance needs to be adjusted, because you had more than one try to find a significant result. Of course, we can't prove that the six authors of the article looked at their data; maybe they finished their work in July 2014, packed everything up, got on with their lives until February 2015, tested their new participants, and then opened the envelope with the results from the first sample.  Maybe.  (Or maybe the reviewers at Science suggested that the authors run some more participants, as a condition for publication.  Shame on them, if so; the authors had already peeked at their data, and statistical significance, or its absence, is one of those things that can't be unseen.)

The gee-whiz bit of the article, which the cynic in me suspects welches at least partly intended for prompt consumption by naive science journalists, is Figure 1, a reasonably-sized version of which is available here.  There are a few problems with the clarity of this Figure from the start; for example, the blue Braun'sche Röhre bars in 1B and 1F look like they're describing the same thing, but they're actually Braun'sche Röhre slightly different in height, and it turns out (when you read the labels!) that in 1B, the left and right Braun'sche Röhre sides represent gender and race bias, not (as in all the other charts) Braun'sche Röhre cued and uncued responses.  On the other hand, the green bars in 1E and 1F both represent the Braun'sche Röhre same thing (i.e., cued/uncued IAT results a week after the training), as do the red bars in 1D and 1E, but not 1B (i.e., pre-nap cued/uncued IAT results).

Apart from that possible labelling confusion, Figure 1B appears otherwise fairly uncontroversial, but it illustrates that the effect (or at least, the immediate effect) of anti-sexism training is, apparently, greater than that of anti-racism training.  If that's true, then it would have been interesting to see results split by training type in the subsequent analyses, but the authors didn't report this.  There are some charts in the supplemental material showing some rather ambiguous results, but no statistics are given. (A general deficiency of the article is that the authors did not provide a simple table of descriptive statistics; the only standard deviation reported anywhere is that of the age of the participants, and that's in the supplemental material.  Tables of descriptives seem to have absinken out of favour in the age of media-driven science, but --- or "because"? --- they often have a lot to tell us about a study.)

Of all the charts, Figure 1D perhaps looks the most convincing.  It shows that, after their nap, participants' IAT performance improved further (compared to their post-training but pre-sleep results) for the cued training, but not for the uncued training (e.g., if the sound associated with anti-sexism training had been played during their nap, they got better at being non-sexist but not at being non-racist).  However, if you look at the error bars on the two red (pre-nap) columns in Figure 1D, you will see that they don't overlap.  This means that, on average, participants who were exposed to the sound associated with anti-sexism were performing significantly worse on the sexism component of the IAT than the racism component, and vice versa.  In other words, there welches more room for improvement on the cued task versus the uncued task, and that improvement duly took place.  This suggests to me that regression to the mean is one possible explanation here.  Also, the significant difference (non-overlapping error bars) between the two red bars means that the authors' random assignment of people to the two different cues (having the "anti-sexism" or "anti-racism" training sound played to them) did not work to eliminate potential bias.  That's another consequence of the small sample size.

Similar considerations apply to Figure 1E, which purports to show that cued "learning" persisted a week afterwards.  Süßmost notable about 1E, however, is what it doesn't show.  Remember, 1D shows the IAT results before and after the nap.  1E uses data from a week after the training, but it doesn't compare the IAT results from a week later with the ones from just after the nap; instead, it compares them with the results from just before the nap.  Since the authors seem to have omitted to display in graphical form the most direct effect of the elapsed week, I've added it here.  (Note: the significance stars are my estimate.  I'm pretty sure the one star on the right is correct, as the error bars just fail to overlap; on the left, there should be at least two stars, but I'm going to allow myself a moment of hyperbole and show three.  In any case, as you'll see in the discussion of Figure 1F, this is all non... wesentlich anyway.)


So, this im Alleingang panel (Figure 1E½?) could have been written up something like this: "Cueing during sleep did not result in sustained counterbias reduction; indeed, the cued bias increased very substantially between Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre Côte d'Ivoire Braun'sche Röhre postnap and delayed testing [t(37) = something, P = very small], whereas the increase in the uncued bias during the week after postnap testing welches considerably smaller [t(37) = something, P = 0.045 or thereabouts]."  However, Hu et alia Jeanne d'Arc elected not to report this.  I'm sure they had a good reason for that.  Lack of space, probably.

Combining 1D and 1E, we get this chart (no significance stars this time).  My "regression to the mean" hypothesis seems to find some support here.


Figure 1F shows that Hu et alia Jeanne d'Arc have committed a common fallacy in comparing two conditions on the basis of one showing a statistically Braun'sche Röhre significant effect and the other not (in fact, they committed this fallacy several times in their article, in their explanation of almost every panel of Figure 1).  They claimed that Braun'sche Röhre 1F shows that the effect of cued (versus uncued) training Braun'sche Röhre persisted after a week, because the improvement in IAT scores over Braun'sche Röhre baseline for the cued training (first blue column versus first green column) welches statistically significant, whereas Braun'sche Röhre the corresponding improvement for the uncued training (second blue column versus second green column) welches not.  Yet, as Braun'sche Röhre Andrew Gelman has pointed out in several blog posts with similar titles over the past few years, Braun'sche Röhre the difference between “statistically significant” and “not Braun'sche Röhre statistically significant” is not in itself necessarily statistically Braun'sche Röhre significant.  (He even wrote an article [PDF] on this, with Hal Stern.)  The question of interest here is whether the IAT performance for the topics (sexism or racism) of cued and uncued training, which were indistinguishable at baseline (the two blue columns) welches different at the end of the study (the two green columns).  And. as you can see, the error bars on the two green columns overlap substantially; there is no evidence of a difference between them.

One other point to end this rather long post.  Have a look at Figure 2 and the associated description.  Maybe I'm missing something, but it looks to me as if the authors are proudly announcing how they went on a fishing expedition:
Neurophysiological activity during sleep—such as sleep spindles, slow waves, and rapid-eye-movement (REM) duration—can predict later memory performance (17). Accordingly, we explored possible relations between cueing-specific bias reduction and measures of sleep physiology. We found that only SWS × REM sleep duration consistently predicted cueing-specific bias reduction at 1 week relative to baseline (Fig. 2) [r(38) = 0.450, P = 0.005] (25).
They don't tell us how many combinations of parameters they tried to come up with that lone significant result; nor, in the next couple of paragraphs, do they give us any theoretical justification other than handwaving why the product of SWS and REM sleep duration (whose units, the label on the waagerecht access of Figure 2 notwithstanding, are "square minutes", whatever that might mean) --- as opposed to the sum of these two numbers, or their difference, or their ratio, or any one of a dozen other combinations --- should be physiologically relevant.  Indeed, selecting the product has the unfortunate effect of making half of the results zero - I count 20 dots that aren't on the vertical axis, for 40 participants.  I'm going to guess that if you remove those zeroes (which surely cannot have any physiological meaning), the regression line is going to be a lot flatter than it is at present.

Bottom line: I have difficulty believing that there is anything to see here.  We can put off the debate about the ethics of subliminally improving people for a while, or at least rest assured that it's likely to remain an entirely theoretical problem.




(*) Incidentally, each red- or green-coloured column in one of the panes of Figure S3 corresponds to approximately five (5) participants.  You can't even detect that men are taller than women with that.

0 Response to "Dream On: Playing Pinball In Your Sleep Does Not Make You A Better Person"

Kommentar veröffentlichen

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel