"Evidential Value in ANOVA-Regression Results in Scientific Integrity Studies"

Comments (27):

Display By:

Peer 1:

( June 8th, 2015 9:51am UTC )

Edit your feedback below

Sections 5 and 6 of the paper show that the author too does know that what he is doing is coming up with a frequentist test statistic. His own explanations show that his "likelihood ratio" must not be interpreted as a likelihood ratio. I think that use of the word "evidential value" is misleading, since in forensic statistics, LR and evidential value are almost synonymous.

He performs the statistical test on 12 different studies contained in the already retracted paper Förster and Denzler (2012). He gets a value in the upper 25% of the null distribution,**every time**. Sure, there seems to be something wrong going on there.

In the report by Peeters, Klaassen and van der Wiel, in which the same methodology is used to investigate further publications, the authors are again clearly operating as frequentist statisticians, computing test statistics and rejecting a rather technical null hypothesis if the p-value is too small. Apparently they consider an alpha of 8% enough to reject the null-hypothesis of "veracity". See their table in paragraph 1.5 (bottom of page three).

The interesting thing is that this small deviation is very systematically maintained across a lot of publications. And that is not because they tested 140 publications and found 12 which were significant at the 8% level. I suspect systematic QRPs. In fact, Denzler and Liberman explicitly explain that their statistical analyses do not usually reflect the actual statistical design of their experiments.

He performs the statistical test on 12 different studies contained in the already retracted paper Förster and Denzler (2012). He gets a value in the upper 25% of the null distribution,

In the report by Peeters, Klaassen and van der Wiel, in which the same methodology is used to investigate further publications, the authors are again clearly operating as frequentist statisticians, computing test statistics and rejecting a rather technical null hypothesis if the p-value is too small. Apparently they consider an alpha of 8% enough to reject the null-hypothesis of "veracity". See their table in paragraph 1.5 (bottom of page three).

The interesting thing is that this small deviation is very systematically maintained across a lot of publications. And that is not because they tested 140 publications and found 12 which were significant at the 8% level. I suspect systematic QRPs. In fact, Denzler and Liberman explicitly explain that their statistical analyses do not usually reflect the actual statistical design of their experiments.

Unregistered Submission:

( June 8th, 2015 10:51am UTC )

Edit your feedback below

Perhaps another team of statisticians should write a report about the problematic nature of the statistics used in the 'PKW rapport' which investigated the problematic nature of several studies of Förster (et al.)? Additionally, this counter-counter-rapport could suggest to retract the counter-rapport which suggested to retract Förster's studies.

Sarcasm aside, this discussion does emphasize the need for some more reservedness in publicly assessing the 'veracity' of any study. As such assessments can itself be problematic (as many researchers seem to agree on) they should be handled with great care; such investigations can be the death sentence for one's academic career. If the suspicions will truly be grounded that this is of course o.k. but as long as there is still significant room for discussion than the reputation of any researcher involved should surely be protected.

Innocent until proven guilty should be the modus operandi.

Sarcasm aside, this discussion does emphasize the need for some more reservedness in publicly assessing the 'veracity' of any study. As such assessments can itself be problematic (as many researchers seem to agree on) they should be handled with great care; such investigations can be the death sentence for one's academic career. If the suspicions will truly be grounded that this is of course o.k. but as long as there is still significant room for discussion than the reputation of any researcher involved should surely be protected.

Innocent until proven guilty should be the modus operandi.

Peer 1:

( June 8th, 2015 12:08pm UTC )

Edit your feedback below

Since there is a maximization over three parameters, over "half" of a parameter space, I would expect the null distribution of twice log V to be reasonably approximated by 50% point mass on zero, and half of a chi-squared (3) on the positive line. The author likes the critical value 6, that would correspond to a p-value of 15%. The author suggests that the right answer is about 8%.

I'm not sure what likelihood ratio a value of V = 6 that should represent to a Bayesian. Clearly a value of V equal to 1 should represent to a likelihood ratio which is *favourable* to the "defence" hypothesis.

One could think of using Fisher's combination method using approximate p-values obtained from my 1/2 on zero, 1/2 chi square (3) approximation. However this will tend to give much too large values (ie it will reject the null hypothesis much too easily), since under the null hypothesis, 50% of the individual tests will have a p-value of 0.5 which really stands for a whole range of values between 1.0 and 0.5.

Better to use a normal approximation to a sum of independent "1/2 on zero, 1/2 chi square (3)" random variables.

Right now, my feeling is that V should not be called "evidential value".

I would not call a value of V at least as large as 6 "substantial".

If a value of 6 corresponds to a p-value of 0.15, then the chance of getting at least 3 "substantial V's" in 11 independent experiments is about 7% (hope I got this right).

I would not call this "strong evidence for low scientific veracity". (See the table in section 1.4, page 2, of Peeters et al).

I'm not sure what likelihood ratio a value of V = 6 that should represent to a Bayesian. Clearly a value of V equal to 1 should represent to a likelihood ratio which is *favourable* to the "defence" hypothesis.

One could think of using Fisher's combination method using approximate p-values obtained from my 1/2 on zero, 1/2 chi square (3) approximation. However this will tend to give much too large values (ie it will reject the null hypothesis much too easily), since under the null hypothesis, 50% of the individual tests will have a p-value of 0.5 which really stands for a whole range of values between 1.0 and 0.5.

Better to use a normal approximation to a sum of independent "1/2 on zero, 1/2 chi square (3)" random variables.

Right now, my feeling is that V should not be called "evidential value".

I would not call a value of V at least as large as 6 "substantial".

If a value of 6 corresponds to a p-value of 0.15, then the chance of getting at least 3 "substantial V's" in 11 independent experiments is about 7% (hope I got this right).

I would not call this "strong evidence for low scientific veracity". (See the table in section 1.4, page 2, of Peeters et al).

Peer 2:

( June 8th, 2015 12:11pm UTC )

Edit your feedback below

I've done some simulations under various settings. It seems to be that, across tested conditions, about 55-60% of all V-values are 1 (and thus 2log(V) = 0).

Peer 1:

( June 8th, 2015 12:39pm UTC )

Edit your feedback below

Great - just what I expected. Please also check if twice the log of the V-values larger than 1 look like a chi-square (3)

Peer 2:

( June 8th, 2015 1:44pm UTC )

Edit your feedback below

No, they don't.

I've simulated 400,000 samples, based on mu = (1, 2, 3), s = (1, 1, 1) and n = 3*ni = 30. Then, based on n, the sample means and sample sd's, I've computed the V-values.

Sometimes, rather than a V-value itself, the dataveracity.R script gives a lower and upper bound. Two extreme cases: V always attains the lower bound, and V always attains the upper bound. The chi-sq(3) distribution should fall between these two extremes (apart from sampling variation; but we can assume that to be minor, since I've drawn 400,000 samples.

For the lower-bound-version of V, the 25%, 50% and 75% percentiles of (V>1) are:

1.08 1.42 2.52

For the upper-bound-version, these are:

1.08 1.42 2.63

For the chi-sq(3) distribution they are:

1.21 2.36 4.11.

I've simulated 400,000 samples, based on mu = (1, 2, 3), s = (1, 1, 1) and n = 3*ni = 30. Then, based on n, the sample means and sample sd's, I've computed the V-values.

Sometimes, rather than a V-value itself, the dataveracity.R script gives a lower and upper bound. Two extreme cases: V always attains the lower bound, and V always attains the upper bound. The chi-sq(3) distribution should fall between these two extremes (apart from sampling variation; but we can assume that to be minor, since I've drawn 400,000 samples.

For the lower-bound-version of V, the 25%, 50% and 75% percentiles of (V>1) are:

1.08 1.42 2.52

For the upper-bound-version, these are:

1.08 1.42 2.63

For the chi-sq(3) distribution they are:

1.21 2.36 4.11.

Peer 1:

( June 8th, 2015 2:15pm UTC )

Edit your feedback below

OK, I'm kind of glad that my guess is resoundingly disproved.

Can you tell us what percentile of the distribution, the threshold for "substantial", V = 6 corresponds to?

If a value of V = 6 corresponds to a p-value of 0.08, as Klaassen suggests, then the chance of getting at least 3 "substantial V's" in 11 independent experiments is about 5% (hope I got this right). Personally, "significant at the 5% level" is not what I would call "strong evidence for low scientific veracity".

I hope that whoever reads these findings, also understands that "low scientific veracity" might just as well be caused by (innocently used) QRP's, as well as by actual (deliberate) fraud of some kind. And I wonder if anyone realises that the null hypothesis also assumes a whole lot of distributional assumptions (normal errors, constant variance etc etc) which, at best, are only very rough approximations?

The authors do write "these guidelines should be applied with care." I wonder if the typical university administrator or researcher in social psychology understands what these guidelines really mean?

Can you tell us what percentile of the distribution, the threshold for "substantial", V = 6 corresponds to?

If a value of V = 6 corresponds to a p-value of 0.08, as Klaassen suggests, then the chance of getting at least 3 "substantial V's" in 11 independent experiments is about 5% (hope I got this right). Personally, "significant at the 5% level" is not what I would call "strong evidence for low scientific veracity".

I hope that whoever reads these findings, also understands that "low scientific veracity" might just as well be caused by (innocently used) QRP's, as well as by actual (deliberate) fraud of some kind. And I wonder if anyone realises that the null hypothesis also assumes a whole lot of distributional assumptions (normal errors, constant variance etc etc) which, at best, are only very rough approximations?

The authors do write "these guidelines should be applied with care." I wonder if the typical university administrator or researcher in social psychology understands what these guidelines really mean?

Peer 2:

( June 8th, 2015 2:22pm UTC )

Edit your feedback below

V=6 correspond to a 3.15 to 4.59 percentile (again, it's an interval because we've got upper and lower bounds) so considerably smaller than what Klaassen suggests.

I've also looked at the distribution of V under various violations of the normality assumption, because I was thinking along the same lines: perhaps a large V can be explained by other reasons than "fraud". It turns out that V is very robust against not-normality (I simulated skewed data (Gamma distributed) and data on a 5-point Likert scale, both clearly non-normal). My guess is that this robustness can be explained by that V only depends on n (given) and x and s.

I've also looked at the distribution of V under various violations of the normality assumption, because I was thinking along the same lines: perhaps a large V can be explained by other reasons than "fraud". It turns out that V is very robust against not-normality (I simulated skewed data (Gamma distributed) and data on a 5-point Likert scale, both clearly non-normal). My guess is that this robustness can be explained by that V only depends on n (given) and x and s.

Peer 2:

( June 8th, 2015 2:35pm UTC )

Edit your feedback below

Note: this was for n = 30 (thus, 10 per group). It seems as if V depends on n. With n = 90 (30 per group), I get somewhat larger V-values (the number of replications still is too small to say something properly about this).

Peer 2:

( June 8th, 2015 2:40pm UTC )

Edit your feedback below

Replying to "Personally, "significant at the 5% level" is not what I would call "strong evidence for low scientific veracity".": I agree. It should be "beyond reasonable doubt" and "one in twenty" sounds unreasonable.

However: that 5% is only one piece in a larger puzzle. Especially the visual representations in the report seem to me to be quite convincing. (*If* V is indeed not a very good statistic (I write it between * because I still haven't figured out whether it is or not), then that doesn't mean that the data weren't fabricated (nor that they were).)

However: that 5% is only one piece in a larger puzzle. Especially the visual representations in the report seem to me to be quite convincing. (*If* V is indeed not a very good statistic (I write it between * because I still haven't figured out whether it is or not), then that doesn't mean that the data weren't fabricated (nor that they were).)

Peer 1:

( June 8th, 2015 3:06pm UTC )

Edit your feedback below

I agree, Peer 2. And if V = 6 is about the 95 percentile, then the chance of three or more "substantial" V's in 11 sub-experiments is around 2% [if I did the calculation right].

Enter your reply below (Please read the **How To**)

Unregistered Submission:

( June 8th, 2015 5:01pm UTC )

Edit your feedback below

I am not a statistician. Can someone summarize these discussions for a lay person? Can we confirm that misconduct took place based solely on this report? What about the other papers by Foster for which he was initially charged with misconduct. Was this same method used for those papers too?

Peer 1:

( June 9th, 2015 8:32am UTC )

Edit your feedback below

(1) The very extensive report PKW (Peeters, Klaassen, van der Wiel) contains a whole lot more than just calculation of Klaassen's statistic "V". The authors do not blindly follow formal criteria.

(2) The report PKW does not conclude "misconduct". It concludes that results of a number of published papers should not be trusted.

QRPs ("questionable research practices") are highly prevalent in social psychology. Within the field, various practices are extremely common, and not regarded as "misconduct" by many researchers within the field.

The discussion is presently "inconclusive". We discussed the terminology used by Klaassen in the arXiv preprint where he introduces his "V". The title page has "Preliminary Version" written in large letters next to the title. I think Klaassen's terminology is misleading, and the arguments around Bayesian updating incorrect (I think there are conceptual errors here). "V" is in fact "just another frequentist test statistic" and the Klaassen paper also gives a decent motivation for using it. Like all conventional statistical hypothesis testing, one has to use it wisely.

On the whole PKW seem to me to have done a careful and thorough job. I think the authors know what they are doing.

I think further research is needed to find out if "innocent" use of QRPs could be responsible for the anomalous patterns in the data which PKW have identified.

(2) The report PKW does not conclude "misconduct". It concludes that results of a number of published papers should not be trusted.

QRPs ("questionable research practices") are highly prevalent in social psychology. Within the field, various practices are extremely common, and not regarded as "misconduct" by many researchers within the field.

The discussion is presently "inconclusive". We discussed the terminology used by Klaassen in the arXiv preprint where he introduces his "V". The title page has "Preliminary Version" written in large letters next to the title. I think Klaassen's terminology is misleading, and the arguments around Bayesian updating incorrect (I think there are conceptual errors here). "V" is in fact "just another frequentist test statistic" and the Klaassen paper also gives a decent motivation for using it. Like all conventional statistical hypothesis testing, one has to use it wisely.

On the whole PKW seem to me to have done a careful and thorough job. I think the authors know what they are doing.

I think further research is needed to find out if "innocent" use of QRPs could be responsible for the anomalous patterns in the data which PKW have identified.

Enter your reply below (Please read the **How To**)

Peer 4:

( June 9th, 2015 10:45am UTC )

Edit your feedback below

The official LOWI investigation studied the possibility that "innocent" QRPs could explain the linearity and dismissed that possibility. In fact, to date, no one has provided any honest explanation of the weird linear results other than deliberate manipulation or fabrication.

Unregistered Submission:

( June 10th, 2015 2:02pm UTC )

Edit your feedback below

The official LOWI investigation studies only three QRPs. There are more. For example, they never examined the possibility that file-drawering (a very common practice) contributed to the extent of linearity.

Peer 4:

( June 10th, 2015 3:12pm UTC )

Edit your feedback below

It is easy to show with a simulation that the extreme linearity in the 33 studies presented by JF cannot be explained by selective reporting of only significant results.

Peer 2:

( June 10th, 2015 3:51pm UTC )

Edit your feedback below

Indeed. The LOWI report might not have considered file-drawering but this doesn't imply at all that this QRP is a possibility (or not). You really need an enormous amount of drawers to file all the not-perfectly-linear studies before you reach 33 linear studies by fair play.

Enter your reply below (Please read the **How To**)

Unregistered Submission:

( June 11th, 2015 1:56am UTC )

Edit your feedback below

Why didn't that suffice? Why this new method? Is it that some of the papers couldn't be analyzed by the previous method?

Peer 1:

( June 12th, 2015 6:03am UTC )

Edit your feedback below

The "old method" used as a test statistic a familiar F-statistic but where one rejects the null for *small* values of the statistic, not large. It is ad hoc, easy to use and to understand, has been around for a long time. The "new method" uses a test statistic carefully designed for the problem at hand. I suspect that it has higher power to detect unreliable data (though whether the data is unreliable because of QRP's or fraud is another matter).

I think the following research should be done: find out whether parametric bootstrap allows one to find a less conservative critical value; investigate power and compare with the standard (old) alternative method.

PS more discussion here: http://rejectedpostsofdmayo.com/2015/06/09/fraudulent-until-proved-innocent-is-this-really-the-new-bayesian-forensics-rejected-post and here http://errorstatistics.com/2015/06/11/evidence-can-only-strengthen-a-prior-belief-in-low-data-veracity-n-liberman-m-denzler-response

I think the following research should be done: find out whether parametric bootstrap allows one to find a less conservative critical value; investigate power and compare with the standard (old) alternative method.

PS more discussion here: http://rejectedpostsofdmayo.com/2015/06/09/fraudulent-until-proved-innocent-is-this-really-the-new-bayesian-forensics-rejected-post and here http://errorstatistics.com/2015/06/11/evidence-can-only-strengthen-a-prior-belief-in-low-data-veracity-n-liberman-m-denzler-response

Enter your reply below (Please read the **How To**)

Unregistered Submission:

( June 11th, 2015 3:09pm UTC )

Edit your feedback below

How about the new report? It seems that excluding the three papers examined in previous complaints, all the papers in the new report are "convicted" on the basis of the new V method. Can the V value be inflated by file-drawer effect?

Peer 2:

( June 11th, 2015 8:11pm UTC )

Edit your feedback below

Yes and no.

Yes: file-drawering does inflate the V value.

No: the combined V-value for the newly "convicted" papers is so large, you need very large drawers if this was fully due to file-drawering. It's difficult to pinpoint an exact number to it, but it isn't the case that this become "normal" again if for every of the 12 published studies there are one or two file-drawered studies. You really need to think about something more like a 100 to 1 ratio. Something that extreme could be an explanation.

(But, if I had to choose whether I'd find "data fabrication" or "drawering 1200 studies" more likely (which is a subjective choice, not a matter of statistics), I wouldn't go for file-drawering).

Yes: file-drawering does inflate the V value.

No: the combined V-value for the newly "convicted" papers is so large, you need very large drawers if this was fully due to file-drawering. It's difficult to pinpoint an exact number to it, but it isn't the case that this become "normal" again if for every of the 12 published studies there are one or two file-drawered studies. You really need to think about something more like a 100 to 1 ratio. Something that extreme could be an explanation.

(But, if I had to choose whether I'd find "data fabrication" or "drawering 1200 studies" more likely (which is a subjective choice, not a matter of statistics), I wouldn't go for file-drawering).

Enter your reply below (Please read the **How To**)

Unregistered Submission:

( June 12th, 2015 3:43pm UTC )

Edit your feedback below

Is it true that the V is also inflated when applied to control conditions with no effect, "flat lines"? There was a post on that in ISCON some time ago. Is that correct?

Peer 5:

( June 25th, 2015 1:04pm UTC )

Edit your feedback below

Comment by Hannes Matuschek on Klaassen (2015) at arXiv.org:

http://arxiv.org/abs/1506.07447

Abstract

Klaassen ( 2015) proposed a method for the detection of data manipulation given the means and standard deviations for the cells of a oneway ANOVA design. This comment critically reviews this method. In addition, inspired by this analysis, an alternative approach to test sample correlations over several experiments is derived. The results are in close agreement with the initial analysis reported by an anonymous whistleblower. Importantly, the statistic requires several similar experiments; a test for correlations between 3 sample means based on a single experiment must be considered as unreliable.

http://arxiv.org/abs/1506.07447

Abstract

Klaassen ( 2015) proposed a method for the detection of data manipulation given the means and standard deviations for the cells of a oneway ANOVA design. This comment critically reviews this method. In addition, inspired by this analysis, an alternative approach to test sample correlations over several experiments is derived. The results are in close agreement with the initial analysis reported by an anonymous whistleblower. Importantly, the statistic requires several similar experiments; a test for correlations between 3 sample means based on a single experiment must be considered as unreliable.

Peer 1:

( July 6th, 2015 1:23pm UTC )

Edit your feedback below

Klaassen, Peeters, van der Wiel respond

http://www.uva.nl/en/news-events/news/uva-news/content/news/2015/07/update-articles-jens-forster-investigated.html

http://www.uva.nl/en/news-events/news/uva-news/content/news/2015/07/update-articles-jens-forster-investigated.html

Enter new comment below (Please read the **How To**)

Report by Peeters, Klaassen and van der Wiel: https://drive.google.com/file/d/0B5Lm6NdvGIQbamlhVlpESmQwZTA/view

Counter-report by Denzler and Liberman :

https://dl.dropboxusercontent.com/u/133567/Generalresponse.pdf

The R code of Peeters et al. has also been distributed. Two files:

https://www.dropbox.com/s/xzyhq5gir9sh4p8/Analysis.R

https://www.dropbox.com/s/qq6ymi9h6z9tzmz/DataVeracity.R

Permalink

Thus data can never support the hypothesis of innocence, it only can ever add support to the hypothesis of guilt.

Thus the approach is not actually Bayesian at all. It is classical statistical hypothesis testing. V is "just" a test-statistic. To evaluate it, we need to know (or bound) its null hypothesis distribution, pick a significance level, etc etc. We can't evaluate it by thinking of it as an honest likelihood ratio.

It is *not* an honest likelihood ratio. It is badly biased to the prosecution, since the prosecution is allowed to look at the data, come up with the best theory out of a huge class of theories to explain it (different values of rho), and then pretend that they had thought of that theory in advance.

Forensic statistics is presently an explosive mix of frequentist and Bayesian ideas and I think that someone has stepped on a mine, here.

Of course, this is not to say that Förster is innocent of (innocently used) questionable research practices, or malpractices even. It shows yet again that the correct procedure is first do the science, in public. Then (perhaps) retract or correct papers. Then (perhaps), if a scientist has not done the "right thing" in a previous stage, perhaps some authorities will also get involved.

Permalink

## Are you sure you want to delete your feedback?

For a single-paper case (i.e. you want to get a V-value for just 1 paper), the procedure is:

1. Formulate H0: no fraud, H1: fraud (apologies for the sloppy formulation; see the Klaassen paper for a better one)

2. Set the nominal level alpha (e.g. 5%).

3. Compute V.

4. Compare it with critical value V*. If V > V*, then reject H0.

The problem lies in finding V*. For this, you need the distribution of V under H0. Given for the study of interest are only n, x (the vector of sample means) and s (the vector of sample sd's) and the assumption that the regular ANOVA-assumptions (normality etc.) hold.

If you would know mu and sigma (or if they would be fixed to a certain value under H0), it is all fairly straightforward. Computing the distribution of V under H0 might be difficult analytically, it is easy to estimate it through simulation up to some desired accuracy.

Problem is: you do not know mu nor sigma, and I don't think the distribution of V is independent of mu and sigma. A Bayesian way out would be to impose a prior for mu and sigma, and integrate things out, etc., but this is the frequentist framework so that's not an option.

For the multiple-paper case (i.e. you want to know if e.g. n = 10 papers combined form evidence for fraud), adjustments are needed. Since V can not be smaller than 1, the product of n V-values will - also under H0 - tend to infinity as n does. Some kind of Bonferroni-correction is needed. (But of course, this only helps if the single-paper case problem is resolved).

Permalink

## Are you sure you want to delete your feedback?

How To)## Are you sure you want to delete your feedback?