Thursday, October 24, 2013

10-31 Smucker, Mark D., James Allan, and Ben Carterette. A comparison of statistical significance tests for information retrieval evaluation. CIKM’07.


  1. 1. My first question is in section 2.6 Summary. For the p-value of t-test, we often set a threshold, which is 0.05 in convention. If we get a value that is a little bigger than 0.05, say 0.051, then we should reject it since it is larger than the threshold. However, this kind of p-value can be easily artificially manipulated by carefully selecting the results. So I am wondering how we can ensure the p-value to be unbiased.

    2. My second question is about the null hypothesis. The tests used here are dealing with null hypothesis. However, it remains problematic that whether it is proper for IR to apply null hypothesis. This problem is not answered in this paper. So I am wondering how we can verify the validation of null hypothesis in IR.

    3. My third question is about the comparison of statistical significance tests. I believe these statistical significance tests must have been compared previously in other fields. I am wondering what is the difference when we apply them in IR field. I am thinking that null hypothesis should be one, and the result distribution should also be counted. What else differences are there when we apply these significance tests in IR compared to other fields?

  2. 1. The authors state that although in their experiment they found the differences between the randomization, bootstrap, and t-tests to be of little noticeable difference other tests may find a difference. How generalizable are the results from this experiment? How might another test be structured to attempt to show differences between the three tests?

    2. The tests that the authors show to be better are more likely to distinguish differences between systems. Is it good that these tests are more prone to distinguishing significance? Wouldn't it be better to set a higher bar for significance?

    3. The authors say that they do not know how to estimate the accuracy of the bootstrap tests so to compensate they run many more samples than is traditionally recommended (p. 5). Is it a problem that they cannot estimate the accuracy? Could other issues with significance testing be overcome simply by running a lot of tests?

  3. In this paper, Mark has proved that the randomization, bootstrap, and t test largely agree with each other in his experiment; however, can we make a conclusion that the three statistic methods can be employed in the research in the realm of information retrieval correctly and reliably? In other words, is there any benchmark to test which method is correct?

    In this paper, the number of the topics is very large. If the number of the topic is very small, will the randomization be still stable?

    In this paper, Mark use the difference in mean average precision; if the precision is calculated in terms of GMAP, will the results in this research be achieved? Is it possible that randomization, bootstrap, and t test may be sensitive to the method of calculating precision?

  4. 1. I appreciate the pragmatism in the explanation of the randomization test: "Computing 250 permutations takes even a fast computer longer than any IR researcher is willing to wait." I was curious though, what's the difference between the 1-sided p-value (the "achieved significance level") and the 2-sided p-value? It seems like in the interest of getting published a 1-sided p-value would be preferable since it is more likely to report significant results, but what are the actual implications of choosing one over the other?

    2. For the randomization test, they computed 100,000 random permutations. How were they sure that the permutations were random? Did they just randomly choose between each pair and then check if that permutation had been generated? Or did they use some well-known algorithm?

    3. The calculations involved in significance testing IR system results seem easy to parallelized (i.e. generating 2^50 permutations and calculating the difference in MAP). Has any work been done to verify these conclusions about significance testing on highly parallel hardware such as GPU's?

  5. The authors reference randomizing their data and sampling from the randomized population several times (e.g., page . However, they do not reference their randomization procedure. Do they use a built-in "rand" function? What is the process?

    I can understand their suspicion of the Wilcoxon signed rank test given their findings, but I cannot understand the elan with which they dismiss it. Their description of the Wilcoxon test, furthermore, is less than adequate in explaining why its results diverged so greatly. What are potential reasons for why it produced such different results than the randomization, bootstrap, and t tests? Without such reasons, how exactly can we know whether or not we should dismiss it in any particular circumstance?

    What was the bias referenced in Section 5.2 of the bootstrap toward smaller p-values, vis-a-vis the randomization and the t tests? Is this inherent to the bootstrap, or to their construction?

    Other questions: On the topic of unreported information, how do the two error rates (type 1 and type 2) of the Wilcoxon and sign tests (Figure 6) compare to the other tests' error rates? Without a benchmark, I am not sure what to make of this Figure or the related argument against the tests. Lastly, has anyone looked at how NDCG or RBP results might be similar or different?

    1. Oops, please interpret my mistake in Question 1 as "(e.g., Section 2.1)".

  6. In regards to the bootstrap test and shift method, the authors state that, "The bootstraps null hypothesizes is that the scores of system A and B are random samples from the same distribution. This is different than the randomization test's null hypothesis that makes no assumptions about random sampling from populations" (2.4). I don't understand how sampling random, vs not random sampling would be important in determining significance?

    Smucker et. al. also state that the Wilcoxon and sign tests disagree with other tests and each other and should no longer be used. This seems contradictory to the Sanderson paper we read that seems to think the Wilcoxon test has "Great power". So which one is right?

    What happens when you run a Wilcoxon signed rank test and you get a difference of zero? The authors mention that this needs special handling but do not go into details?

  7. 1. What would the authors of "Improvements that Don't Add Up" think of this article? It seems to be addressing the same issue- researchers are using methods (in this case the Wilcoxon signed rank test and the sign test) that are heavily flawed and therefore do not promote real advances in the field of IR. Would the elimination of these tests satisfy at least some of the concerns from "Improvements"?

    2. The Wilcoxon signed rank test and sign tests are said to have been useful to create approximations before computation was as efficient and affordable as it is today. If this is the case- why are they still being used by any researchers? Are researchers using them to make their work look more significant than it is?

    3. The authors of this article claim that researchers "may misapply a test by evaluating performance on one criterion and testing significance using a different criterion"(p. 6). Is there a source that lays out what should tests and evaluation measures should be used for different kinds of experiments? What level of standardization is in place?

  8. 1. Since significance tests are widely applied in various IR studies, it is very important for us to have a clear understanding of the power of different IR tests and the applicable fields of them. My first question is about the implementation of sign test. It is mentioned a minimum absolute difference 0.01 is used for using the sign minimum difference. Why is this value chosen? Also it’s mentioned that if we use a different value 0.05 the p value will be dropped. So what is the optimal minimum absolute difference for using the test?

    2. The p values were used to assess the performance of different tests. It is shown that three tests (randomization, bootstrap, and t tests) have much smaller p value than that of Wilcoxon and sign tests. Although the first three tests have smaller p value than that of the rest two, and we might publish more by using the first three tests, does that mean that the first three tests are better test statistics? It is pointed out that there is type I and II errors. Will the first three tests generate more false positive results? How can we be sure the significance we find is real?

    3. For testing the agreement of different tests, why the run scores of trec_eval[3] were used? Different test statistics have different powers. As it is pointed out that bootstrap test has a consistent smaller p value compared with that of t tests. So is it fair to compare the p value from different tests since they may have different intrinsic significance threshold?

  9. In answering the research question “What statistical significant test should IR researchers use?”, the authors take a pragmatic approach that if two significance tests report the same significance level, “the fundamental differences cease to be practical differences”. I am not comfortable with this answer as the significance test results also depend on topics, assessors and choice of document collections. Ignoring the underlying fundamentals of the tests is not a safe way to reach this conclusion.

    In Randomization Test, the authors state “In fact, since there are 50 topics, there are 2^50 ways to label the results under the null hypothesis.” And then the authors take an example that produced MAPs of 0.258 for system A and 0.206 for system B. In the paper, I could not find what is the approach to take the example, which is the key base for calculating one-sided or two-sided p-value.

    In comparing Randomization, Bootstrap and t-test, since the experiments themselves do not show any difference, the authors recommend the randomization test to be used based on a cited work and simplicity. I am not very convinced of this and the claim shall be substantiated by more empirical studies or a user study.

  10. In the beginning of the article author says “A powerful test allows the researcher to detect significant improvements even when the improvements are small. An accurate test only reports significance when it exists.” But it seems that none of the tests that we currently have can be categorised as powerful tests. Has it even been attempted to define what such a test would be able to accomplish ?

    For these tests by default the value of p has been taken as 5%. But it is possible that the results given by Wilcoxin, tests are less sensitive to the changes between the systems and would therefore need a higher value of p to give the correct results. The author himself has mentioned "The Wilcoxon and sign tests are simplified variants of the randomisation test"

    The author has not provided much reference as to how randomisation was achieved for the the random testing. Can this affect the test results in any way?

  11. 1. When we are implementing statistical significance testing - wouldn't we always have to face the problem wherein the null hypothesis gets rejected at some sample size x? Since all statistical testing methods rely on the sample size in their formulae - How can we hope to deal with the issue which makes it seem like all significance tests have a tautological logic?

    2.Although the Wilcoxon and sign tests have the same null hypothesis as the randomization test, these two tests utilize different criteria (test statistics) and produce very different p-values compared to all of the other tests. What could explain this phenomenon which seems to be a result of a variation between the test criteria and the area of interest?

    3. When dealing with 2 or more independent variables in the t-test it seems like we would be required to either collapse the categories or not run the analysis if the focus of our investigation is to test differences between the group means. Isn't this a major bottleneck if we want to use the t-test in retrieval strategies which include an A-B testing (or, any other form of comparative analysis)?

  12. 1. For the different significance test measures, the authors either used an existing implementation or they provided their own in C++. An infamous example I’ve heard in all my software testing classes was an error in the Array class released in the Java API. The bug went undetected for years and the code was published in several textbooks. Eventually, people discovered the code was actually faulty. The example shows that even publicly acceptable applications may contain code errors. Can errors in the significance test programs lead these researchers to draw incorrect conclusions? Specifically when comparing the randomization, bootstrap and t-test, which all contained closely related values. An incorrect implementation of any one of these could have influenced this evaluation.

    2. Three of the significance tests, randomization, bootstrap and t-test, had a close agreement with one another. The bootstrap test and t-test were the closest related, which may be because they are based on the same assumptions unlike the randomization test. Although there is no overwhelming difference between the three, the authors conclude randomization is the preferred significance test method. In class, we discussed the purpose at looking at agreement between measures. Just because measures are in agreement, does not mean the measures are good or bad. It simply means the measures tend to reach the same conclusions. With no real difference, is the author’s justification for selection randomization test valid? The author talks about the nature of the information retrieval process as his justification. The main point he considered with was that bootstrap and t-test assume randomly pulled queries, which is not the case for test collections. However, does this really impact the evaluation power of the two tests?

    3. Is having the same data a good enough control to compare significance tests? Each measure has its own unique features, and the paper has demonstrated thorough examples on how tweaking these values can impact the results drawn from the test. The authors pick a common sample size, but then note that the factor doesn’t apply to all the test. The authors also go on to outline the different controlled variables they established such as significance values. The controls make sense, but are not equally impactful to all measures. Given the close evaluation between randomization, bootstrap, and t-test, is the experimental design suitable for drawing conclusions?

  13. 1. The author mentioned that for the two sign tests, the test statistic on which they are reporting significance is different from the test statistic on which they are measuring significance. To prove this this can be problematic, the author showed that significance testing results on Median Average Precision is different from that of Mean Average Precision. However, this doesn't convey the full picture, and is even slightly misleading, since the difference between the two test statistics used/reported on by the sign tests is not (possibly) as great as that between Median and Mean average precisions. Has there been any work to justify that it is not always invalid to implicitly use a different test statistic, but report on another, if the two are highly correlated?
    2. From a passing comment made in 5.3, it seems choosing the right metric can alleviate for the problems the sign tests have been shown to have. In particular, the authors pointed out that on the MRR the Wilcoxon was more consistent. The question then is if we can find metrics where the Wilcoxon actually performs better than Student's t, for example, and becomes a favored test. In other words, wouldn't a bigger picture consider pairs of metrics and significance tests in the IR context, since some metrics seem more adaptable to some tests?
    3. The authors point out that the randomization test is ideal because it does not 'incorrectly' assume that the test scores are random samples from a population. However, even if this is theoretically incorrect, in practice the t-test has been found to work quite well. So isn't it not completely incorrect to assume that samples have been drawn randomly from a population, even if the design of the experimental setup says it is?

  14. 1. When discussing Wilcoxon Signed Rank Test and Sign Test, the authors thought that both tests might have “only disadvantages” compared to the randomization test. However, such tests can play a role to roughly estimate the significance before using computers. Thus, such conclusion seems not solid.
    2. This paper mentioned “bootstrap shift method significance test is a distribution-free test”. What’s meaning of “distribution-free”? Such test still requires the hypothesis based on “same distribution”. In the light of this definition, except for t-test, other 4 tests are all “distribution-free”.
    3. In section 2.6, paper discussed how to select test. Should such decision be more relied on the data feature and scenario instead of the pure statistical significance?

  15. 1. It is said that “A powerful test allows the researcher to detect significant improvement … An accurate test only reports significance when it exists.” (p.1) Does it imply a significant improvement or any other improvement if the significance exists?
    2. It is mentioned that the “sign minimum difference test is clearly sensitive” in Sign Test. How to determine such a value?
    3. Tests discussed in the paper were only verified with the TREC data. Are these guidelines still valid in the Web search engine?

  16. 1. The Section 1 in the paper states the different challenges posed during the evaluation of an IR system such as the topic hardness and assessor’s behaviors, choice of document collection. But while comparing the statistical significance tests, the authors have not discussed the applicability of the tests to these various scenarios and also have failed to analyze the performance of these different tests under different assessor behaviors. Will these cases affect the results of the test? How can one attribute these factors to these tests?

    2. In case of Randomization test, it was found experimentally that with 100K samples, a two-sided 0.05 p-value was computed with an error of 2% whereas with that of 20 x 10^6 samples was 0.01%. It is clear from these two results that the number of samples plays a significant role. From 100K samples to 20 x 10^6 samples, the estimated error is has a huge margin of difference. Computing the test with 20M would incur high computational costs when compared to that of 100K samples. They claim that the level accuracy obtained with 100K samples is very good. Then how do we identify what is the minimum number of samples that is required in order to test the null hypothesis?

    3. How do the authors generalize that the IR researchers should not consider using Wilcoxon's test for Evaluation? When compared to Randomization test that depends on the number of samples (more the number of samples, more accurate the estimate of p-value will be), the Wilcoxon's test obtains a rapid approximate idea of the significance of the differences. It is mentioned in Section 2.1 that even with super fast computers it is highly unlikely that even the IR researcher, leave alone the user, has the willingness to wait for the result of larger samples for randomization test. Then why can’t Wilcoxon's test be considered a good trade-off between gained computational ease and higher accuracy?

  17. Smucker et al., at the very end of their introduction, make it a point to mention that they “know of no other work that looks at all of these tests or takes our pragmatic, comparative approach.”(2) How could the IR field have continued to use the significance tests for so long without using the same kind of approach Smucker et al. have employed in this paper?

    On page 5, Smucker et al. state quite forcefully that testing using the Wilcoxon method should’ve been ended quite a while ago and on page 7 they mention that IR researchers using the Wilcoxon and sign tests could conceivably miss identifying IR techniques which are an improvement over other techniques. Would this in any way retroactively change any research? Could researchers go back and sift through experiments and apply the randomization test in an effort to show the actual advances in IR versus what could be considered the “assumed” advances?

    This is just a general thought, how likely are researchers to simply abandon the use of the Wilcoxon and sign testing completely as Smucker et al. have suggested given that they are easier tests to apply than the randomization? It seems like IR researchers, in some cases, prefer to stick to these methods for as long as possible with little regard for change. Especially if this article is the first time the different tests were compared in a significant way.

  18. 1. Given that the randomization and bootstrap tests are more flexible than other significance tests, could tests like these be used to come to some sort of test that would allow comparisons across systems that measure different metrics?

    2. The authors mention that these significance tests work particularly well with TREC style evaluations. How do non-TREC evaluation and more live evaluations affect the strength of the significance tests?

    3. Graded relevance has been shown to differentiate runs more substantially than using a simple binary relevance. Would these types of differences in relevance scales cause a similar difference in the significance test results between the better systems?

  19. 1. How does the t-Test and Bootstrap test perform in the miss rate and false alarm ratio as discussed in figures 6 and 7? The central claim of the authors that the Wilcoxon and signed tests produce more false alarms could have been supported strongly by the evidence for the same in comparison with the t-test and the bootstrap test.

    2. The author argues that randomization testing is a better alternative to the other parametric and nonparametric tests. Are the authors not overlooking the fact that a randomization test is computationally very intensive and time consuming? For a sufficiently large data, can 2^50 permutations be generated in a non-compromising way?

    3. Considering the importance that has been laid to significance testing to compare IR systems, does it not make sense to define a standardized test for the specific purpose? This might reduce the risk of researchers projecting better results using the tests that work for their data. Is there a recognized standard test which is widely accepted for the same?

  20. 1. The claim made on the reliability of the t-test is that the a large number of samples validates the normality assumption. But how do we determine a sufficient sample size and how sensitive is it. It appears that t-test is the go to method and hence more investigation in this regard would have been helpful.

    2. The bootstrap test is hinted to be the best compromise in related work but not so much here. How and when does the systematic bias of lower p-values affect inference – is this a serious concern?

    3. The existence of a population from which samples are draw is the only advantage of using bootstrap test. However, it is still not clear when such an assumption is not necessary and the randomization test can be used.

  21. 1. The difference between a powerful test and accurate test is still fuzzy to me. Based on the definition provided, don’t they seem to overlap, or doesn’t accurate seem to be a subset of powerful, which helps detect small (or any) significance?

    2. Rather than disregarding the Wilcoxon Signed Rank Test, what do you think of using it as a fast-tool, for researchers to do some quick work before they deep dive?

    3. The paper says that if the null hypothesis cannot be rejected - i.e. it cannot be proved that there is no difference between the two systems - then there might be noise in the evaluation. What are the criteria for this assumption, is this true for all systems, topic sizes, and/or test collection sizes?

  22. 1) Given that Randomization, Bootstrap and t-test all show similar behavior, if one uses all test but they disagree wouldn't it be better to follow a more conservative and use a voting scheme to determine if there is significance?
    2) Can you elaborate about the practical differences of the null hypothesis between Randomization, Bootstrap and t-test and how sometimes this differences might not be negligible?
    3) Are there factors that influence the sample size for the Randomization tests or is using a sample size of 100,000 usually sufficient?

  23. 1. In this article the authors compare several different statistical significance tests. To do this they use a comparison of two different engines across 50 TREC topics. This leads to a large number of different permutations of the differences between the two on the order of 2 to the 50th power. Because they could not calculate the significance value for all of these permutations they selected a pool of 100,000 for their experiment and 2 million for the gold standard they eventually created. Do you think that the size of this pool was too large or too small or the right size?
    2. In this article the authors state that the sign and Wilcoxon signed rank statistical tests are not valid for testing significance in IR. They state that at one time, before computers, these tests were useful yet now they are obsolete. Yet in the other articles it is shown that they are often used in IR evaluation even if they are somewhat unreliable. What circumstances can you think of in which you would use these tests in IR research you would do?
    3. In this article the authors create a gold standard to compare the other testing methods against when discussing their error rates. To create this standard they use the randomization significance test for 2 million permutations of different runs rather than the 100,000 they used for the initial test. Wouldn’t using one of the methods that you are examining to create your gold standard cause some sort of bias? Do you think it was safe of them to do this?

  24. In section 5.2, in the discussion about which of Randomization, Bootstrap and t-tests are suitable, the author suggests that in the case of disagreement of the significant tests, the randomization test is to be preferred. However, randomization test is computationally infeasible. How representative is the sampled test of the original one? In this paper 100,000 samples were taken instead of dealing with 2^50 cases corresponding to the 50 topics. What would have happened if the number of topics were 100? What would be the sufficient sample size then (the total number of cases now would be 2^100)? Would it the sample size itself have not been computationally expensive? Thus I feel that even sampled randomization test is limited by the input data size.

    As mentioned, the choice of significant test is still highly dependent on the input data. It was suggested that when the data is from normally distributed population, then the T-test is to be used as it outperforms the other two. In the other scenarios the randomization test performs nearly as good as the T-test. Even with this knowledge, I don’t understand how IR researchers can pick the ‘best’ significant test as they possibly cannot infer the data distribution. Thus it appears that inferring the data distribution is as important as knowing which test to apply when. Therefore it seems to me that the point made by the authors of the paper is instructive but not implementable as of yet.

    The related work section in the paper presents the work of various researchers that seem to have distinct conclusions. Hull advocates the use of T-test due to its robustness in cases of violation of normality assumption. Wilbur recommends the bootstrap test over the other tests because of its greater generality. Box concludes that t-test is an approximation of randomization test. The authors of this paper advocate the use of randomization test for many cases. Thus it appears that no method serves all the data well and no method serves all the data worse. One way to understand the implications is perhaps to look at the technique(s)/ significant-tests used by the search engine companies (if available!) and observe how they fare with user activity. Or the researchers could produce all the results corresponding to the different significant tests they employ just to remove the doubt that they are producing their best results.

  25. This comment has been removed by the author.

  26. 1) In the introduction the authors state that if two significance tests give the same p, there is little practical use for having both of them. Doesn’t having this redundancy provide a sort of mutual reaffirmation? If two tests provide the same p for several comparisons, but different p’s for a new run, then this sudden distinction might tell us something interesting about the new system.

    2) I was a bit confused by the example of the randomization test. Specifically, if negative differences are used, it seemed that we could have just as easily made the same conclusion regarding System B being better than System A, if we used our primary labeling as one where System B has a higher AP than System A. Would we really get a much higher p value for the specific labels where B is better than A?

    3) The authors in this paper were very dismissive of the wilcoxon and sign tests. This was also mentioned in the ch.5 reading, where it stated that they determined these tests were simplified versions of the randomization test. Given that these papers are several years old, are these significance tests used at all anymore? Did anyone make an argument against Smucker? It’s interesting that these two tests perform much differently from the other three, so I was wondering if they were simply trying to explain away this anomaly as an “inaccuracy” in the tests.