Thursday, October 24, 2013

10-31 Mark Sanderson. CHAPTER 5 - Test collection based evaluation of information retrieval systems, Foundations and Trends in Information Retrieval, pp. 308-318.


  1. 1. My first question is in section 5.1.1 Not Using Significance. This section mainly talks about alternative solutions instead of using significance. One approach is to set a certain threshold on the performance difference. One example is that any performance difference that is below 5% should be disregarded. This seems to be a feasible solution. But one problem with this solution is that there would never be a certain threshold that is agreed on by all people. How can we decide on the threshold according to specific requirement?

    2. My second question is in section 5.1 Significance Tests. For the t-test, the results should be drawn from a normally distributed population. However, most papers which made use of t-test did not prove the results to be normal distribution. They just simply assumed that the results to be normally distributed and used t-test directly. Is it reasonable for them to do so? Or do they have to prove normal distribution first?

    3. My third question is in section 5.1.3 Consider the Data in More Details. In this section it talks about the data in more details and from different perspective. Especially it talks about that even if the experimenter compared two runs using the collection and found p<=0.05, there remains the question that whether the result is practically significant. This gives me more insight about how to do and evaluate the experiments. So I am wondering what methods we can apply in order to verify the result is practically significant.

  2. 1. Sanderson proposes confidence intervals as a solution to some of the problems of significance testing (p. 318). Do you believe confidence intervals are a better way to evaluate differences? Why or why not?

    2. Sanderson states that it is of concern that researchers who fail to find significance do not examine their data to find out why this is the case (p. 317). Why is this a concern? How could you examine why there wasn't any significance?

    3. Sanderson writes that if the test set is a representative sample of the larger set of queries then the test can be reliable, but often the sample is not representative (p. 316). How do you determine if the test set is representative? What aspects would you need to examine?

  3. This comment has been removed by the author.

  4. 1. Zobel preferred the Wilcoxon test “given its reliability and greater power” (p. 312). Isn’t this statement contradictory? By definition, greater power means not a Type II error, and thus comparatively more assumptions. How is that more reliable?

    2. What is it about an operational setting that causes significant differences resulting from a t-test to fail for a small number of documents (p. 312)? Also, what number is considered small, for runs that include thousands of documents?

    3. I’m a little confused about the definition and/or ways of determining if a sample is representative (p. 316) of a test collection. Also, if the tester knows, at the onset, that the sample is not representative, what are the merits of testing it?

  5. In this part of the paper, it’s mentioned that the data gathered in the experiments of information retrieval can rarely satisfy the assumptions of t-test, in which case using t-test is not suited there or even is prone to be a mistake. So, why will t-test still be employed in the experiments?

    In addition, significant difference is merely a tool to test whether there’s a difference between system a and system b for instance; however, it can hardly to support that system a is better than system b. In this sense, how can we determine which system is better in terms of statistics.

    In the evaluation of information retrieval systems, for the users, without considering the interface design, they are likely to concern about the accuracy of the results provided by the systems and quickness of providing these results. So, how do users balance the two factors when judging these systems in fact?

  6. The author mentions that for randomization test to be applied, no properties of the data must hold (similar to the case of bootstrap test). How do you ensure that the data does not hold any properties? Is it a necessary condition for applying randomization test? Or does the author only imply that randomization test does not make any assumptions over the input data?

    The author mentions that the single tailed tests (in both directions) cannot be performed using the same data set. Why is that so? What would happen when the direction of the data is incorrectly predicted? I do not understand how the input data set is related to the direction of the single tailed test performed.

    As mentioned in the conclusion of the reading, confidence intervals is another measure of reliability of input data. What are the advantages and disadvantages of using Confidence Intervals? It is surprising that none of papers read (in the course) so far, use confidence intervals. Is there any reason why this is so?

  7. 1. I know that precision is the portion of documents retrieved that are relevant, and recall is the proportion of relevant documents that are retrieved, so what does the text mean when it says that Type I errors (false positives) measure precision and Type II errors (false negatives) measure recall? I know that we went over this in one of the first couple of lectures, but I don't remember the reasoning. It kind of makes sense that Type I errors measure precision, since zero Type I errors indicates perfect precision, but if there are no Type II errors, you can still have poor recall (for example if you don't retrieve any documents).

    2. The writing cites Saracevic as urging cautious use of many statistical tests because the data from using these statistical tests on effectiveness evaluations "data does not satisfy the rigid assumptions under which such tests are run". Does this explain anything? It seems kind of dismissive (though the writing goes on to state that Saracevic did not imply that there was no more work to be done in this area).

    3. The writing cites Spärck Jones as saying "in the absence of significance tests, performance differences of less than 5% must be disregarded," which Voorhees and Buckley later clarified as referring to absolute percentage difference. What is absolute percentage difference? And what are the alternatives?

  8. Without having read all of the papers discussed in this meta-review, I am a little confused conceptually as to how the different metrics are being compared. Such quotes as "Zobel concluded that use of all three tests (paired t-test, paired Wilcoxon's signed rank test, and ANOVA) resulted in accurate prediction of which was the best run, though he expressed a preference for the Wilcoxon test 'given its reliability and greater power' " (312). In order to evaluate these metrics in such general terms, it seems that there would need to be a rigorous strategy to compare across several datasets and under different, controlled experimental conditions. Perhaps there is better predictive accuracy of particular types of runs using Wilcoxon's signed rank test, but better of other types of runs using other statistical tools. What are the particulars, and how heavily evaluated were these tools, in order to produce such findings? Only when we are sure of this can we be sure that the comparisons given herein are legitimate.

    As discussed in class, while it might be to the benefit of theory to have a better grasp of metrics, it is not necessarily to the benefit of researchers seeking to publish. Is it possible that having a multitude of tests serves the interests of the researchers in getting published, and that this may be part of why there does not seem to be much agreement on which tests to use at certain times? In addition, although it may ostensibly serve the researchers as just mentioned, until we DO know exactly which statistical tools best serve particular situations, would it not also be the best that journals can do to consider a multitude of tools as potentially equally valid for similar experiments, so as not to discredit good research?

    I cannot recall seeing error costs seriously discussed in many of the IR papers we have read so far. Do researchers ever weight the costs of Type I and Type II errors differently in order to adapt models to particular applications, or is this generally considered to be outside the domain of conceptual IR research and its top journals?

  9. 1. Throughout the chapter a number of studies are discussed which try to conclude significance at a specific number of sampled topics. How appropriate of a sampling unit is query topic? It seems as though different topics would measure very different aspects of IR systems. Why, when assessing the significance of results is not merely appropriate to show significance on the level of particular topics? Creating a sample that properly characterizes the population of topics seems far more problematic to me than creating a sample that properly characterizes the population of documents for a topic.

    2. Two of the newer significance tests for IR that were mentioned are the bootstrap test and the randomization test. Sanderson however remarks that the IR community does not use these new measures all that commonly. Why is this the case?

    3. When proposing a new IR approach, there may be computational cost reasons for not processing a statistically meaningful sample of topics or documents. What is an appropriate way to interpret such studies?

  10. The authors say that according to Zobel, there is a preference for the Wilcoxon test 'given it's reliability and greater power'--What is this greater power that the author is referring to? Accuracy? Stability?

    In section 5.1.1, the authors quote Spark Jones who says significance is notable at 5-10% and material at over 10%. Is this to account for unreliable use of statistics to determine significance? Or to account for false positives, and false negatives? These numbers seem higher than the ones in other papers.

    In section 5.1.2, the authors explain the difference between one and two tailed tests, and why I agree that one tail tests do not seem to show the entire picture, and according to the authors are "almost always inappropriate", is there a case where a one tail test would be appropriate?

  11. 1. Significance test gives us more information about the performance of different IR systems than the standard IR metrics. As it’s mentioned that the hypothesis of performing the significance test is that the data we’re studying is a representative of the population. How do we find the representative data collection for the data population?

    2. In comparison of different statistical tests it is mentioned that a study divided the test collection into one half as a mini test collection and the other half as an operational setting. What is an operational setting and why the authors need to different data set, the mini test collection and the operational setting?

    3. The effect of the number of topics was studied to find the suitable number of topics for getting a p value of significance. Does this effect vary across different statistic tests? If so, which types of tests need fewer topics and which types of tests need more topics? Since the 5% significance threshold is a social convention, is there a systematic study on what’s the most proper p value for the significance test?

  12. 1. We have seen how the Wilcoxon Test is capable of resulting in false detections of significance as it in effect is a simplified version of randomization. So, how can we account for the percentage of results which have been tabulated as significant but in fact are insignificant? Have there been any methodologies which have been focussed at eliminating these false detections of significance?

    2. Owing to the mathematical construction of the ANOVA, the underlying assumptions of the test include Homogeneity of variance, Normality and the independence of observations. How can we go about to establish that each of these assumptions are in fact always tautologies? As isn't it possible that the inclusion of such assumptions may create a bias in the experimental results provided?

    3. When dealing with the choice of a one tailed test or a two tailed test - the paper does not take into consideration the fact that it is easier to reject the null hypothesis in one tailed tests. Would neglecting this factor cause any difference in the observations made? And so, what would be the correction parameter that can be introduce to deal with this issue?

  13. The author states that if the creators of the tests make more assumptions on the underlying data being examined, tests will have more power and be more supportive of H1. I am wondering why is this? My second question is what if the underlying assumptions are not accurate or even misleading?

    The author states that “the sign test is known for its high number of Type II errors whereas the t-test is known for producing Type I”, later he suggests both of the tests shall be used in IR experiments. Since these two tests are very different and produce different type of errors. I am wondering how IR researchers shall consolidate the different results from these two tests. It is not mentioned in the paper.

    In Section 5.1.1, the author cites others works of not using significance at all. Instead, these researchers instigate how many topics were needed in a test collection in order for a 5% difference to accurately predict which run is better. I think it is quite subjective to determine the number of topics required and this 5% difference itself is very subjective as well. They are nowhere near to tell whether the test results are due to chance or attributes of different run. Significance tests can not simply be replaced by some measurement of absolute values no matter it is the number of topics or percentage of difference required.

  14. 1. When discussing the alternate hypothesis H1, the author outlines two different possibilities: the one-tailed test and the two-tailed test. For the two-tailed test, the hypothesis states there is a difference between system A and system B. The hypothesis does not go on to state which system is considered to perform better. Instead, the conclusion that can be drawn is that the difference between the performance of system A and system B is significant. When is the experimental design purpose just to know that two systems are different? Given the goal of research is to make a contribution, knowing your system outperforms some standard seems like a worthy evaluation. However, just knowing your system is different than the baseline does not mean you have contributed to the field. As a result, are two-tailed experiments performed when the research is trying to mask the performance of his system? If he has the resources to perform a two-tailed test, then he should have the resources to perform a one-tailed test that can conclude if his system is better.

    2. In the same discussion of the H1 hypothesis, the author comments on the actions of researchers using the one-tailed and the two-tailed test. Depending on the data, a researcher can incorrectly hypothesize which system is performing better. As a result, he cannot reject the null hypothesis and is forced to conclude there is no difference in performance between his system and his tested baseline. However, the researcher can perform the one-tailed test again with the opposite assumption for the alternative hypothesis. In the end, he may be able to conclude the baseline performs better. The author mentions that switching the hypothesis and performing the one-tailed test again requires a new data set. Why is this restriction required? If you have experimental data, why is it not possible to perform multiple significance tests against it? Is this data set referring to something specific? I do not see why this is necessary requirement.

    3. As a criticism to significance tests, the author demonstrates how outside factors can influence which system may actually be the “best.” The author pointed primary to financial considerations a corporation may take into account such as cost of new hardware requirements or installation requirements. Is this a valid criticism too thorough at significance testing? It is true that these outside factors are not considered at all in the evaluation process? However, they were never intended to be considered. The purpose of the significance test is simply to evaluate the data results of two systems to determine if the difference was by chance or a reflection of truth. It was never meant to be a tool such that it is the only considerations for choosing the implement one system or the other.

  15. 1. This article claims that sign tests, t-tests, and Wilcoxon are the most widely used tests currently in use. I am wondering if anyone else is alarmed by this after having read the Smucker ("A Comparison of Statistical Tests...") article for this week. And, who would have the power to declare these tests unacceptable if the Smucker article is right?

    2. This article summarizes the Smucker article mentioned above, alluding to the problem with some of these tests that the "Wilcoxon and sign tests [produce] quite different results"(p. 313) from the others. But, are different results necessarily bad results? What if the other tests shared a flaw that Wilcoxon and sign don't have? What would we lost from getting rid of these tests?

    3. How trustworthy are studies that rely on one-tail tests and about what percentage of IR experiments rely on them? The authors say that they are frequently used to test against a baseline, but they also seem like they might not be as substantial as the two-tailed tests.

  16. 1. It’s stated that “tests that have fewer assumptions tend to generate more Type II errors and are said to have less power”. (p. 310) How many assumptions are needed in a test to guarantee the power of test result?
    2. Saracevic argues that no parametric statistical test could be used with confidence on data emanating from effectiveness evaluations because such “data does not satisfy the rigid assumptions under which such tests are run...conditions set for use with non-parametric tests were also likely violated by the data output from test collection evaluations”.(p.311) What’s the evidence of this argument?
    3. It is stated that if the test set is a representative sample of the broader population of queries, documents and judgements made by users in an operational setting, then the conclusions on whether H0 can be rejected or not should apply to the population.(p. 316) How to decide whether a certain test set is representative or not? Is there a criteria?

  17. 1. The author states that the 'choice of a one or two tailed test needs to be made before analyzing the data of an experiment and not after.' How true is this in practice? What if I did a two tailed test, found that the results are insignificant, but that if I chose a one tailed test instead, they are significant. Would it be unethical to do this and say that my results are significant according to a one-tailed test? How unethical is this compared to running a whole bunch of experiments on a whole bunch of metrics and hand picking the ones that show something was achieved?
    2. The authors come back to the result of one of the other papers that 'in the absence of significance tests, performance differences of less than 5% must be disregarded." One would argue however, that where this difference occurred is equally important. For example, if the 5% difference occurred in the 80th percentile range than the 60s, wouldn't that make a difference? This issue seems to have not been mentioned at all in any of the papers thus far. Differences are considered uniformly across the full range.
    3. Given all the unanimous disavowal of the signed tests, why are they even in use? Has their use diminished at all since these papers got published, or have they been conveniently brushed under the mat?

  18. 1. It is said in this paper that “in IR parlance, Type I measures the precision of the test, Type II measures its recall”.(p.310) Why Type I is not used to measure recall and Type II is not used to measure precision?
    2. Van Rijsbergen states that Wilcoxon and Sign tests can only be used if data are drawn from continuous distributions; yet retrieval experiments produce discrete distributions. (p. 311) Why Van Rijsbergen suggests “conservative” use of the Sign test despite the violations? How is sign test conducted?
    3. In this paper, the author mentions a lot of tests, such as the Wilcoxon signed-ranks test, the Sign test, the ANOVA test, the Friedman test, the Mann-Whitney U-test, the Chi-squared and the t-test. What are the features of each test? What are the advantages and disadvantages of applying each test?
    4. Savoy proposed use of the bootstrap test. But the test was little used by the IR community. What’s the disadvantage of the this test?

  19. Sanderson notes on page 312 that Zobel ran an experiment using three significance tests in an effort to pinpoint the best one for use in information retrieval. Zobel discovered each test that was used was able to predict the best run but singled out the Wilcoxon sign-test as the best of the three. If all the tests used were able to come to the same conclusion does the “power and reliability” of the Wilcoxon test matter or is it simply a matter of individual preference?

    It seems like significance tests vary to such a small degree that they might be interchangeable between various runs. Even Smucker et al., including the randomization test showed the same basic differences between the tests and yet they champion the randomization test above the others. What justification might a researcher have for using a particular test in test collection runs?

    On page 315, Sanderson touches on this by exploring one and two-tail testing. Is there any way to detect when an experiment or researcher has used a two tail test and then a one tail test in order to “search out significance” or is it up to the conscience of the individuals?

  20. 1. The author states that Confidence Intervals can be used to analyze the data instead of significance tests. He also states that significance tests have been overused and that confidence Intervals are a popular alternative. From what has been described in the chapter and the references, it is not clear how CI can be used to compare two different IR systems.

    2.In page 310, the author states that the tests that have fewer assumptions tend to have more type II errors and that non parametric tests, such as Wilcoxon’s test tend to have more assumptions about the data in order to reduce type II errors. Is this the reasoning behind Sanderson and Zobel’s claim that wilcoxon is a bad test(in one of the other readings this week)? If not, what are the other reasons why a t-Test is better than wilcoxon test?

    3.Bootstrap test proposed by Savai and discussed in this paper does not assume data normality. Does this mean that the test is generalized for all types of data distribution? If not, how does it hold good/relevant to the IR field where all the data that we consider are mostly normalized to fit in the standard distribution (or at least tending towards it).

  21. 1. If the most commonly used significance tests typically produce type I and type II errors, is it beneficial to run several significance tests to ensure that your results aren't being affected by those errors?

    2. Why are the Wilcoxon, t, and sign tests the most commonly used? What about these tests appeal to researchers given the existence of alternative significance tests and evidence against the accuracy of the Wilcoxon and sign tests?

    3. How valid are the results of significance tests run on different runs on different topics? How accurate are these significance tests when compared to runs done on the same topics?

  22. 1. It is essential that one have to make certain assumptions on the unknown data for analyzing purposes. But it can result in Type I or Type II errors based on the type and number of assumptions one has made. But what is not clear is that what assumptions lead to which of the errors? Considering the fact from Smucker's paper that stated Wilcoxon rank test and the Sign test have to be discarded and that only parametric tests have to be used for evaluation, how was it generalized that Type I errors are acceptable for all IR systems?

    2. In page 312, there is a section that describes about Zobel’s experiment about three significant tests by splitting the topics of a single test collection. Although many IR researchers have mentioned experimental proofs against the use of Wilcoxon test, Zobel's experiments concludes that he preferred Wilcoxon test given its reliability and greater power over paired t-test and ANOVA. Can this be attributed to the strategy of splitting the topics of the test collection? And what is Zobel referring to as "greater power" of the Wilcoxon test?

    3. What were the methods, test collections, topics and metrics used by Voorhees and Buckley as opposed to Zobel that the number of topics required to accurately predict which pair of runs is better is found to be double in the first case? Did they both make the same assumptions that resulted in this huge difference? If yes, does that indicate that the accuracy is independent of the number of topics? If no, then does that imply that it is highly volatile in nature that during each run there might be differences in the resulting value of accuracy?

    4. In order to avoid Type II errors, researchers had suggested a method to incrementally add topics to a test collection until the required power was achieved to avoid such errors. Is this a proven approach and how reliable and robust can this approach be? Isn't it a computationally exhaustive and brute force method? What are the other ways of avoiding these errors?

  23. 1. Significantly better as determined by the user population vs statistically significant, how will the two correlate? Will the correlation be higher if a simple measure of minimum relative difference is observed?

    2. The author brings to light the purpose of significance tests and concludes that significance may not translate well to real world settings, since a single measure of performance does not enable a holistic inference. However, how often do we measure and compare systems holistically?

    3. The author suggests that the use of confidence intervals may encourage progress and investigation. Confidence intervals do seem to be less harsh as opposed to a binary decision. However, confidence intervals are sensitive to the sample size adopted.

  24. 1) Sanderson mentions Hull's claim that the t-test often performs well even when violating the normal assumption. Can you describe Hull's experiment?

    2) Are empirical tests required even after having a theoretical analysis of a test? I would assume that a theoretical analysis is always better but is difficult to develop.

    3) When describing one-tailed tests, Sanderson states that if the test fails to reject the null hypothesis with a one-tailed test then the only way to test in the reverse direction one must use a different data set. Why is this necessary?

  25. The author suggests researchers should decide what type of testing they will do (one tailed test vs two tailed) before starting with testing so as to remove bias for the test results. But what if the factors based of which this decision was made changes with time? Isn’t it better to change the direction of the testing once it has been identified that such a change is needed ?

    The author mentions that : “More powerful tests generate fewer Type II errors but make more assumptions about the data being tested.” But how many assumptions are needed ? and what are they ? and how is it ensured that they are actually correct ?

    The author mentions "One popular alternative is the confidence interval (CI) which can be used to compute an interval around a value, commonly displayed in graphs using an error bar.” indicating that CI has become a very popular way for significance testing. And then he also states “Confidence intervals are sometimes used in IR literature” which seems to indicate that CI has not yet been used practically in IR. Are not these two contradictory statements to be made. Then he has also not mentioned how exactly it has been used literary ?

  26. 1. In this chapter Sanderson states that there is essentially one null hypothesis that significance tests in IR evaluation consider, that there is no difference between the two runs under evaluation. However in the Smucker, Allen, and Carterette article they show that several significance tests like the bootstrap test have null hypotheses that are similar and yet different from other significance tests. Do you agree with Sanderson’s statement or do you think that the null hypotheses that Smucker, Allen and Carterette showed are really that different?
    2. In this chapter the author discusses the differences between a one-tailed and two-tailed significance test. He states that many IR researchers consider the one-tailed test to be very useful for IR and yet that most other experimental sciences the two-tailed test is more common. He goes on to state that each has its own benefits. Which of these two types of tests do you think is more useful for IR evaluation and why?
    3. At the end of this chapter Sanderson suggests that IR evaluation should consider using some other type of statistical evaluation other than significance testing to compare systems. He argues that confidence intervals, which are used widely in other fields of study, would be better tests to use. Given what we have read this week on significance tests in IR evaluation do you think that the field should give up all of the work they have put into significance testing to use another testing type that may give better results and allow for better analysis of the data?

  27. 1) Sanderson mentions that the t-test is known for producing Type I errors, but does not go into details regarding the test. What characteristics of the t-test cause Type 1 errors more so than the sign test? I’m wondering, because if the t-test produces so many false positives, I’m not sure how it can be used reliably in an IR context.

    2) Sanderson goes on to cite several authors, who reach the conclusion that statistical tests can only be used if data is drawn from continuous distributions. Unfortunately, retrieval involves discrete distributions. This would imply that such tests should not be used in IR, but they go on to say that “conservative” use is okay. How do we go about making a test “conservative” or using it “conservatively.” ? This was not entirely clear to me.

    3) It is interesting that when discussing one-tailed vs two-tailed tests, the author mentions needing to choose the test before the data is analyzed. Does it actually affect the results to run every permutation (two tailed test and one tailed tests in both directions) ? As long as you are consistent throughout your data regarding which test you use, does it matter if you “test the waters” to see what yields the “best results” ?

  28. 1- The author mentions two tests that I do not remember learning about: the bootstrap test and the confidence interval test. It is interesting that the bootstrap test does not make assumptions about the distribution of data. I am curious about how a test like that might work. In terms of the confidence interval test is this an actual test or just a visualization of data?

    2- When discussing the one-tailed test the author mentions that if the null hypothesis is accepted in this test it may be that that is missing some information. If the baseline was significantly better then the test system this would be obscured. He said that the test could then be preformed in the other direction (with the former baseline as the tester) but that the test would have to be done on new data. Why is that? I am generally confused about his support for this one tailed method because it seems to contradict our prior class discussion in which we said a two tailed test is better. Is it much harder to conduct a two tailed test? Surely a two tailed test must be easier to do then a one tailed test and then a whole new experiment with new data for the tail to go in the other direction.

    3-The author mentioned almost indirectly what is the most interesting thing about significance testing: the idea of practical significance. Just because one system is so many percentages better than another does not mean that a real user would bother switching to it because of a variety of factors including what resources are needed to run the better system and what types of topics it does better on. Should real system engineers have to include a discussion of these other variables when declaring that one system is better than another? Merely being able to claim that one system is significantly better than another seems shallow if it is missing this other information.