Thursday, October 24, 2013

10-31 M. Sanderson and J. Zobel, Information retrieval system evaluation: effort, sensitivity, and reliability, SIGIR’05


  1. 1. My first question is in Section 3.3 Examining past SIGIR results. In this section this paper mainly talks about a selection of papers presented at SIGIR in 2003 and 2004. Altogether 26 papers were chosen which evaluated a well-defined retrieval task. My question is although they selected 26 papers according to certain rules, that does not necessarily mean the goals for these papers could be treated as equivalent. I am wondering whether evaluation for significance, or some other issues criticized by this paper are really critical for these papers.

    2. My second question is about the comparison of MAP and P@10. From my perspective, if we want to compare two evaluation methods, we need to distinguish between different applications. For example, if we are just searching for navigational query, then MAP is obviously better than P@10 since we only need top 1 answer. I think this problem also applies for the whole paper. This paper was trying to illustrate and compare evaluation methods. However, it failed to discuss this problem in different retrieval tasks case by case.

    3. My third question is about self-proof for this paper. Since this paper criticized previous papers for lack of sufficient experiments for evaluation, how could this paper prove its comparison is sufficient and unbiased? For example, when this paper talked about topic selection with and without replacement, how did they ensure the data and corresponding topics were unbiased? How could they ensure their methods elaborated all situations before drawing conclusion?

  2. 1. In both the Sanderson and Zobel article and the Smucker, Allan and Carterette article, the authors concluded that the sign and Wilcoxon tests are not reliable. Is there ever a case in which the sign and Wilcoxon should be used? Why do some authors still use these tests?

    2. Researchers reported a significant difference in less than half of the SIGIR papers that Sanderson and Zobel examined. How can this be improved? Should there be more stringent criteria for accepting papers into SIGIR? Or could researchers still learn from papers that didn't achieve a significant difference?

    3. The authors conclude that MAP is a more reliable measure than P@10, but that P@10 more efficiently uses assessor effort (p. 166-7). Since P@10 is more efficient, should researchers use this measure to provide a “good enough” result? When might it be useful to use P@10 and when would you want the more reliable measure of MAP?

  3. In this paper written by Mark Sanderson, the 50 topics have been divided into two disjoint sets, which have been regarded as independent samples. In this case, should we test their homogeneity of variance of the two samples before conducting independent-sample t tests?
In Mark’s research, the error rates have been studied there. For the error rates, I feel very confused: can the error rates really predict the accuracy and reliability of significant tests?

    For the assessor effort, why should it be mentioned in this paper mainly considering significance tests? Will it impact the tools of testing significance? What other factors may actually impact the reliability of significance tests?

  4. Effort Sensitivity and Reliability:
    As each TREC run had just 50 topics, it is understandable that the authors and previous researchers divided the topics into disjoint sets of 25 each. The results were extrapolated to 50 topics on sets of 50 topics. I am doubtful whether the extrapolation can be done based on such limited data. Moreover, the smaller trend values (for both absolute difference and relative difference) appear lying on a straight line. It appears that the authors assumed that the data points lie on an exponentially decreasing function without any justification.

    As pointed out by many researchers, most of the data that is collected for IR research does not follow a normal distribution. However, we observe multiple researchers concluding that the t-test performs well even though the data does not follow the normal distribution. Where is this difference arising? Is it because of the too stringent assumption that t-test requires normally distributed data or from our possibly incorrect assumption that the data is normally is not normally distributed?

    It is interesting that p@10 gives a much lower error rate than MAP for a given amount of assessor effort. This supports the view that more number of topics is preferable than having more number of judgments for a topic (topics vs depth). This appears counter intuitive because MAP takes into consideration the relevance judgment rankings whereas P@10 remains oblivious to them. Thus, MAP, in contrary to the experimental results, is expected to produce less error than P@10.

  5. 1. Do the researchers have an ideal technique in their head on how significant tests should be undertaken? Not only are they picking on papers that fail to (explicitly) report significance, but also where it is reported but the test is stringent. Also, wouldn’t a ‘stringent’ test lead to increased reliability, or are they any repercussions?

    2. Agreed that the density of relevant documents is higher at the top of a ranking than at the bottom. But would you consider this top 11-14%, mostly relevant documents, as a representative sample of the test collection?

    3. The texts we read this week identify that statistical significance for small topic set sizes cannot be repeated for the other topic sets. Thus for future work, is there a recommended minimum set size (a number or percentage) that researchers should use to make it repeatable and representative?

  6. How would such shallow pools and large numbers of topics impact inter-rater agreement? How would it handle alternative interpretations of a query, which may be judged "somewhat relevant" by some assessors? In addition, wouldn't this impact the type of measures used and how we might need to consider the error costs of the measures?

    I am not sure I follow their argument re: topic selection in 50 and 25-topic datasets without replacement. While the conditional dependence issue is certainly valid, doesn't their argument about the respective probabilities of "a>b", "a=b", "a<b" hold whether or not the topic subsets are chosen to be disjoint? Furthermore, what is the alternative - selecting with replacement? They mention this, but I do not understand how this is a viable alternative, particularly when the probability of a large overlap is high (e.g., selecting two subsets of 25 topics). If selection with replacement is used to establish the "lower bound" of the error rates, how can we be sure that it actually is providing the lower bound? Furthermore, in what situations will this type of lower bound be useful?

    Why does the Wilcoxon test seem to favor Type I errors? The arguments in the papers so far have not exactly elaborated on the underlying issues of the Wilcoxon test, but rather commented on its low reliability or obsolescence. What might the cause be for its errors?

  7. 1. The paper cuts down on the time required to run the tests by measuring only 25 topics, and then extrapolating. Given the papers we've read on how some topics are much more effective in measuring effectiveness than others, doesn't this introduce the possibility of incorrect results? Has any work been done that characterizes the effect of using fewer topics on significance tests?

    2. One of their conclusions is that building test collections with shallow pools locates more relevant documents, which they believe results in more accurate measurement. This seems counter-intuitive though, are they saying that by taking fewer relevance judgements into account, they are able to get a more accurate measurement of system effectiveness? How could this work?

    3. The paper uses selection-with-replacement of topics as a lower bound for calculating significance of measurements. The authors view this as a lower-bound for significance. This seems like a much better measure for IR systems, since the gains from changing or implementing a new system must be substantially more than the cost of changing. Has this approach seen much adoption in the IR community?

  8. Sanderson and Zobel write in their introduction, "If the significance is omitted or the improvement is small, as is the case in many SIGIR papers- results are not reliable". When I first read this paper, I wondered what their definition of 'small' was, but then throughout the paper they keep referring 0.05% --Is that 'small'? or is that number small or large enough to determine statistical significance?

    What is the smallest amount of query outcomes that can be used with the bootstrap method? The authors mention in section two that it is unclear if the small amount of queries used by Savoy was enough to asses significance. The authors don't mention how many queries Savoy used, but is there a general number that is 'accepted' by the IR community? In other papers I see mentions of 25/50 runs. Are those the norms?

    In section 4 (P@10 and assessor effort), the authors conclude that MAP is more reliable measure than P@10 because MAP takes into account the location in a ranking of all known relevant documents. My question is, can you calculate MAP if you don't know the ranking of all known documents, such as in a dynamic or live setting?

  9. The author criticised a study by stating : “in two papers no numbers were given, with all results presented graphically.” But the authors themselves have quoted once that the amount of data got reduced greatly for the testing. But they do not mention to what amount the data got reduced to. Then they were also very obscure about the fact “"Not all bins were graphed, either to remove clutter or because insufficient data was available.”

    Author suggests that : "These considerations and the results above suggest that, in contrast to the current TREC methodology, it is better to have larger numbers of topics (perhaps 400) and shallower pools (perhaps depth 10).” But will not testing with a shallow pool miss the chance to evaluate the system’s relative performance with regards to precision ?

    Author has advocated use of shallow pools for testing, but he has not mentioned the criteria that need to be taken into account when generating them. Because testing will be performed on a very limited set of documents for a query, wouldn’t it be important to have the documents in the shallow pool be diverse, distinct and should be able to complement the information being shown for a given query ?

  10. 1. This paper is similar to one of the paper in which the authors try to identify the minimal number of topics for making the evaluation. Projection method is used throughout the whole paper. My first question is about the projected method that predicts the error rate for data sets with 50 topics. What is the projection method they use to make such predictions? If it is a curve fitting method, what model do they use?

    2. My second question is about the choice of significance range. Why only p value within (0.04, 0.05) and (0.01, 0.05) were used? How about the p value in (0, 0.01)? Also the error rate is another important metric used in the paper. How is this error rate calculated? Also why IR metrics such as GMAP was not used since this emphasize more on the highly-ranked bad result, which is probably a more important metric for real users?

    3. The relations between MAP values and p values were studied. Intuitively higher MAP difference will imply a higher chance of significance, but what is the exactly relations is not clear. MAP measures the averaged performance of different systems, while statistic tests such as sign or Wilcoxon more focused on a paired comparison of each topic. So based on the definition these different metrics won’t be well correlated. So if that is the case, what is the point of studying the correlation between these metrics? And how can we using value from one of the MAP and p values to predict the other?

  11. 1. Statistical testing methods do include their own sets of assumptions. What would be the impact of violating these assumptions? Also, it seems like in each of these testing methodologies - we would require to make use of prior information. Doesn't this come across as a limitation that we would always have to deal with when making use of statistical testing methodologies?

    2. Since we have read in paper as to how a simple metric like P@10 and RR are capable of being representative of the user experience - do we require complex metrics like nDCG and RBP at all when we are getting reasonably low error rates when using these simpler metrics especially when we hope to implement this metrics towards judging assessor effort in IR tasks?

    3. We have seen in previous papers as to how tedious it is to calibrate assessor effort. The primary investigation of the paper is aimed at proving that building test collections seems to play a more important role than a thorough assessment. of effort However, doesn't this seems contradictory to the entire purpose of IIR wherein it is more important to calibrate user satisfaction?

  12. In Section 3, the authors state “As can be seen, according to the projection a relative difference of 25% must be observed to give confidence that the result for the first topic set is significant”. They actually imply that many previously observed differences between IR systems are insignificant as most of them are based on 5% relative difference. I think the statement has an underlying assumption that for all those IR systems, including the ones used by the authors in the paper, have the same underlying conditions (e.g., test collection, assessors). Apparently this is not a safe assumption.

    In Section 3.1, because the authors only use pairs of runs where significance is observed, they admit that the quantity of data is greatly reduced. I have a question for this approach that since data are filtered in this way, whether the representativeness of the data is greatly reduced as well, which in turn will jeopardize the experimental results?

    In Section 3.2, the authors find 57.3% of comparisons are significant for the t-test but 78.0% of differences are significant for the Wilcoxon test. For both test, the differences are all in the band 20%-30%. What do the authors imply? Does the result imply that the Wilcoxon test is better? If so, the authors themselves do not give the confirmation and neither explain why the Wilcoxon test is better.

  13. 1. In the last paragraph of section 2, it mentioned “there has been concern that the theory underestimates the rate of type I error”. Why does such concern raise?
    2. In section 3.3, the authors criticized some SIGIR papers which did not provide sufficient information. It is a tricky judgement, and there might be other reasons for the absence of such data or value.
    3. Page 166, left column, last paragraph, it said the reason why MAP is better that P@10 is “the more relevant documents an effectiveness use, the more accurate that measure has to the potential to be”. However, MAP and P@10 are not only different on the data size. What will happen if we consider P@100, which has the same size as MAP in this paper?

  14. 1. In the related work, the authors do not seem to make any mention at all of the randomization significance testing, which was the favored method of choice in the other reading. Moreover, they mentioned bootstrap testing has not been used significantly in practice in the IR community. One could infer the randomization significance testing is used even less. Why is this the case? Or has this changed in recent years?
    2. It was mentioned that bootstrap significance testing is not popular because it is hard to implement or requires expert advice. It was also mentioned that the reason the sign tests became so popular was because they were proposed in an era when cheap computation was not available. So it seems like we have an effort tradeoff even here, except the effort is different from that of labeling samples (like in machine learning) or assessor effort (like in traditional IR), we have effort in properly implementing and determining the parameters of a good statistical significance test.
    3. This is an inherently subjective question, but which kind of error would we like to avoid most in practice: Type I or Type II? The authors conclude that Type II errors are actually not that common i. e. it is unlikely for a system to be rejected if it's truly good. Moreover, we've seen evidence that the sign tests have tended to overestimate performance of many systems, and may have been misused to detect performance difference, leading to a Type I. Is it really a bad thing to be overestimating a system? What are the pros and cons of doing something like that?

  15. 1. The experimental setup described in section 3.1—which seems largely to be a recreation of the Zobel (1998) work—shows the effectiveness of different stringencies of t-test in the form of reordering error produced. Does the setup not create a contradiction with respect to the beginning reason for doing the significance test? (i.e. to begin absolute differences in error were not to be trusted so we needed significance testing, but 3.1 again evaluates the effectiveness of the significance tests with respect to absolute differences in error they allow, thus repeating the pattern)

    2. The authors advocate for more query topics over more assessments per topic. More query topics should have the effect of increasing reliability for significance tests. Nevertheless, might generalization over all topics actually tell us less about which of two systems is stronger? How, for example, can one ensure a random sampling of topics? Might it be more insightful to learn which algorithms perform significantly better on which topics?

    3. The results in figure 8, from the experiments without topic set replacement, do not make sense to me. Why is it that we see an initial decline in error when topic set size is increased followed by a steep increase in error as topic set size nears 50? What does this say about the effect of topic set size on the reliability of significance tests?

  16. 1. The fact that the authors "reconsider the error rates" by revisiting and attempting to repeat an earlier study by Voorhees leads me back to our discussion of the necessity of being able to repeat an experiment (with the example of the Google experiment where the authors could not reveal their methodology). Why did Sanderson and Zobel feel it was necessary to repeat this exact experiment instead of creating a new, similar study? What does it lend to their argument?

    2. The authors claim that 23% of their sampled studies' "reported improvements were small, no more than a few percent in relative MAP"(p. 4). Are they suggesting that these studies go unpublished? What responsibility should studies like these have to be straightforward and do a complete significance test? How is funding effected by a project's "success"?

    3. This article ultimately favors t-tests, while Smucker favors randomization tests, and previous works favor Wilcoxon. All of these articles seem convincing and are published by reputable sources. How do researchers decide who to trust and what to use?

  17. 1. When discussing figure 2, the author said “higher relative differences were not considered”. Why did they ignore such differences?
    2. In section 3.2, it is mentioned that “for higher differences in MAP, results were similar to the t-test”. In the lower differences, why the results are not similar?
    3. It discussed the “assessor effort” as “the number of topics multiplied by assessor effort to assess a topic to pool depth 10 or 100”. (p. 167) Does it make sense? Assessor effort is something related to user behavior; and it cannot be simplified as such a number or formula.

  18. In section 3.3, Sanderson and Zobel note that many recent SIGIR papers lack the use of a significance test and ,yet, they make claims of MAP differences. If significance tests were not performed for these papers, how can we trust the results moving forward especially since the significance claimed by these papers can be called into question?

    I might just not be remembering correctly or may have overlooked the TREC runs that have taken place since this article was published, but has the methodology changed to reflect the suggestion by Sanderson and Zobel to shift from a deep pools and fewer topics to more topics and smaller pools? If, so, how has that been embraced or rejected by the participants?

    For a field which seemingly embraces constant change and improvement, it seems as though IR clings to outdated or perhaps practices which lack the scrutiny required for true advancement in a lot of different areas. Other than relying upon the researchers to spearhead using methods such as the randomization or t-test over other significant tests and even actively using those tests within their papers, is it possible to have a discipline wide set of guidelines to go by that could promote an environment where actual advancement is being made? Or is that too ambitious?

  19. 1.This is one of the first papers where we have looked at Analysis of Variance i.e ANOVA test. We have observed Wilcoxon and t-test in the other papers that we have read. What forced the authors to compare the existing test with another statistical test which is not used widely? Smucker, Allan et al ( conclude that Wilcoxon test should be discontinued from IR study. What are the tests which are used currently?

    2. In section 6 as well as other sections, the authors have stated that the two disjoint sets of 25 taken from the 50 topics are assumed to be independent. Does disjointness guarantee independence? There might be disjoint topics which are highly dependent on one another. Even if the 50 topics were picked to be independent(they have just assumed independence in the article), how can they be representative of the real world scenario?

    3.The authors conclude that the test collections with shallow pools locates more relevant documents. Although they have tried to show that this claim holds true, a comprehensive explanation so as to how the shallow pool can be formed. Considering that they have reduced the depths from 400 to 10 (page 6), it would have been clear if they had explained the intuition.

  20. 1. The observation of the issue with random topic selection without replacement is interesting. While the paper stated that this issue was approached, it was not presented. Maybe be discretizing topic sets based on performance differences and then attempting sampling would have been a good idea? Conclusion, does talk about investigating stratified sampling though I'm not sure how can this be approached.

    2. If shallow pooling and increasing topics enable stronger assessments, why hasn't this methodology been adopted? Conclusions drawn from this paper were not strongly supported, however, how much traction has this approach received in the following years?

    3. While this paper and earlier studies suggest that the sign or Wilcoxon test is not as reliable when compared to the t-test, there does not appear to be any justification. What are potential reasons for such behavior to be observed consistently?

  21. 1. When the authors restricted their data to only "significant" runs, does the smaller data set that is left weaken the strength of the results they acquire?

    2. The authors considered MAP and P@10 in their significance tests. Given the number of metrics available to researchers, are there different metrics that could be used that would provide different results?

    3. Sanderson et al. mention that the Wilcoxon and sign tests were simplified permutation tests due to lack of computing power at the time. How might significance tests change as IR test collections grow to keep up with the growing information needs of users?

  22. 1) Can you explain the logic of Zobel's experiment? Is there a reason to partition the topic set in half?

    2) Isn't the idea of using shallower pools bad for numerous reasons? One being, metrics become unstable with shallower pools. Also, the reusability value of the collection diminishes with shallow pools.

    3) Can you elaborate more about the idea of topic selection with replacement? How is this approach establishing a lower bound? What is the impact as the topic overlap increases?

  23. 1. In this article the authors redo an experiment that was done by Voorhees and Buckley so that they can test it under different parameters. When describing what they did they state that the results that they obtained were close to the ones that Voorhees and Buckley reported. Why would their results be any different from the results that Voorhees and Buckley obtained? Since they never state exactly how large of a difference exists between their and Voorhees and Buckley’s data should they be able to compare their results to Voorhees and Buckley?
    2. In this paper the authors talk about the difference between using selection with and without replacement. When splitting up a run to test significance you can either make sure that there is no overlap between topics that are chosen for each group or that there can be overlap. The authors suggest selection with replacement because selection without replacement causes some bias in the pools because one pool is dependent on the other. However wouldn’t it be the case that if you selected with replacement that there are going to be some topics that are not represented at all and that the absence of thee results could change the results?
    3. In this article the authors suggest the creation of collections designed to be used with P@10. They argue that while P@10 is not as good as MAP a collection designed to use it would be much smaller and therefore require much less relevance judgments. This would make it easier and cheaper to create these collection compared to traditional test collections. However, wouldn’t creating a test collection that is designed to be use for one metric like P@10 be biased because the engines that are tested on it you focus themselves on having goo p@10 values? Is this good or bad?

  24. 1) The authors mention that using P@10 could be better than MAP, since it would allow for relatively reliable results using shallower pools. Since some test collections will likely not be reused it is beneficial to find a useful metric that reduces assessor effort. This seems great, but the authors don’t really back it up at all. They state that, “1.7-3.6 times more relevant documents would be found by using shallow pools.” Where are these estimations coming from? It seems incomplete to use an unfounded motivation for validating the use of P@10, and almost seems like it’s just chosen because t-tests worked well on it.

    2) Previous papers mentioned issues with the t-test and false positives. The authors determine that the t-test works better than ANOVA/wilcoxon, but they do not directly address the issues mentioned in the other papers. Are these not actually major problems in practice? And if so, why has the t-test not been widely adopted?

    3) What exactly is the ANOVA test? Is this some version of the sign test, with additional parameters? Several of the papers discuss the sign test, t-test and wilcoxon tests in conjunction, so that’s why I was wondering if the ANOVA test is supposed to be a sign test.