Wednesday, September 18, 2013

26-Sep Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management 36, pp. 697-716, 2000.


  1. 1. Unless I'm mistaken, the mathematics in the Voorhees paper is slightly off. This is in reference to the part where its written that there are 3^48 possible qrels combinations. This is true...if we assume that no intersection or unions are being performed to produce new qrels. However, this is precisely what the author does, by considering two special cases where intersection/union of the qrels for each of the topics is done to form two new qrels (then again, they are denoted 'special'). The full space of qrels, assuming we may arbitrarily intersect/union over any of the topics, becomes 5^48. Sampling 100,000 qrels from this humongous set seems tiny, but the randomness of the sampling seems to cinch it. Still, I'm wondering what the confidence of these results are, statistically, given the number sampled is tiny compared to the full set?
    2. I asked this before, in reference to a similar reading, and I'm still confused: what is the difference between a manual information retrieval system and an automatic one? With sets of this size, its hard to see how anything can be manual. What component needs to be handled by humans for the system to be denoted manual?
    3. The most interesting take-away from the paper for me was that 65% precision and recall seems to be the upper bound, from a human point of view, of what a retrieval system can achieve given these overlaps. Assuming that number holds for other studies and collections, can we exploit it as an alternate measurement of comparing retrieval systems? For example, we could try computing precisions of different systems at 65% recall, and see how they compare. At least that way, the comparison could be justified as robust against assessor disagreement. Are there any pitfalls in forming that conclusion?

    1. To answer your second question, Mayank, I believe that there is an answer in this reading page 10, where the author mentions that the system is termed manual if 'any' human intervention occurred during processing. This might not be for all topics and documents as it would be exhaustive, but it might even be as minimal as just relevance assessment clarification in case of conflicts in automatic judgement. This could be a very good point of discussion.

  2. This comment has been removed by the author.

    1. If the Waterloo assessors were allowed to use a 3point relevance scale, relevant, not relevant and iffy, shouldn't all the 'iffy' ones been either re-evaluated by another set of assessors, or divided up evenly? I would imagine 'iffy' falls somewhere between relevant, and not relevant, so to force all ifffy judgments into not-relevant, then that seems like it would skew the statistics.

      Are there other retrieval strategies besides ones that focus on precision/focus on recall,or some percentage of the two?

      On page 713/17, the author points out that the averages of averages in topic to topic variation can hide individual topic differences. We also spoke in class about the importance of IR being only as strong as it's weakest retrieval. Is it possible for a topic's retrieval to become obscured in the masses? How do search engine companies compensate/re-work their algorithms to aid in IR for topic differences? Are some topics ever 'left behind?'

  3. 1. TREC-4 had assessors judge different number of topics (and thus documents), based on their availability (pg. 700). We’ve read how issues like fatigue and boredom can affect relevance judgments. In that context, should all assessors have the same workload per study? Could subject-matter experts be allowed more work based on the assumption that they might be more efficient? Is there a way to tier assessors so that the setup is cost-effective, maximizes available resources, and yet makes their work more efficient?

    2. We see in Table 1 (pg. 701) that overlap is maximized when assessors have a similar background, go through the required training, and work under similar conditions. With the Internet increasing the possibility of remote assessment and crowdsourcing, how can we achieve a similar setup that streamlines assessors, and help maximize overlap? Is it possible?

    3. Its interesting that on one hand we have Carterette and Soboroff, who evaluated assessor errors from (purely) a human factors perspective, and on the other hand Voorhees concludes that test collections in a lab setting disregard the fact that user relevance judgment can change over time. Agreed that changing opinions and viewpoints are (almost) impossible to study. But is there something we can do, as information professionals, to account for these changing considerations? Information is changing at such an overwhelming pace that looking out for a methodology that evaluates how humans keep up with these changes seems important.

  4. In talking about TREC-4 relevance assessments, the author points out that the primary assessor judges more documents relevant than do the secondary assessors. Why were these primary assessors more likely to judge documents relevant here?

    In the part about Defining different qrels sets, why were three additional distinguished qrels added into the sample qrels of size 100,000?

    In the research of estimating the probability of a swap, manual systems and automatic systems have been considered. Here, I have a question: what are manual systems and automatic systems, and what’s the difference between them?

  5. 1. The authors substantiate the claim that although assessors tend to disagree on document relevancy, the relative ranking of IR algorithms remains constant for the early TREC studies. Does this necessarily render criticisms about low assessor agreement moot? How important are the absolute measures of Precision and Recall? Doesn’t the lack of trust in the absolute measure render it difficult to compare IR approaches from year to year, making it more difficult to know whether systems are getting better?

    2. The paper sometimes mentions a ‘baseline’ of assessor document agreement. What standard should we be holding IR systems to? If only about 40% of assessors agree on document relevancy, should we expect IR systems to get about 40% of documents right?

    3. While reading the paper I grew curious about whether there were commonalities in the documents that tended to be disagreed on. Can the notion of a controversial document be abstracted to a set of common features (i.e. long documents where information is buried, documents that only indirectly mention the topic)?

  6. This comment has been removed by the author.

  7. 1. To study how the variance of relevance affects the evaluation of different search engines, different assessors were used to judge the same documents. And to minimize the effect of shared backgrounds of assessors on this study, different groups of assessors (NIST, University of Waterloo) were used. It’s not addressed clearly in the paper but what are the backgrounds of these assessors? Since real users have drastically different backgrounds and preferences, does the combination of assessors from NIST and University of Waterloo capture this diversity? How many assessors were used for the judgment of each document for TREC-6 assessment? Are three assessors for each document enough to simulate diversity of the preferences from the real population?

    2. Up to 50 topics were used for all the studies presented in this paper, which is probably based on the previous findings that evaluation of the search engine based on 50 topics gave stable results. However, the focus of this paper is on the evaluation of the effect of variance of relevance judgment, and is it enough to use just 50 topics? Is there any study that uses more topics? Also it’s mentioned that some of the inconsistency of the ranking results was due to the fact that few documents (<5) were defined relevant for some of the topics, and elimination of those topics decreased these inconsistency. However, in real world there are topics with few relevant results. In that case, how do we handle the problem of vaarince in relevance judgment?

    3. My last question is also about the methodologies. Binary relevance judgment was used in this study, which greatly reduced the variance of relevance judgment among individuals. In real world relevance is not binary, and is there any study that incorporates non-binary relevance judgment and study the variance of it on search engine evaluation? Only mean average precision and recall was used in this study. As we all know that it is often the worst performance that drives users away, and that’s why metrics such as geometric mean average precision was invented. Is it necessary to incorporate these metrics in following study?

  8. 1. Vorhees cites Lesk and Salton on page 711:

    “1. Evaluation results are reported as averages over many topics.
    2. Disagreements among judges affect borderline documents, which in general are ranked after documents that are unanimously agreed upon.
    3. Recall and precision depend on the relative position of the relevant and non-relevant documents in the relevance ranking, and changes in the composition of the judgment sets may have only a small effect on the ordering as a whole.”

    These findings make good sense in reviewing the comparisons between union and intersection. Are there any interactive studies wherein assessors specify what they are thinking when they decide against the relevance of particular documents judged as relevant by other assessors? Are there any studies that take the topic into greater consideration in this regard? Vorhees mentions that some topics had null intersections, and that the intersections in general varied quite a bit. Do some topics lead to greater collective assessor ambivalence than others? What might some of the factors be?

    2. It also makes sense to me that the mean rank of the documents in the intersection (which are the same as “unanimously judged relevant documents”) is lower than the mean rank of the documents judged as relevant by only one assessor, with few exceptions. However, I am a little curious as to the variance values of these figures, which I did not see reported. Are there some with very high rank and then some with very low rank (could they be bimodal distributions)? Based on the individual assessment ranks, perhaps there are some with bimodal distributions – at least, there are some where one researcher’s qrels had a much lower average rank than those of his or her counterpart, with some even below the unanimous relevant set. What might be some other explanations for the patterns in the ranking of such documents?

    3. This examination uses judgments from TREC and UWaterloo analysts. How might the results compare to those of a similar study performed on independent samples of individuals employed by Google to assess relevance, using their Guide for Relevance Assessment? What about individuals employed through MTurk? For the latter, we might assume that agreement is lower, because users may incorporate more biases than trained analysts at TREC or UWaterloo. Would unions and intersections behave in the same manner? Why or why not?

    One more question (a repeat): How would results change using interval, rather than binary, relevance assessments?

  9. 1. This paper proposes the utilization of qrels which gives importance to varied query aspects. Are we assuming that all queries are capable of polyrepresentation? And so wouldn't the task of mapping these multiple query representations to a single cohesive document ranking get tedious? We will still have to deal with an incomplete set of relevance judgements thus, how can we justify the statement that qrels in unbiased? Also, more importantly how does an unjudged document affect the score of computation?

    2. I am unclear about the use of Kendall's correlation which is used for ranking objects. Many a times in IR it is more important to highlight the discrepancies between among higher ranking documents than between the lower ranking documents. However, Kendall's statistic equally penalizes both without accounting for these ranking distinctions. Is this acceptable? What is the absolute value of the Kendall's coefficient which concludes that two documents are in fact equivalent? How does Kendall's coefficient do justice when the application concerns determining the difficulty of the query?

    3. I'm still confused on the concept of relevance feedback. Can we classify all feedback as either manual or automated? Given that a lot of tasks in IR require automation as well as user perspective - doesn't it make more sense to use a hybrid methodology? Determination of query quality and its assessment is what drives this relevance feedback so what is the principle that the system uses? Like for instance, does the system reweigh the query and re-execute the search? Or does it provide the user with a bunch of words to choose from to augment his original query? Also, if a user retypes in a query which he has already used in the past is there a mechanism which will enable feedback systems take into consideration the documents the user has already parsed through and change the order of relevance of documents accordingly?

  10. 1. When discussing the second experiment involving the TREC-6 data, the author mentions that they ended up discarding topics with less than five relevant documents. The justification the author provides is the instability of the mean average precision measure when very few relevant documents are present. As an example, the author points out a TREC-4 system gmu2. With only one relevant document, the topic’s average precision can be halved if the document is in location 2 instead of location 1. However, I don’t see how the author can say his hypothesis holds when extreme cases are removed. Otherwise, other studies could remove topics and systems until the data seemed to infer their own conclusions. Is there a different measure that can account for the number of relevant documents and their positioning?

    2. The results of the author’s study seem to indicate that rankings between systems remain relatively constant despite changing the source of the relevance judgments. When determining a system’s rank, an average across all topics is calculated. As the author points out, using the average of averages measure will hide the performance of a system on individual topics. In class, we compared and contrasted the GMAP measure with MAP. In this paper, MAP is used to determine a system’s ranking. Given that masking topic performance is one of three criticisms mentioned, couldn’t the author have used GMAP instead? The first paper we read introducing GMAP did say the research standard is still MAP, but there was no evidence as to which measure provides the best rankings.

    3. In section three, the author plots a graph depicting difference in mean average precision on the x axis and the probability of a swap on the y axis. The author notes that points away from the origin are interesting cases. Since the author goes on to explain a handful of different plotted points, are there more conclusions that can be drawn outside of the obvious meaning? Do these systems swap often because they are considered relatively close in performance? Are some systems designed that handle lots of relevant documents better than a few? Since the study seems to indicate that system rankings remain the same despite differing relevance judgments, are there further research questions such as the above to be explored that give insight into the systems that behave differently?

  11. In order to show the outcome of experiments was robust against changes in the group of assessors, Voorhees used NIST assessors, who have a similar background. I think it has validity issue or has a unfounded presumption that evaluation outcomes are not going to be affected when using different judge populations, various judging guidelines. The study would be more convincing if the assessors would come from different background.

    In Section 3.1.2, Voorhees drew a conclusion that “These results demonstrate that the stability of the recall-based rankings is comparable to the stability of the mean average precisions rankings”. I am wondering how generalizable the statement is considering she used selected number of topics from TREC-4?

    In Fig. 4, it looks like that automatic systems are more stable than manual systems. My first question is how applicable this statement is? For which specific topics does this statement hold true or whether does it always hold true? If it always holds true, why not always use automatic systems to evaluate the relevance? Are there any research challenges which stop it?

  12. In utilizing secondary assessors in TREC-4 relevance assessments, Vorhees mentions that some assessors judged more topics than other judges. Due to multiple consequences involved in judging a large amount of topics, does the workload have any bearing on the quality of relevance judgments? Might account for the 7% difference in overlap shown in Table 1 between the primary assessor and the others.

    In dealing with the TREC-6 assessments, the graded judgments put forth by the Waterloo assessors were changed to straight binary judgments. What kind of differences in relevance occur when, in analysis, graded relevance judgments are forced into binary situations?

    In dealing with the Waterloo/NIST comparisons, Vorhees states that both groups “produce essentially the same comparative evaluation results.” Aside from the change of graded judgments to binary, several other factors were different in the judgments made by Waterloo. How do the researchers and those doing analysis on this topic account for these differing factors in their study?

  13. It is mentioned that there were systems whose difference in mean average precision (for different q-rels) was greater than 5%. Does this mean such retrievals are accommodating more diversity? How does this study and findings apply to the diversity of search results? Does increasing diversity of search results mean that the mean precision value over all the q-rels decrease?

    Why is it that the primary assessor judges more documents to be relevant than the secondary assessors? It is unclear whether the documents are presented to the three assessors in the same order. It is understandable that there could be multiple relevant document sets. However, presenting them in different orders could bring in more differences in the inter-assessor relevance judgements.

    The number of different q-rels that were considered for testing each of the individual runs was around 100,000 (and the additional three). Given that the total number of q-rels possible is 348 (or 346) the represented set of q-rels, thus, has is 1/1018 times the total number of possible q-rels. Do you think this ratio is a good representative set? With two similarly trained assessors the Kendal Tau values shown were close to 0.5 (which means that one agrees with two-thirds of the other’s relevance judgements). At this rate, if there are 10 assessors will the mean average precision as computed by the union and intersection q-rels still fall within the range of values observed in the chosen sample? (Can we extrapolate the current results to slightly bigger scenarios?)

  14. 1. In Page 5, the author mentions that 65% precision and 65% recall is the level in which humans agree to each other. What is the significance of 65%? I tried looking up and all of it points to Vorhees’ works but I could not find any reasoning behind that. It would be great if we could discuss the reasoning behind 65%. Also, if that is the magic number for precision and recall, should we be content with .65 precision and recall instead of focusing on improving the values?

    2. The author through figure 7 elucidates how unanimously judged relevant documents ranked lesser than the average ranks. Is unanimous relevance not biased on the assessor behavior? A spam/deceptive document might be unanimously judged as relevant if the assessors do not have enough expertise in the topic. Is it possible to make sure that all the assessors are from as many disparate fields as possible, covering most of the topics?

    3. Why are we ignoring the queries for which there are a lesser number of relevant documents? It is understandable that the evaluation measures are unstable when there are very few relevant documents. But the main objective of an information retrieval system is to improve precision and recall even if the number of relevant documents are really low. If we ignore them in our study of relevance and correlation, then are we not defeating our own purpose?

  15. 1. Vorhees mentions that in the TREC submissions, the participants were allowed to choose which runs they wanted to submit if the resources limited run assessments. Wouldn't participants choose what they felt were their best runs and thus remove the results further from a natural search?

    2. Given that judges were allowed to put documents into an "iffy" category of relevance. Doesn't forcing "iffy" documents into the nonrelevant category marginalize the judgments of the assessors? If the judges felt there was some relevance in the document, marking it as non relevant doesn't truly represent their judgments.

    3. Vorhees mentions that smaller collections have less stable results when tested. Does having a larger collection size actually prove more stable results or does too large of a collection simply downplay any errors that occur during the search process?

  16. The example given by Author on page: 5 " the primary assessor judged 133 documents as relevant for Topic 219, and yet no document was unanimously judged relevant. One secondary assessor judged 78 of the 133 irrelevant, and the other judged all 133 irrelevant (though judged one other document relevant) " indicates how even the judges who are similar in so many aspects with respect to expertise in a field might disagree completely with respect to something. Doesn't this indicate that a total reliance on manual judgement might be risker?

    The author mentions that "If a particular system gets a relatively high score with a particular qrels, then it is very likely that the other systems will also get a relatively high score with that qrels." But will the inverse stand true for this ? That is if a system doesn't perform that well in a specific qrel where other systems have a relatively high score will that imply that the system is behaving incorrectly or will it might just be that the system is different than other systems ?

    The author mentions at one point "The stability of the rankings is due in large part to the fact that the rankings are based on average behavior over a sufficient number of topics." and once he mentions that "As few as 25 topics can be used to compare the relative effectiveness of different retrieval systems with great confidence." Are these not contradictory statements. I mean can 25 topics really establish the level of confidence needed to understand the behaviour of a system? Also when can an evaluator know that he has evaluated everything he wanted to, and know he can stop his evaluation ?

  17. 1. It appears reasonable to hypothesize that primary judges have better topic grounding and hence they make more relevance judgments but information is lost when writing topic descriptions – If that were to be the case then I would have expected a higher overlap for A&B, which is not so. Hence, It would have been useful to understand the impact of topic articulation to the judges.

    2. Relevant documents which were not judged by A&B assessors, were added when computing metrics with 100% agreement, does this not inflate results?

    3. Reliance on MAP to validate claims in the analysis section does not intuitively extend to other measures such as GMAP, esp. reasoning about averaging topic-to-topic variation.

  18. 1) When doing the analysis of relevance judgments, Voorhees describes that for topics with more than the 200 relevant documents limit, all the relevant documents were in the pools of the secondary assessors but no additional nonrelevant documents were added. I was wondering if this could have influenced the result since it is notorious when more relevant documents exist in the pool.

    2) I am not understanding Fig. 7. How is it possible to have a greater unanimous relevant document count than the Original, secondary A & B qrels?

    3) Can you please elaborate more about the probability of a swap and how Voorhees uses it?

  19. This comment has been removed by the author.

  20. 1. Page 701, it said that the documents judged relevant by primary assessor but failed to be included would be added as relevant documents to the secondary assessor’s judgments for analysis. It is clear that such documents were never judged by secondary assessors. Is it reasonable to add them into the pool directly with relevant tag?
    2. In section 3.2, it mentioned that the assessor of TREC4 were the authors and had similar background which resulted in the overlap. However, for UW case, though their assessors were different with the NIST experts, there is no information about their background. In the course, as it mentioned, these participants might have something very similar to impact their judgment. Thus, in such circumstances, did the TREC6 make sense?
    3. Section 4.1, the number of topics for system is 50. However, this value was implied from TREC6. Considering the different features of different data collection, is it sound to make this number as a criterion for all cases?

  21. 1. Why “this is the level at which humans agree with one another”?(p.701) Is there any evidence or support for this claim?
    2. It is mentioned that the MAP did not perform well here. (p.707) Does it mean that MAP has some weakness or limitation?
    3. The conversion “iffy” to non-relevant is mentioned in the paper. (p.708). Is it reasonable? What is the definition of “iffy”? Is it possible that some documents could be finally judged as relevant with more detailed investigation? If so, what is the impact to the experiments?

  22. 1) One issue I had while reading the early portion of the paper was that all test collections were judged by various sets of experts. In section 3.2, the author addresses this by citing a study where one of the groups is outside NIST. However, the 3rd party was given fewer total documents and a higher percentage of relevant documents. Doesn’t this reduce the validity of the claims made by this study? Were there no other studies where the set of documents was constant between the groups?

    2) It is interesting that for larger topic sets the Kendall’s tau values are so much higher than in case where there are fewer topics. What might be the reason for this? Do Kendall’s tau values favor high correlations significantly more than low correlations?

    3) The Carterette and Soboroff paper for this week focuses on collections with large amounts of topics and relatively few judgments. Meanwhile, Vorhees concludes that in the TREC-6 environment, as long as the amount of topics is large enough, there is a high correlation (as indicated by Kendall’s tau) of the evaluation results for different groups of assessors. Essentially, the disparity in the actual judgments between groups does not matter, if the topic set is large enough. To reiterate, the Carterette and Soboroff paper focuses on a collection with relatively large number of topics. Does this suggest that the types of relevance assessors (and their effectiveness at determining relevance) does not matter at all in the evaluation scenario addressed by Carterette and Soboroff?

  23. 1. In the TREC-4 relevance assessments, “the author of a topic was the primary assessor for it”. If the author is asked to assess then it may not reflect the user’s perspective towards that topic. It may result in opinionated and biased assessment. How was the judgment considered valid? Would the judgment not be streamlined along the author’s thoughts and not give room for diverse comparative thoughts on the topic?

    2. Lesk and Salton have provided analysis that averaging results in masking individual topic differences and it may not represent the actual behavior of the any one query. It rather illustrates the average performance. How can this be useful when in judging the position or rank of the results of a query? The results of an average function may depict an overview of good and bad relevant judgments but how precisely can they depict the position or ranking of the document/results?

    3. In this paper also Voorhees mentions about assessing a document/topic using multiple judgments but doesn’t discuss about the methods for choosing the topics for multiple judgments nor about the ways of combining the judgments. How can one obtain valid conclusions when the judgments made by different sources are disparate and wide-ranging?

  24. Voorhees
    1. In this article the author states that one of the problems from one of their studies was that one of the assessors did not tag any documents as relevant that the prime assessor did in one of the cases. Is it possible that this user was exhibiting the tired user model that Kazai et al. were explaining because he was exhausted from judging all day?

    2. In this article the author consistently states that it has been proven that improvements created using test collections have been beneficial to real world purposes. Is there any evidence to show that improvements made to the effectiveness of relevance could lead to the type of improvements that the author is talking about?

    3. In this article the author consistently states that the results of her experiments show that there is stability across the relevance judgments of several judges. However is proving stability important? Isn’t it important to have a little variation in the results if relevance because that is how real users act?

  25. 1. My first question is in 4.1 Effects of averaging. This chapter is used to expand Lesk and Salton’s first reason for the stability of system ranking - “Evaluation results are reported as averages over many topics”. Here I want to question that whether averaging is a good way for system ranking evaluation. Just as referred in 4.1, averaging does mask individual topic differences. For users of search engines, it is hard to tell what is so-called “averaging”. It is unlikely that each individual user would try to search for every topics and then get an overall impression about the search engine. It is usually the poor results that disappoint the users. My question is, is there a better way to represent the system evaluation without losing stability?

    2. My second question is in 4.2 Ranks of unanimous relevant documents. One interesting point they found is that there usually exists a set of documents that are unanimously judged relevant. According to their experiments and statistical information, this seems to be a universe situation. I think one main reason is there there is certain overlap for existing relevance judgment methods. Can we figure out what they are? This may help us to better understand the relevance judgment as well as improve the existing methods.

    3. My third question is in Section 3.2 TREC-6 relevance assessments. It talks about the method that was used by the University of Waterloo, whose system is ranked first using the NIST qrels. One method the waterloo assessors used is a three-point relevance scale - relevant, not relevant and ‘iffy’. This seems to be incorporating more relevance information into the judgment process by enlarging the relevance scale. Can we make the relevance evaluation better by enlarging the scale? Say we have a four-point scale very relevant, relevant, maybe relevant, not relevant.