Saturday, August 24, 2013

Voorhees 2001

Ellen Voorhees. 2001. The Philosophy of Information Retrieval Evaluation.


  1. 1. The first question lies in Section 2, Completeness of Relevance Judgments. It says “Assuming a judgment rate of one document per 30 seconds”. Here “30 seconds” sounded too long to me according to my previous experience (My guess is that 30 seconds are taken for a human being to read the article and judge the relevance manually?), which made me curious about related knowledge on relevance judgments. So my first question is: Can you list and compare existing mainstream judgment methods and compare them among each other?

    2. In Section 2.1 Building Large Test Collections, it says TREC has almost always used binary relevance judgments either a document is relevant to the document or it is not. As we know, instead of treating information simply as relevant or irrelevant, in most cases we want to know which document is more relevant, or which document is the most relevant. So why TREC still prefers binary relevance judgments to a more complex ranking mechanism?

    3. I am a little confused when reading section 4 Cross-Language Test Collections. The basic idea of cross language test collections is to incorporate documents from different languages into one big collection? Or for one document, there are different translated versions from different languages? From my understanding, I think it should be the former, but is it necessary for us to do that? Why can’t we just split it into several sub-collections and treat each of them as monolingual collection? What is the benefit for us to combine them all?

  2. This comment has been removed by a blog administrator.

  3. This comment has been removed by a blog administrator.

  4. In discussing the effect of inconsistency, author mentions that the mean average precision score changes depending on the qrels in Figure 3 by citing the result that the difference between the minimum and maximum mean average precision values is greater than .05 for most systems. However, is the number (.05) really significant here? Does the author take any significant test to support that? Is there any possible that all differences for most system are due to the chance? Or, if the author took the significant test, could the data in this experiment meet the prerequisite of conducting the test employed here? For instance if the t-test or Wilcoxon test were used, the data for testing should be distributed normally and symmetrically.

    Having reading this paper, I think the test of Granfield paradigm is very controlled, so I am also wondering whether results of this system evaluation can be employed into the real settings. In order to prove the stability of the comparative evaluation, through some experiments, the author shows effects of inconsistency and incompleteness may not affect the results of the comparative evaluation. However, I think, the retrieval system, selected through this comparative test, is less likely to meet users’ need without considering the reliability of relevance judgment. Is it possible that some documents that may be desired by actual users would be judged less relevant by accessors? Maybe, the retrieval system, of a good performance in this comparative evaluation, is not what actual users want, but what accessors want.

    Which one is better, user-based evaluation or system evaluation? The author considers that user-based evaluation is extremely expensive and difficult to correctly. Nevertheless, in reading this paper, we can find that system evaluation is also very expensive and time consuming to produce reliable relevance judgments, and moreover it’s possible that the results of such evaluation may not transfer to operational settings. So, which is better, indeed?

  5. 1. In the beginning of the article, Vorhees states that system evaluation is often used in place of user evaluation because of the costs of doing user studies “correctly” (p. 1). Do all the variables have to be controlled in a user test in order to get some workable improvements? What risks could there be in doing a small, less-controlled user test?

    2. In several places (p. 3, p. 15), Vorhees states that scores from one TREC test cannot be compared to scores from another year or another topic. If this is the case, is it possible to state that search algorithms and systems are improving? Is there any way to see year-to-year trends?

    3. Are you convinced by this study presented that showed there was not a significant difference between using one assessor and a group of assessors? Would it depend on the quality of the assessors (i.e. experts in the field instead of those unfamiliar with the research) or the size of the collection? How would you determine if you had an adequate sample size of documents to negate the effect of a particular assessor?

  6. 1.As mentioned in this paper, it is important that a document set “reflects the diversity of the subject matter for that topic.” Is there a way to ensure that the selected pool accounts for this diversity? Do you agree that TREC test collections, which select a set of documents that allow “a wide range of queries” represent a good pool? Or should the evaluation design focus on creating a non-biased pool versus a diverse one?

    2.Apart from causing a certain level of bias, doesn’t Zobel’s suggestion for judging a collection based on the number of associated documents, limit the potential of an IR system and lead to inconsistencies? In this regard, is the number of documents classified under a topic reflective of its relevance, and in turn its importance with respect to other topics that need evaluation?

    3.The study indicates that different relevance judgments, based on multiple relevance assessors, cause inconsistencies. But don’t you think they in fact provide more a accurate evaluation since they accommodate / reflect multiple perspectives and possibly are more representative of the real user base as compared to a single or very small assessor group.

  7. 1. In the Figure 1 as well as the Figure 2, it can be observed that manual groups performed very well in returning relevant documents when compared to the other groups. Since the number of topics were just 50, is it not possible for the manual groups to have tweaked their performance to retrieve more relevant documents only for the 50 topics in question? How indicative is the result of the general effectiveness of manual groups?

    2.In the topic "Differences in Relevance Judgements", the author discusses the basic reasoning behind using three different assessors for the same topic by including 200 random and 200 relevant documents(as assessed by the primary assessor) into the pool for the other assessment. Given that this assessment resulted in a mean overlap of just .301 (Table 1), does this not indicate the importance of context based assessment? The 3 assessors might have assessed the documents based on disjoint contexts. How does this result show the evaluation of contextual assessment?

    3.The author has not discussed in detail about the Cross Language Test Collections. In the case of Cross Language Assessment, how did the assessors tackle the problem of common keywords/tags across languages? If the same document existed in different languages, how would the system/assessor deal with it?

  8. 1. As it’s mentioned but not discussed in detail in the paper, the relevance in Cranfield is defined as a binary choice. Intuitively, there are differences on the extent of relevance of different documents to given topics, so what’s the justification for using binary choice for relevance and what’s the effect of the choice.

    2. The definition of automatic and manual run is given in the paper but they’re not 100% clear to me. What’re the differences between these two methods and why manual run has lager influence on the incompleteness issue of Cranfiled test.

    3. In experiments that test the effect of inconsistency, the researcher used 3 independent judgements for 49 topics. Is that enough? Is the high correlation between different qrels simply due to the fact that there is a huge overlap (at least 1/3) between different qrels? And even if this is true, whether the consistency of the results from qrels will validate the Cranfield method? An alternative explanation is even the Cranfield result return consistent result using different qrels, they might all be wrong. Are there external methods that can verify the results from Cranfield results?

  9. 1. All three papers claim “relevance judgments” as one of the biggest costs of evaluating an information retrieval system. This paper goes on to give a numerical perspective of the amount of time it would take to gather relevance judgments for a large sample size. If you are running an experiment every year, couldn’t you hold the topics constant and thus over time have more relevance judgments for each topic? Using the TREC experiments as an example, the paper does make it sound like each potential TREC topic has one person that performs the relevance assessment. Therefore, I could see validity questions arise in regards to having different people give relevance judgments for a topic. In addition, past relevance judgments might not apply for long if there has been a fundamental change to the topic being evaluated. However, if these concerns don’t apply wouldn’t it be more robust to eventually have complete relevance judgments instead of using pooling?

    2. When talking about building a test collection, the author mentions it is important to have a set of texts that reflect the type of documents the topic naturally lends itself to. At the same time, the experiments depicted seem to focus on different types of word documents. Have any of these experiments been performed when there is a mix of media such as photographs, audio recording, and video recordings that would all be applicable to a topic? How would one make a relevance judgment on these types of media? Is it possible that one search algorithm is more effective for a single type of media and another more effective for mixed media results? The end of the paper mentions cross-language test collections that require multiple judges for the different languages. Could a mixed media test collection have the same drawback, where a different judge is needed for different media types?

    3. The paper devotes a section to addressing one of the biggest concerns with this style of evaluation: relevance judgments are inherently subjective. The paper concludes that this is actually a red herring since experiments have shown that changing the relevance judgments has no real impact on the results. Although this subjective nature ends up not affecting the experiment in isolation, it still leads to a significant drawback. As the authors mention earlier, a system’s performance for the evaluation techniques depicted can only be compared with another system’s results that used the exact same test collection and relevance judgments. How can one look at the trend of information retrieval systems over time if the main source of evaluations cannot be compared with one another? For instance, if the same ten information retrieval algorithms are compared using the TREC format but the test collection and relevance judgments changed every year, can any conclusions be drawn from the ranking of the algorithms? It also does not seem like it would be possible to look at the effectiveness trend of one algorithm over the years, which I think could have made an interesting study.

  10. 1) When discussing incompleteness of judgments, Voorhess mentions Zobel's claim that the TREC collections can be used to compare retrieval methods because of the lack of bias against unjudged runs in the collections. However, is it possible to introduce bias against unjudged runs?

    2) When discussing incompleteness of judgments, Voorhess points out that pooling with diverse pools is a good approximation for unbiased judgment. Moreover, Voorhess also notes that Zobel's and Cormack's suggestions for finding more relevant documents would result in a larger but biased set. Yet throughout the paper, it is noted that there are topics with larger judgments sets. How can topics with smaller sets be increased in order to achieve more diverse pooling? If one attempts to increasing their size it would be similar to Zobel's and Cormack's suggestions which would add bias to the set.

    3) When discussing assessor agreement in relevance judgment, Voorhess notes that secondary assessors judge a pool of documents (200 deemed relevant by the primary assessor and 200 not-relevant documents). Is there an explanation as to why the relevance judgments deviate so much? It is reasonable for the judgments to differ but shouldn't the overlap be greater especially since the not-relevant documents were randomly chosen.

  11. 1. What is an evaluation conference? Are there IR-specific conferences that
    are not evaluation conferences? If so, what's the difference between the two?

    2. The first assumption is that "relevance can be approximated by topical
    similarity", and the paper lists three implications of this assumption:

    - all relevant documents are equally desirable
    - the relevance of one document is independent of any other
    - the user information need is static

    How are these actually linked? In other words, if relevance couldn't be
    approximated by topical similarity, then why would it be a problem if, say,
    some relevant documents were more desirable than others?

    3. What has led the IR community to use binary relevance judgements over the
    graded scale originally used in the Cranfield experiments?

  12. 1. A key constraint in the TREC studies is that system evaluations done from one year are not comparable to the next due to difference in document sets, users, assessors, topics, and queries. If the study goal is to look at unguided, imprecise search then clearly the results are not comparable, but if the queries are similar to those in a straightforward Question/Answer setting, perhaps noise from relevant document sets is much smaller. How much of the variance in assessor document relevance can be explained by factors relating to the topic or query of interest?

    2. Voorhees mentions that NIST takes care to choose a document set consisting of “the kinds of texts that will be encountered in the operational setting of interest.” While selecting a representative sample for the setting maybe makes sense for some very constrained settings, how can sampling be done when the setting is everything (i.e. the whole internet)? Is it even possible to build representative samples given very diverse document collections?

    3. The Voorhees article discusses a concern at the number of ‘unique relevant documents’ the different system groups returned in the TREC studies, but ultimately dismisses the concern as not having a major bearing on system performance. While it does seem possible that each IR system could have chosen a unique but ultimately relevant set of documents, isn’t it likely that many of these document sets were better than others? How could changes to relevance judgments help researchers better understand large numbers of ‘unique relevant documents’?

  13. 1. Section 3.1 of this article discusses “Assessor Agreement” and explains that when different assessments produce partially alike sets of relevant documents, the assessments are said to be in agreement. But, they are in disagreement if these sets do not show an overlap of at least 50%. How are these agreement or disagreement levels used to determine which assessment is more successful?

    2.Is the fact that there is typically only one assessor making judgments for a topic(with the exception of cross-language test collections) a product of financial and practical constraints within studies? Voorhees claims that while the one-assessor system's “judgment set represents an internally consistent sample of judgments,” there is still the issue of “this assessor's judgments may differ from another assessor's judgments”(p. 13). Wouldn't it remove more bias and help increase the representation of the user population to have at least two assessors working to create any given judgment regarding topic?

    3. Why have more recent experiments which look to compare different methods of IR opt for a binary system as opposed to Cranfield's “five-point relevance scale”(p. 2) which Voorhees describes in this article? It seems as though viewing “relevance [as] a binary choice” would bring about a higher number of false-positives in relevant documents retrieved by a search engine, and lead to lower levels of precision.

  14. Section 2.1 mentions which document sets should be used in a test collection as texts “that will be encountered in the operational setting of interest.” The example used is movie reviews having no relevance to a medical library setting. Using that train of thought, how then would a document set function in a setting which would use multiple collections across a wide range of subject as humanities or social science subjects might?

    In section 3 the primary assessor(author of a topic) created a sample size of a maximum 200 documents creating a completely relevant sample size as they viewed it. However, once it moved on to having two more assessors look over the sample size the idea of what constitutes relevance changes from the primary. How then do the new reviewers judge the documents on the same criteria as the primary assessor who by being the author of the topic understands the relevance of the topic much better than the other assessors?

    Because of the “noise” involved in evaluating retrieval systems it seems as though there will never be one true system which works the best in every situation. Would the only solution be to simply utilize multiple systems on a collection and allow the user to craft a sort of Frankenstein's monster in relation to their own specific queries or needs when dealing with a collection or document set?

  15. This comment has been removed by the author.

  16. In the Cranfield experiments, relevance judgements were made by domain experts. I am wondering how the experiments can control biased opinions and inconsistency from those experts as failure to do so would inevitably introduce confounding variables and invalidate the experiment results.

    NIST forms pools based on the participants' submissions, which seems an effective way of solving impossibility of complete relevance judgements. However, the list is not complete and for a new retrieval algorithm that had not contributed documents to the original pool, will the new algorithm be underestimated (if not significantly)?

    The author mention in user-based evaluation, each subject must be equally well trained and care must be taken to cater for the learning effect? By previous experience, we have been told to use randomization as a way to mitigate the learning effect? Are there any other effective solutions (examples)?

  17. In the Cranfield experiments, the experts on the subject matter of the test collection were asked to judge the relevance of documents. Even though these people were seen as experts, couldn't documents not seen as relevant to an expert in the field be relevant to someone with no previous expertise? Do they account for the bias of the expert in his judging of relevance?

    It seems like even though the Cranfield experiments used a 5 point relevance scale, more recent experiments have favored the binary scale. Wouldn't it be more representative of a real user session to see documents in a variety of relevance levels instead of strictly relevant or not relevant?

    The article mentions the difficulty of assessing cross-language collections. Similar to how the Cranfield experiments used experts in the fields of the documents, could researchers not bring in an expert in the languages they using in their test collections to judge the documents?

  18. 1. In this paper, the effectiveness of a strategy is computed as a function of the ranks of relevant documents. Thus, we see how high precision is what steers the effectiveness metric and is propagated at the expense of recall. And so, I am inclined to believe the implementation of a strategy of this nature will by default culminate in the a generation of a greater number of shorter documents as opposed to verbose documents as the indexing mechanism is structural and not semantic based. Wouldn't we be incorporating a bias in this case? Also, does this effectiveness metric guarantee comprehensiveness or is this high precision emphasis at the cost of recall justified and by what considerable factor?

    2.'The judgement pools are sorted by a document identifier which results in the assessors getting no intuition on the specifics of every document as to whether it has been highly ranked by a system or by many systems'. I wonder if this implementation is resting too strongly on the assurance that pooling as a mechanism would not be subject to bias. Like for instance, if a judgement pool is required to choose between a document which combines the functionality of 2 or more systems that have already been earmarked by the judgement pool as 'relevant' and another document which proposes a completely novel approach however does bear topical relevance - I am curious to know if the judgement pool would be able to make a completely non-bias assessment in such a case. Also, is averaging the pool size as stated in the paper a safe strategy? Wouldn't we be limiting judgements on certain queries through this proposition?

    3. What came across as an obvious shortcoming to me in addition to the issues that the author has cited pertaining to Cross-Language Test Collection is how would we be able to balance the task of ensuring high precision queries and also generate high relevance translated documents? Wouldn't curtailing on the query length make the translation process cumbersome and ambiguous? So then, maybe the introduction of something on the lines of an interactive query tool which will allow the user to expand his/her views may be an appropriate way to tread as this will provide more insight and attempt at alleviating the disambiguation. However, wouldn't this method be labour intensive from the user's perspective? Also, additionally how do Cross-Language Information Retrieval tools attempt to generate documents of relevance when the same word has a different connotation or a varied contextual reference in another language?

  19. 1. The 3 assumptions of Cranfield paradigm are not true generally, which results in laboratory evaluation of retrieval system a noisy process. How does such noise appear, and what are the impacts of such noise? It is important to know how the noise is introduced. Usually, some noises can be eliminated by careful design, but some are not. So, which noises have finally been decreased?

    2. Refer to the last sentence of the last paragraph of section 1. From statistics perspective, we hold that if a subset which is sampled from a complete one is representative enough, the subset is treated as valid to work as accurate as the complete one to some degree. However, here, “It is also invalid to compare the score obtained over a subset of topics in a ….”, it is different with what we usually hold. Though, to getting a good sampling is difficult, but theoretically, such statement is inaccurate. What is your comment?

    3. In section 3.1, it is said “Documents that the primary assessor judged relevant but that were not included in the secondary pool … were added as relevant documents to the secondary assessors’ judgments for the analysis”. Here, it seems assume that A and B would accept such documents as relevant directly. Why? Is there any impact to the final result?

  20. 1. Under the Cranfield paradigm a test collection of this kind to conduct comparative evaluation is validated. However, if we were to move away from the Cranfield paradigm, by breaking the assumption of independence in relevance judgments, the incompleteness in test collection may no longer be acceptable – especially if we wish to penalize missing documents. Are there other approaches to building test collections.

    2. The paper makes a convincing argument that pooling to enable comparative evaluation of retrieval methods, what is unclear is the effect of the size of the document pool. If the pool is large enough, there is a high probability that the manual and automatic pools will have a large overlap – but how important is this to evaluation.

    3. One of the key findings of this paper is the ranking robustness to “very” noisy relevance – this finding brings some form of equivalence to graded and binary judgments (graded being the noisy version) -- I am not fully convinced of this.

  21. In ‘building large test collections’, the methodology by which a document is marked relevant or non-relevant is mentioned. It is mentioned that – “If the assessors would use any information contained in the document in the report, the entire document is marked to be relevant”. If this is followed and data set is obtained by pooling, then there is a certainty of missing some documents that are relevant (according to the above definition) and not pooled in. A point to think about in such scenarios is how the evaluation measure differentiates between relevant and supposedly ‘not relevant’ documents, and between supposedly ‘not relevant’ and not relevant document.

    The purpose of the Cranfield experiments is mentioned as to “investigate which of several indexing languages is best”. It is known that a single ‘indexing strategy’ cannot be suitable for all types of queries. How does Cranfield deal with the issue? Are there any experiments regarding the same and that can verify the Cranfield experimental setup?

    From the figures depicting the total unique documents retrieved by each of the groups, it is evident that manual runs dominated the other runs. Apart from the complexity of making an automatic run similar to a manual run, are there any other factors that made the groups more successful in manual runs only?

  22. 1. This paper is based on Cranfield paradigm. Though this paradigm is discussed in the first section, it lacks a full picture of what it is and what its framework and/or key structure are. Besides those 3 assumptions, is there any other prerequisite or use scenario? It originally pursued the best among indexing languages, how the term “best” is measured?
    2. In section 2, it is mentioned that “Each document in the pool for a topic is judged for relevance by the topic author, and documents not in the pool are assumed to be irrelevant to the topic”. What is the justification of such assumption? If this assumption does not hold, which may be a common case in real world, what is the impact of “pooling” techniques.
    3. In section 3.1, “Across all topics, 30% of the documents ….”. How is this value obtained. Since from Table 1, what we can get is that only 30% documents are accepted by all 3 assessors as relevant.

  23. 1. The top-k pooling method mentioned for trying to get appropriate relevance judgments without exhaustively going through every document seems like a great way to start (making relevance judgments) but I'm very surprised that designers of a track didn't consider combining domain knowledge with active learning techniques (for each query) to get a superset of results that search engines are likely to return (and hence aim for better completeness than what pooling by itself can provide), since most search engines would not possess domain knowledge that could have helped answer some queries better.

    2. Perhaps this is referenced elsewhere, but the difference between a 'manual' run and an 'automatic' run isn't completely clear. Aren't all runs 'automatic' since we're dealing with a system that supposed to answer a query over a corpus of documents? Where does the manual component come in, and why is it that manual groups seem to contribute more unique documents when evaluated on a track (as compared to an automatic run)?

    3. The conclusion that the 'comparative evaluation of ranked retrieval results is stable despite the idiosyncratic nature of relevance judgments' seems more disturbing than reassuring. One would have intuitively assumed that with such grossly different relevance judgments among assessors (with not even 50% overlap in the best case) some systems would have outperformed the others and the Kendall's coefficient wouldn't have been so high. Although the statistics can't be disputed, I still can't grasp the reasoning behind this observation (abstractly or qualitatively) and would like a higher level explanation for why this should be so.

  24. 1. In Section 2, it is mentioned that pooling is a technique used by modern collections to judge a topic by creating subsets of documents. What are all the criteria that should be considered while choosing a subset? Who decides and how is the decision made while choosing the size of a pool and weighing the relevance of the documents in the pool?

    2. When consecutive runs of the same test collection could result in different outputs across topics there is an inconsistency evident in the system. So to avoid this inconsistency, in the later part of the discussion the author suggests the use of repeated runs of the collection to achieve stability. But would it not be apt to use a learning algorithm or a training program to achieve better reliability and accuracy?

    3. It is agreeable that a system based evaluation reduces the cost of a user based evaluation drastically but it comes with a huge trade-off in accuracy and user-relevance. But neither of the methods can give a holistic view of evaluation independently. Would this be a reason for the inconsistency in the test-beds of information retrieval field and lack of robustness? Apart from precision and recall what are the other measurable quantities that can be used to bridge this inconsistency?

  25. In section two, Voorhees, the author, speaks to collection size, stating that further investigation has shown size of the document collection is important, and I wonder if there are studies on how the type of document effects the outcomes of relevance judgements and recall and precision? It is easy to think about these concepts in text based terminology, but what about sound, video, images? Are there different levels of recall and precision acceptance for different types of documents? How do sites like flickr and Pinterest handle these challenges? The author touches on this a little in the end with cross-language collections. There appear to be different judgements, the author says there are 'different sets of assessors for each language' which in part is why it is so difficult to compare/contrast them. Voorhees then says that often one language is often slighted in favor of another. Can I assume that cross-document type assessments carry the same faults?

  26. In this article they show the results of a study they did to test if the differences between relevance judgments by separate assessors. They analyzed data from the TREC conference using both the MAP and the Kendall tau computation. However in the Croft, Metzler, and Strohman chapter the authors argue that the tau computation has not been shown to be an effective measure of evaluating systems. They also state that in a study of the effectiveness of a system that several tests and measures should be used. Should Voorhees have done more computations to confirm their findings or is what she did enough?

    In this article the author briefly touches on the subject of cross-language collections. The article enumerates the various problems that exist with creating a collection that contains documents and queries in multiple languages. Would the use of modern translation software help to mitigate some of these problems or would the translation errors only hurt the effectiveness of any collection that was made?

    Near the end of this article the effectiveness of applying the Cranfield paradigm to an operational setting is examined. The article cites one study that says that the Cranfield paradigm does not translate to an operational situation. However the article rebuts this study by pointing out that the sample sizes used were too small and that several IR systems in use on the web today are based on laboratory studies like Cranfield and that they work. Is this a safe argument to make without any research to back it up?

  27. 1) Some seemingly obvious problems with test collections: How old are they? How often are new documents added to them. If these main test collections are mainly news reports and government documents is it suitable for developing search engines that deal primarily with other topics or types of documents such as audio files? Who curates these collections? Who engages the experts who make relevance judgements? How are different topics determined or defined? How do these collections deal with topics that overlap such as chemistry and biochemistry or even more closely related topics? Who is responsible for communicating this information to researchers who want to use the collections? What search methods can be broadly used by researchers in determining what collections and topics are available and how often is that system critically examined?

    2) Some basic terminology issues: The word ‘groups‘ is used frequently when referring to pools both for contributing documents to the pool and preforming tests on it. Who are these groups and are they actually testing their algorithms on data sets that they helped create? Is this problematic? What am I missing? Also when searched were preformed the article references ‘Automatic’, ‘manual’, and ‘mixed’ runs. I understand that each of those can produce different results but I don’t understand what the search methods behind them are.

    3) The article says that for a test to be useful several different topics need to be searched. I am curious about defining different topics and how the degree of difference between topics is determined. For example recipes for Chinese food and locations of Chinese food restaurants are very different topics to a user who wants to go out to eat but relatively they are the same for a user who is looking in to the international space program. Along the same track: Are algorithms that interpret user need from their query developed in tandem with algorithms that can tell the differenced between recipes and restaurants? It is not useful for a system to know the differenced if it can’t also determine which the user wants.

  28. 1) Vorhees does a good job summarizing how pooling works in the context of relevance judgments. Specifically, she discusses how pooling inherently violates Cranfield since it is not complete. She goes on to state that this is not actually an issue, since the purpose of completeness is so that there is no bias. Hence, she dismisses studies that aim to improve the completeness of pooling, because they supposedly cause additional bias. Is this not a classic case of tradeoff between precision and recall? If I have time on my hands, perhaps I want the option to sift through potentially irrelevant documents, as long as I have the assurance that if I cannot find what I’m looking for when I’m finished, it’s because it does not exist. It seems a little dismissive of Vorhees to always favor the more pure, unbiased set.

    2) Regarding the previous comment, I found it a bit inconsistent that Vorhees is trying to maintain an unbiased nature to the relevance judgments, yet the primary assessor in her experiments is that author of the topic. Perhaps the author seems to be well-positioned to make an accurate judgment regarding the categorization of documents as they fit into his/her topic. However, would not other raters serve as better judges of what the rest of the community might understand the topic to mean? This ended up being irrelevant in terms of the final results since the evaluation of ranked tetrieval results ended up stable regardless of which set of relevance was used, but I just thought it was a bit of an odd choice as far as base conditions of an experiment.

    3) This observation has less to do with the actual usage of Cranfield, but rather the author’s introduction, specifically the difference between system evaluation and user-based evaluation. I might be misunderstanding, but I’m not seeing how system evaluation is inherently separate from user-based evaluation. Is user satisfaction not tied to good rankings? Perhaps not direct user evaluation, but inferred user evaluation through clicks and time spent on page seems to be. I understand that the review is regarding Cranfield, but this motivation seems a bit weak.

  29. 1) What are automatic, manual, hybrid and 'other' type of runs?

    2) How do we infer from figure 1 that TREC collections contain relevant documents that have not been judged?

    3) Is mean average precision a sufficient parameter based on the measure of which the paper has drawn its conclusions on completeness and the role of relevance judgements? Are there some other parameters which could challenge the analysis and reports?

  30. a. In Completeness of Relevance Judgments the author mentions that the pool size needs to be restricted so as to get the results in a quick time frame. But he has not discussed how to evaluate the accuracy of the data set for various scenarios. It is possible that the data set on which the evaluation tests of the search engines are being carried out may have insufficient data to test a particular scenario. How is such a situation recognized?
    b. New algorithms in the IR space are being designed to take into account the personal choice. How can the effectiveness of personal choice be embedded when testing in such data pools?
    c. The author has mentioned two ways to rating the relevance of the data being retrieved, Binary and Graded. It has been indicated that the cost for the grading relevance of data would be higher but this would be more beneficial. It has not been illustrated in any scenarios where the graded relevance would be beneficial even with higher cost for it?