Thursday, September 26, 2013

10-3 Soboroff, I. and Nicholas, C. and Cahan, P. Ranking retrieval systems without relevance judgments.


  1. 1. This should be the most interesting but counterintuitive IR paper I’ve read so far. The author tried to construct a random model to represent the relevance judgment and use it for search engine evaluation. My first question is why only the ranking of search engines is evaluated? Although it’s shown that there is a relatively weak correlation (0.36 ~ 0.48) between this evaluation and that from TREC, introducing of random factors make this model less stable and repeatable. Sometimes it’s more important to compare the preference between two specific systems, how does this model make accurate prediction for this kind of study?

    2. The fact that introducing duplicate documents benefits the evaluation greatly is actually understandable. Compared with the random and uncontrollable model of relevance judgment, introducing duplicate documents is another form of majority vote by systems, which counteracts with the random model and contributes rational factors to the evaluation. So would it be better to go further to discard the random factor and use a model dominated by “majority vote”? For example, we can rank the retrieved documents by the number of systems returning them, and define the top 100 documents as relevant. In my opinion this might return more consistent results compared with the random model.

    3. In the abstract the author mentioned that their model “could be useful in World-Wide Web search”, which is questionable as well. The model proposed in this paper depends heavily on the acknowledgement of average percentage of relevant documents or individual percentage of relevant documents for each topic. Intuitively, the only the later one will be useful for web search due to a vastly variability of topics on the web. However, how can we measure the percentage of relevant documents for each topic in web search with respect to such a dynamic environment?

  2. 1. The authors claim here that disagreements between assessors "probably do not concern documents which greatly affect the rankings of the systems in question"(p. 70), and that the results of this study prove that assessor disagreements are "concentrated on 'fringe' cases which don't affect many systems"(p. 70). But, I think most of our readings from last week would lead us to challenge this notion. For example, in "An Analysis of Systematic Judging Errors in Information Retrieval" by Kazai, Craswell, Yilmaz, and Tahaghogi, the authors claim that discrepancies between assessors are important because "disagreements among judges may reflect the diverse opinions of real-world users"(p. 109). Can the real-world user, or the human aspect, be so removed from the idea of relevance as is suggested here?

    2. In the opening of this paper, the authors propose that pseudo-relevance judgments could possibly help to improve online web searches. But, this method seems reliant upon the use of small pools of documents in order to be effective. How would this be translated to an actual, online search engine? How would the small pools be created in an online collection (which is not as convenient or manageable as the TREC collection used in these experiments) before this method could be implemented?

    2.5. On the same topic of pooling, I was having trouble understanding how they created such successful, small pools of documents in the first place. From my understanding, deeper pools would have a wider range of actual relevance, but would include more obscure or less "obvious" documents which could be useful. However, the authors explain that in a shallow pool, there is a greater probability of finding "rare or unique relevant documents which other systems either don't find or don't rank highly"(p. 70). Maybe we can go over the process of how they came up with these kinds of shallow pools? They describe it briefly on p. 66, but it is still vague to me.

    3. I was interested in the idea of patterns of relevant documents retrieved, and the illusion to "Zipf's Law." Looking briefly into that this law entails, it seems to describe the frequency of words used in spoken language, and how a very clear, mathematical pattern can be identified from this phenomenon. The hypothesis of using pseudo-relevance judgments is based upon the notion that there is a similar pattern occurring in IR, but that pattern was not made clear to me in the findings presented in this paper. Is the point of the experiment just to show that a pattern does exist, as proven by the finding that the pseudo-relevance judgments tended to match those of the TREC assessors?

    1. A good description of the pooling process is in "Strategic System Comparisons via Targeted Relevance Judgments"- Alistair Moffat, William Webber, Justin Zobel p. 376 for anyone with a similar question!

  3. 1. In the introduction, the authors write that most experiments are focused on the “processing level,” or the level of comparing algorithms without looking at the user interface. What advantages or disadvantages are there to comparing systems at the processing level? Do you think that evaluation should always be done at the processing level before evaluating with the user interface? How might the user interface affect relevance perceptions?

    2. The experiment found that random relevance judgments worked best when used on the pooled set with duplicates left in, which relies greatly on the systems to identify the correct documents to include in the pool. Do you have any concerns about relying so much on the authority of the systems participating in the test? Could this perpetuate errors from “buggy” systems? How many good performers are needed to outweigh the bad?

    3. The authors write that human judges do not assess relevance randomly, but probably have the most disagreement on “fringe cases” (p.70). However, other articles we’ve read show low overlap between judges. Do you think further tests need to be done to look at on which documents assessors disagree? Did the Soboroff et al. provide enough support for this claim that judges likely disagree on “fringe cases”?

  4. 1. I must appreciate this study’s attempt to replace human judgment and eliminate the various limitations and biases associated with it. What confuses me is that the new / proposed method comes into play only after the first four tasks from the official TREC evaluations, which are human assessor intensive, are completed. Don’t you think it’s ironical that it relies so heavily on a method it’s trying to displace?

    2. I think this experiment might actually have the answer to evaluating retrieval systems that deal with dynamic collections? For such collections, like the WWW, don’t you think selecting a pseudo-collection seems like a quick and cost-effective method, which reduces/negates human workload? With a disregard for duplicates and regard for a shallow pools depth, the preparatory work is minimal. This seems like a method that can keep up with the always-changing nature of these collections.

    3. I’m not sure if I fully agree with the shallow pools concept (or maybe I’ve misinterpreted it). How does choosing from a smaller set of documents increase the likelihood of choosing rare documents (pg. 70)? Similarly how does it not lower the chances of selecting obvious documents (pg. 70)? My knowledge of probability tells me that something is amiss in these statements – the probability of both, choosing a rare document, and selecting an obvious document should be directly proportional to the size of the collection one is choosing from.

  5. This comment has been removed by the author.

  6. 1. When proposing the idea of pseudo relevance , the author states that the basic idea is to extract expansion terms from the top-ranked documents and then make use of this information to create a new query for the second round of retrieval. Now, this method depends rather strongly on the expansion terms which get added to improve the overall performance. How will we deal with the cases when the expansion terms are not really related to the query? Also, doesn't the fact that most webpages for example, contain heterogeneous data and information on multiple topics suggest that using simple pseudo relevance would not be thorough?

    2. If we do implement a Statistical ranking methodology - how do we propose to handle the massive size of the web given that statistical ranking is based on regression? Especially since we know that even a small regression error may result in a large ranking error. For instance, if we have to represent a minority class distribution where most of the class labels are 0. Making use of a ranking system which always places the class label as 0 irrespective of the document whose relevance has to be judged would cause a large ranking error. How do we hope to deal with this issue that will still exist on incorporating statistical ranking?

    3. The paper suggests making use of smaller pools. I do not understand how the usage of a shallower pool could boost the probability towards finding a more 'obscure' document when nothing has been stated about the other factors which do affect the performance of pooling- like for instance the number of manual runs performed, the parameters involved in tuning the pool and whether these shallow pools were tested by making use of the exact same relevance judgements for every query as were used in the conventional pooling method. Till we get complete intuition on these aspects, isn't it naive for us to assume that shallower pools would culminate in better IR performance?

  7. 1. Including duplicates showed the best results because of agreement across runs – hence all systems indicating a higher probability of inclusion for these documents. What would have been interesting is to assign relevance judgments by having thresholds in prediction probabilities of the individual systems (maybe some way of sampling systems to remove system specific bias). The duplicate documents can be used to calibrate prediction scores. The key idea of having pseudo-qrels is that they are cheap to generate and don’t have to be reusable.

    2. As pointed out by the paper, pseudo-qrels do not enable stand-alone evaluation. While is it important to rank systems, subsequent improvement is often gained by analyzing true system performance over real judgments. To that end I do not understand the need and the benefit for pseudo-qrels, since collecting true judgments will always be a requirement.

    3. I felt the paper to be misleading in the comparisons made to conclusions drawn in the variations paper by Voorhees -- especially the opening paragraph of the conclusion. A lot of the conclusions drawn from the study have been hypothesized by other papers we've read, and I didn't feel them being validated here either.

  8. 1. The author states "If we leave duplicate documents in our pool, a document is more likely to be chosen for a pseudo-qrels in direct proportion to the number of systems that retrieved it in their top 100 documents."
    I would expect a search engine to be able to find both copies of the document when searching for relevant data from its algorithm. And the removing these duplicates from its rankings so as to provide the user with a distinct set of results in ranked order. (or do something Google Scholar does show various links which point to same document in the same rank ). But I have not been able to understand how adding duplicates to the pool increased the correlation between the data being retrieved? Because if the document is relevant enough then both the search engines should have been able to pick them up in the first test as well ?

    2. In section 3.2 the author has not given insight as to what were the reasons for the shallow pools to not improve the correlation in TREC-5 and 8? Which makes me wonder as to what led to the conclusion : "We found that by sampling from a pool of depth 10, our overall correlation (shown in Table 4) improved in TREC-3, 6, and 7, but not for TREC-5 or 8. "

    3. The author mentions "This indicates that the ranking of systems is not nearly as affected by variation in the number of relevant documents as it is by which specific documents are selected. " But this behaviour is not always expected of a search engine. For many quires users expect a diverse set to results.Does this not expose a gap in the relevance judgment? Judges might be prone to rank a document presenting some intent for a user query lowly even when there might a big user group who might be looking for that specific intent in the result set.

  9. In Section 2, the authors state this hypothesis that "human assessors can disagree widely without greatly affecting relative system performance". The hypothesis, from the context, is drawn from the Voorhees' results. But there is a hole which is neglected by the authors which might undermine the very foundation of this paper. The hole is that in the Voorhee's study, the human assessors selected were all NIST assessors which have been trained in the same system and in the study they agreed on most of things while disagreed only on small sets of topics. In my opinion, the author's main research question "can we model the occurrence of relevant documents, and use sets of pseudo-relevant documents drawn according to that model to rank systems accurately?" is based on an inaccurate assumption or misunderstanding.

    In Section 2.2, the authors replaces steps 4 of the official TREC evaluation process from "using a trained human assessor to judge all the documents in the pool for those topics he or she created" to "using a statistical model to select a set of documents randomly to form a pseudo-qurels". The model selected is based on "a percentage value from the normal distributions with that year's mean and standard deviation as the fraction of documents to select from the pool". The pure randomness without any heuristics or knowledges about each individual document relation to the topics assessed, in my opinion, is a gamble with non-deterministic results. A close analogy would be this imaginary story. FSE (Foundation of Software Engineer) conference, as a top conference in SE, has paper acceptance rate of 15%. The committee tries to eliminate the lengthy process of reviewing all the candidate papers (so the conference can be held monthly instead of yearly to encourage more novel ideas) so instead the committee choose randomly from candidate pool 15% papers as accepted. Obviously this method is not going to work. So, I don't think applying random statistical model without any knowledge of documents contents is going to replace human assessors. There must be other better alternatives for automatic relevance assessment.

    In Section 3, the authors find out "the top systems are ranked for lower than they should be". To mitigate this issue, they use two methods. "Allowing duplicate documents in the pool" and "Limiting the pool depth". I think the finding itself exactly points out the problem of this random selection approach. The reason that the poor systems are less affected is most likely that those systems will not correctly judge relevance anyway so wrongly matched qurels have not much impact over the results while these wrongly matched qurels from the random relevance model will have much higher impact over those top systems which can instead make right judgements. The two methods proposed are used to enhance the chance of randomly hitting the right targets at the expense of pool quality. They are not good solutions in my opinion.

  10. The authors indicate that sampling according to topic statistics (hence modeling according to per-topic mean and std dev) should provide the best that one can do in terms of a model. However, one serious flaw is that the sampling and modeling is considered to be completely random and independent of the documents themselves. Would it be reasonable to expect that if we randomly sampled according to some probability distribution based statistics extracted from a given query and document collection, then a correlation higher than the one obtained for this paper but lower than the runs compared against manual relevance judgments? That should be an interesting experiment to conduct.

    The authors mentioned that their findings could be applicable in a Web Search ranking context, where manual judgments are simply not possible due to size and scale. However, when I look at the figures, I see huge error bars and the authors conclude themselves that without a large number of averaged trials, results are not reliable or stable. Wouldn't conducting a large number of trials, especially considering the size and scale of the web, itself be an expensive task that repackages the solution of the problem as a problem itself? However, even without practical motivation, the findings of the paper were illuminating.

    Based on this paper and the literature we have read in the class thus far, it seems like there's a certain spectrum involving inter-assessor or evaluation research. Voorhees's study and this one both seem to show the robustness of inter-system rankings against radical changes in qrels sets. However, in other work, we have found that differences in assessor populations do have some effects, particularly in the previous class. Those studies were not random but did have populations that had different degrees of training/expertise. Given all these studies, can we state something 'universal' about the matter, that has always held, no matter how and on what qrels the assessment was carried out?

  11. Retaining the duplicate documents in the pooled documents will increase the probability of selecting a duplicate document. However, then we would implicitly be making the assumption that the importance of documents is linearly proportional to the number of times it has been found in the different system runs. Does the linear proportionality work or is a logarithmic scale better?

    Why was the sampling model selected as a normal distribution? Is it just because it has just two control parameters? How correct is the assumption that data on web, which have no relevance judgements, be modelled as a normal distribution?

    In shallow-pool variant, it is stated that TREC-5 and 8 do not show any improvement. Why does the explanation given to support the increase in the correlation among runs in TREC-3, 6 and 7 not hold for runs in TREC-5 and 8?

  12. One of the goals as stated by Soboroff et al. is to use potentially use their method for web search engines where the documents available are constantly changing. With such small pools being preferred for this system, how would that translate effectively over to the larger and varied selection of the web itself?

    Last class we touched on the idea of head and tail queries. How might a system like this one devoid of relevance assessor judgments interact with tail queries and the problems search engines face when dealing with them specifically?

    The paper mentions in the conclusions section that “a small probability that random chance could lead to different results.” If this is the case, how useful would this method be for search engines who would have to deal with the possibility of these differing results in their evaluations?

  13. 1. After 2001, what is the further progress of the work initialized by this paper? How is this methodology applied in recent work?
    2. In section 3.3, it mentioned that the greatest improvement in correlation was achieved by retaining duplicates in the pool. The question is, in real scenario, duplicates are not popular in IR system, what is the value to make such variance?
    3. This paper explained two reasons why it used normal distribution. However, the second is not solid since it only mentioned one other sampling method. If other distribution or model were adopted, does this 2nd reason still hold?

  14. When trying to create the pseudo-qrels, Ian tries to simulate the ad hoc task assessment pools via a normal distribution with that year’s mean and standard deviation. However, is this of great validity? Actually, he offers few evidence to convince me that that the distribution of the ad hoc task assessment is very normal.

    In this study, the authors generate pseudo-qrels by selecting fifty trials for each TREC year. Is the number of the trials too small here? This process, actually, is very similar to guess coin’s heads and tails; we need very large number of trails to get the result.

    In this paper, Ian assumes that the disagreement of assessors on relevance judgment may not affect the rankings of the system. However, I think this result is basis on a very large and diversity test collection. So, considering the number of documents in his research, would this result still reliable there?

  15. Explorations of the Random:
    1. I initially struggled to understand what "deep" and "shallow" pools were, but think that I gather the meaning now. If they mean a big pool vs. a small pool, then I suppose their findings that there is some improvement here make sense, but could not that be an artifact of shrinking the discrete set of potential relevant and irrelevant documents? (In other words, fewer possible permutations of judgment errors exist when there are fewer documents to make mistakes on)

    2. Aren't they just randomly sampling the top 100 documents according to a normal probability distribution (which seems a bit arbitrary...)? I am not sure what this brings to the discussion. If this is not the case, then it seems to me that my misunderstanding stems from their sampling method. If 10% of documents are relevant for pool A of 100 documents, so they randomly select 10% (or roughly 10%, using the mean and standard deviation to generate a normal distribution of the percentage of relevant documents in the pool), why would they have an accuracy level greater than 10-20%? Why would their mean average precision be of any use?

    3. While the authors initially favor using an exact fraction technique, they seem to be surprised in their analysis that this technique is outperformed by including duplicate documents in the pools. But from a theoretical perspective, it seems to me that document re-occurrence in a system's outputs (the "runs") suggests document relevance. In addition, duplicates then have a higher probability of random selection from the pool. In light of this, it sounds like including duplicate documents might actually be the best treatment (of their three) to add signal. As the number of systems/runs used to generate the document pool approaches infinity, would not a count or measure of the recurring documents provide roughly "crowd-sourced" relevance judgments? This does not solve the problem of automating aspects of evaluation process, but it does reframe the question and the variables in a way that seems more intuitive.

  16. 1. The paper was finished in 2001. TREC has developed since then. Does the conclusion of this paper still hold?
    2. This paper is based on sampling from a normal distribution. Why did the author choose normal distribution? Does TREC’s feature just meet the normal distribution so that it happens to result in the conclusion of the paper?
    3. It is said in section 3.2 that the overall correlation was improved in TREC-3,6, but not for TREC-5,8. Why there was such a difference?

  17. 1. The proposed model uses only information about average relevant document occurrences. However it does not have any information about the individual systems in which they are run, the topics and the individual documents. So will this model work for different kind of topics (hard/soft) and queries? How well does it model to the real case scenario?

    2. The authors have mentioned about the advantages of their method when shallow pools were used. The effect of a smaller pool depth of 10 documents is that they were able to find rare or unique relevant documents unlike other systems. And this was attributed to using shallow pools. But isn't it contradicting the basis of retrieval system to balance between recall and precision measures? Would a user not be interested in highly relevant documents that are viewed and rated by everybody rather a rarely occurring relevant document? How do they substantiate relevance and ranking documents unique and rare when they have discarded the use of relevance judgments?

    3. The proposed random-sampling ranking method relies its foundation on picking random documents, how do they perform ranking of these documents without relevance judgments? Can ranking be done? Only distinguishing a document from best and worst pool is capable but this may not be useful while assessing a retrieval system considering the fact that users are interested only in top ranked search results. Is there a way to measure the correctness or effectiveness of such random sampling based ranking techniques?

  18. 1. The author mentions that, for their experiment, they do have the benefit of already having relevance judgments from the TREC run. As a result, they don’t have to go through the process of trying to estimate the number of relevance judgments. In their experiment, this did give them the added bonus of being able to test certain cases such as using exact values to get a better handle of their approach. At the same time, if someone was going to use this approach to evaluate a web search scenario, they would not have an advanced knowledge of all the relevant judgments. The authors don’t mention how to make these estimates. How does this technique actually hold up when starting to explore without any prior knowledge available? Is it easy to make the required estimates or is there a hidden labor cost to this method?

    2. The author performs a number of variations of his methodology to try and pinpoint what variables lead to the best results. The first variation, keeping duplicate documents, had the biggest positive impact on the results. When multiple systems return the same document, there is a higher chance the document will be relevant. Keeping the duplicates increases the chance of this document being supported by both systems to be retrieved. In class last week, we talked about a paper that found an erroneous pessimistic judge has less of a negative impact than an erroneous optimistic judge. It’s interesting that, for this technique, the relevant documents are more beneficial. Is this perceived difference in effect because the nature of this technique is comparing IR systems? As a result, seeing how systems rank the same relevant documents has more of an impact than where non-relevant documents wind up.

    3. The final variation to his approach is to use the actual distribution. Since the author used previous TREC runs, he was able to use these relevance judgments as the ground truth and compare his result if he utilized this knowledge. Surprisingly, he found that keeping duplicates outperformed this approach. As a result, he concluded that the performance of his approach is influenced more by what documents are compared rather than the number of relevant documents. The optional reading brought up the issue of reusability of relevance judgments. With this in mind, I am curious how this dependence on what documents are used impacts how reusable or applicable this approach is in various settings?

  19. 1. The author performs tests to try and demonstrate the possible effectiveness of system evaluations without true relevance judgments, but they start their experiment with fully a fully judged set of documents from TREC runs to avoid having to do a truly random test. Does this tie to a judged collection influence their results? How can they talk about truly random judgments when they themselves aren't practicing their proposed methods?

    2. If a document is retrieved by many systems in their duplicate documents in pools section, it has a higher chance to be picked in the psuedo-qrel. Given this produced the more accurate results in their experiments, wouldn't this leave the random judging process more vulnerable to spam and tricks like the ones Google discussed in their rater documents? Something can be retrieved across many systems without being truly relevant to a search.

    3. In the exact-fraction sampling section, the author took the actual exact percentage of the pool that was judged relevant in TREC. How would this translate into a true application of the random relevance selection where the prior rate of relevnace is not known?

  20. This comment has been removed by the author.

    1. This certainly has to be one of the most interesting papers we have read so far. I believe that the authors also introduce a lot of doubts in addition to the new insights that they bring in.

      1. The authors state that the pseudo relevant sets will be a great way to measure and rank web based retrieval systems. This could have been tested by using the Web Track available from the TREC collections. Can this study be proven to be equally effective in the case of a web track? Are there any other studies which show the same?

      2. The results show a marked improvement in the performance when the duplicates are allowed in the pool. This insinuates a need for a lot of reasoning. Duplicates of highly relevant documents in the pool are bound to increase the probability of a ‘relevant hit’ when a document is picked at random. How does this effectively test the system’s performance? Was this pseudo-improvement not the reason why duplicates were removed in the first place?

      3.The authors also show that adding duplicates, shallowing the pool and exact fraction sampling fare poorly or show limited improvement in the results of highly ranked systems. The marked improvement of performance is consistently found only in the midway. We have seen in the previous articles that the efficiency of highly ranked systems are way better than the low ranked ones. Does this not reiterate the fact that relevant documents can be picked by chance, but if the system has to be highly effective it has to do something more than that? Or does this have more to do with the way in which these values have been computed?

      4. Does the argument of the effectiveness of random relevance judgments hold good in the case of a very large pool? In a large pool, the probability of a relevant document is expected to go down significantly. Does this study hold good in that case?

  21. 1. The paper is interested in comparing two methods for generating relevance judgments: the normal NIST approach and an automated approach that randomly selects from the document pool. With the automated approach, the authors say that they randomly select a number of documents in the pool (number equal to avg. number of relevant docs for past TREC experiment) for each topic and randomly assign relevance. Why would there still be such a high Kendall’s Tau correlation since documents are both randomly selected from the pool and randomly assigned relevance judgments?

    2. The Kendall Tau correlation seems pretty high, but is it actually very high considering a baseline of binary relevance judgments (TREC experiments evaluated use Binary relevance right?)?

    3. One claim made by the author in the conclusion concerns why judge disagreement doesn’t matter as much as we think it should. It is claimed that disagreement doesn’t matter too much because judges tend to agree on the document’s that matter most (e.g. high ranked documents). Shouldn’t this motivate an alternative standard for measuring relevance agreement (i.e. a measure that specifies agreement at different levels of relevancy).

  22. I enjoyed this paper the most out of all of the one's we've read, and that seems
    to be a common thread in these comments. My questions:

    1. Since relevance judgements assigned randomly according to some model do
    produce *some* positive, significant correlation to the actual ranking of
    retrieval systems, is it possible to somehow combine this technique with human
    relevance judgements to decrease the cost of creating a test collection?

    2. In order for this approach to work, the researchers need to model the mean
    and standard deviation of the occurrence of relevant documents in these pools,
    but in order to get those numbers, don't you need to do the relevance
    judgements in the first place? If so, then how would this be useful for their
    stated goal of automatically generating test collections on web-scale document

    3. They use a normal distribution for modeling the occurrence of relevant
    documents in a test collection pool, and their justification is that they
    didn't notice a big difference between this model and using the exact
    percentages of relevant documents. That seems like poor research. They didn't
    spend any time addressing why this might have been the case. Is it valid to
    assume a normal distribution because you didn't see a difference between that
    and another model?

  23. In section 3.2, the authors speak to how sampling from a deep pool makes it less likely to draw rare relevant documents and a shallow poll would increase the likelihood of finding these rare documents. I don't understand how that is possible? Wouldn't a deeper pool allow for more unique documents to surface? While there will also be an increase in duplicates, I don't understand how a deeper pool would have less unique items within it? I understand how this affects precision, but I'm not sure how this effects relevance?

    Earlier in the semester we talked about how the outliers can sometimes be the true measure of a system. We talked about how an algorithm is only as good as it's weakest link (so to speak) and the authors here, in their conclusion say that the differences in human judges are not randomly distributed in regards to their impact but are primarily impacting the 'fringe cases which don't affect many systems'. So are these fringe cases not important, or are they the crux of this method and should be judged similarly to the algorithms in regards to only being as good as their weakest precision scores?

    In section 2.1, the authors talk about how the WWW is dynamic and how it's unrealistic to use it to score against something that's always changing, and how and why the TREC collections are not useful in this case. What I'm wondering is, how do companies like google, bing etc. . test that their documents are relevant when they are live, growing collections? The dynamic structure seems insurmountable to test, but they must? right?

  24. 1) First off, I thought the idea in this paper was really cool, since after reading all those Vorhees papers that basically say “it doesn’t matter if the judges disagree,” I started to wonder if having judges even matters then. One key part of their technique is that the authors chose an “all or nothing” approach in terms of their choice of model. Specifically, they purely use a normal distribution for their pseudo-qrels. The reasons for this make sense, but would there be any benefit to taking a middle ground where parameters could be assigned using a minimal amount of human input?

    2) I realize that the judge-free technique here is a proof of concept, but how can they justify using the mean and standard deviation of relevant documents as the basis for their normal distribution. If this is to work in any real life scenario, these values will not be known. How will they choose them then?

    3) The authors’ approach of using “artificial” relevance judgments seems cool, but has the work been expanded into anything else? I’m particularly curious since this is an older paper, and now with crowdsourcing techniques available for coming up with tons of cheap relevance judgments, their motivation does not seem to carry as much weight. Are any “artificial” relevance judgment techniques at the point where they would be preferred to crowdsourcing with noisy, human relevance judgments?

  25. 1- Where is the authors’ plan for how to apply this method to a web search or other topic with no existing relevance judgements to base statistics off of? Individual topics had to be closely analyzed to find the probability of a relevant document being found for it before an over all average was determined for the year and since a strong correlation was not found from year to year the issue of true independence from relevance judgements must be raised. What is there plan to turn this into an independent effectiveness measure? If that is not the ultimate goal that needs to be explicitly stated and the method’s used needs to be validated.

    2- I would like to see how well multiple assessors’ judgements map to the random ones. How many assessors are needed before the total selected documents resemble random selection? Are human assessors generally more in agreement with each other than with the random set as the article states? By what percent? What about 5 or 8 assessors? If at a higher number of assessors assessments are not widely varied does this argue against using the random selector method?

    3- How do the researchers propose to handle graded relevance judgements? Can their method be extended to include different probabilities for different levels of relevance? Is this relevant to their purpose?

  26. 1. The authors of this article, in their justification for their experiment, state that the traditional method for creating the TREC test collections could not be increased to the larger sample sizes that are needed to test search engines designed for use on the web. However in recent years the TREC program has used very large test collections and had several tracks that focus on web searching. Do you think that the work they did in this paper help to create these collections or that their assumption was wrong?
    2. In this article the authors state that the differences in the reference judgments from different judges is mainly concentrated on “fringe” cases that do not cause problems for most systems. If this is the case is there some way to identify these fringe cases and either eliminate them or create some model that would account for the problems they cause for judgments?
    3. In this article the authors explored three different variations that they applied to their results to help improve the correlations that hey observed. These variations were re-adding the duplicate documents to the test pool, using a smaller pool of 10 results, and Exact-fraction sampling. Can you think of some other variations that they could have explored to help their conclusions?

  27. This comment has been removed by the author.

  28. 1. This paper proposes an interesting topic about evaluating retrieval systems without relevance judgment but based on statistical information. Since this paper was published in 2001, I am interested in how many follow-ups or other corresponding research improvements based on or inspired by this one. So I searched on google scholar and found that since 2001, this paper has been cited 161 times. Topics vary a lot: from “Retrieval evaluation with incomplete information” to “Information Retrieval: a health and biomedical perspective”. I am wondering whether this method is really reliable and applied in practice since the hypothesis in this paper is so strong and it may easily vary from case to case?

    2. Previous evaluation methods are based on relevance judgments, and this paper totally abandons relevance judgments. So I am wondering what if there is lying in between. That said, we have some relevance judgments, but the corresponding number is relatively too small compared to the total number of documents. Can we make use of this method to aggregate more documents?

    3. Another question about this paper is about the experiment part. Actually I am not quite convinced by the experiment results here. The charts and tables shown in this paper are mainly employing average value to compare. So all things seem to be statistically related. Can we validate it more rigorously?