1. The authors state that the best experiments are done using “real or realistic” data and experiments that are designed to model the real world (p.376). What makes one set of data more realistic? Should query logs or common sense be used to determine which topics are most realistic for a particular collection of documents? How does this assertion affect the experiments we read previously that used synthetic data?2. The authors write that pooling does not take into account “buggy” systems, rather it treats all systems the same. Why do they never mention the buggy systems in the other articles? What percentage of systems return wrong results like this? How does the knowledge of the “buggy” systems influence your view on the reliability of TREC tests and the pooling method?3. The alternate methods find more relevant documents than those using the standard TREC pooling method. Is it good that these alternate methods identified more relevant documents? How relevant are these documents since the authors are only working with binary relevance?
1. This paper presents an important study of document selection for relevance judgment based on Ranked Biased Precision (RBP). In the model a value of 0.8 is mainly used for parameter p. Study of effects of different p value is explored at the end of the paper, and it is shown that the composition of retrieved documents changes drastically with the change of p values. How do we find the p value that leads a maximized performance? Since p value is defined as the likelihood of the users go further down the search results, is it better to define the p value in a topic-oriented way?2. The method addressed in “adaptive methods” is a little bit confusing. How is the estimated RPB value calculated? Also the RPB value depends strongly on the number and composition of runs compared in the study, since the documents ranked highly by more runs will have higher RPB value. If so, as we all know there’re similar runs in TREC (contributed by same search engine with different parameter tuning), will this create bias toward the calculation of RPB value?3. The exploration of number of documents necessary for relevance judgment is constrained to 50 topics in this study. Are there similar studies using different number of topics to see whether the conclusion is general enough for different amount of topics? Also is there any study that evaluates the combinatorial effects of both number of relevant documents and number of topics on the evaluation of search system?
1. We’ve continuously discussed binary versus graded relevance judgments; the Soboroff et al. (reading from this week) say that relevance is not a binary relation, but can assume a range of values (Soboroff et al. pg. 66). All the experimental judging methods we read in this paper use binary relevance judgments. Do you think binary judgments are acceptable, in the interest of time, costs, resources, and simplicity? Or do you think that graded assessments would lead to a more valuable pool? 2. This research highlights the limitations of pooling; especially that it may not be very good at recall (retrieving relevant documents). Can you think of ways of improving the pooling strategy so as to improve the number of relevant document retrieved per run? Maybe something like incremental pooling, which adds documents based on the success of previous rounds? 3. The researchers say they believe that for dynamic selections, assessor bias would not have a strong affect on the result (pg. 375). We have discussed how important this factor can be to relevance assessments. Don’t you wish the researchers had elaborated on why they believe so, particularly with respect to this method in which each outcome is influenced by the previous step.
1. I was interested in the section which discusses the shared nature of good and bad runs: "bad runs and good runs share a common propensity to introduce documents not proposed by other mechanisms, and that it is difficult in a static and pre-identified judgment pool to differentiate between the two"(p. 379). In other words, what good and bad runs have in common is their ability to deliver results that other, average ones would not. What can we learn from observing bad runs and implement in the creation of good ones?2. The authors state that confidence tests are an "important part of any system comparison"(p. 380). What is a good way to conduct a confidence test? Why haven't our other articles discussed them? I just did a quick search (control F style) through all the documents we have read in this class, since I didn't have anything in my notes regarding these tests, and came up with no mention of "confidence test." Am I missing it, or is this the only article which discusses them? If so, what are some good references on this topic?3. Are the authors proposing a similar notion here as the authors of "Ranking Retrieval Systems without Relevance Judgments"-- that a kind of pseudo-judgment can be made, or at least predicted, based on observed patterns? Specifically, I am referring to the section on RBP projections: "Another interesting possibility is to extrapolate from the known bases and rs assuming that the unjudged documents are found to be relevant at the same rate as the judged ones. This is a reasonable estimate, since the unjudged documents tend on average to be lower in the system ranking than the judged ones, and, unless a system has particularly perverse behavior, the probability of a system identifying a relevant document is non-increasing down the ranking"(p. 379). This may be a bit of a leap on my part.
1. When making use of Rank biased precision how can we justify not taking into consideration factors like the collection size as well as the number of documents which are relevant to each query? How does ignoring these factors affect the effectiveness of this metric? Since this method is based solely on user preference - the value for the persistence is completely user dependent. Isn't this a drawback especially when we do not have any conventionally agreed upon methods? Or, there is the case of the lack of any kinds of baselines? 2. The paper speaks of working towards utilizing minimum test collections but does not provide a methodology to implement this principle. Since, annotating the test collection is required and we also need to collect documents and queries - how can we go about reducing the amount of time spent in building this test collection and also establishing a high confidence when we are making use of complete judgements and utilizing test collections that are annotated a priori? And so, how can we get the best value in terms of 'qualitatively discriminating between systems and at the same time computing quantitative performance benchmarks for competitive systems'?3. The Wilcoxon Test has been used for significance testing in the investigation conducted. But, we have seen how the Wilcoxon Test is capable of resulting in false detections of significance as it in effect is a simplified version of randomization. The absolute differences are replaced by approximations and so, there is a definite loss of information. So, how can we account for this percentage of significant results that the user may consider insignificant? And, similarly how can we account for the percentage of results which have been tabulated as significant but in fact are insignificant?
1. The paper skims over the fact that adjusting to query-related variability can significantly reduce the judging pool – But I did not see this addressed. I think exploring this would have added a lot of value (esp. if there is a way to draw heuristics from previous collections).2. I feel that the approximation made to standard pooling approach and using it, as the baseline is unfair (In terms of the number of documents considered per run). Most of the methods presented got better by enabling better coverage – while pooling does this across all runs here it is more concentrated among a few runs. 3. TREC5 was often used to set parameters and understand what works and what doesn’t in a trial and error fashion – the extent of use of the development set is not clear – nor is the necessity.
The author is vague about as to why it was decided to cube the values of estimated RBP in Measure C and not squaring it for example? Was it just because with a cubic they they got the best accuracy in results? The value of p is being used in multiple steps of the evaluation, making it a very important pert of the whole activity, and an error in judging its value makes is a single point of failure in the process. Isn't this very risky?It might app read that this way of judging would seem as an advantage over the usual way in which judgment is taken, but even this process has the possibility of introducing errors as has been admitted by the author himself in page 4. It is more like coming out of one trap and entering another one. What are the deciding factors based of which one can decide which of the errors can be ignored in a particular scenario?
In Section 2, the authors listed problems with Mean Average Precision. I understand that MAP does require that all of the relevant documents for each query be identified and the relationship between MAP and user behaviour is also problematic. But I don't quite follow "there is no sense in which doing more judgement work guarantees a higher-fidelity approximation to the underlying behaviour of the system being measured". Is it not true that more judgement works can make MAP statistics more accurate? Why the authors feel there is no sense of doing it? Maybe there is a sweet spot between judgement work and higher-fidelity? If so, how we know when we can stop doing more judgement work? The authors did not mention it explicitly.In proposing Rank-Biased Precision, the authors had the assumption that "the likelihood of inspecting the ith document in the ranking is thus p^(i-1)". Where are the evidence that the user behaviour can be modelled using this exponential formula? Why the authors can rule out other factors which might influence the user behaviour, such as UI design, browser design? The authors did not give these evidence in the paper.In Section 3, the authors listed a method "Summing contributions" to select documents. The method "computes the total weigh of each of the documents, as a sum of their weights in the runs". I am wondering how applicable this method is? If there are a large number of candidate documents, what is the cost of evaluating each of them? Does it defeat the very purpose of selecting few documents into the test collection compared with traditional pooling method? Or did I miss anything?
That pooling attributes the same (apparent) fidelity to every system was portrayed as its disadvantage (in page 2). If that can be assumed to be true and different topics and systems are given non-uniform weights, can this not be categorised as experimental bias? Given that the topics could come from various backgrounds and difficulty levels, and that we cannot measure (quantify effectively) such parameters, can it still be considered as a disadvantage of pooling?In the cubed midpoint method, the expected RBP is cubed. Was it cubed (and not quadrupled!) only to better fit the data? What could be the answer to why cubing would be better than the raising the term to other powers?When the authors talk about RBP projections, they claim that assuming 'unjudged documents to be found relevant equally as judged documents' is a reasonable estimate. Is it not directly contradicting the Zobel’s finding (read in week 2 or 3) that the number of relevant documents from the next K-depth pool of the individual runs is small (average of 1 relevant document per system) and that it would not be affecting the relative system ranking? Am I missing out something?
In using the dynamic selection method, Moffat et al. address the issue of potential assessor bias. They mention it as being “not likely to be strong” but as our previous discussions have pointed out the impact of assessor biases have to be taken into account. Does this change anything in the way we perceive the conclusions reached by this paper?Moffat et al. discuss “pooling insensitivity”(376) as an aspect of pooling disadvantages. If pooling has this problem with placing all systems on the same playing field, as Moffat et al. suggest, is all previous data done with pooling ambiguous for all systems? I guess what I’m thinking is that if they believe pooling “assumes all system should be scored to the same level of (apparent) fidelity” then what does that mean for individual systems across the board that used pooling and might have scored lower?I’m interested in what the article means by a “reasonable first-order approximation of actual user behavior”(376). Does RBP mimic the end result of user interaction mentioned or just the behavior mentioned in Joachims et al. that Moffat et al. reference for their model?
The authors have made a very persuasive argument for using, for example, RBP and selective relevance assessments to achieve the same thing (comparative evaluation of systems) at a lower cost. Obviously, given the same amount of resources, we can conduct experiments on more queries and get even more reliable scores for systems. However, I get the impression TREC has not adopted this methodology yet. Why is that?Even if we cannot directly use RBP for comparing systems, perhaps because we wish to continue using MAP for ad hoc retrieval evaluation, the first graph shows a clear correlation between RBP and MAP. I wonder if using this metric to bound MAP results as well is an interesting area of study, especially since, if it is, it can be used for deriving the same or very similar results that TREC traditionally does, but at a lower cost. This way, it seems like a win-win, both for proponents of this method and for those advocating sticking with the traditional metrics. Subtle as it may be, we again find ourselves grappling with the thorny issue of topical vs user relevance in this paper. For example, the author states on page 2, when explaining the rationale behind choosing RBP that 'The relationship between MAP and user behavior is also problematic. Is a user actually 100% satisfied if they examine the top ranked document for a query, find that it is relevant, and then look at another 999 irrelevant documents before they stop?'. In other words, the reason for not using MAP is that it is not as relevant to a user perusing those results as RBP. However, I find it encouraging that there is a correlation between MAP and RBP as mentioned before. Does this mean, in conjunct with the other readings, that we can put this distinction behind us and assume it doesn't really matter which metric (topical or user relevance) we are discussing? It seems like this kind of distinction would only be of significant importance in an IIR setting, and that these findings show they are not as important when discussing more objective IR evaluation results.
1. This paper refers to Carterette’s work as “judging documents uniformly from the top of ...”. Carterette made such point with the assumption of the term “uniformly”. What will happen if such assumption does not hold. If it does not hold, how does work in this paper can handle the 2nd objective mentioned here?2. The static method in this paper requires a list of candidates for judgment, based on their importance, was chosen prior to inspection of any documents. How does the paper identify which run should be first considered and which run should be the second? In other words, the sequence of run (system) is unclear. Even such choosing is randomly, unless the random work repeated enough times, it is still a question whether the order factor impacts the final result.3. In section “Establishing confidence”, the base-vs-base values are stable as it mentioned, but those of base-vs-top and base-vs-proj are not. What do the changes in such columns stand for?
When introducing Rank-Biased Precision, Alistair mentions that, of more persistence, users are more likely to read large number of results. However, it seems too simple there because users’ behaviors may be influenced by other factors as well. In this case, this model is actually unreliable to some degree. And, if more factors were included in this model, would the results gained in this research be different? When trying to explain (b s,d), Alistair mentions that (b s,d) will be unlimited if d does not appear in the ranking. Does it mean that the document d is the only one that users are satisfied and other documents they read are not actually? In RBP, it seems that unjudged documents are assumed to be all relevant. However, in reality, it is less likely to occur in this way. So, would this bias influence the validity of this model?
1. In section 4 (method C), the variance is cubed as the estimated RBP value. Why? Is larger value more efficient in later analysis?2. When discussing the RBP Projection, it assumes that the unjudged documents are found to be relevant at the same rate as the judged ones. What happens if the assumption does not hold? 3. Table 4 shows the average residual of the choice of p when document candidates are chosen. But such a value is insufficient to present more information. What happens if the residual varies greatly?
1. MAP faces a key issue that it requires all of the relevant documents for each query to be identified. This means a MAP score that is computed for a subset of documents depicts only partial effectiveness measure which implies improper assessment of the retrieval system. Since it is well-known that it is largely impossible to construct a system which is capable of retrieving all relevant documents for a particular query, in which scenarios can MAP help? How can it be considered as a useful metric for assessing relevance measure?2. In the paper it has been mentioned that there is a need for a model of user utility or satisfaction. Is this realistic? How far can one model something that is as volatile as user satisfaction? Although the metrics such as Reciprocal Rank, P@k indicates the amount of user satisfaction from examining the retrieved documents, are we not actually trying to address the problem that is other way round? Retrieving documents which satisfies the user and not evaluate how satisfied the user is with the retrieved documents. Would this not pose a constrained result set to the user while looking for relevant answers to his/her query?3. An interactive judgment process is adaptive in which the documents suggested by systems that are successful are favored. This implies there is a biased opinion in the judgment. Although this bias may be intentional so as to maximize the number of relevant documents that are found, it does not avoid uncertainty in the effectiveness of the system. Would this not be indicative of losing diversity in the search results? Some or more relevant documents that may be highly relevant may be left unjudged due to this. How can targeted relevance judgments using comparisons from various retrieval systems address this?
1. When outlining the initial static judgment experimental approach, the author describes using TREC8 results as the data for experimentation and TREC5 results as training data. Whenever an unjudged document is being handled, it is treated as though it is not relevant. In class, we have mentioned the bias introduced when handling unjudged documents in this fashion. To further support this point, we even had multiple readings where researchers have attempted to find the best approach to address this problem. Given that this approach is geared towards trying to estimate the number of relevant documents, is this concern not valid? Or is it able to corrupt the initial viewpoint of the system and potentially negatively impact the results?2. The author addresses the reusability of his approach by noting that a user is required to provide a p-value. The author attempts to demonstrate reusability by testing three different p values across all of his approaches. The outcome is that very little difference is observed. The author concludes that his approach does have a reusability aspect. The author does his reusability check using the “Method C” version of his metric. Does the reusability of the approach significantly decrease when using the previous iterations of his metric? Or did the author just want to present the best version since that would be the metric most-likely to be used outside of the paper? Also, the author used 10,000 judgments. How does the number of judgments impact the reusability?3. Towards the end of the paper, the author evaluates the confidence level of his estimation techniques. He lays out three different methodologies for driving confidence levels and explores them all with varying numbers of judgments. The author considered 95% and above a good threshold for confidence as it seems to be the industry standard. The results imply that at best, just over 50% of the pairwise comparisons can be found using the technique. This does not sound like an overwhelmingly successful evaluation method. At the beginning of the paper, the author does note this approach is for determining the best performing systems and not for exact system performance or comparison. Is this due to the fact that there seems to be an upper bound towards how many pairwise comparisons can be correctly estimated? Is there an assumption that is limiting the application or is it inherently not going to be able to make fine grained decisions?
This comment has been removed by the author.
The author provides a compelling argument on the need for reducing the number of relevance judgments made. However I believe that the following questions need to be answered in order to make the argument more compelling.1. The authors state that the use of Rank Biased Precision is better than MAP, as RBP is more reflective of the real world scenario. But it is apparent that RBP is a lower bound measure while MAP is an upper bound measure of the system performance. What do you think is a better way to evaluate system performance? Does a lower bound accurately reflect the system performance in case of a better run? 2. I believe that the RBP evaluation might not be as compelling when there are graded relevance judgments instead of binary relevance judgments. Since we have already discussed about the advantages of graded relevance, I am interested to know about similar works done with graded relevance. How do you think the evaluation measure will fare with graded relevance? 3. The authors state that cubing the estimated RBP value gave an observable increase in performance on the TREC 5 data. Although it seems like a trial and error computation, it would have been even more compelling if the authors generalized this study to the other data collections as well. How might cubing of predicted RBP fare in other collections?
Several of the downsides to pooling discussed in this paper sound like serious and legitimate concerns. For example, "pooling is insensitive, and assumes that all systems should be scored to the same level of (apparent) fidelity." (376) This strikes me as a mirror of the concern expressed in Guiver, Mizzaro, and Robertson regarding the weighting of topics - why not also weight systems? On the other hand, the authors also write on p. 376: "pooling has several disadvantages. One is its vulnerability to faulty systems – most TREC participants have suffered the embarrassment associated with a buggy system that causes the assessors to evaluate thousands of junk documents." How, and how often, does this happen? Is the root cause of the frequency of this error too difficult to manage, such that a new method is worth pursuing in order to avoid it rather than fixing the mistake?If one were to operationalize the procedure of Guiver, Mizzaro, and Robertson to actually make predictions on other datasets, OR if one were to further investigate the supposed correlation between the MAP of their sampled topics and the overall MAP, would it be a good decision to implement RBP instead of or in addition to MAP? I kept feeling concerned, or at least uncertain, about the use of MAP in the Guiver et al piece, and reading about it now, RBP seems to identify and answer some of those concerns or uncertainties. Is it widely used in evaluations today, or has it not caught on?RBP supposedly provides an advantage over MAP in that it works from "the bottom up", being able to factor in unjudged documents as residuals. It also incorporates rank (obviously) in the form of p^(i-1). However, why should we believe that this is a) the most accurate probability calculation when no empirical or alternative measures are included in the paper, and b) generalizable to all topics and all types of queries? It shouldn't be.
1. The authors mention that documents retrieved by only one submitted run aren't likely to be relevant to their queries and for the purposes of their experiment they focus on documents that were retrieved by many of the runs. Given our discussion on tail queries, is it wise to minimize the importance of any one document based on how trial runs retrieve it?2. The authors mention erring on the side of caution and reporting lower bounds when making reports on something being better. That seems to be a pervasive attitude across the IR field as a whole. What does it say that so many researchers seem to try and downplay various results and aspects of their experiments? Being wrong or reporting results can lead to future research and improvements in the field.3. Along those same lines, the authors spend more time focusing on the performance of the well-testing systems, but don't the middle and low-testing systems give valuable information to the experiment as well?
The authors say that, "when systems are performing as they were designed to, pooling is insensitive, and assumes that all systems should be scored to the same level of (apparent) fidelity"--what do they mean fidelity? how is this term being applied? I understand that they're making the argument that pooling is not always the best approach, but I wonder how you could compare two systems without assuming the same levels of measure?When the authors talk about base vs base, and base vs top comparisons, what exactly are they talking about? The authors also talk about how the documents are chosen in a static selection, and how they are chosen based on their 'importance in the scoring regime'. Documents that all score highly are considered good runs, but ones that don't score highly, or vary in their scores are not considered good runs. In previous readings, we have learned that very few raters judge documents constantly (most coming in around 30% accuracy), so how can we really know what a good run is or isn't? In addition, what about documents that are more rare and more likely to be judged to varying degrees? How can we account for those using this method?
1. The paper looks at ways to reduce the number of judging decisions and uses RBP as measure for how important a document is. The authors use a p value (probability that a user will look at a document) of .8 for the study. What is the motivation behind setting the statistic in this way? How correctly does RBP equation model actual user search behavior?2. The article represents yet another work in which absolute performance measures are ditched in favor of methods that only seek to give relative performance information between systems. What are the hazards of relying exclusively on one-time relative system comparisons? Could it inhibit progress?3. At the end of the paper, the authors look at the discriminative power of different pooling methods (Table 5, pg 381). Looking at table 5, the alternative pooling Method C looks pretty powerful (it appears to have the same discriminative power at 5,000 document that normal pooling does at 20,000 documents). Nevertheless, it is unclear from the table what the absolute difference between performance of the algorithms is (i.e. projected ranges of the RBP performance are different for every algorithm). Is it important to simply know whether systems are better than each other or to know by what degree they are better?
1) The authors of the paper mention a need for modeling user utility and satisfaction. From previous readings, I had the impression that from a mathematical standpoint, this could inferred from efficiency and effectiveness measures. How can we determine whether a user is more satisfied with fewer results (but potentially missing relevant ones) or more results (but potentially having many non-relevant ones)?2) The authors state that one of the issues associated with pooling is inability to compensate for query related variability (some queries need more judgments). I am not entirely clear on how RBP accounts for this. Is there a reliable way to determine p for every query, so that we can correctly determine how many judgments are necessary for a reliable RBP score?3) The authors propose multiple methods for achieving less error in differentiating between the top systems. They build on their initial method that simply uses rbp weight, and finally use a cubed midpoint technique to get the least error at the top end of the spectrum. My issue is that it seems that they use a single collection as the basis for tweaking their model. Would their results still hold true if they ran their tests on a different set of queries/judgments?
1- The ‘p’ measure is probability that a user will view a document and can increase or decrease with an assumed user persistence. The authors used .8 (80% probability) as their p measure in their critical calculations for this paper but did not include a discussion of how they arrived at this number. The authors then demonstrated a clear difference in results when p was increased to .95. I would like to see a serious explanation for the use of .8 in light of their knowledge that this number can change results. No exploration of a p value lower that .8 was presented. Also I think the authors should discuss the fact that different topics are guaranteed to to elicit different p values in the same user and the same topic will generate different p values across multiple users. I am not arguing for or against the .8 value but I want to see it justified in terms of reals users, especially since the authors claim that the RBP is better at accounting for user behavior than other metrics. 2- I noticed that the description of the adaptive method of pooling (judging systems overall on their highest ranked documents) would certainly open the door for bias toward systems that produce a higher quality document at a higher ranking. The authors mentioned that another paper by Voorhees also raises that concern. Arguably it is appropriate to bias towards systems that have better output early on however the authors did not make that argument. All they said was that MAP produces the same bias. This is an unacceptable argument. Their work should be defensible even when comparing to other potentially problematic work. Or a source or potential error should be admitted. I would really like to see the authors address the issue seriously. 3- The authors appear to have developed a very clever new method of ranking that produces at least comparable results to prior methods. What they do not do is discuss the obstacles to its wide spread adoption or outline clearly future work to be done in refining their method. I get the feeling that is is actually a very good idea that has not been embraced immediately. What do the authors feel are its shortcomings? Where do they envision its use? What will they continue to work on?
1. In this article the authors explore two different ways that documents can be selected for judgment in an information retrieval experiment. The second method that they describe is the dynamic selection method. The authors state that this method contains the possibility of assessor bias but that the effect of this bias is not likely to be strong. Do you agree with this assumption about the strength of the assessor bias?2. In this article the authors state that one of the problems about the MAP metric is that it gives the upper bounds of the actual performance of a system and that heir metric RBP tends to give the lower bounds of system performance. Is it possible to combine these two metrics in some way to get a range of possible system performance and what use would this metric have?3. In this article the authors explain several alternative methods for ordering judgments. In method C they use the estimated RBP value but they cube it to give a better performance. However they never give a good reason for why this works. What is a good reason for using the cube of RBP in this method other than making the results better?
1. My first question is in Section 2 Effectiveness Measurement - Selecting documents for judgment. In this chapter it talks about the disadvantage of pooling: most TREC participants have suffered the embarrassment associated with a buggy system that causes the assessors to evaluate thousands of junk documents. This is quite true according to my personal experience. I remember back in 2009 (or 2010?), the team I used to work with participate in one track of TREC. We were exhausted labeling the documents and most of them were irrelated. I notice that in this chapter, it also talks about pooling is unable to adjust to query-related variabilities. What does this mean? How do we understand that some queries might require more judgments than others?2. My second question is in Section 2 Effectiveness Measurement - Problems with Mean Average Precision. It gives an interesting counter-example about using MAP: If a user is actually 100% satisfied about the top-ranked document for a query, then it is unnecessary to look at another 999 irrelevant documents before he/she stops. It also talks about user satisfactory, and hen talks about reciprocal rank. I think there is always dilemma to choose either method. For example, there is also no universal criterion to decide on the depth k, especially in another paper it says P@10 is poor in measurement. As for the RBP, it is also naturally associated with the problem. Is there any better way to decide on where to stop (like statistical information from user log to decide on the average length user stops)?3. The authors give step by step explanation by examining the disadvantage for existing methods. However, there is one part that I am not quite sure since I failed to find detailed information. How does the proposed method solve the problems existing in pooling? Also, from RBP we can see that there is a probability parameter p in the RBP formula, as well as a termination with probability 1-p. What about the documents that are after the termination? What about the documents that are not within the sampling pool? Are they taken as irrelevant?