1. This paper presents an important study on how many topics are necessary for evaluation of search engine, in which the author recruits a list of rigorous statistical tests. The author uses the evaluation results (i.e. MAP) of 50 topics as reference data set, which assumes that MAP or results of other metrics of 50 topics are close enough to representation the nature relations of these systems. Is there any study that justifies this assumption?2. Ranking of systems is the central result evaluated in this paper, and all test statistics (i.e. linear correlation, kendall’s tau) performed on the ranking results assume that the ranking of systems at different positions are equally important. However, in real world we might care about which system perform best, or which systems are among the top 10 best, which suggests that correct ranking of systems at the top of list is more critical. Is it possible or necessary to perform similar study for this purpose?3. My last question is about the heuristic search approach, which uses the optimal subsets of previous steps as start point and allows change of maximally three topics. As it is proposed in the paper that topics might be complementary, and they might also have contradicting effects on each other. A heuristic approach that includes a significant proportion of topics from previous steps might exclude some other topics from the optimal sets, and the bias might grow stronger as the increasing of steps of iterations. So is it better to adopt an alternative random sample procedure as mentioned in the paper?
1. Through this paper - we are making use of a premise wherein we have a uniform distribution of similarity judgements and so the confidence is proportional to the number of similarity judgements which are available. This would imply that all documents of a 'similar' nature are equally likely to be rendered. How can we justify this assumption in practise - especially since, we've seen how similarity does not follow a uniform distribution and is definitely skewed? Wouldn't we require a better approximation of this distribution in order to improve the performance of the system? 2. I'm unclear on the section that deals with the evaluation of the goodness of ranked documents which focuses on optimising this goodness measure and also comparing this goodness measure to another. The basic principle in IR evaluation is to optimize the ranking function. But, the metrics we use like MAP and NDCG for instance are not really related to the IR performance measures. Thus, aren't we limiting ourselves to a sub-optimisation always? Also, how much can be said through a comparison of the goodness measures? 3. The paper does begin with the hypothesis that some topics may be better than others for evaluation. It however does not address the concept of redundancy and overlap in the case of repetition of data. I am curious to know that when discovering these subsets - how is redundancy calibrated? Since, redundancy in fact is a similarity metric - what is the distinction between two documents being similar and one document providing duplicated information to another document? Also, since we have read in papers how a simple metric like P@10 and RR are capable of being representative of the user experience - do we require complex metrics like nDCG and RBP at all?
" At every point the `best' curve is a little better (lower) than, or coincides with, the corresponding `random' curve." But for a best set to be built out of a random set further evaluations are needed. Is the whole effort with it ? I mean it seems beneficial when one might want to evaluate various search engines, but in this era every search engine is different and targeting a different user set. So how can this idea be brought to use by every new search engine ? The analysis done by the author is based of testing which was done with a wider set of topics and then based of its result has shown that testing with a smaller query set too would have returned the same result. But how can a that ideal set of 25 topics be created. I mean what all factors need to be taken to account. For example, how can it be ensured that the query set is diverse enough? How can a tester gain confidence about his testing that whatever testing he has done was not biased in any way ?With more testing, the engineers might come to know about the weaknesses of the algorithm in some specific domain. Isn't that a risk associated with this evaluation ?Also in this type of testing wouldn't using GMAP be more beneficial that MAP? This idea would be best suited for groups working on lowering the costs of their evaluations. But there is no science behind how to create that best set in the first place, I mean big companies like Google won't be the ones who would want to do their testing on a small set of data. They have the funds and infrastructure to test every aspect. It is the small groups who would want to implement this. But they will lack the expertise and experience to determine what might constitute the best set for testing...
In Section 4, the authors choose to use TREC 8 as the main dataset for experiments. I am not sure choosing a subset of topics will work for other search domains. With 50 topics chosen in TREC 8 for the experiments, the authors do not mention how many topics in the TREC 8 and how they select these 50 topics. There might be issues of over-fitting.In Section 4.5, the authors states the concern of over-fitting issues and generalization. To mitigate this issue, they try to address with three ways. One way is to split systems. But in the end, they actually split the system runs. Since even distinct runs might be base on minor changes of a specific system (parameters tuning), it might be possible when splitting the system runs, the over-fitting issue may remain as a group of system runs from the same system are divided. In Section 4.6, the authors tries to resolve computation issues by proposing a recursive heuristic. The heuristic, however, is based on the assumption that the best sets overlap. I am wondering whether the best sets indeed overlap. In Section 5.4, the authors themselves find "the best at c=9 and the best at c=10 are completely disjoint". Does it provide the contradictory evidence to the authors' assumption. If the computation is not accurate, what is the impact to the experimental results?
1. It was experimentally found that the some topic subsets are much better than others at predicting effectiveness on the full set of 50 topics. Would this be a result of the structure of the relevant documents in the test collection and the retrieval algorithm or could it be owed to the reference relevant judgments for those documents in the topics? 2. In our previous week's readings, many papers stated that the results vary when used hard topics (typically those topics for which the relevance judgments were either ambiguous (without any consensus agreement/with very few relevant documents)) over easy ones. How can one measure the effectiveness of the retrieval system with a fewer good topics? What is the optimal number of topics that one should consider? and How to find that optimum value?3. In the Linear Correlation Analysis experiment, it is mentioned that the test was performed with 50 topics on 96 systems on 4 different metrics. Earlier the author stated that 'choice of effectiveness metrics matters accounts for accuracy of prediction'. It is contradictory to the results that were obtained in this analysis as it was observed that the results were the same irrespective of the metrics used. So in which cases and subject to which testbeds do the metrics play a role in different subsets of topics? In this analysis, were all the 4 metrics used to evaluate the same topics or subset of topics?
This article begins with the assumption that a small set of topics can be more useful for testing a system than the evaluation method that developed during the Cranfield experiment. While using a “best set” seems like an effective tool for evaluating a system, does the lack of variety in topics and distance from a real world scenario make this a viable method for testing systems as another aspect as noted in the implications section?The article mentions cost as as a major issue faced by the current methods employed for evaluating systems. Yet in section 6.2, multiple scenarios are presented which appear to involve “significant effort” which begs the question, given the complexity of the idea presented would this be a better alternative cost-wise vs. existing methods?Guiver notes the use of NTCIR topics as being noisier. Even with the noise Guiver points out the generalization that appeared within the runs. But it almost seems like they included the NTCIR as an afterthought since they neglected to give it the same set of runs as the TREC sets and maintained the same amount of topics. Why not treat both sets in the same manner in order to truly see how the “different sets give them different amounts of information?(2)”
1. This paper adopts several metrics in effectiveness evaluation. Compared to the paper of “Strategic System Comparisons via Targeted Relevance Judgments”, are these metric competent with those in the “Strategic ...” paper? Or, is there any relationship between these measurements?2. This paper is based on 2 experimental data sets. To define the ground truth is simple since the data is ready and prepared. Considering the scenario that we just have rough data, e.g. search results from various search engines. How to define the ground truth then? Or, in other words, if the topics or search terms is dynamic, how does the methods in this paper fit into?3. Section 4.6, it discussed heuristic method. Such method cannot guarantee there is no best/worst set missing, as mentioned in paper. Why would no more than 3 topics be replaced? Is there any influence from the data set that the feature of such data set make this heuristic method work? If yes, the generalization of this method is questionable.
Having read this paper, I feel very curiosity of why the MAP values of best subsets are more correlated to the MAP values of whole set. Is it because some topics are more difficult for system to judge? What’s more, why some topics are both in the best subsets and in the worst subsets? In 5.4 Subset Analysis, John tries to summarize this data in Fig.5 with a formula. So, how can we calculate counter min, counter max, and counter actual there? When introducing Voorhees-Buckley Experiments, it’s mentioned that there are 20 bins of width 0.01, going from 0.0 to 0.2, in experiment. Why did Alistair design experiments in this way? And for the lines in Fig.9, what does each represent actually?
1. I had some trouble understanding the high-level thesis of this paper. The authors state the hypothesis on p. 19: "some topics or topic sets are better than others for evaluating systems, in terms of their ability to predict absolute or relative system performance in other data sets." The next paragraph states: "We may summarise our more detailed results by saying that not all topics are equally informative about systems." Perhaps it is more accurately stated as, "not all sets of topics of equal cardinality are equally informative about system performance." Regardless, the second paragraph makes no reference to predicting performance in other data sets. Are MAP and the other metrics being used as proxies for this? The MAP_sigma correlation to the MAP_all correlation only reports the presence of some signal between the two. It does not suggest, however, that MAP_sigma has any specific predictive power over other topics. Is this correct? If so, I would like to further understand the useability of this method, which is not evaluated, as well as examine the internal validity of the claim that MAP correlation (or correlation by any of these other metrics) actually suggests predictive merit.2. The use of Kendall's tau in this paper is also confusing to me. Perhaps this is simply because I am not extremely familiar with its general use. But from what I understand, Kendall's tau is generally used to examine ranking similarities or differences between qrels. Here, it is examining the ranking of systems... but it is doing so in terms of what? How exactly is the tau being employed?3. The heuristic that the authors employ is fascinating! I admire the clever reduction of computational cost. The alternatives are also fascinating. As this heuristic seems to be at the heart of the optimal topic selection procedure, it would be nice to dive further into how it works and some other alternatives. How would a genetic algorithm-based heuristic fare by comparison? In what other ways might we be able to make an algorithm that is even cheaper by more tightly and intelligently controlling the ways in which the sets can mutate?
1. The validation of the heuristic is addressed in section 5.1. But in the data set section, it is mentioned that some runs are removed. Will these runs have impact on the result if they are not removed? Is such removal a bias in data preparation?2. It is mentioned in section 5.5 that another good set may be reached via heuristic. Though any of the good sets works, is there any practical preference in the set selection?3. In Figure 8, the lines of Best/Average T2 on T1 intersect with other line when the topic set size is above 13-16. What does this scenario imply?
1. The authors did a good job of evaluating their hypothesis along several dimensions. For example, they varied metrics, split queries and systems, and showed that in general, the hypothesis that fewer topics are necessary is valid. However, one assumption made throughout was that even 50 topics is enough to demonstrate system utility, and that in reality, fewer than 50 topics are necessary to achieve the same comparisons those 50 topic runs achieved. What if we increase 50 to 150 or even 1000 as older work has proposed as necessary for reliability? It would be interesting to conduct experiments along that dimension as well.2. I found the section on computational complexity on set selection quite insulting, frankly, as a computer scientist. The authors did a great job overall in the paper, but they were sloppy (at best) in that section. I know for a fact that there is a wide body of literature on set selection with many variations of the same theme explored by many renowned theoreticians. Since the authors did explore that issue (at least briefly) they could at least have tried to have cited some of those other papers instead of just proposing a heuristic off the back and justifying it on the basis of an experiment (the incremental greedy heuristic they proposed doesn't sound completely original either; check out this paper http://machine-learning.martinsewell.com/feature-selection/JohnKohaviPfleger1994.pdf). On the positive side, they did mention that they were not trying to come up with selection criteria; nevertheless they did dedicate a section (4.6) to it and cited Knight's algorithm (the only thing cited in that section). Shouldn't the authors have explored the literature a little more and cited some important papers in this area as avenues for future work? 3. I found Fig. 5 to be highly interesting, mainly because it shows that the topics in the best and worst sets of a certain cardinality intersect/overlap more as the cardinality increases. Although this is to be expected, the 'perfection' of it surprised me, that we don't see a complete overlap all the way till cardinality 25 (when there is only one candidate set since they have split the topics). The question then is, how much should we increase the topics so that the difference of scores between best, worst (and equivalently random) selections becomes statistically insignificant (without necessarily choosing the full set of topics)? It seems like that number would be the best bet in practice since the issue of set selection (and complexity) goes away.
1. This paper proposes an interesting idea, but how does it translate toactually reducing the work required to make a test collection? Sure it's coolto see in retrospect how little work could have been done to achieve the sameresults, but you can apply this to any endeavor. I've got a soft spot in myheart for the Corps, so I appreciate the reference in their title, but I wonderif a more appropriate one would have been "Monday Morning Quarterbacking."2. The paper didn't have suggestions on how to choose these more or lessimportant topics, but it seems like since these are TREC collections, thetopics are chosen by relevance judges, so it would be interesting to see ifcertain judges are better at picking topics than others.3. I thought it was interesting in figure 6 that several topics were present inall of the worst cardinalities, while there were no topics present in all ofthe best cardinalities. I would think this is just a result of there being morebad cardinalities than good, but there seems to be a higher frequency of topicsappearing in good cardinalities. That is to say, the bad cardinalities are morepolarized than the good. Why is this? Does it have something to do with thelarger cardinalities tending to be better (since they contain more of theoriginal data set)?
Can we review the difference between a system, a run, several runs, and a topic are? I thought a run was a topic running through a system, and it's results--but in reading this paper I question that? (pg 3-4)In the conclusion, the authors talk about how there are good and bad sets, and how their method could be used by TREC to offer up more 'hub' type subsets, and also allow TREC to offer up topics to exclude--wouldn't this be cherry picking your data sets? The authors admit that the results aren't 'intended for a realistic scenario' but then go on to say how these special subsets could be used to further research in other ways. My question is, is it truly furthering TREC research if the topics which are already and picked, become even more refined into smaller subsets? This seems like overfitting, which is something the authors talk a lot about, but where do we draw the line? In section 4.4.3 the authors talk about error rates in regards to 'some measure is defined for this purpose' and takes the form of curves and a n=50 topics /m=96 runs as the gold standard. Is that the 'some measure' that the authors are referring to? or are they referring to some other measure of precision? Or are they talking about relevance here?
1. Robustness of the optimum set is not tested across metrics – this is brought up in the paper’s initial sections, I eventually expected this to be addressed, but it’s left as is. It makes me wonder if such analysis was pursued and the results we not consistent. This is maybe the first step to understand what are good topics or bad topics.2. The paper does a great job in presenting a thorough analysis to validate claims – however a lot of the interesting aspects are left to future work. An attempt could have been made to present heuristics to assess topic difficulty or variability to validate experimental findings. An understanding of good and bad topic sets over evaluated TREC sets can help surface such heuristics. Especially aspects like what does it mean for topics to be complementary as far as evaluation is concerned. 3. I really liked the analysis presented in the paper – maybe we can filter through queries per topic in a similar way? This may surface notions of which queries are important and need reliable judgments – thus enabling better understands ranking invariance to judgment noise.
1) A thought this paper brought to mind, is it correct to define “bad” topics as those that fail to perform distinctions between systems? Perhaps, the assessor is bad, and the topic he came up with is not one that should be (or frequently would be) searched on that collection of documents, so the fact that some systems perform poorly on it, is an unfair evaluation of that system.2) One interesting conclusion is that different topics provide complementary evidence about the system. However, given the large size of the topic set they used, is not the case that you’d always be able to find some subset that performs ideally for the superset? I’d be interested to see if these results could be replicated with a smaller base topic set.3) After reading the paper, I was a bit disappointed that they were not able to perform any analysis that would determine how to characterize “useful” vs “bad” topics before selection for evaluation, since this is the main question I was hoping would be addressed at the onset. Has there been any work on promoting good topic selection for evaluation?
1- In attempting to deal with the issue of generalization the authors split the runs in half, identified the best subset of topics for one half of the runs and then ran the second half of runs on those topics to see if those topics would best predict the results of the second half of runs. The authors described this as a separate endeavor from the first experiment of just identifying the best subsets of topics as predictors as a whole. But surely this comparison was already done in first part? Wasn’t nearly every subset tested on each run? Was it just a matter of comparing already established data or was this work done twice or am I missing something?2- The authors describe in some detail a validation test of their data in which their sub-sets are measured against randomly generated subsets. Since their subsets preform better than the randomly generated ones they take this a some validation of their work. They call this the Voorhees-Buckley experiments and I completely do not understand them. Please can we generally explain the experiment and discuss its methodology. It was the only part of the paper that went right over my head. 3- I was extremely fascinated by the concept only discussed in the conclusion that there may be certain intrinsic qualities in a topic that make it a good or bad predictor of system performance. How can these topics be identified? It occurred to me that if this work on subsets was combined with the Moffat et al method of targeting judgments it is possible that some serious cross referencing could occur and new ideas of classification on ‘goodness‘ of topics might arise. What would the qualities of these topics be? The authors here mentioned that ‘goodness‘ often has to do with complementary topics in a subset. What work has been done on examining which topics are often found with each other in good subsets?
1. In this experiment the experimenters decided to follow Voorhees and Buckley’s example and eliminate from the data that they were using the lowest 25% of the runs from the TREC 8 data based on their performance. They did this so that they could compare their results to Voorhees and Buckley’s results. Does the elimination of these runs cause any concerns with their results? Is it better to include all of the runs in the data set or to have another set of results to compare with your own results?2. In this article the authors are trying to prove that some topics are better than others at predicting the effectiveness of an information retrieval system. However despite some suggestions the authors never give a complete list of possible factors that could affect the goodness of a topic. What do you think are some possible factors that could cause one topic to be a better predictor of the effectiveness of a system over other topics?3. In this article the authors use several different metrics to examine the effectiveness of topic sets as predictors of system effectiveness. These measures include MAP, RPrec, P@10, GMAP, and NDCG. In the Moffat et al. article we saw another metric being used, RBP or Rank-Biased Precision. Do you think that RBP would have been an effective measure to use in this experiment? Why or why not?
1. This paper mainly talks about reducing the number of topics needed for evaluation. From part 3 we can see that the main method it employs is to compress the metrics. Each item in the metrics is Average Precision (in part 4.3, it also refers other choices). This is mathematically feasible. However, this is only from math’s perspective. What if the topic we actually want is filtered by this method? Is there any way to incorporate some weight information to the topic? 2. In the motivation of this paper, it talks about one major issue before topic reduction is its expense, in particular with the major bottleneck requirement to make relevance assessments. However, there is no such experiments about how many efforts can be saved by this method. It is possible that this method will filter out some trivial topics with very little relevance documents. In that case it is not quite helpful in saving the efforts in relevance assessments.3. My third question is in Section 4.3. Firstly, I want to talk a little bit about RPrec. This is the first time I see this measure for retrieval effectiveness. What is the benefit for us to use it compared with others? Secondly, it talks about P@10 is a poor predictor of other measures in this paper. To take this one step further, how do we define goodness for these methods?