Thursday, September 5, 2013

12-Sep Sanderson. Test collection based evaluation of information retrieval systems, Ch. 4


  1. 1. In the chapter introduction, Sanderson discusses how in later TREC conferences the focus was on finding home pages—the “topic distillation” task. How does the difference between home and sub-pages play into systematic testing? Are different sections of a website each considered a separate document? In what ways is testing for sub-pages similar and different from testing for “nuggets” (discussed on p. 301-302)?
    2. Why are the traditional methods of measuring relevance insufficient for blogs and question answering? What do you need to take into account for measuring relevance? Do you need to consider the user’s personal opinions, and if so, can you create a test collection to measure this?
    3. Sanderson mentions several measure that address concerns about the unjudged documents left by the pooling method (p. 298-301). Which method do you think best addresses the problem? Or do you think that the pooling method is still the best method?

  2. 1. In Section 4.1.2, this paper talks about diversity. Based on this idea, there were novelty track and QA track, both of which encouraged the systems to retrieve fragments of documents that were both relevant and had not previously been seen. This is a promising way to improve the diversity aspect of search results. In Section 4.2.4, this paper also talks about how to add diversity as a factor for the document ranking. Great ideas are introduced about dealing with covering all topics, or ranking the documents based on the frequency of topics in the corpus. But the emphasis of people’s attention for certain topic changes across time. How can we dynamically decide what the current hottest topic for certain keywords and add it into the ranking process?

    2. My second question is in 4.2.3 Relevance Applied to Parts of Documents. In this section, it talks about passage-based retrieval. However, here passage sounds a little vague to me. Sometimes one sentence serves as a passage, sometimes one passage may contains thousands of words. How do we heuristic decide on the length (or how much information) of the parts that we are going to rank?

    3. My third question is about unjudged documents in Section 4.2.2. Actually I am not quite sure about this part. Unjudged documents mean that we are not able to tag the documents as relevant or irrelevant during pre-processing. Since these documents are not handled beforehand, so and methods afterwards just seem to be a guess and statistically proportion. Can we do better about this part using some pre-processing with less manual effort?

  3. 1. It is mentioned that multimodal tasks such as video track TRECVid and image searching track ImageCLEF are started and explored since the development of TREC web track. One thing I’m interested in is since web pages are usually data composed of collection of different types of data (i.e. text, image, video.). So is there any study on how search methods in different tracks can work collaboratively to improve the search performance on web pages.

    2. A probability p is proposed as the probability that a user progresses from one document to the next. How is the value of p determined in real study? Also a geometric discount function is used in the Rank-Biased Precision (RBP) measure. What’s the justification of using this function? Similarly I’m a little confused with the theory of infAP. What is the meaning of using (k-1)/k in the equation?

    3. The last very interesting topic is the equality of topics. Intuitively there are differences in contributions of topics in evaluation of search engine. However, what is the justification of using geometric mean of average precision (GMAP) or using arithmetic mean of the log values of AP (AL)? What are the differences, advantages and disadvantages in choosing these methods? Will the variability and hardness of topics affect the selection of the evaluation metrics?

  4. 1. How are the boundaries of a passage defined when evaluating search system for their ability to “locate the best point in a document structure for a user to start reading” (p9. 292)? Is a passage a paragraph? If there are two consecutive paragraphs is that considered one or two passages? If a system identifies two (or more) passages in a document, how is the material in between classified?

    2. Is it safe to assume that “unjudged documents are not relevant” (pg. 298)? Do you think conducting more than one round of pooling and testing can help increase accuracy of the test collection, and make it a more representative sample?

    3. Don’t you think that while testing a collection, it is important to take into account the number of unjudged documents? In my opinion that number reflects not only the accuracy of testing, but also the precision of the IR system as a whole.

  5. 1. In 4.2.1, Sanderson explains about Rank Biased Precision where p is the probability measure reflecting user behavior (persistent/impatient). Is the RBP value a lower bound or an upper bound measure on the relevance of documents? And at p=0, the RBP value does not depend on p(and the user's persistence). What happens at this point?

    2.In page 291, the author introduces the "topic distillation" task - finding a series of home pages relevant to a particular topic. Is it efficient to have a "homepage" as the fundamental unit of retrieval in this case? Would it not be more accurate if the sub pages are kept as the fundamental unit? Because the homepage might not be relevant to the query but a sub page linked to the homepage might be relevant.

    3. Sanderson mentions about the inferred AP measure and suggests that inf AP might be better than the BPref. Upon evaluating the formula for inf AP, AP and BPref, is it right to assume that inf AP might not be very different from AP when all the documents have been judged? What happens to the infAP value when there are very few unjudged documents?

  6. 1. The authors mention a study that reveals how algorithms which work well for retrieving highly relevant documents are not the same as the algorithms which are able to retrieve all relevant documents. When the TREC test format was developed, it was inspired by a library catalogue system and several of the authors we have read from have mentioned that it focuses on how many relevant documents are retrieved and does not focus on ranking. However, a new trend is breaking away from the TREC initial viewpoint and is focused on degrees of relevance. In the next paragraph, the author mentions that the actions of industry companies helped motivate the switch. Most research papers I have read could not be immediately transferred to an industry setting and most are not inspired by what companies in the industry are doing. In this case, it seems like the research community used the corporation’s activities as a justification for breaking away from the binary relevance philosophy. Although this reasoning is not listed explicitly for the other post ad hoc measure listed in the chapter, is it safe to assume these measures are motivated by the change in perception in the field due to the huge impact of the internet and the focus on web search applications for information retrieval?

    2. In our last class discussion, we talked about several issues with pooling. In particular, we mentioned the issue of unjudged documents and how they can impact the results negatively by being a source of bias. The author goes on to focus on a couple different methods for trying to address unjudged documents. A few of the measures seem relatively simple. One idea was to look at the number of unjudged documents in a run, noting that if the number was different between two results then they can’t be compared. This solution seems to have some inherent limitations since, not being able to compare all the results, one obtains impact information they can deduce from a study. Other solutions such as BPref take a more complex mathematical approach. BPref changes the type of evaluation performed because it produces preference judgments. In a previous paper, an author listed some of the drawback of preference judgments in place of relevance judgments. In the end, there does not seem to be an end-all solution to the issue of unjudged documents despite a wide take on the issue. Is further research warranted or should attention be focused on finding a different alternative to pooling?

    3. An interesting idea the author focuses on at the end of the chapter is: are all topics equal? The author mentioned GMAP as a measure that tries to break away from the historically used arithmetic mean. GMAP uses the geometric mean to report a summary evaluation value across multiple topics. Due to the fact the GMAP does not hold all topics equal, the author points out it can be leverage to see how the current results compare to the historical past given an algorithm and topic. . Currently, GMAP is not widely used and has not been proven to average results better than the commonly used method. At the same time, GMAP has not been proven to perform worse. Therefore, why do people not use GMAP and gain the added benefit of the historical trend? Is there a common place measure that would be broken if GMAP was used and thus extra work would need to be done anyway?

  7. 1) In the description of the blog track, Sanderson mentions that the main goal is to locate feeds about a particular topic. In addition, detection of opinion was added to the task. However, this is the only track where it is explicitly mentioned that part of the goal is to derive information about the results. Is there a reason for choosing blogs rather than other forms of documents?

    2) During the Diversity discussion, Sanderson points out that Verhoeff et al., Fairthorne, and Goffman, argue that relevance can be influenced by documents already retrieved, however, test collections topics continue to rely on independent relevance judgments. Yet, as a user, you already have some background information on the topic and you are just either seeking confirmation or additional information. The latter being the closest case. Is it possible to consider the document dependency as a sequence of independent runs (making the test collections acceptable)?

    3) When discussing ranking, Sanderson defines diversity and novelty as “coverage of different aspects of relevance in ranking” and “prevent repetition of the same relevant content” respectively. What is the difference between them?

  8. 1. In the section on binary and non-binary relevance, Sanderson describes that researchers had “a realization that degrees of relevance were commonly being used in the internal test collections of web search companies”(p. 293). Is it common that corporations are able to advance their technology at a more accelerated rate than academic researchers? If so, can this be attributed to funding, the stronger pressure of competition in the corporate world, or another factor?

    2. Sanderson seems to take issue with the TREC experiments in a similar manner to the authors of “Improvements that Don't Add Up,” as he states during his recap of TREC HARD: “Many of the measures were task specific and no single measure emerged that is used more than others or used beyond the evaluation exercises that created it”(p. 301). Are the methodologies surrounding IR evaluations, especially surrounding TREC, fairly widely criticized? The authors of “Improvements” seemed to think that they were the only ones addressing this issue.

    3. Sanderson, in the section entitled “Are All Topics Equal?” describes a view of certain researchers that a person only “'could do reliable system evaluation on a much smaller set of topics'”(p. 306) than even those controlled number included in TREC. Doesn't this create an issue directly in line with one Sanderson identifies earlier in this chapter, that “test collection topics continued to follow the tradition of being detailed unambiguous statements of an information need for which one view of relevance was defined”(p. 294)? If all of these elements of an experiment are so highly controlled, the conclusions will not hold up among unpredictable, multifaceted user searches. It is also a similar problem to the use of simulated users as described by Kelly in the section on “Wizard of Oz Studies” in “Evaluating IR Systems.”

  9. After reading this chapter, I find most test collections and measures are for text searching. So, would these measurers introduced in this paper be still suitable if searching topics were media? Are there some specific test collections for evaluating media retrieval?

    The author introduces the passage retrieval in this chapter. Here I can hardly understand why using passage retrieval can improve document retrieval and why existing document test collections could be used unaltered.

    In discussing managing unjudged documents, it’s pointed out that some scholars, such as Buckley, Voorhee, Yilmaz, and Aslam, are all attempting to create measures that can mimic MAP as closely as possible. Why is MAP selected as a standard? Is MAP of certain advantage in testing retrieval system?

  10. The page 51 of Donna Harman Information Retrieval Evaluation indicates that DCG is widely used in the industry. But the underlying assumption in DCG that the probability user will click on a higher ranked document is higher, seems incorrect. The probability a user finds the returned information useful would depend on many factors other than its rank. Additionally the data returned to many of the queries nowadays is not just a ranked list of documents but a combination of documents, images etc. So it just might be placement of the information that affects the users decision to click on a link. How can these features be included in the evaluation measure? Is not, RBP a better measure of evaluation than DCG?

    The statement made on page 293 “retrieval techniques that worked well for retrieving highly relevant documents were different from the methods that worked well at retrieving all known relevant documents “ implies that based on document relevance, algorithms can be categorized into two groups: a) Algorithms which return highly relevant data b) Algorithm that returns all known relevant data.
    a. But how can a search engine determine from a query which type of result the user is looking for? Maybe longer query lengths imply that user is looking for specific information implying he was the result set to be highly relevant.
    b. Evaluation of an algorithm focused on retrieving highly relevant data should make use of graded relevance as opposed to binary relevance. But what other evaluation parameters will be different for such an algorithm evaluation?
    c. How can two algorithms where (a) returns more relevant data and (b) returns all relevant data, that are so diverse be compared against one another?

  11. 1. Discounted Cumulative Gain (DCG), Normalized DCG, and Rank-Biased Precision (RBP) have been proposed as ways of measuring effectiveness of IR systems with graded relevance measures. With DCG and NDCG document-rank importance is thought to be a logarithmic function of rank, while RBP models document-rank importance with user relative search behavior. While both methods have advantages and disadvantages, the idea of tailoring document-rank importance to user behavior is intriguing. How much do individual users vary in their search behavior given identical queries and results? Could a more general form of RBP (one that modeled average user search behavior) be faithfully used?

    2. There is an assumption with most IR performance measures that the documents that are not available in an initial “pooling” are not relevant (pg 300-301). While there is some disagreement as to whether this is appropriate (pg. 301), a reading of Harman suggests that a performance measure that does not make this assumption has not been offered. Might it be possible to create an effectiveness measure that doesn’t assume unjudged documents are irrelevant?

    3. One topic of concern in the Sanderson chapter is how to handle the effect of topic on system performance. In particular, there seems to be disagreement on how much systems should be penalized for especially poor performance with some topics. Do users tend to prefer IR systems with mediocre MAP but consistent AP or IR systems with mediocre MAP and a mixed AP?

  12. What are the pros and cons of binary relevance? What are the pros and cons of degrees of relevance? In which scenarios binary relevance shall be used and in which scenarios degrees of relevance shall be used?

    Passage retrieval was part of the tasks in the TREC HARD track and was also included to the INEX evaluations of XML retrieval. It shall be quite useful in returning those relevant parts in a big document (like a book) instead of returning the entire document but how to define passage (by line, by page or by chapter which might also relate to the Natural Language Processing)?

    As mentioned in the article that it was long assumed that the topics of a test collection have equal contribution to the effectiveness measuring of a search engine. The assumption has validity issues as what steps have been conducted to ensure there are no duplicate contents inside the topics and also there are really no difference among the topics (for instance, level of difficulty and level of importance, which are also related to specific queries raised)?

  13. 1. While managing unjudged documents, the idea proposed by Yilmaz and Aslam on splitting the unjudged documents into two sets is appreciable but isn’t its feasibility questionable? Since the pool of unjudged documents is vast and innumerable what could have been the criteria to select the documents from this vast list?

    2. I agree with the idea that redundant topics should be eliminated from the test collection but Mizzaro and Robertson’s statement that “one could do reliable system evaluation on a much smaller set of topics” talks about only reliability of the system whereas isn’t relevance and performance more important for an IR system? Reliable and small topics vs Relevance, Performance and large diverse topics – which one is better?

    3. The re-discovery of evaluation ideas and practices from the past and the extensive use of previous years’ test collections and query logs helps in unearthing new topics but using them to judge the current IR system in my opinion may not be that effective. Isn’t topic relevance time dependent? The logs and relevance judgments of the past may not be useful now with the changing pace of the growing fields. So how can the logs, test collections, source queries of the past be useful?

  14. 1. Page 302, where the P(r) and R(r) were discussed. It seems these measures involve the length of literature, e.g. the word counts in a segment. When we want to compared two results with these measures, will the length of literature impact the final result/judgment? In other words, are these measures sensitive to the length of literature?

    2. When discussing GMAP, MAP and AP, Robertson’s words were cited. But it does not answer my question when to use them. Cooper’s suggestion of using geometric mean and weighted average against the arithmetic mean did not provide any further hint when to use them? Are there any criteria to determine which measure should be adopted in some given scenarios?

    3. When this chapter discussed diversity, it mentioned Liu had described building a web test collection of ambiguous queries. How is the level of ambiguity defined in such work? If the queries were ambiguous, it may result in non-comparability issue. How to handle it?

  15. 1. An important point is mentioned (but not elaborated upon) in pg 297 when the author mentions how IDCG can give 1.0 in two cases where the quality of information is clearly poorer (in one of the cases) in that it received lower relevance judgments. This was cited as a counterintuitive result which it is, if we consider that one of the queries does not have very relevant answers to begin with. This directly brings up a measurement issue: do we evaluate an IR algorithm on how well it does based on the answers that are available (regardless of the quality of the answers) or on an absolute scale? When doing comparisons between two different systems, both scores would be expected to correlate; however, for a standalone system, using graded relevance judgments, we could have two different evaluations.
    2. On page 302, isn't the precision formula flawed? It seems like the ranks are being clustered together instead of each one being treated individually/independently. We're summing the relevant 'segment length' for all ranks up to r, then summing the total segment lengths and then dividing these totals. What if we summed the individual precisions (up to rank r) and then averaged the results (with or without a discounting scheme)? It just seems like that seems more intuitively correct.
    3. I don't quite understand how k-call(n) on pg 303 can be used to determine diversity. Its definition only states that it counted if at least one relevant document was received till rank n. How do we determine if the results are diverse enough from this measure?

  16. 1. The diversity is mentioned in section 4.1.2. What is the focus and value of diversity research? What can we expect from such an area? How can such work contribute to or impact IR?
    2.In section 4.2.2, it is said that the pooling technique could not reflect the updates in the collections. How does Bpref successfully handle this problem?
    3. It is mentioned that Yilmaz and Aslam split the unjudged documents into two sets based on whether the documents would or would not have contributed to the test collection's pool. (p.299) It is unclear how they defined the contribution and how they knew that a given document would or would not contribute to the pool.

  17. In section 4.1.1, Sanderson mentions the use of multiple levels of relevance. This seems like a better alternative to binary or even ternary judgements. In creating these levels do researchers employ the same averaging type of methodology when utilizing the scales from Perfect/Good on down to bad?

    In dealing with relevance of document passages, do relevant passages only come from whole documents deemed as relevant or from a pool? Or does determining the relevance of passages encompass an entire collection which would look past the set of documents not included in a pool or deemed as irrelevant in previous judgments?

    In section 4.2.4, Sanderson presents an interesting idea in that users approach each document independently. Much like the different classes of measures, judging multiple documents creates a bias of sorts moving forward in judgments. Does this mean that evaluations done by judges after the initial judgment has been scaled or chosen to be relevant or not becoming increasingly skewed in favor of the forming idea of relevance?

  18. 1. Sanderson mentions the inclusion of home pages in several tests. How would the home page be judged in a test collection? Does each relevant page from the home page get included in the test collection as well or is the home page itself the only thing judged?

    2. Passage relevance was discussed by Sanderson in section 4.2.3. If a section of a page is deemed relevant to the task, would it not have already been included under previous methods of relevance judging?

    3. Sanderson lists a number of different measurements and formulas that have been developed by researchers post ad hoc collections. Some of these have replaced earlier measures that were proven problematic, but others seem to be different ways of quantifying the same thing. Would it not be beneficial to researchers to unify the measures they use during research so as to be able to compare different systems more easily?

  19. The Burges version of nDCG, which emphasizes the high ranking of most relevant documents, could prove to be counterintuitive in some cases. Consider Multi-Level relevance of 0 to 3 and a list of 9 documents. Say Ranking(a) has the relevance numbers 1, 2, 2, 1, 1, 1, 1, 1, 1 for the documents retrieved. Ranking(b) brings up the following relevance numbers for its top 9 documents – 0, 0, 0, 0, 0, 0, 3, 3, 3. DCG(n) for the first ranking is 6.49, and in spite of all the non relevant documents in the initial positions, the second ranking has DCG(n) value of 6.64! Therefore for graded relevance ranking, a hybrid of DCG(n)[Burges version] and nDCG will perhaps be a better choice than either of the individual measures.

    In 4.2.4 aspects or subtopics for diversity are introduced. It is not mentioned as to how subtopics are defined (nominal or operational definitions). We know that subtopics constituting a topic share similarities. What level of similarities do they need to share so that they become sub topics? How do we deal with the modularity of domains or topics?

    It is unclear what Buttcher’s RankEff looks into for textual similarity. It is stated that RankEff infers the relevance of an unjudged document based on its textual similarity to judged documents. A document is marked relevant due to few or all parts of it which are considered to be in context. An unjudged document which has the same text as the judged document except for its relevant parts will definitely be having a high similarity value. This does not mean the unjudged document could be marked relevant nor does it mean that it can be marked to be not relevant. How does RankEff or any other method that label unjudged documents work?

  20. 1.If we intend to retrieve information from XML documents - isn't there always the chance that we will be losing information and therefore have a lower retrieval performance? I understand how indexing the tags becomes crucial at this stage. However, again since several XML documents will incorporate different user defined document structures which would be from disparate sources - wouldn't this be overwhelming? Also, pertaining to the extension of retrieval of information from XML documents which are multilingual - makes me curious about the fact that we would have to work with tag names which could be in different languages and so how will we be able to implement an approximate translation in such cases? Finally, what constitutes the retrieval unit in these XML IR systems?

    2. In this paper, under the section 'Managing Unjudged Documents' the author states that the reason for including a preference based measure like BPref was in order to have a better estimate of the effectiveness of a system which has a large number of unjudged documents. How would this measure work in the case that our pool of relevant documents is rather small? Also, since the denominator of this measure is min(N, R) - where N is the number of documents judged not relevant and R is the number of documents judged as relevant for a particular task - we cannot use this measure until at least one relevant (and) one non-relevant document is retrieved. Isn't this limiting the measure to the absolute values of relevant and non-relevant documents? Wouldn't this therefore, result in an uneven measure across topics due to its basis on absolute values?

    3. I understand that GMAP works towards emphasizing different parts which contribute towards the effectiveness distribution of a document over a 'poorly performing topic'. But, doesn't the fact that GMAP is extremely sensitive, uses an external collection and tends to enlarge the emphasis especially with outliers - reduce the stability of this measure? Does the fact that GMAP's stability is susceptible to change when making use of a combined data collection affect its robustness in any way? Also, is there any difference in the data pre-processing when making use of GMAP? And, what are the specific criteria which will help determine when GMAP should be preferred over other methods?

  21. Which is a better measure, Cumulative Gain (CG) or Discounted Cumulative Gain (DCG)? It seems like a document shouldn't loose relevance solely based on it's ranking, but it also seems unfair to only judge the top (n) documents in a retrieval < this all confuses me.

    On page 304, the author says that 'they found sources of information to estimate a user's probably intended meaning when entering ambiguous topics', this entire sentence and idea seems ambitious. What exactly are these sources of information? Are they based on log history? Is this what the author is referring to later when he speaks to run scores? (pg 305)

    If subtopics are standardized and put into runs, what happens when new subtopics emerge? Are these subtopics instantly at a disadvantage because they are so new, or does that work in their favor since they may be in the news/heavily searched? Is this something that can only be worked out through time?

  22. Does graded relevance help better evaluate relative performance of search systems – it is clear from the discussed metrics that it can help evaluate individual systems thoroughly. The concern is that graded relevance can be viewed as noise on agreement between judges over binary relevance -- which is seen to marginally affect system-ranking measures.

    The author discusses metrics, which measure diversity and novelty; an extension to this might be to weight subtopics/nuggets to better measure system performance? The concern is that the balance between novelty and diversity is not addressed. This line of thought follows from the section, which discusses treating topics equally.

    The discussion regarding stability of indAp, infAP, BPref over other metrics is not clear. What does stability mean in this context and what is the nature of the synthetic experiments, which validates the claim?

  23. 1) The author discusses diversity as a new area lacking coverage in terms of test collections. Specifically, since “users’ definitions of relevance” are diverse, there is a need for collections which can address multiple notions of relevance. Since the number of “definitions” of “relevance” is potentially infinite, how do we determine where to stop, in terms of adding additional aspects of relevance to a specific topic?

    2) The author briefly mentions normalized discounted cumulative gain (nDCG) and Rank-Biased Precision (RBP) as methods for quantifying relevance measures beyond a simple binary scale. nDCG is characterized by adding relevance values of each result, while discounting those that are less relevant. RBP is characterized by summing relevance values, and using probability for discounting rather than a log scale. RBP seems to have the added benefit of seamlessly incorporating objective probabilistic data on a per user basis. In what circumstances might nDCG be superior to RBP?

    3) During the discussion on unjudged documents, the author mentions that a large concern is that size of pools relative to size of collections is reduced over time. Further, some test collections were created from sets of documents that were later updated. The author goes on to discuss different measures such as BPref that are better suited for collections with large amounts of unjudged documents. Would time not be better spent in finding ways to reduce the disparity between the number of judged and unjudged documents? I realize that BPref attacks the problem from another angle, but it seems that these collections are inherently limited in the reliability of the results if they are limited to such an outdated set of data?

  24. 1- Compare the cumulative gain, discounted cumulative gain, normalized discounted gain, BPRef, and infAP measurements. What are the relative merits of each? Under what circumstances is it best to use each one?

    2-Adjustments were made in metrics to account for graded relevance judgements. However that still assumed that documents could be assigned a single measure of relevance and that that measure was not relative to different users or to the other documents that were retrieved. What work is currently being done to account for the reality of relative judgments?

    3- Regarding the question/answer track: Many important questions have complicated answers that cannot be declared right or wrong. It is easy to judge if a system answered the question “How many calories in a big mac?” correctly. But what about “Are big macs healthy?” Are search engines obligated to offer variety of opinions? Should the opinion of a nutritionist, the burger company, and a random person off the street be presented equally- and with no distinction made between them? Since a powerful search engine (such as google) might reasonably be assumed to be a person’s primary source of information to what extent is the system ethically responsible for delivering the truth?

  25. In this article the author discusses the idea of relevance judgements that use more than two levels of relevance. One piece of evidence that he uses to support the idea that we should move away from the binary relevance model is that companies like Microsoft, Yahoo!, and Google have shown that they use non-binary methods of relevance assessment. Are the methods that these companies use to measure relevance the best method for measuring relevance? Is the fact that they are doing this a good reason to do it?

    This article discusses several of the tracks done at TREC that used methods of creating test collections from the traditional ad hoc method used by the TREC collections. These tracks include the Question Answering track and the blog track. However in the Harman article the author states that these tracks were problematic and were either limited in scope of the collection or had unresolved issues. Which of these two analyses are correct? Does Harman's assertion of the problems of these tracks invalidate Sanderson's point about non-ad hoc collections?

    The author of this article talked about several methods of evaluating the effectiveness of a test collection. These methods include normalized Discounted Cumulative Gain(nDCG), BPref, and inferred Average Precision(infAP). However each of these measures focuses on a different part of the test collection and tells us different things. What are each of these method better at measuring about the effectiveness of a test collection? Are any of these measures the best overall for measuring a test collection's effectiveness?

  26. In section 4.2.2, the author describes several methods to manage unjudged documents, but he doesn't compare the result of each algorithm. I'm quite curious about the result. Is there a absolutely best solution to manage the unjudged documents or just should depent on different test collections?

    After finishing this chapter, I'm still confused why some link-based methods will lead to little or no value? Since every time, we could just select the top K returning results as the retrieval document?

    Also, in section 4.2.1, the author just describes two methods to measure retrieval relevance: DCG and RBP, but he didn't compare the effect of these two algorithms. Is there exist a better solution in most cases or it just depends?

  27. 1. It seems that the creation of BPref as a response to a relatively increasing number of unjudged documents misses the question of whether or not these unjudged documents are relevant. Should not researchers be incorporating some type of parameter or function based on uncertainty to calculate the probability that there are additional, relevant but unjudged documents?
    2. The infAP measure, while appearing to improve upon the BPref measure based on the parameters and variables used (this is a subjective and biased gripe, but it appears slightly more Bayesian to me), also appears to be incomplete in its inclusion of useful data into the probability judgment regarding the relevance of unjudged documents – has there been any attempt to include additional features, such as content-based features (perhaps a rapid, sample-based bag-of-words/cosine similarity (to judged documents) approach) or testing mutually exclusive samples of unjudged documents to develop a general value? It would be intriguing to learn more about efforts in this area.
    3. Does not the idea of different topic weights or the ranking of relevance judgment consistency for particular TREC topics also depend on the purpose of the retrieval task? For example, searching for information about a developing news story versus searching for a particular datum from the news story?