Thursday, September 26, 2013

10-3 Optional alternative reading


  1. "Ranking Retrieval Systems without Relevance Assessments—Revisited"

    Summary: In this article, Sakai and Lin examined several previous studies that proposed methods of ranking systems without using human relevance assessments, including the study by Soboroff, et al. Sakai and Lin are interested in this topic is driven because of a desire to cut costs and time for IR evaluation in conferences and in testing search engines. Their results showed that the simplest and best method for evaluating systems without relevance assessors is looking at how many systems returned each pooled document.

    1. Throughout the article, the authors discuss how NTCIR rankings are easier to predict than TREC rankings, and NTCIR documents are ordered by the number of systems that returned each document while TREC documents are organized by document ID. Could the order in which assessors receive the documents to judge bias their results? How could conference organizers reduce the possibility of assessor bias?

    2. The authors state that one of their long-term goals is to make methods of evaluating search engines faster and easier; however, in the conclusion they write that “majority vote” does not work for that purpose. How could such a system be used for search engine evaluation, since it is not as likely that you would be testing multiple systems on the same topic? Could some of the less-effective methods be used to provide some data to evaluate search engines, although they are not as accurate?

    3. Are you alarmed by the discovery that the number of runs is the best method to evaluate systems without doing relevance judgments? If such a test was used, would it make it easier to manipulate or “fix” the results of NTCIR or TREC? Is this a concern, since systems are ultimately held accountable for good results by users? Does the use of the number of runs as a metric show bias toward mediocre systems and not include top performers that might find relevant documents not found by other systems?

  2. A Practical Sampling Strategy for Efficient Retrieval Evaluation- Javed A. Aslam, Virgil Pavlu

    In this article, the authors propose a "best of both worlds" kind of strategy that, in theory, combines the most efficient and accurate portions of other random sampling methods, while leaving behind the negative elements. Their approach is to use a non-uniform sampling method informed by estimates and expectations. The authors go into depth to describe how to work with the necessary elements of an experiment of this kind, and possible evaluation methods to judge its accuracy.


    1. It seems like the best way to imploy the strategy proposed in this article is to use it on a set of documents which has already been examined, or that is highly similar to one that has been examined in the past. This was first made apparent to me in the section on "probability proportional to size" when they discuss the voting example where a candidate may get more votes in rural areas than in urban ones-- from this trend, a form of sampling which creates estimates can be implemented. The authors claim that if there is this kind of knowledge, then "under-count where one over-samples and over-count where one under-samples"(p. 3) If this is the first time a subject has been sampled, how can one predict or estimate this? Must this be based on previous data? They later state that "if the size of the sample space is unknown...then one can estimate this quantity as well"(p. 4) with the generalized ratio estimator. But, this wouldn't still require some basis of previous data?

    2. I had related questions about the ultimate evaluations of the results from the implementation of this proposed strategy. The authors review their findings "compared with the actual 'true' values"(p. 7) from the TREC judgments. And, later they explain that they can evaluate their judgments by comparing them to ones that are "independent of the sampling picked judgments"(p. 8). What if you don't have a fully judged pool to compare your estimates to, as they did with the TREC depth-100 pool in this experiment? What is another way to judge your results on a set of documents which has never been examined? Do you just repeat the experiment using other methods?

    3. In the beginning of this article the authors mention the "greedily chosen dynamic pools"(p 1) and later refer to "a greedy strategy like depth-pooling"(p. 4). But, they never explain what they mean by describing these pooling techniques as greedy. When I tried to find what "greed" means in this context, the only relevant result was this article. What does "greed" signify in this context?

  3. Robust Test Collections for Retrieval Evaluation

    Summary: The author outlines his approach to trying to create a reusable testing approach that works with relevance estimates instead of relevance judgments. The framework is motivated by a need to address relevance judgment weaknesses in IR as well as having judgments that can serve a purpose outside of the initial testing situation. Throughout the paper, the author places a strong emphasis on the reusability and even provides a formal definition. The robust test collection is an iterative process where a small set of judgments are initially evaluated and a confidence level is calculated. As more judgments are added, the confidence level grows until it reaches 100%. In practice, the author suggests stopping once 95% confidence is reach; otherwise, the evaluation can be corrupted by the over-fitting problem. As a comparison, the author evaluates his approach against competing methods and finds his results promising. At the same time, the author concludes there is still work to be done to improve the technique.

    1. The author chooses to try and model expert judges to serve as the equation for predicting relevance. The authors take judgments from experts and build the model as an aggravation of their performance across all topics. The author quotes, "it seems reasonable to assume that if an expert makes good predictions for one topic, it will make good predictions for all other topics as well." Based on our class discussions, this seems like an unreasonable assumption. We have mentioned that people are better judges of topics that they have intimate knowledge about. Otherwise their performance can be noticeably different. Is the author’s chosen model realistic enough to be considered a good choice? Even if the analysis is broken down into topics to avoid differing performance by judges who have limited knowledge on some areas and expert knowledge in other areas, is modeling the decision of expert judges the right decision from that start?

    2. When outlining the experiments, the author goes into details about the data. The author mentions for their experiments, they only consider the first one-hundred ranked documents. The author does mention their approach would still work but due to intensive computation of variance, they cut the number of documents off. A big issue in software testing research is how well does your approach scale? Given the World Wide Web application, can this type of estimation technique actually scale to this setting or one similar?

    3. The author does point out what the worst case situation would be if two systems being compared retrieved few documents in common. A primary goal of the author is to ensure that, for the technique, he develops the relevance judgments are reusable. He even goes so far as to take a section to explain his viewpoint of reusability in an IR setting. Exploring the worst case, the author finds his metric to be robust. Even though the performance doesn’t appear to suffer terribly in the worst case scenario, is this something that happens often? If so, it would almost seem like you’d be comparing two systems designed for different purposes. The author calculates two other IR system comparison measures to demonstrate his approach is better at handling this situation than the others. What aspects of his measurement allow it to be performed reliably even in the worst case? Different variables are in flux so I feel the intuition behind the approach has captured aspects of IR systems.

  4. A Practical Sampling Strategy for Efficient Retrieval Evaluation - Aslam and Pavlu

    Summary: The researchers use two previous studies as a foundation for their research and propose an evaluation method for large-scale retrieval, based on random sampling. They propose non-uniform sampling wherein sample sizes are chosen in direct proportion to probability of outcome. This bias is compensated for by under-counting where one over-samples and over-counting where one under-samples (pg. 3).”

    1. The researchers say that sampling without replacement is used in practice because it prevents individual draws from being repeated (pg. 2). However didn’t Soboroff et al. advocate for duplicate documents in a pool stating that it helps reinforce the relevance of that document in a certain way? (Maybe I misunderstood the researchers’ intent, but the voting analogy does seem like a slight misfit in this case.)

    2. Do IR researchers usually / always have a sense of what the result might be, or where the relevant documents might be located? Thus can a researcher always make informed decisions about when s/he might have over- or under-sampled? Don’t you think this method brings in a biased view from the onset?

    3. The researchers mention that with random sampling there comes a trade-off between exactitude and certainty, and efficiency (pg. 2). Aren’t IR systems about the former? Is this similar to the precision versus recall argument? As users, don’t we expect certainty from of Google, Yahoo!, or any other search engine we use?

  5. Summary :
    The authors make an attempt to highlight the importance of replacing the relevance judgments by forming pseudo relevance judgments for ranking retrieval systems as there is a lot of cost involved in manual relevance assessments. They share the results of their study on existing system ranking algorithms without relevance assessments, clearly stating the methods that they followed in the process. After this extensive study, the authors conclude that the simplest way to form the pseudo-qrels is based on how many systems given each pooled document performs better than the others. The authors also note that the rankings of NTCIR were predicted easier and better than that of the TREC and attribute this to the method followed by TREC for pooling documents before relevance assessments.

    My questions for discussion from this article are :
    1. The authors note that “nruns%30” turns out to be the most effective system ranking algorithm. Will this statement hold good even if there are a lot of novel runs, where ranking based on majority might not be good? Does this not depend more on the reliability of the test collections?

    2. As can be observed in Table 3, the method followed by Nuray and can (BIAS + nuray in this article) seems to be the worst performing in predicting the run ranking. The author attributes this fact to the ‘condorcet bias’, that is, the uniqueness of the system in producing results for a given run. Does the author try to say that the effectiveness in creating a pseudoqrel for a system lies on the ability of the system to produce as many common runs as possible? The systems which produce unique runs are intuitively more likely to be ranked higher (as argued in the Soboroff’s method that we are discussing this week).

    3. The author while discussing about the results of the two-sided test on nruns pseudo qrels observes that the reliability of the nruns pseduo q-rels is 82-100% for NTCIR while it is 68-81% for TREC data. The authors base this result to claim that TREC data are most likely to misrepresent facts as they are less reliable. Does this have to do more about the type of test collections in TREC robust, or with the methods of ranking? I believe that it would help to have further discussion on this topic.

  6. Article: F. Radlinski and N. Craswell. Optimized Interleaving for Online Retrieval Evaluation. WSDM 2013.

    The authors propose a new approach to developing new interleaving algorithms for the purposes of online evaluations. Their approach involves incorporating an explicit credit function as opposed to the implicit credit of past interleaving types. They also start their process of developing the interleaving algorithm by starting with a set of defined properties that they want to achieve and then working back from their goals to arrive at the algorithm. This contrasts developing algorithms as a result of a problem that occurs in another algorithm. The authors believe that their approach and new explicit credit can be taken by future researchers and lead to better algorithms than the one that they propose.

    1. The authors spend a section discussing the effects that random clicks have on the credit given to different rankers in the interleaving, but what qualifies a clicker as random? How can it really be determined that a user in clicking randomly on links as opposed to merely searching through the results in their own manner?

    2. The authors discuss the breaking cases of past interleaving algorithms and how their new algorithms proved to be correct in all of the different breaking cases. Seeing as these breaking cases were discovered through implementation and research of the past algorithms, couldn't it be the case that the optimized algorithm simply has not been tested enough to find the breaking cases within it?

    3. The authors admit that their dataset was partially to blame for the poor performance of the probabilistic algorithm in their tests. Given their selection process, could the dataset not falsely bolster the appearance of their new algorithm?

  7. I read "Relying on Topic Subsets for System Ranking Estimation" by
    Hiemstra et. al.

    1. Unlike Guiver, This paper never once mentions over-fitting to the data, but
    it explores the idea of removing topics for which the ranking estimations
    perform poorly. This seems to open the evaluation up to the possibility that a
    system which performs really poorly on some class of topics will be evaluated
    as being just as good as a system that performs well on the topics. Am I
    missing something, or does this approach make the evaluations susceptible to

    2. This paper concludes that the best approach for estimating evaluations is
    that proposed by Soboroff, which aside from essentially randomly assigning
    relevance, also has the downside of requiring existing relevance judgements to
    develop the mean and standard deviation necessary to construct the model used
    to assign relevance. Does this indicate that in spite of all of the buzz around
    ranking estimation, there still hasn't been an approach that reduces the amount
    of work necessary to build a test collection?

    3. In the intro the authors state that they're hypothesis relies on the "right"
    subset of topics. Soboroff seemed to do a good job of showing that this subset
    can be used to reduce the work needed to create a test collection, and the hard
    part was picking the subset, so doesn't this then go in the wrong direction and
    focus on the easy part of the problem?

  8. Minimal Test Collections for Retrieval Evaluation – B Carterette, J Allan, R Sitaraman.
    [In this paper the authors present a novel way of looking at average precision. Using the above, the authors develop an algorithm to select documents (for judging) by using an estimate of the degree of confidence which is obtained by defining a distribution over the possible document judgements. The paper ends with discussions about the extent to which the results can be attributed to AP and the number of documents required to differentiate a pair of ranked lists with certain confidence. Furthermore a discussion on the reusability of the documents so judged (for evaluating a new system) is presented.]

    Metrics like the average judgements per minute and relevance percentage depend a lot on the topics chosen. The annotators were asked to choose topics which did not present too few or too many documents. While the TREC topics span much of the difficulty spectrum (some have many relevant documents and some few), does the chosen criterion qualify this method to be compared to the TREC’s metric results?

    The eight retrieval runs were obtained from six retrieval systems (end of page 3). Which of the retrieval system(s) was used more than the rest? Why is each system using different retrieval models?

    In spite of the scepticism surrounding the fewer number of annotators, it appears that this method can be used for a quick evaluation of several retrieval rankings. It is surprising that the expected MAP values and the original MAP values of the different retrievals are proportional and that the former’s distribution could be exploited to allow for higher judgement rate.