Wednesday, September 18, 2013

26-Sep Kazai, Gabriella, et al. An analysis of systematic judging errors in information retrieval. CIKM 2012.


  1. 1. In this experiment, the authors discovered biases in the ways that raters handled Wikipedia pages. Is it possible that the bias by the ProWeb judges is a good bias, in that they are giving users what they want? Or should experimenters focus on comparing raters to ‘super’ judges, as the authors did in the HQ set, which showed that ProWeb judges overrated Wikipedia? Should raters seek to find good information or get users what they want?

    2.Could search engines be customized and transparent to let users input their own biases and in a sense be their own assessors? Users could rate the level of depth and scholarly-ness they wanted and then give thumbs up or down to individual web pages. Has this been tried before in web search?

    3.In the introduction, the authors state that training procedures and guidelines lead to biased judging. The experiment showed that this was true, showing biases in how the different groups of raters handled Wikipedia pages. However, the ProWeb raters, those with the most training had the most correlation with real web searchers. How does this study affect your thinking on the type of training that should be given to raters?

  2. 1. The conclusion from this paper seems to be that assessor deviation does start to matter once we take different populations of judges into account. This is in contrast to the conclusion in the Voorhees paper, which showed that rankings don't change so much if different groups of judges are involved but from the same population (retired intelligence analysts). What stands out though, is that the authors conducted an analysis on the types of errors made by different judge groups (like Wikipedia pages which were overrated by crowd source population). Have these authors covered all the important biases or are there some which they have not covered?
    2. Some of the findings clearly indicate that crowd sourced workers are a good approximation to the average users (this is succinctly mentioned in the conclusion as well). However, the average user is not the best assessor. Assuming the findings hold in future studies as well, doesn't this validate the use of crowd sourcing for modeling user intent, at the price of invalidating it as a good source for proper relevance judgments?
    3. The statement on page five that Wikipedia pages are 'inherently more controversial and are harder to judge consistently' seems to be a shallow conclusion, given the evidence that inter-assessor agreement on all the groups is low on the Wikipedia pages. Wouldn't an alternate conclusion be that the Wikipedia page disagreements are effects, not causes, of a deeper problem: that none of these groups, even the trained ones, have been informed about how to rate pages that are chock full of mostly reliable material relevant to the query but possibly irrelevant to user intent? It would have been interesting to see on which queries there was maximum agreement on the Wikipedia pages, and on which there was maximum variability.

  3. Who are web super judges? (pg8) How do you become one? So there are gold, silver and bronze, and super judges? This seems like a lot of levels of judging. It makes sense, but I just wonder what criteria and qualifications grant a judge to go from one category to another?

    The authors state that non-expert judges ratings are 'shallow and inaccurate', but wouldn't most common search engine users be considered non-experts? or perhaps only a domain expert--a silver or bronze at best? If non-expert judges are so shallow, but closer to the typical search engine user-what can we learn from these assessors, and how can we filter out the shallowness?

    What is a gold set of data and how is it created?

  4. In this paper, the labeled gold set is recruited to measure label quality. Here, I have a question about gold labels: How is these gold labels created? Is it possible that this set, itself, could involve certain problem of accuracy?

    According to the results of this paper, crowd gives a better user model while likely to make mistakes, comparing to professional judges. So, whom should we recruit, when trying to judge relevance, crowd workers or professional judges?

    Having read this paper, I have a question here: which contributes to these trained judges’ high inter-assessor agreement, their very accurately rating based on the users’ need or the training effect generated in judging process?

  5. 1. The authors of this article present trained judges as inherently biased at the outset of this study. Later, however, it is revealed that non-professionally trained assessors and crowd workers are also subject to judgement biases, as they tend to create their own, internal rules for assessments. Ultimately, the authors conclude that trained judges are more consistent, discerning, and in line with real Web users. Furthermore, they point out the many flaws non-professional and crowd workers, saying they are "skewed towards overrating results"(p. 112), and that "[n]on-expert judges tended to give shallow, inaccurate ratings compared to experts...also disagree on the underlying meaning of queries significantly more than experts, and often appear to 'give up' and fall back on surface features such as keyword matching"(p. 106). If this is true, why do sites like Mechanical Turk rely on crowd workers? Have they not discovered similar findings as the authors of this article? Or does a cost-benefit analysis show that crowd workers are more desirable than professionally trained assessors?

    2. This study shows that all groups overrated Wikipedia pages. The authors claim that different document formats can sway an assessors performance (p. 111), and that this can lead to a "systematic judging error"(p. 114). Are the authors of this study taking into account all factors of usability-- including satisfaction, efficiency, and users' feelings towards the documents (as described by Diane Kelly in "Methods for Evaluating Interactive Information Retrieval Systems with Users" Chapter 10)? Perhaps the Wikipedia format is more useful, and therefore judged to be more relevant. Could self-report data added to this study help to clarify why Wikipedia pages were unanimously overrated?

    3. There seems to be a similar concern here as Ellen Voorhees identified in "The Philosophy if Information Retrieval Evaluation" regarding the use of a single relevance assessor. The authors of this study state: "Although disagreement analysis is informative, it may well be that not all disagreements are bad. In some cases disagreements among judges may reflect the diverse opinions of real-world users"(p. 109). How can we trust relevance judgments which come from one judge at one point in time, given one set of instructions?

  6. 1. Different groups of judges (NIST, crowd workers, trained judges of a commercial Web search engine) were used for this study, are there any reasons why these groups are chosen? 10 people were chosen as trained judges and 45 crowd workers were used. Why are the numbers of judges in different groups different? Will the variance of numbers affect the study of within group inconsistency? Intuitively more people will have more disagreement with each other. Also as we all know disagreements exist among real users, so how do we assess the effect of these disagreements within the judge groups?

    2. In “GOLD SET ANALYSIS”, the gold data is created by a group of highly trusted judges, what are the training backgrounds of these judges? Is the comparison between performance of different judge groups to the gold data more like a comparison between closeness of preference, backgrounds of different judge groups to that of “highly trusted judges”? To judge errors using the gold label, the distance between relevance grades was used. However, the distance with the same value might have different means. For example, the distance of 2 might be due to a change from Ideal (4) to Happy (2), or Happy (2) to Unhappy (0), and intuitively the later mistake is more serious. So is it better to calculate the frequency of these pair-changes (i.e. “Ideal-Happy”) to calculate the judge errors?

    3. Three different system biases were recruited in this study (WP/nWP, QiU/nQiu, RU/nRU). What is the importance of these difference biases in real world? Intuitively, users will be less tolerant in getting highly ranked unrelated documents, but will probably be OK if the wiki page related to the topic, which is not what they are looking for, has a high rank. If so, studies on these categories might be of less importance to real users. Similarly, for final analysis on ranker training, why NDCG is used as metrics for evaluation? And why only the top 10 entries are used? What are the differences between the training features BM25 and WP?

  7. The paper takes an interesting approach to the relevance problem at a group level (most of the approaches that were seen were at individual level). If we are to assume that the users can be modelled as a set of groups then relevance could be defined specific to each of the groups. It is because it is beneficial to classify users into categories based on certain features and to take advantage of the similarities and the biases of each of the groups. Perhaps, experiments with more number of people (judges, crowd) will help us understand the scalability of this approach.

    From the Fliess’ Kappa agreement values, the authors’ conclude that Wikipedia pages are inherently more controversial and are harder to judge consistently. I understand that the conclusion is made by the data obtained from the experiments. However, how can this relate to the real world web when most of the ‘know’ queries can be answered by Wikipedia? Does this mean there a lot of better sources than Wikipedia for queries that only experts seem to distinguish?

    The authors’ argue that the rating provided by the judges need to be not just consistent but also correct and that the correctness can be measured by using the gold set. Though the gold set judgements are made by super judges, it is unclear what parameters they take into account. Do only the vital or highly relevant documents make into the decisions made by the super judges?

  8. 1. Are the relevance judgments that reflect the “middle of the road approach” (p. 109) potentially the results of “lazy assessors” (from Carterette and Soboroff) working in an interval framework? While Carterette calls theirs the lazy model, what other explanations are there? Based on the description in this paper, these assessors, for example, could be unsure of the relevance, or sure of secondary relevance, or even acknowledging that others may find the document relevant even though they do not.
    2. I was interested to see judgments of documents (and for particular types of queries) examined in more granularity, using classifications such as “WP” and “URL”. The findings point out that the types of documents themselves are a very important component of relevance assessment behavior. At the same time, the categories that Kazai et al examine are, for the most part, quite specific and mutually exclusive, if not structural (for example, queries that may seek specific URLs). What would be suitable next steps for expanding this analysis to take on broader query and document types (e.g., news, searches for files, and so on)?
    3. Rather than revealing a concerning phenomenon regarding ranking variance, perhaps the researchers are actually opening the door to more honed, user-specific search results. This paper finishes by suggesting the possibility of tracking agreement between judges and end-users. Assuming that both assessors and end-users have biases, then for specific types of queries, why might certain types of assessors be more germane for certain topics? To that end, what are other risks or dynamics to consider when evaluating variance between assessor on both assessor AND document type-by-type bases? How does intent mismatch differ between types of assessors?

  9. 1. As we have established that relevance isn't a static quantity by simple logical extrapolation threshold for determining relevance of a document is also subject to variation and so I'm curious as to how we can account for these discrepancies while analysing judging errors. Like for instance, wouldn't a person who has judged a series of non relevant documents have a lower relevance threshold than an individual who has come across a bunch of highly relevant documents? Currently, this variable measure has been unaccounted for through the various judging errors that have been enumerated in the paper. And so, I'd like to know how this factor is taken into consideration when analysing judging errors.

    2. We've seen how it is important that an IR system makes use of a dynamic test collection and also seen how it is imperative to preserve semantics when evaluating an IR system. IR systems have made use of traceability tools to generate such information pertinent to users. However, the results we get continue to just be the something we can base our assumptions on. Isn't this a causal nexus? In which case would it be imprudent of me to believe that analysis the judging errors requires consideration only once we have been able to envision a nullifying factor against the bias that the judging error creates?

    3. The paper speaks of Fleiss Kapa agreement levels as a standard to calibrate Inter-assessor agreements. What are the factors that this measure considers? Does it account for varied sample size? Different subjects? Different criteria that require inclusion? Is it capable of detecting any of these biases? Is this measure sufficient and yet substantial for this judgement? And, how can we interpret this value when it has been stated to be a chance correct agreement?

  10. In Section 4, the authors chose to use ad hoc task of the TREC Web Track for the case study. In the section, the authors did not mention at all why this track is selected. Also how representative is the ad hoc task to establish the author's claimed hypotheses?

    Second question is for the Analysis Methodology adopted by the author. The authors used consensus and gold sets based on labels and clicks to identify systematic errors in the judging behavior of those groups of assessors. The underlying assumption is disagreement implies systematic errors. I am wondering that disagreement is very often caused by diversity of user information needs, the authors have not stated clearly how to find out what exactly leads to the disagreement.

    In Section 6.1, the authors stated controversial finding that “the ProWeb judges have the highest agreement with click preferences, they are the most successful at interpreting the real user needs”. And then the authors also indicated “Focusing only on the top two clicked URLs per query, we find that crowd workers are actually the best at agreeing with real users”. I am confused here what conclusions can be drawn from the findings. Does this controversy represent the intrinsic nature of confliction between topical relevance and user relevance(user satisfaction)?

  11. Karzai mentions training and judging procedures as potentially causing bias amongst the judges in contrast to the idea behind the MTurk. By applying a benchmark for taking part in specific Turk tasks, the possibility of better judgments increases but a bias might be present. How does she take into account the use of testing qualifications in this?

    One listed error in judging comes in the form of intent mismatching. But in the case of the MTurk or any crowd sourced judging, wouldn’t a qualification aspect(For example Male/Female, age of user, location) help to it help narrow down possible intent for different queries? Furthermore, from that data would it be possible to create a range of possible intent?

    Given that ranker training can influence evaluation decisions, does this prevent rankers from participating in other relevance judging posts? In particular, how might crowd sourcing be influenced by individuals with ranker training in the midst of random users?

  12. 1. What makes a "label-based gold set?" This paper seems to be calling into
    question the validity of human-created relevance judgements, and they're
    differentiating this gold set from a click-based one, so where would this be
    coming from?

    2. In section 3 it talks about types of judging errors, and among the common
    errors it lists "intent mismatch," which would involve something like
    interpreting "circuit breaker" as the electrical device instead of, for
    example, the program by the same name run by the Illinois Department on Aging.
    The thing is, there doesn't seem to be a wrong answer here. It is entirely
    possible that the user intended to find the program and not the electrical
    device, or visa versa, so how could it be an error for a judge to interpret it
    one way or the other? How is this handled in the relevance judgements?

    3. This paper seems to find that expert judges do in fact achieve a higher
    level of agreement with the click-based gold set (for example in section 6.1).
    This seems kind of un-interesting, as this is the approach used by TREC already
    for making relevance judgements. Is one of the main contributions of this paper
    to just confirm practices that are already being used?

  13. 1. The authors mention selecting portions of the past TREC assessments in order to compare them to the judgments they did for their experiment. While they are interested in comparing how different types of judges rank pages like Wikipedia, doesn't selecting portions of the TREC assessments skew their own results? The TREC judges assessments that they used for comparison seem to be taken out of the context of the whole assessment.

    2. Different trials and researchers train their judges for different tasks. When judges are trained to assess documents in the TREC runs or for companies like Google, how does their prior training affect their performance for other types of assessments? Wouldn't a rater for Google have a different view of a document than a judge for a conference?

    3. The authors mention that their "gold set" was assessed by so-called "super judges." What sets these judge's assessments above the other professional judges so much so that they could consider their relevance judgments as "gold"?

  14. 1. The article mentions gold standard consensus judgments generated from “super judges”. What is an appropriate way to create a gold standard of relevance judgments and what is a super judge?

    2. The author puts a lot of emphasis on the importance in distinguishing between Wikipedia relevance judgments on non-Wikipedia relevance judgments. I understand that the interest is in whether certain document types are more controversial to judges, but is Wikipedia vs non-Wikipedia a good way to characterize a document? Also, while Wikipedia pages tend to be less agreed-upon, the amount of agreement that is shown isn’t all that much lower than average agreement. How do we know that the difference is significant?

    3. In order to investigate the impact of bias in relevance judgments on ranking algorithm evaluation, the authors look at algorithms specially trained on “BM25” and “WP” features. What does it mean to train an IR system on BM25 or WP features? Is the variation in NDCG between different test sets (pg 113) explained purely by how these different groups viewed Wikipedia or is something else at work?

  15. The author mentions “Non-experts also disagree on the underlying meaning of queries significantly more often than experts, and often appear to give up and fall back on surface features such as keyword matching.” Then he also states “Thus, it seems that while NIST judges may have become better at rating popular Wikipedia pages in 2010, their accuracy in rating popular non-Wikipedia pages reduced. On the ST data, we observe that both” Does it not indicate the flimsy nature of the whole method? It also seems that judges are driving the result set rather what user deems things to be relevant? Is this the reason Google doesn't use the relevance judgments provided by Search Quality Graders in its search algorithm as mentioned in its guidelines?

    On page 2: author mentions “We characterize and compare the judge groups according to their inter-assessor agreement, their agreement with click-based gold sets and with label-based gold sets.” But he has been very vague about how these gold sets be created?

    Then the Author states "One way to assess correctness is to sample the labels and evaluate each one against a known "gold" rating. The generation of gold sets is however expensive and it may not be practical to test each judge on all the gold data. " He then mentions a way to generate the gold set " gold sets may be derived from click evidence. For example, one can create pairs of URLs for a query with known click preference relations and then check if the judgments agree with the click preference."
    But how can there be any certainty that the gold set we have created this was will be ideal/correct is all respects? I mean if we have no perfect way to evaluate the user behaviour for a given data set how can we say this gold set will be ideal and we can compare everything with respect to it. The gold set itself might indicate some form of bias if created in the way mentioned above...

    Author states "It is important that ratings provided by the judges be not only internally consistent, but also correct. One way to assess correctness is to sample the labels and evaluate each one against a known "gold" rating." Now if we are able to generate a gold set then why do we need go through the whole process of evaluation? Is not having a complete understanding of the user and then being able to simulate his behaviour and this being able to create a gold set the goal everyone in this field trying to achieve ?

    Continuing the thoughts raised above, If we are not sure about the authenticity of the gold set, and we are just using it as a baseline for judgment, then why do we need a gold set anyway. We can use any data set and set that as benchmark and then continue with respect to it?

  16. 1. Crowd model adopted is not clear. How was the judging interface designed and how different is this from the rest? Why 45 workers and how distributed is the worker effort. It maybe an unfair setup for the crowd, since as indicated by the authors agreeability has a positive correlation with training and calibration (which the other groups receive in the form).

    2. Clearly both gold sets show a certain bias – which is the better gold? How pure is this gold?

    3. The results presented in Table 7 is not clear – how is it that there is not training bias for the first two features? Training on the crowd labeled data is shown to consistently perform the best when using the third feature.

  17. 1. The author introduces us to different methods of identifying systematic errors made while judging. While in this process, it is stated that relevance judging methods rely on a fixed group of expert judges who are trained to interpret user queries as accurately as possible and label documents appropriately. Is this even feasible for a small set of topics? With the web, which is vast and varied is this possible and cost effective approach for judging relevance?

    2. In the section about “Consensus Analysis”, it was observed that judge training on a specific judging task increases the assessor’s judging consistency. What level of training would be necessary to avoid the problem of over-fitting? Also does that observation mean that training had resulted in a biased judgment?

    3. Using the “Ratings for most popular URLs”, when the observations were made it shows that for a particular data set (ST), the crowd workers had a leniency to assign high grades while the professional web judges were more conservative in their ratings. And some judges trained in other tasks performed worst. This disparity models the real user scenario well. The main use case that is portrays this situation doesn’t demonstrate how the different judgments resulted in ranking the search results. Isn’t the key problem in an IR system more overall ranking specific than concerned about individual group specific consensus? How can consensus be achieved across different judging groups?

  18. 1. In section 4.2, it said to use click data to build gold set. How does it guarantee what it get is a “gold” set? Though the paper discussed the details how to use the click data, but there is yet no information about its quality.
    2. In page 109, last paragraph of left column. It said “Wikipedia pages … harder to judge consistently”. However, in the popular search engine, Wikipedia pages now have gained higher ranking. Can we treat Wikipedia a reliable source of information retrieval?
    3. Overall, the problem raised in the section 1 that different group has different intention when searching cannot be solved even by the expertise. Without scope or constraints, we cannot say one is error or not. That is my concern about this paper, what is the value of this paper from this perspective?

  19. 1.This paper discusses three specific types of judgments. Are they representative enough? In other words, can these data sets fully cover the scenarios of judging errors?
    2.Three judging errors were discussed in section 3. Generally speaking, these three errors cannot cover all cases in search engine. Are there any other errors? If yes, how do they impact the final quality of the evaluation?
    3.In section 5.1, the analysis was based on “Fleiss’ kappa”. Why does the paper choose this measurement here? Is it possible to adopt other measurements?

  20. 1-I would like to see the gold standard of ‘super judges’ and the gold standard of click preferences compared to each other. The is not just a matter of curiosity (although I am curious) it is important to see how well super judges compare to real user expectations. If they do differ in what ways do they differ? How far off are they from each other? Can we use that information in our assessor training? How do we want to weight click data vs. super judged data? Is it healthy to cater to what the users want or should we try and start making the results look more like the super judged data?

    2- One general conclusion of the article seemed to point to a higher agreement and accuracy of the ‘ProWeb’ judges- those specially trained to asses web searches. It then would be the preferred situation to use them as judges as often as possible (for web search engines) but this is obviously cost prohibitive. So is there some way to better train ‘the crowd’ to give more professional answers? Does this defeat the purpose of a crowd? Can some medium level be established that some of the crowd be trained a little more but not as much as a professional assessor? Are Google rankers who use the guidelines we read considered professional assessors? How difficult is it to train a professional web query assessor?

    3- Isn’t problematic that the NIST judges did not use the same ranking methods as the other judges or the same data sample and that none of the other groups of judges judged their data sample?

  21. 1. In this paper the authors used the top ten search results using topics from the 2009/2010 Web track from Google and Bing as a method for gathering a test pool for their experiment. Considering that we know that Google sometimes uses experimental algorithms that might be worse than their regular algorithm in real time traffic without telling the user how can the experimenters know if the documents that they are getting from Google or Bing the best or most relevant documents? Would the use of a document from Google or Bing on an experimental algorithm that might not be as good as the usual result affect their study in any way?

    2. In this article the authors use a set of “highly trusted judges (super judges)” as one of the methods for testing the results of the relevance judgments gained from their different sources. They compare the judgments of these super judges against the results that they acquired from their groups. However they never explain why these highly trained judges are any better than the professional judges that they use. Isn’t this just another example of comparing the relevance judgments of two different judges? What makes these super judges any better at judging relevance than any other judges?

    3. In this article the authors use the NDCG as a method of judging the impact that the difference in judgments has on the effectiveness of search engines. What other metrics could they have used to do this? Is just one metric enough?

  22. 1. My first question is in Section 6 Gold Set Analysis. In this paper gold set is defined as two types. 1) a set of labels provided by a group of highly trusted judges 2) click-based gold set reflecting real Web users. Suppose now I have one group of labels from judges, and another group of labels from click-based data. There may be some labels that are in both sets. Should we give them more weights than others while we are doing analysis on these labels? There may be also noisy information in the click-based data, how does the noise affect the experiment results?

    2. My second question is in Section 6.4 Regression analysis. In this part, they are trying to fit a linear regression model over their query-URL feature set to determine which features have significant impact in the two judge groups. From Table 6 we can see that linear regression is not a good indicator for this question (all the p-values are greater than the threshold). So I wonder whether there are some other models we can make use of. Maybe we can visualize the relationship first and then give a better estimation about the underlying model to fit the query-URL feature set.

    3. My third question is in Section 7 Impact on ranker training. Throughout the paper it talks about labels everywhere. As we know, supervised learning deals with labels a lot, especially in this paper it talks about the gold set. This is a learning to rank question and I am wondering what is a proper way to make the most of gold set. How do we avoid over-fitting for gold set? How do we eliminate noise for the gold set?