Saturday, November 9, 2013

11-14 Gabriella Kazai et al. User Intent and Assessor Disagreement in Web Search Evaluation


  1. 1. My first question is about how we can make use of the conclusion from this paper. This paper concludes that the awareness of different possible intents, enabled by pairwise judging, is a key reason of the improved agreement, and a crucial requirement when crowd-sourcing relevance data. So if we want to improve the quality of crowd-sourcing relevance data quality, we can provide some pre-training for the crowd-sourcing users. However, in the meanwhile, if we add additional information for the work, fewer people would like to participate in this task. How can we integrate the awareness of different possible intents into the task smoothly?

    2. My second question is about how to improve the crowd-sourcing quality. Currently we can remove users who make mistakes above a certain threshold to improve the overall quality. Besides this method, can we do more? For example, suppose we have a small portion of relevance data by editorial judges, and a large amount of crowd-sourcing relevance data, can we use the editorial judges’ data to bootstrap and revise the crowd-sourcing relevance data quality?

    3. My third question is about pairwise judgment. I am wondering what they would do if there is a chain for the pairwise judgment. For example, for documents a, b and c in the pairwise judgment process, users identify a > b, b > c, c > a. So now there is a chain for relevance judgment. How can we resolve this kind of problem to put them into a unified ordered list?

  2. 1. In their experiment, the authors drop mTurk workers after 3 inconsistencies with their gold standard of click data. Is it fair and accurate to drop workers if they disagree with click data? What does this say about trust in assessors and the rating process? Could this create issues if an item that appears high in search results is spam or deceivingly looks relevant from the snippet?

    2. The authors still have a problem with assessing tail queries, since it was easier to show assessor accuracy when there was a strong preference in click data and tail queries would have no data to compare to. Could the authors’ crowd sourcing setup still be used to help system performance on tail queries? If so, how?

    3. In the conclusion the author states that absolute judgments are better for system evaluation with test collections. Do you agree? How could a series of pair wise preferences be used in evaluating a system?

  3. To measure the strength of click preference, the authors presume that when a user clicks a link, the link is preferred over another link which is in the consecutive rank position regardless of the order in which they are presented. I think the approach has an underlying assumption that the user actually see both of the links, which are not always true due to different reading strategy (e.g., I always click the first link which is relevant to my query, even the second link might be more relevant as I simply do not choose to continue reading the list).

    I find the way the authors used to block sloppy or dishonest workers is quite useful. They insert randomly the gold tests into the normal work. Failing a gold test also increases the probabilities of next HIT being a gold test. HIT workers will be blocked if they fail 3 gold tests. I think it is an effective way of improving HIT response qualities, which is a contribution of the paper in my opinion.

    In the paper, the user click data is the foundation for the experiments. Though the data is taken from three months of click logs, the reliability of the data remains questionable as the paper has not discussed any steps to remove any possible false data/noise/impossible click paths (e.g. due to faulty Java scripts or bad cookies or possible wrong data gathering logics).

  4. 1. This paper provides an important study which compares the click preferences from query log, judgments from crowd source workers and internal judges. My first question is about the calculation of click preferences. To remove the rank preference the authors using click data containing both ranks of compared documents u, v, in other words the author collect the click data where u is ahead of v and also data where v is ahead of u, and assume in this case the rank is not important. However, as far as we know that users usually just focus on top k results during searching. In this case, if the two documents happen to be at position k and k+1, the users are very likely to skip the later one, which leads to the false conclusion that these documents are similar. How did we handle this problem?

    2. In figure 3d, the agreement of inter-assessor agreement decreases at large click volume (5k), what is the possible reason for that? Intuitively, click preference can be unreliable when the click volume is lower, but agreement of click volume 50 and 500 increases with the increase of click preference strength. And large click preference strength would indicate obvious differences between these two results. So why the agreement decreases in this case? Is there any topic-specific study on this?

    3. My last question is about the purpose of the comparison of these three different sets, query log, judgments from crowd source workers and internal judges. What’s the purpose of compare these data sets? Also is there a comparison of the effects of these three data sets on the rank of the IR systems, which is the ultimate comparison we’re interested in?

  5. 1. The authors say that to guarantee users saw both documents while indicating a preference of one document over another (3.1.1) they used the heuristic that the documents be next to each other. This seems a little simplistic. Instead, based on the UI wouldn't it have been more convincing to adopt a looser heuristic i.e. the documents be within c units of one another rather than next to each other, where c documents can fit in a single screen?

    2. While the formulae given in section 3 seem convincing, the authors seem to have borrowed them from previous and related work. I'm wondering if there are caveats we should be aware of in using those measures or if they have become standards. Are there alternate ways of formulating measures like intent similarity? The choices in the section seem reasonable but also a little arbitrary. No explanation seems to be provided on why those formulations are best for this experiment.

    3. Was it a good idea to choose a 5 point relevance scale for absolute judgments? Why would opting for binary relevance judgments not have been adequate?

  6. This paper compared click data to relevance judgments made by different types of assessors - trained and crowd-based. My question is whether it is viewed as a safe policy to use click data, in particular for low-click or low-preference queries and documents. Queries whose documents have small amounts of click data may be queries that never led to producing the "right" relevant documents to the information consumer. How can we determine if this is the case, and how can we control for it subsequently? If the crowd disagrees more with trained judges on the relevance of ambiguously relevant documents, is it safe to model based on click data rather than pursue a new strategy or consider new documents?

    This is similar to/expanding upon Jin's question above. The researchers control for rank based on selecting adjacent queries. However, this does not seem like it would fully remove prejudices in rank. Would not rank need to be controlled for each rank value of each click for each document? The rank of both documents may change, which would impact click traffic for both documents in unknown ways. Why can we assume that, because one or the other document appeared on top in different results over time, the rank bias is controlled?

    If pairwise document comparisons may permit users to make better comparisons in terms of relevance, why not move to triple-wise comparisons? Is there any research into what amount of information is the best to present to the user for optimized "contextualization," as the process is labeled by the authors in Section 4.1.2?

    1. Regarding the end of the first question, it reads weirdly and is a little hard to understand. This is what I meant:

      "If the crowd disagrees more with trained judges on the relevance of ambiguously relevant documents, is it safe to discount the judgments of the crowd users and follow the trained judges' input, essentially continuing to evaluate based on click data rather than pursuing a new strategy or consider and test new documents?"

  7. 1. I’m confused by the definition of substitutable (pg. 2) in the two-URL example. How does one judge / evaluate this – does this mean that when a user is presented with two URLs, s/he sees the first, then sees the second, but comes back and clicks on the first? So, is it that in such a case we assume that both the documents satisfy the same intent and thus the user clicked on the one s/he saw first?

    2. Don’t you think this study could use eye tracking to reaffirm some of its metrics? For example, how do the researchers define ‘observed results’ used in their preference strength metric? Is it based on mouse hover, click length, or simply observing the user’s interactions?

    3. In the click-agreement section, the researchers say that the editorial judges agree significantly with the user click preferences as compared to crowd workers. Who are these users that the two judges groups are being compared to; what is their background and experience? Don’t you think that is important information while analyzing the gathered data?

  8. The click preference has been used as a criterion to test the relevance judgment, whereas there is little evidence to support that the data of click preference generated in this research can represent the real needs of users. So, is it possible that click preference data are not reliable actually?

    How the tasks have been assigned to crowd workers. If the crow workers are rarely unfamiliar with some topic or passage, is it possible that the accuracy of their judgments will be impacted?

    Is there any benchmark to test the agreement? For instance, how large the Fleiss kappa value would indicate that the agreement is acceptable or excellent in a research?

  9. 1. In the other article we read for this week (Scholer, Kelly, Webber), the authors suggest that an assessor's judgment when determining the relevance of one document is heavily influenced by the relevance of the previous documents viewed, and that this bias causes mistaken relevance judgments. However, the authors of this study argue that viewing other documents (in the form of a pairwise display) provides an assessor with context that is useful for correctly determining relevance. Am I correct in assuming these studies are opposed in this way? If so, which study do we find more credible?

    2. In the conclusion of this study, the authors state: "For many IR applications, the absolute judgments are more useful, for example to crete a reusable test collection." Previous studies we have read suggest that the high cost of creating a test collection is often justified by the fact that these collections can be used many times. This statement also brings to mind the issue of test reproducibility which we have previously discussed. What would a cost-benefit analysis for building a pairwise test collection look like?

    3. Is the authors' definition of duplicates-- "two URLs [which] are likely substitutable if, when presented with both, users click whichever they see first..."(p. 2)-- what is typically meant by "duplicate" throughout other Information Retrieval studies? I had assumed duplicate meant documents which were (or were almost) word-for-word matches, or even the same URL altogether.

  10. 1. We have seen how click-through rate is not a sufficient metric when hoping to determine user satisfaction or even topical relevance. The implementation deals only with query analysis as opposed to investigating the user's information need. Doesn't this imply an oversight - especially when assessing user intent when we need to deal with relevance as an information need as opposed to dealing with relevance as a query?

    2. Although the paper does propose multiple metrics to understand the click preference - it does not seem to take into consideration the ranking of the relevant documents by the IR system and the fact that the initial position(relative rank) of the documents rendered as the 'relevant' documents would definitely impact the 'click-preference' of the user. How would the proposed methodology hope to take this initial ordering of relevant documents into account?

    3. The investigation makes use of stratified sampling when analysing the session logs details. Stratified sampling makes it difficult to identify appropriate strata for any study especially when the information cannot be exhaustively partitioned into disjoint subgroups. Since, this is the case that we would be dealing with, when it is extremely tedious for us to segregate user intent's exhaustively - wouldn't it make more sense to use an F-test instead?

  11. Q. The goldset has been determined with a component of randomness being embedded into it. Thus making the decision of not allowing workers to continue based of such a gold set seems a bit inappropriate.
    Q. The author mentions “assessors agree more with each other and with users when click based evidence suggests stronger preference for one search result over another.” Will the inverse be true as well? I think that they were able to check the satisfiability of the above hypothesis because the data was tested for queries with click data falling into a huge range. But testing the inverse would be challenge as then the results would be hard to proves where click data gathered is not significant, which intuitively doesn’t seem to be always true.
    Q. The way judge calibration has been defined in the paper doesn’t seem to be good enough way to be used to filter workers in crowd-sourcing events. An improvement over it can be using the labels from workers to be compared with judges as well and not just other workers. Using this as a reference for how workers are compared it might return better results.

  12. Kazai et al., bring up judge training several times throughout this entire paper as a possible explanation for trained judges. Is there anything to suggest that the agreement between trained judges is not related to their shared training? Or might it be due to the overall experience they’ve gained as assessors vs. crowd judges?

    In section 4.3, it is pointed out that judges disagree more when asked to judge web pages that are similar in topic. The word “random” is attributed to how the judges go about actually choosing one page versus another. Is it really a random choice or might there be some underlying reason for choosing one over another?

    Given the nature of crowdsourced workers would the suggestion made by Kazai et al. to utilize a pairwise UI for relevance judgments make sense? What I mean to say is how fatigued or possibly disinterested would crowdsourced workers be when having to deal with a pairwise UI to complete the HIT?

  13. 1) The authors run with the idea that preference judgments offer much lower assessor disagreement levels than absolute judgments. However, in terms of overall system evaluation are preference judgments anywhere near as useful as absolute judgments?

    2) When the authors discuss strength of click preference they indicate measuring a proportion of times one result is clicked ahead of another. How come this results only takes into account, the case where u is above v? For completeness, shouldn’t they also swap the ordering, and combine both sets of results?

    3) It is interesting that for a volume of 500 clicks, judges actually agree less with higher click preference strength. They mention this briefly in the paper, but I could not figure out any rationale for this. 50 and 5k have similarly sloped lines, but 500 is the exact inverse? Why might this be?

  14. 1. For the single judging based HITs, the paper created gold labels by “labeling the preferred URLs as relevant and the randomly picked URLs as irrelevant”. However, the randomly picked one might also be relevant. Was their method reasonable here?
    2. To block the “bad” workers, this paper adopted a gold test. “Failing a gold test increases the likelihood that the next HIT assigned to the judge is also a gold test”. What is the value or criterion of such likelihood?
    3. In table 1, only the cost of Crowdsourcing is listed. There is no cost information about Editorial judges. If such cost is higher than the crowd workers, does the high-agreement among editorial judges make sense?

  15. 1. The authors state that they collected data from 18 months of user clicks on Bing, but limit this data to a set of specific scenarios for the purposes of their study. How large was the original data set and how much did their filtering reduce the pool of user data they were drawing from?

    2. How were the crowd-workers instructed to perform these HITs? Was there any sort of special instruction or training given to the workers to prepare them for the judgment process?

    3. Related to the Scholar et al. article, would priming the judges and/or crowdworkers have aided in the inter-judge agreement? Perhaps giving a sample trial size for the judges/crowdworkers to work through would have given them a better idea of what they were trying to do.

  16. 1. The authors use Jaccard similarity to measure the relationship of two URLs. What is the Jaccard similarity? What is its feature?
    2. Equation 2 and 3 are very similar. What is the relationship between them? Can we say Iuv is a subset of Ruv? Or, can they be induced from each other?
    3. It is mentioned in section 3.3.3 that workers failing 3 times would be blocked. However, it is possible that a worker just made some mistakes and unfortunately met those 3 tests. Is this bar reasonable?

  17. 1. In this article the authors discuss mathematical models that help to determine the similarity between two documents in a collection and if those two documents could be considered duplicates. However in the Scholer et al. article we read this week we saw that they used another method of determining if two documents were duplicates by using one as the query in a search engine. Which of these methods do you think is the better method of finding duplicate documents? What are the benefits and drawbacks of each?
    2. In this article the authors use a group of crowd-sourced workers from Mechanical Turk to judge documents. They attempt to eliminate the problem of lazy or dishonest workers by occasionally inserting known gold tuples in for judgment. If a worker fails three of these gold tests then they are not rewarded for their work. However what about the user that is lazy yet luckily passes the gold tests by guessing? How much of a problem could this user cause on their results? How would you go about modifying their technique so that lazy workers could not get by just by luck?
    3. In the conclusion of this paper the authors state that the lower effectiveness that is seen from crowd workers when compared to professional judges is due to their lack of experience. They also state that crowd workers do not have as good of an understanding of the user’s needs as professional judges do. Can you think of a way that you would be able to explain a user’s needs to crowd workers better, as well as help them benefit from the knowledge that experienced judges might have, that would help to overcome or at least minimize the problems that crowd workers have?

  18. 1. My first question is regarding the reliability of the click data. In the conclusion, the authors state that they have not discussed about the reliability of click data. Why is the data not reliable or why should such a question even arise? The article deals with the click data from the query logs of bing which is a standard search engine with a normal distribution of users. Is there any way in which we can test the reliability of such data?

    2.In the related work section, the authors have stated how recent research reported random behavior by dishonest workers in the crowdsourced methods and state that their work builds on the other works and complements them by studying why relevance judgments may disagree with user clicks. But the authors never seem to be convincing about how they negate or even overcome the issues stated above. If crowdsourced workers tend to be more random, then how does their work contribute to the research accurately? How have they taken that into account in their research?

    3.The research work has been conducted with 24 professional judges and 286 crowd workers to assess the same number of documents.The ratio seems confounding. Do professional judges differ so much in their skill when compared to crowd workers? If so, does it not make sense to hire more professional judges? Assuming that they cost more than the crowd workers, does the additional cost incurred benefit the research? Or do the crowd workers produce considerably expected results for a cheaper cost?

  19. 1. Web users and crowd workers, is there really a distinction here? Aren’t the crowd workers a sample of the web user population? Why is it that we see differing trends.

    2. The way click preference is defined it encodes pairwise preferences. Then is it really surprising to see more agreement when measuring agreement in the pairwise interface?

    3. The quality assurance process adopted to enable worker blocking is quite naïve. I suspect this to be the reason for not seeing comparable results. Maybe, better modeling or task design can enable better inclusion strategies.

  20. I don’t understand how two URL’s could be termed to be the same if users show no preference to click on one than the other. In a ranked document list, users would stop when they feel that their information need has been satisfied. How is this related to two URL’s being related or almost same?

    The authors state that the URLs are randomly assigned to either side of the display. However the authors also mention the probability of a URL being shown on the left side is 0.54 . Are these statements inconsistent with each other?

    Inserting some goldset tasks as HITs is indeed a nice way of avoiding dishonest workers. However I am not sure 3 would be a right number in all situations. It is dependent on the number of HITs expected from each crowd worker. Any statistics on how many workers were dropped would have given some insights into how effective the technique was.

  21. 1) The paper is not clear about the stratified sampling employed nor is it clear if multiple attempts were done in order to achieve a proper number of samples. In your opinion how much those the lack of information affect the validity of the results?

    2) When discussing the preparation of the data, Kazai et al. state that there is a 52% and 54% percent of showing a page on the left side at different phases of their experiment. Their goal is to remove any bias induced by position hence a reasonable thing to do is to have equal probabilities of showing on the left and the right. Why did they decide to use a leftist bias approach?

    3) I am a bit confused about the experimental setup, is each participant in the experiment only given one of the three possible tasks or a series? Since they mention that the purpose of the serialized pairwise is to lessen the consequences of having a participant judging two documents at the same time instead of one (tasks one)? Also, how is task three suppose to lessen this problem?