Wednesday, September 18, 2013

26-Sep B. Carterette, I. Soboroff. The Effect of Assessor Errors on IR System Evaluation. SIGIR 2010.


  1. 1. Why did the authors choose to develop mathematical models of possible user behavior first instead of observing the actual behavior of Turk users and then create models? What advantages or disadvantages are there to this method?

    2. One of the solutions the authors present is to have some documents judged multiple times (p. 7). However, in the Saracevic article we read last week, he discovered that when multiple assessors judged documents, there was not much overlap. What methods using multiple assessors do you think are effective to minimize the problem of assessor errors?

    3.After conducting their simulation, the authors conclude that “it is generally better to underestimate relevance than to overestimate it” (p. 5). Under what conditions is this true? Would overestimating relevance produce more “noisy” results and more “marginally relevant” documents? How would this affect the user’s experience?

  2. 1. It is interesting to see how different 'emotional' states like fatigue and pessimism are modeled using these probabilistic models. The models were simple yet predictive, but they do not account for one thing: the switching of states. For example, it is not hard to imagine that a worker started out as random or bored (employing an alternating judgment pattern) but then gets tired of that as well and starts judging everything the same (most similar to an optimistic/pessimistic model). Assuming one were to build a more sophisticated model where such switching behaviors were observed, what would it take to determine when the switching occurred? Wouldn't that have to have a probabilistic model as well?
    2. The pessimistic assessor seems to be the most accurate, as observed by the findings. The logical explanation behind this seems to be that there are far more non-relevant documents than relevant and therefore, the pessimistic assessor performs better than the others. However, if the distribution is not skewed, the pessimistic assessor would be expected to perform no better than the optimistic assessor. This led me to wonder, are there any such tasks where there are a comparable number of relevant documents as non-relevant, and where the search problem becomes more involved, like returning a diverse set of relevant documents? Or is it always the case that nonrelevant documents outnumber the relevant ones by orders of magnitude?
    3. I also wonder if, once we fit a population of users or crowd workers to one of these models, it is possible to correct for their errors? For example, if we have reasonable confidence in a choice of priors, and from query logs and timestamps, we are able to 'fit' a set of users into one of these models, can we increase or decrease the relevance estimates and then correlate that with the true relevance judgments? It should be interesting to see if search engine rankings are as robust to crowd sourced variability as they are to variability among domain experts. If not, then would it be fair to conclude that Voorhees's findings are limited to cases where the judge groups are not only uniform but have some degree of expertise (at least as much as the University of Waterloo group, who were less expert than the original judges)?

  3. 1. We have discussed (debated) the significance of context and location of the assessor on relevance judgments. Studies have shown that a majority of MTurk evaluators are focused in one geographic region. Then, (how) can we extrapolate the results – their judgments – to represent another context?

    2. The paper describes ways to adjust / correct assessor errors (section 4). Can you think of ways studies can limit (drastically) varying assessor types from the onset? Is this a realistic stance? Do guidelines like the Google document we read last week push users to function as per expectations?

    3. To save costs, do you think many rejudgements on a single document, and using a smaller test collection (with less documents), is a way of ensuring better precision of relevance judgments?

  4. The authors talk about the types of assessor models, the optimistic, the pessimistic, the lazy/over-fitting and the Markovian assessor. I think I would start the day out as an optimistic assessor, then a lazy one around 1-3pm, and then after that, depending on my day, I'd either be a pessimistic assessor, or an lazy assessor again. And I wouldn't probably know if I was a Markovian assessor. Have there been any studies on how the time of day effects assessors? Do the optimists and pessimist cancel each other out?

    If the primary motivation is to reduce cost, then what effect does that have on authenticity of the interactions between data collections, the users, and the systems algorithms?

    Number four in the authors 'broad trends of assessor behavior' is that across large sets of topics, assessors vary in the proportion of documents they judge relevant. Wouldn't that be contingent on the number of actual documents that are relevant, more so than the assessors? Or in these data sets are there a predefined, equal amount of relevant documents? I don't understand why this is being put on the assessors, vs. being a result of the documents themselves.

  5. 1. The authors look for evidence of autocorrelation in the judgements, which is
    the tendency for a document to be judged the same way as the previous document.
    They found evidence of autocorrelation, and they defined it as:

    P(j_i = 1 | j_i−1 = 1) = 0.22 → autocorrelation
    P(j_i = 1 | j_i−1 = 0) = 0.18 → no autocorrelation

    There seems to be two problems with this:

    - This doesn't account for the probability that a document was *not* relevant,
    given that the previous *was* relevant, i.e. P(j_i = 0 | j_i−1 = 1)

    - A difference of 0.04 doesn't seem like a lot -- but they say this is
    significant by a two-sample two-proportion test. What does that mean?

    2. Section 2.1 talks about model distribution and priors. My understanding of
    this section is that they take the correct relevance judgements, and then they
    look at the average numbers that would change (either non-relevant to relevant
    or visa versa) based on different patterns of behavior. If this is the case, I
    don't understand how that ties into the different distributions and parameters.
    Is my understanding correct? If so, how do the parameters relate to the
    experimental controls?

    3. This study seems to have done a lot of good work in characterizing models
    for different types of undesirable behavior among relevance judges, but the
    real question is whether these behaviors actually affect our assessment of
    search engines. Their conclusion and future work section don't seem to address
    the real question: are we incorrectly judging some search algorithms as better
    because of poorly designed test collections? Am I just missing something?

  6. In mining a large log of assessor interaction data from TREC 2009 Million Query track, why the author excluded inter-judgment times in excess of 200 seconds there?

    When attempting to explain the assessor models, the authors have mentioned that the presence of topics completed faster than usual is more likely to be connected with the errors generated by assessors. Here, how to define the usual speed of judging? Is it possible some assessors judge topics faster than usual due to their own characters, such as their education levels, their IQ, or their ages. Moreover, is it possible that some assessors judge topics faster latter, because of learning effect that they get more familiar with some topics, or that after practicing, they find a more efficient approach to judge topics.

    Having read this paper, I have a question: how many tasks should be assigned to assessors each time when we try to use crowdsourcing tools to gain the relevance judgment?

  7. The author lists several trends that are used to model assessor behaviour and the first one says that the time between judgements is higher initially than when the assessors are judging the last documents. Though the assumption of independence of the retrieved documents is being violated, it has to be accounted that the assessor learns to better distinguish between relevant and non relevant over several documents. So attributing the decreased judgement time to factors like fatigue and mood will only be partial justification. Perhaps modelling how users/assessors learn to distinguish between relevant and non-relevant documents implicitly (by looking through several related documents) will complement the aspects covered in the paper.

    It is surprising to see that pessimistic modelling leads to a better correlation value than optimistic modelling. This could be the reason as to why the raters were asked, in the Google’s ‘Search Quality Evaluator Guidelines’, to choose the lower relevance value when in doubt.

    How are the alpha and beta values of the Beta distribution being selected? When the fatigued assessor is modelled the paper mentions alpha and beta values to be 0.05 and 1 respectively, whereas for the other models, alpha and beta are chosen to be 16 and 1 or 1 and 16. Why was this change necessary?

    In reference to – “non-expert assessors judging domain-specific queries make significant errors affecting system evaluation” there are two issues that have not been talked about: defining domain and defining expertise level. Though these can be simulated in the lab experiments (under controlled conditions), the question as to how this correlates to the general user population still needs to be answered.

  8. 1. As we read about all of these isolated studies of search engine assessors which focus on different variables, I find myself wishing that the studies were in better dialog with one another. It would be interesting to see a more complete study which synthesized various elements from different experiments into a more comprehensive analysis. For example, wouldn't it be beneficial to combine this study with that of Kazai, Craswell, Yilmaz, and Tahaghogi, simulating intersections between the different assessor classifications-- for example, examining one professionally-trained assessor who is unenthusiastic, one crowd worker who is Markovian, etc.?

    2. The authors of this article are very critical of assessors, describing for instance the "unenthusiastic assessor that alternates between non-relevant and relevant judgments to stay amused"(p. 5). This is not what a real Web user would do when conducting research or other types of searches, so why would we bother to simulate this type of assessment? What are some ways in which we can increase assessor performance to reflect real Web users? Should we, for example, give unenthusiastic or topic disgruntled assessors an option to quit their assessments early if they feel what they are doing is lazy or not going to be useful for a study? Or would this not be realistic for crowd workers who are simply trying to complete the job for financial motivations unrelated to the purpose of the study?

    3. The authors of this study are operating under the assumption that it is "generally better to underestimate relevance than to overestimate it"(p. 5), and under this assumption they conclude that the best results would be obtained if "we could require a supermajority of positives to call a rejudged document relevant. Thus it would take two of two judgments, or two of three...being relevant before we are confident in concluding that a document is really relevant."(p. 7). But if all of these judgments are so flawed (as I describe in my second question), that is going to create a high number of false negatives. I disagree with the authors' claim that a false positive is worse than a false negative-- the worst thing that could happen with a non-relevant document being judged as relevant is that a user is annoyed (which is a concern for commercial search engines). The worst case scenario with a relevant document being judged as non-relevant is that information is not accessible (which is a concern for academic, professional, or specialized research search engines).

  9. 1. The paper provides evidence that certain assessor behaviors could have an effect on the relative ranking of IR systems. However, their assessor simulations appear to be based on very simplistic models of assessor behavior. On what basis were the assessor archetypes created for this study? I understand that there is some descriptive work on assessor behavior in chapter 2, but it doesn't really seem to be related to the categories created for the experiments.

    2. While it is true that the absolute ranks of each IR approach change with different assessor simulations, generally the best approaches stay at the top end of the pack and the worst at the bottom. Given this behavior, is it possible that simulated assessors could replace human assessors? What are the main impediments on the path to bringing simulated assessments up to the quality of human assessments?

    3. It seems to me that these types of experiments would be very vulnerable to certain biases in the underlying document pool (i.e. large numbers of non-relevant documents in the pool would benefit pessimistic assessors). What was the pooling methodology used for the TREC Million Query Track?

  10. 1. The paper presents a very important study on the effect of the assessor errors in relevance judgments with crowd-sourcing. My first question is about the error model they created. The probability p that a document is relevant is simulated with Beta distribution or Gamma distribution. It’s understandable that certain model has to be used for the study but is there any reason why Beta and Gamma distributions are used? Both distributions have two to three parameters, and how are these parameters chosen? In the initial study they exclude inter-judgment times in excess of 200 seconds. Is there any reason why to do that? Or even is there any reason why we care about time?

    2. Eight different error models have been proposed for this study. Is there any real study that supports these models? And since topics are judged by different people with different error models, is there any study that evaluates the combinatorial effects of different error models on the final results? Can we estimate the proportion of assessors with different error models in real world? Since at the end of the paper it is shown that the calculation of statAP is fairly robust, is it possible that all these problems discussed in the paper will disappear as long as we use enough topics, assuming the proportion of irrational assessors is low.

    3. The error models assume a binary relevance judgment criterion. However, as it’s addressed at the very beginning of the paper, the real judgment has several criteria, non-relevant, relevant, highly-relevant, and related. How will the result be if non-binary relevance judgment is incorporated in the error model, will this make the evaluation better or worse? Also, since some of the errors are topic specific, will it help if we remove those topics from the test collections, or assign specific assessor groups for those topics? And will it help if we set up some tracking/feedback system for each assessor, which might help reduce the potential errors?

  11. 1. The authors insinuate quantification of the assessor's behavior for better evaluation of IR Systems. But how can accurate quantification be possible in the case of crowdsourced assessment of relevance? How accurate can this quantification be, as the assessor’s behavior might also depend on a lot of external factors?

    2. The authors show how pessimistic models have better correlation. It has also been observed that random assessments have an observable probability of correlation. Is this the reasoning behind why extensive efforts have not been made to resolve the ambiguity between multiple levels of relevance? Can this philosophy be scaled to a scenario where there are a large number of assessors assessing relevance for the same amount of topics?

    3.The author also fails to discuss what documents are judged multiple times. The author mentions that multiple judgements and majority opinion might be taken into account to improve the accuracy. Although it does make sense, it would be more clear if we could discuss about what kinds of queries can be re-judged, so as to minimize the cost overhead.

  12. 1. Assessor Context: Carterette and Soboroff’s models overlook some of the complexities of assessor (e.g., human) behavior. This is necessary, of course – as Box (and Mooney) state, “All models are wrong, but some are useful”. How do they account for contextual matters, such as assessor expertise, in deciding whether they are “optimistic” or “pessimistic,” and in determining whether or not the assessor made an error? Side question: How do assessors judge documents if they do not know whether or not the document is relevant? It sounds like this may happen quite often in crowd-sourced work (See Kazai et al)… are such judgments predictable? It seems as though they could behave similarly to several of these models.
    2. Redefining errors: Perhaps we are debating the “right” way to do things too much. Going back to the “user satisfaction” aspect of search engines (to keep users coming back, rather than switching to other search engines): Should not search engines evaluate user biases against those of different “assessor models”? Why not just tweak results such that particular users will find them more useful – a pessimist may want only “pessimistically assessed” documents, whereas an optimist may be more eager to explore the many relevant findings displayed by a model built upon “optimistic assessment.” In such a situation, is “error” actually bad, or simply misallocated?
    3. This study addresses variance in relevance judgments caused by assessor error in binary judgments. The question comes up, again, of how their results might have differed if they had used an interval-based judgment framework. This might require more proactive behavior from assessors, and make assessments using MTurk more accurate and/or reliable. In addition to this, have scholars attempted to incorporate partial relevance in terms of secondary interpretations of queries in analyses of assessment variance?

  13. 1. The simulation procedure reassigns labels randomly whereby there is no way to control correlation between the original and assigned label set. To account for this, the authors show a measurement interval over 25 trails – is that enough? I would have liked for label correlation measure to be presented or even incorporate it in the simulation procedure.

    2. Parameter sweep on priors – how meaningful is such an analysis with cherry picked cases? The contour maps of Figure 4 definitely seems to be a better representation – I would have liked to see similar plots for the other assessor models to better understand generalizability of the cherry picked cases.

    3. I understand the motivation to get redundant judgments on documents with low probabilities of inclusion; however the conclusion of the analysis I think is questionable because of the unrealistic nature of the experimental setup.

  14. 1. The paper proposed making use of crowdsourcing for acquiring relevance judgements when no true relevance labels are available. Given that, the labels provided by the workers could be accounted to their inexperience and this would result in quality variation - how do we deal with these noisy labels that cause inaccuracy especially when worker's judgements are bound to be varied? The paper states design costs as an issue however, my immediate concern is the fact that simple majority voting continues to be used as a metric in this evaluation . Wouldn't we be compromising on data quality assurance through this practise?

    2. When making use of Bernoulli Distribution only the presence and absence of quality terms are modelled. This distribution doesn't really capture the different frequencies of the query terms and ignores multiple occurrences of terms in text. Isn't this incorporating a bias as a document may re-state terms only to emphasize the term's importance in the document? Further, the presence of different terms are expressed by probabilities which are independent of one another. Doesn't this affect the performance of the IR system especially with document retrieval as semantics and grammar is given a miss?

    3. Even if we use the varied assessor models like for instance the pessimistic, lazy, disgruntled, optimistic or Markovian model to judge the documents because of the sheer increase in the number of diverse participants wouldn't we be exposing the system to be open to spam infiltration as there is close to no control placed on the user's environment? Also, how do we hope to reach a reasonable tradeoff between ensuring reliability and certifying scalability of data when we are also attempting to optimize time overheads of the Human Processing Unit and the CPU?

  15. 1. The authors hope to explore the effects of assessor error on the results of experiments. In the process, the authors develop numerous different models depicting the different assessor mentalities that can lead to errors. The models all seem to make sense and capture realistic behavior. For instance, if I get too many C answers in a row on an exam I begin questioning my results and may go back and change some of the answers. The authors provided a Markovian model that captures this type of human behavior. To judge the impact of each of these models, the authors introduce a number of probability based equations, all of which need adjustments depending on the model. In the end, the authors concluded that it is possible for assessor error to have a significant impact on the results of a study. In past papers, authors have mentioned the cost of developing a test collection and the lack of funding for relevance judgment gathering tasks. To cut costs, IR evaluators have turned to crowd sourced platforms; however, there is a greater risk to incur assessor error in an uncontrolled and unregulated crowd sourced environment. The authors do outline some steps to take to combat assessor error, but I would think this extra cost would counteract any cost saving benefits gained from using crowd sourced workers. Although it would not appear so at first, would it be more cost effective to go ahead and obtain relevance judgments in a controlled laboratory setting?

    2. The authors suggest using multiple relevance judgments to counterbalance errors by assessor. As a result of the evaluations the authors performed, they were able to discover pessimistic assessor errors have less of an impact than optimistic assessor errors. They tie this conclusion into their suggestion by proposing a supermajority voting schema for multiple relevance judgments. The authors go on to outline how to select which documents could be candidates for multiple judgments. Given that IR evaluations started to use crowd sourced workers in part because of the cost, couldn’t the experimental standard be, for all documents, to receive multiple judgments? Although this would open up the possibility that multiple assessor errors are made on a single document and the document could still be categorized incorrectly.

    3. The authors decided to use statAP to calculate system rankings. StatAP samples a set of documents and then estimates what the average precision is going to be. The authors choose statAP because they found the other measure, EAP, to be bias. When there is an erroneous judgment, then the number of relevant judgments statAP predicts can be noticeably off. When trying to combat assessor error, the authors suggest treating relevance judgments as a QA problem. To explore this idea, the authors introduced bad data and checked the impact on the system rankings. In the end, 40% of the topics were changed before the correlation fell below an acceptable amount. Since all systems would be evaluated using the same relevance judgments and statAP seemed to handle errors well, is there any need to explore methods of correcting judgments or approaches that lead to better relevance judgments gathered?

  16. This comment has been removed by the author.

  17. This comment has been removed by the author.

  18. In Section 2, the authors examined the first 32 judgements per topic and excluded inter-judgement times in excess of 200 seconds. The authors have not given explanation of why they chose first 32 judgements instead of other alternatives which might be more representative of assessor's behaviour(for instance, selecting random 32 judgements). Also excluding those judgements in excess of 200 seconds is a selection bias in my opinion, as for some topics user might need extra time to search for the meaning of those keywords involved.

    Another potential issue with the paper lies in Section 3.1. The authors created the test collection based on lightly-judged topics. I am wondering whether the findings are applicable outside the TREC Million Query Track. The paper would be more useful to cover more test collections.

    In Section 3.1.1, authors stated a statAP method which takes inclusion probability as a parameter. My question is how to get these statistics for the inclusion probability. I could not find anywhere in the paper for this.

  19. In section 2.2, the unenthusiastic assessor is characterized by either naming everything as non-relevant or approaching judgments with a set pattern in mind. Assuming the assessor is detached from assessments, how many different patterns are employed when not judging everything as non-relevant and do they make a difference?

    Carterette points out that the information gleaned from the simulation results does not mean that assessors should be trained to be pessimists. But if leaning toward a pessimist model would produce more accuracy in ranking systems would that not be one method to explore in a non abstract setting?

    Section 4.1 introduces the idea of multiple judgments and mentions the cost associated with having multiple judgments take place. WIth certain models such as the pessimistic assessor available, how feasible would it be to use the standard human judges for making relevance judgments then utilize a model to settle possible assessor errors?

  20. 1. The authors discuss the process of creating computer judges for relevance, but how can a computer account for all of the various factors that go into a person's relevance judgement of a document?

    2. How did the authors arrive at their decision to include the 8 judge models in their tests? What factors led to them selecting these as examples of human judges?

    3. The authors mention the idea of quality assurance in relevance judgments. If models can be made for extreme cases of judgments, could models not be used as more of a baseline judgment to use in addition to more traditional human judgments? In essence acting as an additional judge.

  21. 1. The author discusses various models to assess the assessor’s behavior and to quantify their errors. But these models are performed under highly controlled situations were the judgments that were altered for simulation purpose had no relationship to that of the actual documents. Thus the models seem to portray results in an unreal world of IR system. How well will it work in a real case scenario? Will it scale up to work for at least optimal use-cases if not most of them?

    2. When analyzing the assessor’s behavior and modeling it, it was stated that there is an interaction between the assessor and the document and also between the assessor and the topic. Are these related mainly to the assessor’s domain expertise/topic knowledge? If yes, then how is the scaling done? If not, then what interaction is the author referring to?

    3. The final section that talks about “multiple judgments” discusses about controlling error and improving quality of judgments by assessors by re-judging certain documents and voting the judgments in case of disagreements. This has left two questions open-ended. One how to find out if a document requires re-judging? For a given topic and a query, how many such documents can be re-judged? Secondly the author hasn’t discussed as to who has to participate in the majority voting technique? If it involves the same assessors then it would be biased, if not then it incurs additional cost that again may not be preferred.

    4. Is simulation of the assessor’s model really helpful? Although it claims to reduce the cost involved in judging, it comes with an additional cost for controlling the error due to assumptions and modeling. Is it worth the effort and cost to build a simulation for assessing judgments? Simulation vs Manual effort – which one is better in which scenario?

  22. The author has tried to categories people into different sets like optimistic, enthusiastic etc. But every person is different and everyone is a mixture of all the types mentioned here. How can this model account for that ? But how can a machine determine when which of the characteristics might be behind the decision being made by a judge and to what extent?

    The author has not mentioned in the paper the values of how alpha and beta have been derived for different types of assessors? And again how can these distinct characteristics mix with each other, and their potential affect the the values of alpha and beta?

    The author has tried to wrap human behaviour into a distinct set of some characteristics but human behaviour is subjected to many things and not the personality only. What about the environmental factors affecting the decision of the judge?

    Then what about personal bias of individuals towards a single topic and not towards the whole documents set? How can this be taken into account by machine without knowing the person's background completely ?

    Then if a machine is able to replicate human behaviour as a judge based of the characteristics defined here, then will it not be able to replicate human behaviour as an evaluator?

  23. This comment has been removed by the author.

  24. 1) Is there a reason why Carterette and Soboroff chose to do 25 trials and set the range of alpha and beta to be [1, 1024]?

    2) This experiment follows the TREC standard of only having a single assessor per topic. However, wouldn't it be more interesting and realistic a model that does not have this assumption?

    3) To some extent I think the models presented are very “optimistic” in some sense. Most of the models seem to produce more negative noise than positive and as a result they tend to mimic the tendency of the relevance judgments. Wouldn't it be better to add a random element that would change the judgments of the model after X documents? This random variable would simulate the assessor's desire to not be caught.

  25. 1. This paper based its model on Bernoulli and Poisson distributions. Why did they choose those two distributions? Could those two distributions really reflect the feature of the data, at least from TREC MQ perspective?
    2. This paper introduced several behavior models. Is it possible some assessors have a hybrid model, which is combined two or more models in their assessing work? If yes, what is the impact of the error analysis? Or, in other words, may an assessor be categorized into several models?
    3. The paper was totally based on the TREC MQ. Though it mentioned at the end that they will further work on MTurk, the questions is how they get sufficient data, e.g. log, from MTurk so that they can continue their work? Is the model in the paper applicable to those in MTurk?

  26. 1. What is the meaning of “two-sample two-proportion test” here? (p.2) Is the data size large enough to make such conclusion?
    2. The authors states that the unenthusiastic assessor follows a fixed pattern. Is this true? Such a pattern may be close to “fixed”, but there are still some variances and dynamic factors in the behavior. Do such factors have impacts on this model?
    3. When discussing the simulation results, the author states that the models result in more accurate rankings. What is the evidence for this claim?

  27. 1) The authors conclude from their results that “it is generally better to underestimate relevance than to overestimate it.” Elsewhere it is mentioned that the Million Query track test collection is characterized by having a relatively small number of judgments per topic. Would they have achieved similar results if there were more judgments per topic?

    2) I might be misunderstanding their simulation, but their approach to simulating multiple judgments to adjust for assessor errors seemed flawed. They do not take into account that when a document is judged multiple times, the type of assessor could be different each time. I realize the focus of the paper is on simulating the effect of errors and not devising a solution, but would they have achieved better results if they accounted for different types of users judging the same document? This could be potentially be more effective than “forcing” users to be pessimistic because that type of user yields better results.

    3) Continuing along the same lines, why is there an assumption that the probability of relevance associated with a document is the same for two users (even if they are both of the same “type”)? If we are both optimistic, we still have independent views of each document. Something that’s clearly irrelevant to me might be slightly relevant in your view.

  28. 1- Can this predictive modeling be better developed to produce models of good behavior? What happens when models are created from the best aggregated assessor behavior? Surely if we can model inattentiveness we can model attentiveness, no? Is good behavior actually too random to model and that is why it is good? Is it possible to derive a patter for good assessing?

    2- A lot of discussion is give to thinking about rejudging documents thought to be judged erroneously. In the age of nearly unlimited supply of low cost crowd source judges it doesn’t seem to make a lot of sense to put a lot of work into testing each judgement and deciding which ones need to be re-judged or judged multiple times so that a vote can be tallied or a formula applied to multiple results. Wouldn’t it be easier to just throw out the judgements believed to be made badly (according to empirical standards) and get another separate set of judgments that don’t look lazy, or optimistic, or unhappy? Unless you intend to collect multiple judgements for all the documents (which I think is a good idea) and average them all out in which case it won’t matter too much what one fatigued assessor said.

    3- The conclusion mentioned some next steps that include doing an actual crowd sourced experiment to determine how common these models are. I think this is a fantastic idea because as soon as we have those numbers we can use them to control for them in future experiments and make crowd sourcing even more useful. But it occurred to me that we might want to go back and see how many of our TREC pro assessors displayed some of these characteristics. For two reasons- one establish how good our results in the past have been how meaningful is the data we have already been using. Two- establish a reasonable baseline to hold the crowd to. If even our best assessors were giving us judgements that weren’t that good that should definitely lower the bar for accepting crowd sourced judgements AND it will tell us how much better a crowd has to do to really be worth our time.

  29. 1. In this article the authors are conducting an experiment to examine the errors in lower-cost methods of making relevance judgments like crowdsourcing. They conduct this experiment by running simulated user models against documents in the TREC Million Query track. How can we trust these simulated users that they created to accurately show the behavior of real users? Is it possible that the results that they got from this study are flawed because the models they created were geared towards creating such results?

    2. In this article the authors use statAP as a method of targeting which judgments will be representative of a larger group of documents. The statAP estimates the average precision of a group of documents by sampling a set of documents that represent sub-groups of documents within that group. However statAP only accounts for a document being relevant or not relevant while the TREC Million Query track has several levels of relevance. Would this discrepancy in the binary or non-binary nature of relevance documents cause the statAP to overlook documents that could be relevant? Would the problem with these documents be statistically significant?

    3. This experiment showed that being stricter in assigning relevance to a document caused fewer problems than being too generous. In previous article we have seen that assessors that are more educated about a topic are generally stricter in assigning relevance judgments to documents. Does this mean that the data from this study proves that more educated assessors are better assessors because they will cause fewer errors?

  30. 1. My first question is in Section 2.2 Assessor Models. In this chapter it lists different assessor models. Starting from the baseline model: making random judgment, compared with a series of realistic model: unenthusiastic assessor, optimistic assessor, pessimistic assessor, topic-disgruntled assessor, lazy.over-fitting assessor, fatigued assessor, Markovian assessor. This seems a little weird to me. The intent of this paper is to measure the effect of assessor errors. If we define all these assessors on our own, it seems that we add too many subjective factors and for each of the assessors we are trying to measure the errors that are pre-defined and generated by ourselves. Is there any better way to make it more objective?

    2. My second question is in Section 3.3 Simulation Results. In this chapter it discusses some experiments results and draws some conclusions from observation. One conclusion they draw from these results is that it is general better to underestimate relevance than to overestimate it. This sounds really to be a biased conclusion. It depends on how we are going to evaluation our relevance judgment system. Let’s talk a simple example, intuitively, underestimate will increase precision potentially, and overestimate will increase recall. So it really depends on the evaluation methods.

    3. My third question is about the use of the abstract models. Just as referred in the experiments, these assessor models are abstract. One abstract assessor model is better than another does not necessarily mean that we should completely choose one over another. In reality, when we implement a relevance judgment system, we usually use a series of mixed methods instead of solely one kind. How do we address this question when we take all these factors into consideration?