Summary: In this paper, the Sormunen describe the process of re-evaluating documents from TREC-7 and TREC-8 according to a 4-point relevance scale, instead of the binary relevance system used in TREC. They made distinctions between documents that were highly relevance and marginally relevant documents, and documents that just contained the words in the query. 50% of the documents judged determined to be marginally relevant. Sormunen also describes some of the difficulties encountered in the judging process.1. Sormunen writes that the ides of “degree of relevance” used in TREC brings with it the assumption that the user has some idea about the topic. Is this a fair assumption? Couldn’t the user have heard or seen the term, but they don’t have any context for interpreting the results?2. In rating TREC topics, the same people that submitted the queries rate the documents for that topic. In this study, the assessors rated many of the documents differently, which could be in part to their not understanding the intent of the original topic. Does the method used in TREC provide meaningful results for applications to search engines? Is the process of having the same person who submitted the topic doing the relevance judgments biasing the data?3.In judging TREC documents, assessors are able to highlight keywords and terms on the screen and use these visual clues to determine relevance judgments. Could this highlighting of terms contribute to the fact that many TREC documents are only marginally relevance (as an assessor might see the terms, but not notice their context)? What are the implications of this? What if the keywords are not present, but the document uses different terms to describe the same topic? How thorough judgment does an assessor need to give a document for the rating to be accurate?
The Crowd vs the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant BehaviorSummary: With crowd sourcing becoming a popular way of recruiting participants for IR studies, Smucker and Jethani study the extent to which crowd-sourced participants behave differently from the “accepted standard of lab participants”. The lab participants included 18, mostly graduate students, while the crowd sourced group included 202 participants, recruited via MTurk. The researchers are interested in evaluating how the behaviors of the two groups are different, and not necessarily devising a methodology for getting “good” relevance judgments from noisy participants.• Finding 1: Only 61 of the 202 participants (about 30%) were considered in the final group• Finding 2: The crowd-sourced group worked much faster than the lab group• Finding 3: The crowd-sourced group had a high false-positive rate or more false alarms1. A fixed time task is where participants are asked to work for a fixed amount of time and are compensated a fixed amount for that time, regardless of the work they complete. A fixed size task is where participants are asked to complete a task of a certain size and are compensated based on the amount of work completed. Currently, crowd-sourcing adopts the later. I know this paper isn’t looking at solutions, but don’t you think a combination of fixed time task and pre-screening can help reduce noisy workers?2. The screening method for lab participants and crowd-sourced participants is different. While it was mandatory for lab participants to take a quiz about the instructions, and do a practice tutorial (judge a set of practice documents), this was optional for the crowd-sourced participants. Do you think this is an experimental error? Should the researchers have made both groups go through a similar protocol to ensure a fair comparison?3. I agree that matching / comparing the results from each group to a single standard – the NIST relevance judgments – is a practical assessment, but also don’t you think that from the onset, and theoretically speaking, the lab group has a better chance of matching the standard? We have read in other papers this week that assessors with similar backgrounds, training, and working conditions have a better overlap – and Voorhees indicated this about NIST assessors. Don’t you think this factor goes in favor of the group working in a lab setting?4. The researchers repeatedly mention that the “judging behaviors” of both groups were almost similar. How is that true, considering that the crowd-sourced group had a much high false positive (false alarm rate)? Don’t you think this is a big exception and not a small factor? Doesn’t that indicate they might be more “liberal” in their judging behavior and thus a negative criterion group? Also drawing the aforementioned conclusion by comparing accuracies seems inaccurate since the group size for both the groups was drastically different.
This comment has been removed by the author.
The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior -Mark Smucker, Chandra Prakash JethaniSummary:This article focuses on a behavioral study of crowd-sourced workers vs graduate students participating in a laboratory setting, both completing relevance judgments. The authors of this study immediately reveal that they believe "random crowd-sourced workers are not to be trusted"(p. 1). They go on to explain that most crowd-sourced workers are spammers, or lazy individuals who are trying to rush through work, or complete work while checking email or engaging in other distractions. The conclusion of this study reveals that laboratory participants had more discerning, useful judgments, but took twice as long to complete tasks and received much higher compensation. The authors do admit, though, that they "cannot conclusively say that the crowd-sourced environment caused these differences as the two groups were not trained and qualified in the same manner"(p. 5).Questions:1. On the Mechanical Turk website, they address the factor of time under the Frequently Asked Questions: "Once the Worker accepts the HIT, a timer begins counting up to the HIT's allotted time. This timer is visible to the Worker on the Worker web site. When the timer reaches the HIT's allotted time, the HIT is made available for other Workers to accept and work on." Did the authors of this study take this timer into account, or provide a similar visible timer to the laboratory participants? One of their major findings in this study was that "crowd-sourced participants judged documents nearly twice as fast as the laboratory participants"(p. 5) (https://www.mturk.com/mturk/help?helpPage=worker#how_time_work_hit)2. How would this study have varied if there were no fixed task size? The authors claim that implementing a limit to the amount of work at hand makes participants (at least the crowd-sourced workers) rush through the job. If it were a matter of paying by the hour instead of the judgment, would crowd-sourced workers and laboratory participants give a higher quality of effort and produce results which would be more beneficial to users? Has this already been considered and deemed too inefficient?3. All of the articles from this week agree that crowd workers are not a good source of relevance judgments. Even running experiments on crowd workers requires endless measures to check that the workers aren't cheating, including installing cookies, having assessors plug in codewords, and asking gold questions. Must it therefore be concluded that Mechanical Turk is completely motivated by paying low wages, and that most judgments and information that comes from their site are poor quality?
Paper: The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior Summary: Historically, IR evaluation experiments have used a laboratory environment to obtain relevance judgments from selected candidates. In recent years, IR experiments have turned to crowd source workers to make relevance judgments. The goal of the paper is to highlight the differences between the relevance judgments received from the two environments to get a better understanding of how to ensure crowd sourced judgments meet historical standards. First, the author outlines his laboratory study. Then, he describes how he attempts to recreate the study using Amazon’s crowd sourced platform. The biggest difference was in the number of participants that qualified for inclusion in the final experiment. A majority of crowd sourced workers did not evaluate an appropriate number of “gold” documents correctly. The second difference is crowd sourced workers had a higher false positive rate. In the end, the author admits the two experiments had noticeable differences in administration and not just performance. At the same time, the difference in performance can’t be ignored and needs to be a consideration in future studies where relevance judgments are obtained from crowd sourced workers.
1. The author goes into detail on how both experiments were conducted. In the laboratory environment, the requirements to continue in the experiment include a 70% correct judgment threshold, but the monetary incentive is comparable. On CrowdFlower, the author tried to replicate these requirements. However, the author did not force the practice quiz and only one in five documents counter towards the 70% correctness value. In addition, the payment was substantially lower. In the end, only 30% of the CrowdFlower participants ended up in the experiment. The author mentioned not being able to control the environment or force the users of CrowdFlower to take and ace the practice quiz. However, later in the paper, the author mentions having the users of CrowdFlower also report the results to their website. Before entering the website, the quiz could be a requirement. The author mentioned using a certain percentage of documents as gold according to the recommended standards. Since all documents were used as gold in the original study, was it possible to make all documents gold on CrowdFlower? 2. Between this paper and a few of the other papers for this week, I am beginning to wonder what are the benefits that motivate a study organizer to use crowd source workers for relevance judgments? The author mentioned that although 30% of the people are included in the evaluation, they did not qualify on all topics they were evaluating. On top of that, the author found that crowd sourced workers reported relevance judgments twice as fast. I feel this would tend to support the ideas that using crowd sourced workers runs the risk of paying scammers or those who, for whatever reason, are not motivated properly to try their best. The results of this study as well as “The Effect of Assessor Errors on IR System Evaluation” do not seem to paint a positive picture. 3. One measure calculated across study participants is the false positive rate. The author found crowd sourced workers had a higher false positive rate. The author notes there was not a big difference in accuracy but the false positive rate did reflect in the criterion. In the end, laboratory participants appear to be more conservative in their judgments than crowd source workers. The author does note one factor affecting these results could be the difference in training. Based on this study, it would seem a crowd sourced worker is more likely than a laboratory participant to judge a document relevant when it should be non-relevant. Carterette and Soboroff’s paper mentions optimistic assessors have a much stronger negative impact on study results than a pessimistic assessor. Are the crowd source workers more optimistic due to lack of training on handling questionable relevance documents? Can this information be leverage to structure better instructions or training for crowd source workers?
Paper : Voorhees, Ellen M. Evaluation by highly relevant documents. SIGIR 2001.Summary : The article discusses about ternary evaluation of TREC-9 Web Task Test collection, classifying the documents into relevant , highly relevant and not relevant. The authors discuss about the motivation and intuition behind the need for multiple relevance levels and compare the performance and correlation between using only highly relevant documents and moderately relevant documents for evaluation. The authors also asked the evaluator to pick out a single best document for a query, so as to observe the behavior of assessors and correlate them with the results. The performance measures like MAP and DCG are analysed for the same and the authors conclude by saying that there is a significant difference between using relevant documents and using only highly relevant documents for evaluation. The authors also mention the caveat saying that this study has been performed only for informational queries in web search and highlight the need to explore the methodology in navigational and transactional web searches. Questions :1. Figure 2 provides an insight into the comparison between MAP and Precison with a cut off of 10. Apart from the run S performing well for highly relevant documents, it can also be noted that the run H has a relatively low MAP value for R1 and R2, but performs the best in the P measure. What does this say about the run H? Does it say that the run had both Highly relevant as well as non relevant documents in the first 10 results? Does this signify the inconsistency in the method used or does this not have any significant information to read from?2. From Table 2 and the discussion that follows, it can be observed that there is a significant correlation when the ratios of the gain values are sufficiently high. tau(100:1, 1000:1) is almost 1 for both DCG and DCG. But is the result not biased, in the sense that the number of topics here is very less? Can this result be extrapolated to an (informational) web search in the real world? Why is there a huge difference between the best possible runs and the average runs, although the kendall’s tau values do not seem to depict the same? Is there any specific way in which the other runs have been picked?3.In the case of navigational requests, there is a higher possibility of an increased number of highly relevant documents. (Because for a navigational request, I am assuming that if a document is relevant, it is more likely to be highly relevant while for an informational request it can be relevant, highly relevant or irrelevant). It would have made sense for Vorhees et al to look at navigational requests to perform the above study. Is there any reasoning that could be behind why they chose TREC-9, although it deals with mostly informational requests?
[sub: An analysis of systematic judging errors in information retrieval]This paper is about an experiment of doing graded relevance judgments to topics from TREC-7 and TREC-8. The assessors were Master's students in information studies. After the document assessments were complete, the students were interviewed and the new judgments were compared to the TREC originals. One of the main findings is that about half of the relevant documents in TREC are actually marginally relevant.My comment is about the type A inconsistencies, when dealing with marginal relevant documents, another important factor that can explain such large percentile is the background knowledge of the assessor? Since the assessors in this experiment were MS students, they had limited knowledge about the topic which would make it difficult to define marginally relevance. In contrast, TREC assessors got to decide the topic which means that they were “always” right on their decisions.1) Sormunen explains that throughout the experiment the assessors had meetings to discuss problems they had at hand. Moreover, he notes that when the assessors were asked to reassess Type B inconsistencies, a lot of them were firm about their previous judgment. I wonder if the meetings to some extent calibrated the assessors and thus they were more consistent overall and as a result they were more confident about their work. Thus, they opted to keep their previous judgments when asked.2) Doesn't Sormunen's results to some extent show that Voorhees' claim (2000), that the background of the assessors influenced the overlap, is incorrect? If we are to consider levels 2 & 3 and 0 & 1 as TREC's relevant and nonrelevant judgments respectively, we can see that the agreement percents are 39% and 61%. This values are within a reasonable distance from the overlap values Voorhees reports: 42%, 49%, and 42%.
Title:The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior Summary:As a preface, I was drawn to this paper, since I’m heavily in the crowdsourcing camp (more users, better results), and this paper offered a direct comparison between the two opposing approaches. Basically, it compares the effectiveness of generating relevance information from laboratory participants versus crowdsourced participants (Amazon Mechanical Turk, CrowdFlower). Participants in the lab setting were given qualification tests (to weed out potentially “bad” participants) and placed in isolated environments to perform their judgments. There were obviously fewer laboratory participants, and they were compensated significantly better (25*18 for a total of $450). The crowdsourced participants were obtained through CrowdFlower and were given optional qualification quizzes. Gold documents were used to weed out “bad” participants. They were compensated much less per user for a total of $313.14. Both performed equally in terms of true positive rate, but laboratory study did better in terms of false positive rate. Crowdsourced users also worked twice as fast.1) It might be interesting to profile the users in some way. Could the laboratory students be “smarter” at judging NIST topics than the types of users who would subject themselves to crowdsourcing experiments?2) What was the reasoning behind having only the laboratory users being required to take the quiz and also giving the answers to the crowdsourced participants? Was this financially motivated, or were they trying to add more quality control on laboratory participants? This seems like a glaring oversight that favors the laboratory participants.3) The authors mention that 84% of users did not qualify for inclusion. They go on to say that they only rejected 70% and that this 84% is overstated and is caused by users failing to qualify on all topics. In my opinion, one of the benefits of crowdsourcing is that you have a large amount of potential participants. What’s the harm in making sure that each user you use is able to judge each topic correctly? Based on all the discussion of issues with crowdsourced assesssors’ reliability in general, I would think that you want to weed out assessors as heavily as possible, and retain data only from the “top-notch” assessors. This sort of weeding process is unavailable with laboratory users, and as such should be leveraged to its maximum potential when evaluating crowdsourcing as a technique.
Smucker and Jethani: The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior. Summary- The authors identified a need to examine the differences between the behaviors of traditional student assessors in a controlled laboratory setting and the behavior of "the crowd" that supply crowd sourced judgments. Two parallel experiments were set up. The first was a very routine test that brought 18 students, paid them each $25, trained them carefully about the expectations of them and the methods they should use when judging documents. They were tested for accuracy and given training documents to practice on before they could qualify to participate. Then they each made binary assessment judgements on 2 (of a possible 6) TREC topics. The second experiment used some crowd sourcing application to gather about 200 random participants. Some training and practice was offered but it was not mandatory for the crowd to pass it or even participate in the practice topic. There were issues with the CrowdFlower application both in its ability to accurately weed out participants preforming at unacceptable levels and with it communicating information back to the researchers. A similar judging task was offered each on a subset of the potential 6 topics. In the end about 62 of the 200 participants provided usable data. They were paid on average $1.55. The data was compared with the lab results and everything was compared to the standard of judgments established by the TREC judgers. The fast conclusion is that there are some main differences between groups: first- many fewer participants provide usable data. Of the usable data compared the crowd and the lab produced a similar true positive rate, the crowd was more prone to false positives, and made judgements much faster. 1- I must critique the entire experiment. They have set up a situation where they are not comparing apples to apples at all. The purpose of the experiment was to test subject behavior but the subjects were given radically different tests, different standards for participation, different training and information, and different metrics for time were used. Not enough was the same for these to sets of data to be comperable to each other. The truth of the matter is it sounds like CrowdFlower was not suitable for this experiment. It admittedly was used for some SIGIR challenge grantee even though the researchers knew it wasn’t the best tool. Either they needed to design an experiment better suited to that tool or they needed to not rely on it and use a different path to get the set up they needed. Something else that bothered me- 100% of their lab group had results that qualified them to participate but only about 30% of the crowd qualified. That sounds like a test that is unfairly skewed toward a lab group. Especially when they added more ‘gold’ test results to one of the crowd topics to make it harder since too many of them were passing it. I really feel like designed a poor experiment and set up the crowd group to fail.
2- They seem pretty down on the crowd results but all I can see are good things: The crowd provided three times the participants (whose results ended up counting), 30% more judgments per topic, nearly twice as fast at 70% of the cost and differing by a statistically relevant amount from their lab group only 20% of the time. Even if you have to throw 2/3 of the judgments out that is still a good deal. They really should be crowing about these results especially because it seems that with a little bit of mandatory training of the crowd they should be able to get results well within the league of the lab group. Another thing to try might be making each document have a mandatory wait time on it before a judger can move on to the next one. That may discourage snap decisions. 3- Something I noticed: All but one of the statistically relevant differences between the lab group and the crowd group occurred on two topics #310 and #336. This makes me wonder what a larger sample of topics would have shown us. Is that an anomaly and that over 20 or 30 topics the crowd and the lab groups would have been more similar? Or does that point out some topics are more suited to lab evaluators. If that is true and if we can identify which ones they are maybe we can make a better division of labor where the crowd can judge the easier topics and experts only need to worry about the ones the crowd can’t handle.