Friday, October 4, 2013

10-10 Harris, Christopher G., and Padmini Srinivasan. Applying human computation mechanisms to information retrieval. ASIS&T 2012.


  1. 1. My first question is about the comparison between Crowdsourcing and GWAP. Across the paper we can see that crowdsourcing is almost better than GWAP in each step of an IR model. From Table 3, all the ratings for crowdsourcing are high, and ratings for GWAP are mixed with high, medium, low. So I am wondering why we bother to research on GWAP? For crowdsourcing, we already have Mechanical Turk. The cost of using Mechanical Turk should not be higher than developing a game and then promote users to play it. So what is the advantage for us to still research and develop GWAP for Information Retrieval?

    2. Since it is not explicitly discussed in this paper, I am not quite sure about how we implement the GWAP. For example, is it like some RPG games that when we finish some small takes we can get some rewards or level up, or the game itself is a whole big task for the information retrieval? How can we actually connect the entertainment with the tasks? Is GWAP widely accepted with some successful cases, or is it still tested in experimental environment?

    3. My third question is about Table 5: Assessment of Applying of Crowdsourcing and GWAP to Accomplish Preprocessing Steps. In the table it gives the corresponding assessed rate: High, Medium or Low. It confuses me about the error rate. From my understanding, error rate is the degree of errors encountered. So I think the lower the error rate is, the better the system is. But clearly it is not the case here. For example, for step 3a and 3d, you can see that the assessment is low because of low error rate for 3a, and the low error rate limits the human value added for 3d. So I am wondering why low error rate is a bad thing here?

  2. 1. In the section on Implementing the User Interface, the authors propose that "the design could be crowdsourced if clear guidelines can be provided to several designers, who work on it in parallel"(p. 5). Where is the line drawn between private contractors and crowdsourced workers? This task seems inappropriate for the population they defined earlier for "crowdsourced" workers in which there is an "open call" to an "undefined, generally large group of people"(p. 1).

    2. As far as I could see, there is no actual measurement or process that goes into determining the success of Criterion 4 (the possibility of making the task fun while still getting the necessary results). Instead, they just seem to decide, as if they authors had a private conversation while writing the paper and came to this consensus. Or, they claim that GWAP has never been tried for tasks like choosing documents for a collection, and therefore it would be inappropriate. If this is an adequate criteria, why develop this elaborate study?

    3. Criterion 1 and 2 similarly work to weed out viable points of intersection between IR tasks and crowd workers or GWAP players with little empirical information as a basis. Shouldn't this have been tested by GWAP designers or crowdsource managers as well, even on just a consulting level?

  3. 1. This paper presents a very interesting and important topic about IR research, how to incorporate Crowdsourcing and GWAP (game with a purpose) into IR study. My first question is have we considered using a combination of these approaches? Intuitively, a web-based GWAP can easily be plugged in Crowdsourcing tools such as Amazon Mechanical Turk (AMT), and with a combination of these tools the fetching of information shall be even more efficient.

    2. GWAP is a very interesting and exciting area. The author mentioned that real time scoring is a criterion for such kind of a task. And tasks require longer feedback time is not suitable for GWAP. However, there’re many real games with a delayed reward response system, such like games that require players to grow plants and wait for 24 hours or even longer till they get the fruits. Is it possible to implement a IR-related game, such as document collection, in which players collect documents every day, and get feedback and reward on work on previous day?

    3. A great limitation for GWAP in IR field is the including of entertainment factors. On the other hand, video games are extremely popular these days, and players spend days and nights on it, repeating the boring executions for most of the time to get rare rewards, such an upgrade of their levels, a new weapon/armor. Is there a way of incorporating IR problems in these pre-existed video games and make use of the entertainment and a large population of pre-existed players? In this way, we don’t really need to worry about the entertainment of the IR problems themselves and potentially more IR problems can be incorporated into games.

  4. In this paper, Christopher analyzes information retrieval system with some criteria, which mainly focus on the aspects of system. However, humans play a very critical role in crowdsourcing and games as well. In this case, when we are trying to evaluate these systems, should we consider more criteria or factors about crowd workers? For instance, can these systems motivate crowd workers to complete tasks efficiently, or can these systems meet the needs of these workers so that they can finish their tasks well?

    In addition, Christopher mentions that suitability ratings can be created, based on the scores achieved in the equation on the page 4. How do the scores transform to the ratings here? Are there any other criteria involving in this process?

    Lastly, I have a question about the games. How could these tasks described in table 5 be completed efficiently during games? For example, if I played games and were asked to answer some questions, I would feel very annoyed actually. So, here, I’m just wondering whether the results gathered through games would be reliable there.

  5. 1. The authors rule out using crowdsourcing and GWAP to define the domain. When the authors bring up a hybrid idea, they still rule out using crowdsourcing and GWAP. The authors feel that defining the domain requires specific knowledge and, or location related knowledge. I have never worked with a crowdsourcing program, but the papers I have read have mentioned restricting the location. Therefore, couldn’t the hybrid approach be applied to the defining the domain for crowdsourcing? I don’t see how to make an interesting game out of it, so I do not see how a GWAP approach would make sense. However, I don’t see why crowdsourcing should immediately be ruled out unless defining the domain cannot run the risk of cheaters out to make money only.

    2. Every time the authors mention how GWAP could be used to tackle a step in the IR model, they always mentioned the difficulty in creating real time scoring. Even when the authors give GWAP a high rating for a step, they still harp on the difficulty of scoring. As a result, is it ever really highly applicable to use GWAP? The authors also address the scoring issue frequently, but typically don’t outline the issue of malicious crowd sourced workers. That may have been outside the scope of this paper, but that seems like a valid concern to mention due to the GWAP scoring being stated.

    3. Throughout the paper, the author evaluates whether each step of the IR model can translate into a crowd sourced or GWAP task. The author makes no comparison between the two outside of giving each category a rank: high, medium, or low. Most of the time, the crowd sourced task outranks the GWAP task. From other readings assigned this week, the data collected from GWAP seems to outweigh the crowd sourced responses. When there is only a difference between a high and a medium rank, would it be better to make a GWAP? For when the GWAP task is given a medium rank, is scoring really as important as making a fun task? Scoring could come from a number of different sources depending on how the game is set up and might not be dependent on the task the game is encompassing. Currently, criteria 3, 4, and 5 all weigh the same into evaluating the applicability of the approach.

  6. 1. Do you agree with the steps in which the authors argue that crowdsourcing and GWAP are helpful (shown in the chart on p. 8)? What other criteria need to be considered? Do the participants in crowdsourcing and GWAP produce the same quality of input?

    2. The authors express concerns that crowdsourcing and GWAP are unable to obtain data in real time. Why do they want to evaluate all the queries in real time? Is it possible to evaluate a batch for commonly used queries to be incorporated into the algorithm later?

    3. The authors discuss the steps in pre-processing of a collection for IR evaluation (p. 2). Is it always necessary to pre-process when developing a test collection for IR? Does the way pre-processing is done affect the validity of the test collection’s results?

  7. When assessing CROWD and GWAP applicability to the steps of the Core IR Model, authors list “Preprocess Documents” and “Index documents” as part of the process in designing the retrieval system. I understand the entire semester our main focus lies in the evaluation of IR systems, but I would really like to know more about these two parts as I feel they are crucial in establishing connection between IR and our own research domain. It would be good to have some recommended readings shared with us.

    In the paper, the authors find many spots where GWAP can fit in. For instance, in Step 11, the authors states “Integration of these refinements using GWAP would be more challenging than using the crowd, but it could be done”. I am wondering why it is more challenging and how it can be possibly done. If possible, I am wondering whether we can be presented a short video of how GWAP works in the class with comments.

    In Table 5, the authors make assessments of applying of Crowdsourcing and GWAP to Preprocessing steps. The assessment of “Low”, “Medium” and “High” are fairly subjective and majority of them are not even backed up by related works. I am wondering whether there are any validity issues here or these assessments are simply based on common sense (which I fortunately did not have as being unfamiliar with Crowdsourcing and GWAP, especially the latter).

  8. 1. The criterion 4 that questions about the capability of combining fun with the task’s objectives is not clear. On what scale is fun rated? What exactly is considered entertaining? Is the quotient of entertainment not respective to the user engaged in the game?

    2. In the core IR model, the step 7 indicates the identification of user’s information need and also rates the use of crowd sourcing and GWAP to be high and low respectively. How does one ensure that the local knowledge to satisfy the user’s need is extensive? In case of crowd sourcing, the author mentions about constraints about getting global workers from diverse locations. Then how can the criteria be matched? And in case of GWAP, is it possible to feed in extensive human intelligence on locale information along with games? If so how can it be quantified? The author states that defining the source of information has little value to human tasks and thus doesn’t rate it on the criterion scale. Can we not personalize the games for different users in different locations and then collect their respective needs?

    3. “With GeAnn a set of categories is provided to describe the association of a given keyword with a text snippet. The categories can be used to assess the relationship between the text snippet and the keyword.”
    How does one evaluate the correctness of the mapping between the keyword and the text snippet to a category? There might be ambiguity in choosing the right category because there might be cases, most likely, that a particular keyword might belong to multiple categories. To what extent can this help in evaluating the assessments? Also maintaining a set of categories for use by crowd source workers or in GWAP would not be helpful as it might limit the assessment criteria rather than naturally allowing the human to rate the search result.

    4. In case of GWAP, how can one ensure that users don’t tend to cheat to win the game? The game launched by Google, “Image Labeler” which was a GWAP initiative to label images used in Google Image Search, was shut down due to extensive abuse. The players hunted down the other players’ responses and could easily cheat to win the game. In cases like this, it would render the IR system based on GWAP highly inefficient. Could this be handled?

  9. In Step 7 (Identify the User’s Information Need), how is the integration of the different crowd user summaries or sentence length descriptions done? The crowd source users write a sentence describing their understanding of the query and this leads to the talk about diversity of the responses from the crowd. But what separates one response from the other? How is this distinction between different summaries made?

    In using GWAP for evaluating query results with user’s information need, the author claims that if gold standard judgments are used then scoring the judgments can be made in “real time”. When we have pre-established judgments about the user need related documents, why would we need to evaluate it again?

    Timeliness (response of crowd users within the ephemeral user time) is mentioned in every step from 7 to 10. Even if the crowd users are quick, will the multitude of user queries, received by a large search engine, be matched by crowd-sourcing results? If not, are the approaches mentioned in the paper only limited to trials in evaluating search engines?

  10. 1. GWAP has been used to get answers to factual questions like “how tall was Abraham Lincoln”. Do you think this ability of GWAP can be translated to get (binary) answers from users regarding document relevance, thus using it to identify documents for a collection?

    2. I’m a bit confused about how an on-call crowdsourcing system works. Wouldn’t the system’s retrieval efficiency vary based on who is on call? Do on-call workers have to meet certain qualifications that help limit such fluctuation?

    3. Do you think that comparing GWAP and crowd sourced users is a fair comparison? One set of users is doing something they enjoy and do not necessarily know that they are being recorded, while the other has monetary motivation. I wish the researchers had elaborated on recruiting methods, especially for GWAP.

  11. 1. What is the measure that is used to compute the utility in GWAP? There seem to be many factors which require consideration when analysing the utility model in GWAP like for instance, evaluation of user requirements which would require user feedback and then the metric of throughput which would provide intuition on the game efficiency by computation of the average number of problems solved(or a similar input-output mapping). How does the proposed methodology hope to take into consideration both these aspects and partition the emphasis of interactive applications and human computation conducted by the human and the machine when deploying GWAP?

    2.It seems to me that in the case of GWAP - the players are rewarded in the case that they think similarly. Or, in other words - think just like other player would do. But, isn't this introducing a sense of bias as it would work against games which do require a lot more diversity and creativity where the problem-template fit would require to be rather different from the template that is being used currently. What are the modifications that would be required in these scenarios? And, what kind of progress has GWAP made in these games that are based on diverse perspectives instead?

    3. Do all the games that have been designed by making use of GWAP feature on the premise that they are easy to divide into subtasks? Is this a requirement for all games that are conceptualized through GWAP? In which case, how do we go beyond the small sized nature of these games? Could we work towards implementing GWAP even in cases of a non distributed game framework? Is there a particular reason why we endorse this architecture? And, additionally have there been cases where GWAP has been used for a non distributed game framework?

  12. 1. One of the criteria for viewing whether a GWAP/crowdsourcing is useful is scalability. Document preprocessing steps are viewed negatively under this criteria since there is just so much data to be preprocessed. This however, seems an odd way to apply a scalability criteria. Acronym resolution and POS tagging for example could benefit from having more crowd-sourced annotated data. Obviously having every document hand-annotated is not realistic, but the sample of annotations can become part of a scalable process. In what sense is the scalability definition being applied?

    2. Criteria 4 (a task needs to be entertaining and accomplish meaningful task) does not seem straightforward at all. For example, the IR Step, “Implement User Interface” could be operationalized into a game I think. In addition, why is scoring a necessary component of a GWAP? Plenty of games are popular and fun without having a scoring mechanism.

    3. While crowdsourcing and GWAP seem like good ideas to generate more data, isn’t there a fear that the quality of the data may be inferior? Can these tasks be operationalized in ways that prevent large amounts of trash annotation?

  13. Harris mentions user interface design as being rated high for crowdsourcing but low for GWAP. Due to the lack of a score being applied to the GWAP method but would it be possible to integrate UI design into a game that is scored in real time and attach a point modifier to the UI portion(Increased modifier for overlapping UI design aspects with other users etc) in order to generate a real-time score?

    It seems like the lack of real-time scoring is what holds back the use of GWAP in several steps as outlined by the article. Is real-time scoring that big of a deciding factor? Or would simply providing a score every few days or every week be just as viable?

    Harris lists a GWAP approach as low in the obtaining a document collection. However, wouldn’t a slow developing game be beneficial for obtaining a document collection? I have a picture in my head of an almost farmville type system set up or a “libraryville” where participants collect documents and line their shelves then get scored over time for their whole collections.

  14. 1. Orson Scott Card novels (and the United States's drone program, which may arguably be a real-world extension of Ender's Game) aside, I struggle to think of an example of a successful, large-scale GWAP. There are few to no citations in this paper that favor them either. Why push for GWAP when researchers' understanding of crowdsourced IR work (in the traditional sense of the word) is still in its infancy, and its applications remain exceedingly limited outside the IR research domain? Is there any evidence that a GWAP will improve performance over paid work?

    2. Supposing that a GWAP is implemented, how might particular game styles, gamers, and mechanisms for implementing the GWAP within the game influence the performance of the gamers in the task? For instance, would embedding a relevance assessment mini-game within an RPG provide better results than a stand-alone, Pac-Man style relevance assessment game? If the GWAP is included as a mini-game or a side-quest, how might the plot of the game and the connection between the game and the GWAP influence gamer performance in the task? For example, how would results differ if the GWAP is presented as a light-hearted game where users piece together a puzzle in an arcade in between missions in Grand Theft Auto V, versus part of an avatar's job in Second Life?

    3. How do the authors identify the degree to which GWAP may be useful for a given task? They mention 5 criteria such as, "Can the mechanism handle the scale of the task" and Can the mechanism be designed to be entertaining yet accomplish the objectives of the task". They then score responses as a percentage of "yes" answers within all answers for each criteria and each step of the IR model for document collection and searching. However, where do these responses come from? Whom did the researchers ask for responses, and how were the responses reached - did they perform a literature review, or speak with GWAP experts, or IR experts, or both...?

  15. 1. The authors continue to bring up the issue of real-time scoring with GWAP mechanisms. While scoring is a feedback that appeals to a lot of users, there are other methods of giving a user feedback in the game that would alleviate the need for these scores. Modern games turn to thing like lifetime achievements and sort-of merit badges as players accomplish things within the context of the game. Couldn't a system of achievements or badges be applied to a GWAP?

    2. The authors mention suing pre-established gold standards to score judgments made by users. Wouldn't a majority consensus be a cheaper alternative to acquiring gold judgments on all of the query results?

    3. A common problem with games and crowd-sourcing is when a participant plays the system and maximizes his output whether that be rating as many documents as they can for more payments or attempting to get the highest score possible in a game. Given the tempting rewards that these types of mechanisms offer, isn't there the potential for the spirit of the task to be ignored by players once they figure out how to get the best scores or make the most money?

  16. An important but subtle point that the paper only touches briefly but that I would like to know more about is the 'cost' comparison between crowdsourcing, automation and GWAP. The paper mentioned crowdsourcing was more expensive than automation but less expensive than traditional domain expertise. The cost merits of GWAP seem a lot more suspect however. Designing a crowdsourcing task by itself doesn't seem to require the kind of skill that would go into designing and deploying a GWAP. The key question to ask then is, given that something is amenable to both crowdsourcing and GWAP, which one might we prefer and which one is cost effective, if the design costs of GWAP are also taken into account?

    It seems to be that the task of determining query terms and operators fail to satisfy criterion 2. Specifically, wouldn't determining query terms and operators require specialized knowledge of the IR system? Perhaps the authors interpret this task in a simpler way than I do. However, an artificial running example would have helped to clarify exactly what they meant. Other places where examples would have been useful for a lay reader were POS tagging and term resolution.

    The authors say that there have been no works where crowdsourced workers are required to locate documents for a collection, given a query. Much of the work seems to have been in creating new data or metadata such as labels and tagging. However, it seems difficult to see where the first scenario fits in. In big collections, how are workers supposed to systematically 'locate' documents? And how is it very different from a search engine first showing them a set of results which they then label as relevant for the query (as I suspect is the case)? The only difference between that and a traditional labeling scenario is that in the latter documents are provided to the workers a priori and for the former, the workers must go out to the internet and look for the documents themselves on their favorite search engine. Is that really a big distinction? I could pipe the results of Google to the workers and simply ask them to provide labels, given my query.

  17. 1. An interesting point that the author notes is that humans need to be given an incentive or an inducement to become a part of collective computation. An immediate question that popped up in my head was if “CAPTCHA” can be classified under GWAP or is it crowdsourcing? We knows that CAPTCHA data are used for several research problems like information retrieval for machine vision.

    2.With advancements in social media, which can be exploited to obtain information from the users using social games as well as incentives to provide information(like filling out a survey to view the complete document), does it still make sense to treat GWAP and crowdsourcing as disparate entities?

    3. It can be observed from the Table 3 that crowdsourcing is more suitable to the core IR model than the GWAP. Will one be wrong in inferring that the ratio of reliability to cost is higher for crowdsourcing when compared to GWAP? Since it is a ‘game’, the user range might not actually represent the targeted population, while we always have more control(relatively) over crowdsourcing.

  18. The authors talk about how easily the user interface could be designed by crowdsourcing in which "several designers who work on it in parallel" and the best one is accepted. The authors argue that more can get done with several designers working on things at the same time. So are these designers working together, or parallel of each other? The authors also say the best one will get chosen, and if the designer doesn't meet a deadline, then a backup design can be used. This entire section seems unethical, and dismissive of the entire design process. Are these designers getting paid? and how can you create a consistent user interface with various designers all working independently and parallel of one another? The authors talk about building many interfaces, and that seems like it would lead to a poor user experience. Would a project manager really want to pick a back up design after seeing a much better one?

    When setting up the criterion for rating crowdsourcing and GWAP, GWAP has two additional criteria applied to it, but there is no mention of any way to equalize these scores. This makes me suspicious of the seemingly equal comparison/rating between them.

    The criterion of #4 'Fun yet meets objective' seems like a simplification of a very difficult concept. Can something be made enjoyable? Yes, but that is what every modern software, app, and game creator is currently trying to do. There are millions of applications in the iTunes App store, but how many get purchased or downloaded? What is even more difficult is to measure how often these applications get used. Often applications are downloaded,and their use fades over time. Very few applications get used daily (or multiple times a day even). And as for GWAP, what happens when the game gets beat? Or even worse, if the game is unbeatable and users become frustrated with it? While this concept is sort of left in this paper as 'someone else's problem' , I think it's a real Achilles heel of this argument that the authors are trying to make.

  19. a. GRAWP doesn't take into account the fact that there are still so many people who do not play games that much online. Then will not diversity be lost in evaluations done by such processes? How can diversity be maintained when the crowd itself is not in the hands of the evaluators?

    b. Why is the step Collection of documents which are relevant to the task being out of scope from GRAWP. Maybe GRAWP can be extended so that people start contributing to the dataset for queries which return very few results. Taking the example we discussed in class where if a user queries for butterfly cake, and if it is given as a challenge question to people, this way the Search engine will be able to get more innovative data sets as to how the words can mean to others?

    c. The author has mentioned that "To score these terms and operators in “real time” for a GWAP would likely be a challenge". But he has not taken into account that even if such an evaluation metric can be defined upto what level can the results be used for evaluation of search engines. Taking into account the example of the game PageHunt as one of the GRAWP implementations. The applications of the results from the PageHunt Game are not extendable. The result set doesnt give an idea with regards to what exactly are the shortcomings of the search engine. Instead it is the gamer who will change his query in trader to get the desired result. It might be helpful in comparing two search engines but doesnt provide a very good picture of what are the shortcomings of both of them independently. Doesn't this seem a much more of an effort than setting up a simple controlled environment for evaluation?

  20. 1-Some work has been done comparing crowd sourced work to traditional work, especially when it comes to relevance judgements. Has there been any comparable work to see if results from GWAP are better or worse than plain crowd sourced work? It would be potentially important if GWAP players produce inferior results because they are ‘just playing a game’ vs. if they are getting even a minimal payment for work. Or to see if their results were somehow better because being entertained kept them more engaged.

    2- According to the author’s ranking system no task is identified as being more suited to GWAP work than to crowd-sourcing. Given this why would time and effort be put into making GWAPs that are fun when the crowd can just be plainly asked? Unless the cost of paying a crowd is less than the cost to develop and support a fun game? Or the results of a GWAP are perceived to be somehow superior? (See my first question).

    3- I am confused by the criteria of “timely manner” for steps 7, 8, 10, 11. The authors seem to suggest that identifying a user’s information need, obtaining query terms, and evaluating a search results would be appropriate tasks for a GWAP or crowd source work provided their contribution could be integrated in a timely way given a real time query. Surely this is not the case? A user with a real query wants results faster than it takes a human to carefully consider the question let alone play a game about it. If the authors mean to make use of a GWAP or crowd for system training or analysis of systems then the idea of timely is on a completely different scale and exactly what is meant by ‘timely’ in this context should be established.

  21. In table 3, how was the “Fun yet meets objective” determined? After reading the paper, I noticed that all the positive instances had a game already developed was this the determining factor of this column?.

    My overall impression of this paper was that Harris overall just gave a summary of existing GWAPs and in some parts over stretched the use of crowds with disregard of the cost benefit. For example, in step 6, he proposes to do parallel development of interfaces. While development might be fast, the creation of detailed and complete instructions is not. Also, a cost is induced when evaluating the interfaces as well.

    I find it interesting that crowds have a lot more potential than GWAP in being integrated to the IR model. This to some extent make GWAP be portrayed as being inferior to crowds. However, would it be acceptable to say that GWAP have better quality results than crowds because they do not have a monetary incentive? Hence, it would be better to focus more effort in making GWAP fit the model rather than fixing the problems that crowds have because of the financial desire to maximize profit.

  22. 1. The proposed Core IR Model does not comply with modern system design diagram, for example, in agile process, more feedback flows should return to the steps in 1-6. If such flows are added into the model, what is the impact to crowd and GWAP
    2. The authors thought that steps 3a-3g did not scale for either crowdsourcing or GWAP design. I don’t agree with this point. If the documents are too complex, e.g. audio and video, crowdsourcing may be a good choice for them.
    3. When discussing criterion 5, the authors said the scoring was only meaningful for GWAP and not applicable to crowdsourcing. Why did they make such point? Scoring should also be used in crowdsourcing as a measurement for workers’ performance

  23. 1. In this article the authors discuss how most GWAP participants are not reimbursed financially but instead are reimbursed solely through the entertainment value of the game. What are the ethical and legal ramifications of putting someone through an experiment and relying on the entertainment value of the experiment as the sole method of payment for their services?
    2. In this article the authors attempt to calculate the suitability of crowdsourcing and GWAP for different steps of the information retrieval process. After weeding out several of the steps that did not meet their first two criteria for suitability they calculate the suitability of the methods for the remaining steps by using three secondary criteria. However only one of these criteria apply to crowdsourcing while all three apply to GWAP. Should the authors have attempted to develop other criteria for crowdsourcing so that each method had an equal number of criteria applied to it? If so what criteria would you have proposed?
    3. One of the consistent problems that the authors mention with using crowdsourcing and GWAP for the interactive steps of information retrieval is that time is a factor. Namely that when completing a query the crowdsourcing or GWAP method that is used must be fast enough so that the initial query maker does not have to wait for a long time. Is it possible to do some of these crowdsourcing tasks for some common queries ahead of time to mitigate this problem? If not what other methods would you use to mitigate the time problem?

  24. 1.The authors only considered “yes” and “no” when computing the score. But there is also an N/A option in the table. It seems that the authors skiped all the N/As. Would the suitability rating be more solid and reasonable if we count yes as 1, NA as 0 and no as -1?
    2.The GWAP gained a much lower rating in suitability than that of crowdsourcing judged from table 3. I think it is a result of the special requirement (entertainment). Is it fair to compare crowdsourcing with GWAP? Suppose there are mature techniques to solve the problem by making the task an entertainment, is the result in the table still valid?
    3. It is interesting that the rating of crowdingsourcing and GWAP are the same in Table 5. This differs from those in Table 4. Why that happens?

  25. 1. The reference to crowd is often not clear – references are made to untrained crowd, and, when there is need for specialized crowd. Is it still crowdsourcing in the case of specialized collaborative work?

    2. Designing GAWP which includes real-time scoring is very tricky – one proposition made is to test new interfaces on the basis of time and quality. The incentive here does not translate well, because its possible to make assumptions to score faster – but the key is to identify these components, which enable better utilization of the interface.

    3. The paper sees potential for human value addition in the preprocessing stage – however the argument made when assigning a high rating for document classification is not convincing. Is there value in post-processing? There is no such stage indicated, but after results are viewed, a GWAP can be designed to make the classification decisions discussed, thus, enabling a feedback loop.

  26. 1) Being relatively new to information retrieval, I thought this was an interesting paper since it pieces out all the step involved in IR (from the query to the final results), while most papers we have read focus on evaluation. However, I found some portions to be a bit strange and unnecessary. In particular, why do they bother to piece out all the steps of preprocessing the documents, only to say that GWAP and crowdsourcing cannot handle any of them? They could have just as effectively summarized preprocessing, and made the same argument that scalability is not possible with GWAP and crowdsourcing.

    2) To aid them in analyzing the applicability of crowd sourcing and GWAP to IR, the authors establish multiple criteria that are necessary. Criterion 4 (only associated with GWAP) asks whether or not it is possible to make the task entertaining. How did they objectively assign Yes or No assessments to this criterion? Did they go out and look for examples like Foursquare, and aggregate them?

    3) I still feel unconvinced that any of the first 6 steps which constitute the design process can realistically be crowdsourced (or “GWAPed” for that matter, but they come to that conclusion). Specifically, they state that both obtaining document collections and implementing a user interface can be crowdsourced. Regarding obtaining document collections, would not using a crowd require tons of validation and filtering of results? A computer seems much better suited. Now, if we are talking about generating new data, that’s different, but using the crowd to find it just seems like adding unnecessary difficulty and complexity. Regarding user interfaces, do they mean small groups to be crowds? I can understand using crowds to get input regarding how they like certain features, but to actually implement it seems like a logistical nightmare. Are they talking about combining 1000 designs or just choosing the best one? I realize they are trying to “think outside the box” and a lot of interesting innovations happened this way, but the suggestions they offer are weak, and the examples are nonexistent.