Thursday, October 17, 2013

10-24 Maria Eskevich et al. Creating a data collection for evaluating rich speech retrieval. LREC’12.


  1. 1. My first question is in Section 4 Development of an Effective HIT. In this part they demonstrated their design for the questions used in MTurk. One of the task for the workers is to label the specific time at which the speech act begins and ends in the video. Since it is impossible for every workers to pinpoint the same starts and ends, I am wondering how they are going to do the alignment?

    2. My second question is about Table 2. In table 2, they show examples of 2 types of queries associated with speech acts and transcripts for the relevant segments. For each of the speech act, they gave both long query and short query. However, the short query is not that short from my point of view. For example, for the Apology speech act, the short query contains 9 words. Plus, the short query seems to lose some key information compared with the long query. For example, in the promise speech act, the long query is about the promise from Obama about cure for cancer. But the short query just states the promise part.

    3. My third question is in Section 5 Reward Levels. In order to motivate the workers to contribute, they allowed workers to choose their own bonus level to earn trust and showed appreciation. I am wondering how they can verify the corresponding bonus level according to the corresponding contribution. In addition, suppose the bonus level from workers is accurately assigned. Can we make use of it since it seems to be naturally rated already?

  2. 1. In their experiment, the authors had a high failure rate for HITs, which they concluded was in part because of the subjective nature of “speech acts.” Why is it effective for subjective tasks like relevance judgments but not as effective for this experiment? How could the experiment have been run different to try to obtain better results?

    2. In the experiment the authors found that it was much easier for mTurk workers to identify opinions than the other categories of speech acts. Are the categories they determined really equal in type? How could the categories be better defined? The authors also gave workers who identified a rare speech act a bonus—was this bonus fair?

    3. Why did the authors not choose to divide the experiments into smaller tasks, based on their observation from the pilot study that workers tended not to find segments after 3 minutes into the video? Do you think that dividing up the tasks could have produced better results?

  3. 1. Even if MTurk workers have high scores from previous tasks, if their previous tasks were too different from this one (for instance, all were very short tasks) then the scores may not be valuable metrics of worker quality. In addition, as mentioned last week, different types of workers search for tasks differently. Therefore my question is, how exactly did they control for who found and was selected to participate in their tasks?

    2. As a follow-up to the first question, how could one improve this procedure? Based on the fact that they found spam results and only identified "39.5% [that] were suitable for use in the dataset", it seems that there is much room for improvement. Even with the task being relatively straightforward, it seems that there are many issues with using MTurk for even slightly more advanced tasks that need working out. What might these issues have been? How might a better user review/rating system be implemented to ameliorate these issues? Would biographical information be an option (and would it necessarily help, or would it simply lead employers to drastically limit their potential labor population to the point that the crowd-sourced work becomes slow or impossible to perform)?

    2. How did they judge the transcripts provided by the workers for the test set? Did they manually review the transcripts and compare them to the ASR results? Did they ever actually compare either to real transcripts, or did they simply rely on word-confidence? The entire evaluation component of their project is unclear to me from the paper.

  4. The research in this paper is very interesting and instructive, which shows some effective ways to motivate the crowd workers and to control the quality of their tasks. For these methods described in this paper, I have some questions. Firstly, Maria provides some examples for crowd workers in her research. Will these examples influence workers’ answers there?

    In addition, since the quality of transcript task will be influenced by certain characters of crowd workers; for instance, workers’ level of education may impact the outcomes of such tasks. In this sense, is there need in giving a pre-test to select the workers qualified to the tasks?

    In order to motivate the crowd workers, the reward is included in the tasks, which is a very wisely designed actually. So, can we allow these workers to select their bonus level at beginning of these tasks? I think it’s more likely to motivate them in this way actually.

  5. It is interesting that crowd workers deal better with speech tasks than written tasks (Marge et al. 2010). This is probably because speech is inherently more easier to understand for us than writing. It would be more instructive (to researchers of all fields) if any human cognitive research validates this. With the advancements in technology and the ease of speaking will probably make this area burgeoning with IR research. If such a change takes place, will we still be able to differentiate IR and IIR?

    The HIT instruction page consists of examples that describe what each of the speech acts could be. This is good because more workers will now be able to base their judgments on the examples which might make the labels more consistent. However, it also induces bias because the workers may tend to find labels close to what they have seen. Procuring the services of advanced english speakers (or native english speakers) could produce more generalized results, but that would defeat the point of 'cheap crowd workers'.

    The authors mention that many crowd workers do not find any worthy speech act after 3 minutes in video segments. Why is this so? Is this a common user trend? Or did the experimental methodology make it this way? They also mention that limiting the video segments to 3 minutes was not useful due to the non uniform (music and non-speech parts) nature of the video. However, the authors should still have curtailed the video to segments of 3 minutes each just to avoid problems with the external sever on which the videos were posted or for more number of video segments.

  6. The author mentions that splitting the video into 7 mins length did improve the chances of viewers looking at the video, but still they found . But will breaking it down help more ?

    How were the vedios split into segments so that they didn't stop making sense or breaking the information that was tried to be captured ? If someone is actually going to watch the whole video to judge what is the best length of document to which it needs tone broken to, then the same person can actually judge the whole segment and find what it means right ?

    The way testing proceeded can be generalised to other media extractions too. Also the fact that the instructions have been mentioned in a way which is easy for users to correlate to it a very nice idea. But what if the contents are not interesting enough to be shared but still can be categorised into the data segments needed. Didn't this thus introduce a confusion with regards to what they wanted to gather data for ?

  7. 1. The authors have divided the speech act into 5 different topics - Apology, Definition, Opinion, Promise and warning. This classification seems a little less intuitive. A significant part of the speeches we encounter fall under opinion or definition. Would it have made more sense if the authors had introduced ‘conversational’ as another topic? The results seem to be skewed as most of the data fall under opinion and the authors are forced to assume that it is an inherent property of the data which might not be the case.

    2.In the study, the authors have used only videos which have an ASR Confidence estimate of .7 and they have selected the HIT to be made available only to workers with HIT approval rate >= 90. Does this bias the study towards ‘perfect data’? The real world data cannot be expected to have such a high confidence estimate and most of the workers fall below 90 HIT approval rate. Do you feel that this restriction still holds the study relevant?

    3.While summarizing the findings of the paper, the authors state that although the length of individual snippets of video was fixed at 7 minutes, the workers consistently watched only the first 3 minutes of the video. After such an observation, shouldn’t they have re-established consistency either by reducing the video length or by making sure that the workers watch the full video? Does this not bias the result towards the content of the first three minutes?

  8. 1. This paper presents a fairly novel study that uses crowd sourcing for complicated video judging tasks. My first question is about the design of the study. The bonus was designed automatically by the users and approved by the authors, which the authors claim to greatly motivate the workers. However, it seems this will only work for such a small scale problem. When a large data collection study is performed, it became infeasible for the authors to manually approve all these bonus requests. So would it be better if we can create some automatic approval systems, which will approve the bonus requests from workers only if the requested amount is less or equal to the automatic calculated bonus prize?

    2. It is mentioned that all the results were validated manually, which restricts the power of crowd sourcing. Although crowd sourcing can generate a huge amount of data in short time with minimal cost, it’ll be very time consuming to verify these results. The author gives a solution which is using a second crowd sourcing procedure to judge these results. However, this creates two new problems. Firstly, a second round of crowd sourcing will cost extra money and energy in both design the interface and assign payments. Secondly, this will also create an issue that which round of results we shall trust more? Why did the authors not use multiple judgments simultaneously on these data to further confirm the results?

    3. It is not clear to me that why only 30 and 50 queries were selected from development and test tests when they have such a huge amount of data (562 and 3728 start points for all videos). What are the criteria for selecting these queries? Why fewer queries for apology/promise are selected for the data sets. Will this create bias for this data collection? For example, IR systems which are effective at retrieving speech act with apology will on be evaluated properly.

  9. 1. The paper does suggest how the implementation of a reward, bonus and the time for completing the work creates an impact on the quality of the work produced by the MTurk workers. However, even post the suggestion - we are left unsure of how we should take these varied metrics into account when we want to use crowdsourcing to improve this system performance. Also, how can we ensure that the MTurk assessors will be able to appropriately evaluate cross-modal and multi-linguistic frameworks?

    2. What seems to be most important when collecting data for rich speech retrieval is the adaptation of linguistic models according to tasks, topics and speaking styles of the users. It's unclear however as to how MTurk workers would be able to bring in a comprehensive exhaustive linguistic database for every new task that has been published. How does the methodology proposed account for this variation in linguistics through the data construction for the IR system?

    3. Even though the paper has provided specifics on data collection and evaluation - many questions still remain on how and what to measure for interactive information retrieval systems. Currently, real users are used to test the systems. While this does serve as an ultimate test in a way, it also raises issues of comparison of skills of users, standardization of test situations, amount of training given, help given, and time allotted for the evaluation. And so, I am left wondering if user modelling techniques might be sufficient for evaluation efforts?

  10. 1. The authors broke down types of data included in these videos into 5 categories (apology, definition, opinion, promise, warning). Given that the source of the data was a website of semiprofessional and unprofessional videos, how realistic was it for them to expect larger amounts of non-opinion speech?

    2. The authors discuss the difficulty that the crowdsource workers had in identifying relevant speech past the three minute mark in videos that they shortened into 7 minute segments. If crowdsource workers don't want to spend that long on a video, couldn't the HIT poster limit the length of video segments and post more HITs for more workers to work on? It would lead to more nonrelevant clips of video I would imagine and more work upfront to process these videos into logical segments, but you would get more results from workers who are more willing to spend time looking at say a 3 minute clip as opposed to a 7 minute one.

    3. Could some task like the one posed by the authors to crowdsource workers be turned into some form of GWAP? The length of the videos would certainly be a limitation, but depending on the type of videos being shown it could be interesting to see if researchers could turn the task into some sort of game.

  11. 1. “In order to be included in the ME10WWW set, a video needed to have been transcribed by the ASR-system with an average word-level confidence score of > 0.7.” Here, where did the value of word-level confidence score come from. Usually, a reference text is requirement to evaluate the ASR quality. If such reference text existed, why did this paper require another transcription?
    2. In page 2, last paragraph, the paper emphasized all the tasks were “quite straightforward”. However, different workers should have different feelings. What was considered as “straightforward” may be still challenging for some workers. Thus, this assumption should not be held.
    3. In section 6, the authors said they did not want to decrease HIT Approval Rate of workers who misunderstood due to the nature of the task. The question is how they identify such “misunderstanding”. Is it fair to give up decreasing HIT Approval Rate here?

  12. This comment has been removed by the author.

  13. 1. In the course of this HIT, the problem arose that MTurk does not permit playback, and workers had to go to an external server. It seems like this would be a huge gap in MTurk's business model. What would be a benefit for leaving out multimedia playback? What competitors are out there filling in this spot in the market?

    2. In the initial HIT, the authors found that there was too much jargon, and they revised it, instructing the crowdworkers to find within the video "something you'd want to share on your favorite social media network." Since this phrasing was more successful in generating useful results from the MTurk workers, what does it tell us about the demographics of these workers? It seems to appeal to a specific age/ socioeconomic group. In the conclusion, they say that it is necessary to use terms crowdworkers are familiar with, but how can this be done until there is a clearer vision of who the crowdworkers are? They conclude by saying that it is necessary to use terms that crowdworkers are familiar with, and that appeal to their everyday experiences. Do they suggest using a pilot HIT every time?

    3. In this study, crowdworkers must view a video and label the time of speech act, transcribe the segment, come up with a full-sentence query that would find this again. What is the benefit of finding a full sentence, when most people would probably use the short form query? It seems anachronistic. How would the nature of this project have changed if the crowdworkers were asked to tag the videos with key words instead, as is the practice with archives or large host sites like YouTube and Vimeo?

  14. For each HIT the authors designed for the Mturk workers, I believe there are too much works involved which might affect the quality of relevance judgments and query formation. For instance, why ask the workers to write a full sentence query along with a short web style query? What is the point?

    In Section 4.1, the authors “prepared 562 and 3278 starting points for longer videos at a distance of approximately 7 minutes apart”. I think this makes the work much more difficult as those workers are paid by the number of HITs so time is very precious for them.

    In Section 4.2, the authors make a very interesting discovery that when the request was expressed formally, Mturk workers were confused with the type of answers they had to find, but when the request was expressed as part of the everyday experience (e.g., to share something in Twitter), Mturk workers were better to localize the answers. I think this is a contribution of the paper to the crowdsourcing community (especially to the requesters) as how to post effective questions.

  15. In figure one, question 6 of the example explicitly states to avoid using “informal internet language” in determining what comment a HIT participant would use to comment on the video to get friends to see it. What limits does this place on the natural aspect of language over the internet as an informal venue particularly among friends?

    On page 5, Eskevich et al., found that adding new questions to describe particular speakers actions in the video segments lead to some unintended consequences. Is this a case of overcoaching? Or are specific tasks focusing on physical cues just difficult to include in a study that should be centered on speech and intentions?

    Eskevich et al. bring up that there was a lack of Promise and Warning queries. Is that due to the time spent on a video as suggested on page 4 or is it because opinions and warnings themselves are difficult to pinpoint or lacking in general speech acts vs. other acts? Table 2 kind of lends to the idea that they are ambiguous or lacking, both the long and short queries don’t have the same kind of depth and cohesion as the other act queries.

  16. 1. The paper indicates that the HIT refinement failed because of the quality assurance measure adopted. It makes sense that the question posed did not have any instructions associated with it and was maybe too vague. This brings the question of how related and meaningful to the task should such questions be?

    2. Multilevel or breaking the task would have been worth exploring. Refinement of the HIT was not approached with tweaking the structure as the objective.

    3. How can we approach aggregation from redundant responses when addressing such a task? Considering the many free response questions.

  17. 1) Since some workers provided doubtful comments about the quality of their performance, shouldn't this fact raise concerns about the quality of this data set? Thus, in order to improve the data set, some form of calibration before data collection process should be put in place?

    2) When discussing the data management, Eskevich et al mention that most of the reviewers would find a relevant quote within the first 3 minutes of the video. Doesn't this mean that the 7 minutes video clips were too long and that the relevance can be further improved?

    3) In the HIT description, Eskevich et al use the wording “something interesting,” wouldn't this wording imply that workers were targeting more complex quotes (outliers) and thus, the more difficult (common) quotes were ignored?

  18. 1. When talking about the data set, it is mentioned that “shows with less than four episodes were not considered for inclusion in the data set”. Why these data were excluded?
    2. The data set covers 5 basic speech acts types. Are there any other types not mentioned in the paper? Sometimes it is difficult to clearly identify the type in speech. Will there be overlaps?
    3. The authors mentioned in section 4 that they use HIT Approval Rate to evaluate the performance of a worker who could take the task. However, this number just reflected an overall history of the worker. Maybe the professional skills were more critical to setup the data collection. Why the authors did not consider such a measurement?

  19. The authors point out that one of the difficulties of this study of video/audio was in part due to the fact that amazon did not support a video feature within MTurk-and in class we discussed how MTurk hasn't evolved much since it's launch, since this study has been released, has Amazon made any progress to support video streaming and various video features?

    The authors claim that the more advanced HITs that they created were successfully completed by participants, which differs from previous studies who were focused on other Mturk workers capabilities, however later in the paper they state that only 39.5% were 'suitable for use in the dataset', which seems like a stretch for them to claim that they were creating accomplishable complicated HITs.

    I wonder why, when the authors designed the study, and analyzed the data--they discovered that the first three minutes of the videos had the most speech acts, and that there was dramatic drop off after three minutes, why the authors (after having done several rounds of pre-tests) did not address cutting the video's down to less than three minutes a a clip.

  20. 1. With respect to understanding the speaker’s intention during speech retrieval, is crowdsourcing a good technique? Accents apart, context plays a definite role in how one interprets things – and that varies with crowdsourced workers.

    2. Do you agree, as the researchers say, that “crowdsourcing provides an intuitively appealing solution” for establishing the relevance of available documents to a user search query?

    3. What do you think of the option that MTurk HIT owners can change the reward based on responses / work done? Do you think this is ethical? Is it a good way of regulating that participants do acceptable work?

  21. This comment has been removed by the author.

  22. 1. Using crowdsourcing to obtain relevance judgments for speech acts (in the sense defined in the paper) seemed to have been a good choice to try, but one has to wonder if there is any practical utility given the not so confident results found by the builders of the test set. The authors noted some ways to reduce cheating and bad judgments but it seems like noise in this domain can be very expensive. How practical, then, is using crowdsourced relevance judgments for speech acts as opposed to using it for other tasks?
    2. One experimental concern I had about the way the authors presented the HITs to the workers is that they had to enter unstructured text for completing the task, together with choosing from a fixed menu. What if the unstructured text is hampering the workers' performance and that the noise goes away or becomes acceptable if we only ask workers to choose from fixed options? While this does increase the scope for cheating, it could be that average performance will increase since the authors are choosing highly rated (90% +) workers anyways. Is it an experiment worth trying out?
    3. The fact that authors chose to reward bonus to the workers, and even asked workers to pick their own bonus seems to be an interesting experiment in its own right. The authors noted that good, sincere workers were able to express their uncertainty about their responses by honestly admitting when they were unsure about their answers, and denoting low or no bonus. The question then is if we can use such a methodology, at an incrementally high cost, to get not just relevance judgments but also practical measures of uncertainties from the workers themselves, which would automatically discriminate easy, deterministic tasks from the hard ones. This of course, would only work if a significant subset of the worker population was not insanely greedy, but from the observation here it might just show something interesting.

  23. 1. In this article the authors created an HIT for Amazon’s Mechanical Turk designed to create an effective test collection for Rich Speech Retrieval. The HIT they created consisted of asking the worker to perform several different activities that were somewhat difficult. After they posted this HIT they discovered that this was asking too much of the user and that several of the results that they acquired were not useful. Is there some way that you can think of splitting these tasks up into different HITs that would make them easier for the worker to perform and yet still produce results that could be useful to the researcher?
    2. After the first HIT that they created did not create good results, the authors of this article created a second refined HIT. This HIT was designed to explain the results that they were looking for better to the user through a different wording of the task and an example of how the questions could be answered. Do you think that focusing their HIT in this manner could have led to some type of bias in the results that they were given?
    3. In the results of their experiment the authors saw that the majority of result that they acquired were not useful to the creation of a test collection. They were not able to use less than fifty percent of the responses that they paid workers for. This is after all of the effort they expended to try to minimize any error the worker would make. Do you think that the method that they used for crowdsourcing is an effective method for building a test collection? What ideas do you have to raise the amount of useful results for a future HIT?

  24. 1) The paper focus on using crowd sourcing to help with a retrieval task that involves audio/video instead of just text. Is there significant research into the efficiency of such retrieval systems? All the papers we have read focus on effectiveness, since computing resources for text retrieval seem well suited for these sorts of tasks. Are they finally a bottleneck in the case of audio/video analysis?

    2) Throughout this paper, I wondered what sort of results they would get if they had simplified the crowdsourcing task to just transcribing the conversations, and commenting on them. Then, they could do some direct processing on the text to determine the speech act (I don’t know much about language processing but I figure there must be some work in automatically identifying speech acts from parts of speech). Would it work better than using normal users who are more biased?

    3) I thought it was interesting how they framed the task, as “imagine what parts of this video, you would like to share.” While I understand that they need to make the HIT more relatable to the users, would not this cause them to be more biased in their selection? For example, I don’t think I’d ever find a promise, warning or definition that I’d want to share, but I’d be more likely to share a funny/controversial opinion or apology.

  25. 1- Assuming there is a need to investigate ‘illocutionary speech acts’ (a phrase I am still not sure I understand). Why would the authors pick the ME10WWW archive? Of all of the places on the internet this seems like the most obvious place to find videos of cats. Was it an issue of availability? Could they not access youtube videos, which conveniently com in less than 7 minute lengths, feature just about every kind of speech act imaginable, an are compatible with just about anything? In all seriousness I’m interested in the justification for the use of this archive which doesn’t seem well suited for the task.

    2- I am particularly impressed with the ability of the crowd to accomplish this relatively complicated task. Even a 50% acceptable rate seems high. The authors did mention improving the HIT so I wonder if they considered breaking each question into its own part? Maybe one person would be responsible for transcription another for act identification, another for query invention. Making considerably simpler tasks might element the need for bonuses as well, although the authors seems particularly eager to offer them.

    3- The purpose of this research (as identified in the title) was to establish a test collection and too see if they could get a crowd to do it. The authors spent a lot of time discussing the process of creating a HIT, getting data, and interfacing with crowd workers. They apparently have come to the conclusion that this is a task suitable for a crowd. What they did not do was discuss the quality of the collection produced. How through is it? How confident are they in the judgements made by the crowd. Does 50% acceptable really mean 100% confidence in that 50%? What happens when some search algorithms are run on the collection. Does it help produce consistent results?

  26. 1. Unlike the paper by Oard and others, in this paper the tasks were restricted to selection of multiple choice response or numeric input within a fixed range. How could one account for relevance scale in case of expressing emotions or if the search was about multiple categories of topics?

    2. The experiment was performed and compared with domain experts and non-domain experts. Their results demonstrated that the non-expert crowd source workers could produce work of a similar standard to expert workers. Although this result was visible in case of the text retrieval too, would it not be more obvious in case of speech, which has more factors to be taken into account when designing and evaluating the relevance criteria? In this case, how does it matter if the assessors are trained or untrained?

    3. It is reported that the workers faced a technical challenge restricting them to view the videos multiple times. If allowed to view multiple times, then how can one judge the assessor’s quality? Was the issue of allowing the workers to view the videos only once purposely intended? The assessor’s quality versus the number of times the video was viewed (eventually the increased time in completing the HIT) could play an important role in evaluation of the system.