Peng Ye, David Doermann; Combining preference and absolute judgements in a crowd-sourced setting. ICML’13 Workshop on Machine Learning meets Crowdsourcing.Summary: In the article the authors explored the problem of obtaining relevance judgments for a group of documents that can then be used in evaluation. To conduct the experiment, the authors combine absolute relevance rankings with preference judgments (made by human subjects on Amazon Turk) with computer learning of relevance preferences. The authors develop a statistical model to assess the probability of a specific score given absolute and preference judgments, and develop an additional model to show the cost effectiveness. Results were analyzed using the Spearman Rank Order Correlation Coefficient (SROCC) to show the relationship between predicted and actual scores and the Wilcoxon-Mann-Whitney statistics (ACC) to evaluate ranking. The experiment showed that the hybrid method with computer learning was the best system for creating relevance judgments.1. On page 1, the authors mention that ranked ordinal scales do take into account a difference in intervals between the categories such as “poor”, “fair”, good” and “excellent”. Can you determine a numerical value for distances on a ranked scale? Should these just be given a 1-5 ranking or should some other measure be used? How does the distinction between ordinal and interval scales affect the understanding of measures discussed in the other readings, such as Discounted Cumulative Gain and Rank-Biased Precision?2. Do you think that the combination method of absolute judgments with preference judgments and computer learning is better than the pooling method? In what ways?3. In constructing the experiment, the authors choose to use synthetic data to test their method using synthetic data and rely on the assumption that noise levels from different objects and tests were constant. Does relying on such generalities increase the scope of applications for the method? Does this make you question the results?
Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since 1998-- Timothy G. Armstrong, Alistair Moffat, William Webber, Justin Zobel [substituted for Donna Harman article]Summary: The authors state that IR evaluations should become more standardized and that IR communities should be more communicative with their results-- this would lead to greater improvements in the field. Currently, the state of research surrounding improvements to IR were found to be unacceptable by the authors. They went through various issues of the TREC tests, and found that researchers were using weak baselines upon which to create and display improvements, thus yielding advancements of misleading import to the field. The authors propose the use of a database website they have created (www.evaluateIR.org) as a solution to the isolated, irregular state of research in the field.1. It seems to me that the findings in the TREC experiments are too universally dismissed by the authors, such as when they state early on in this paper that there were “no improvements in retrieval effectiveness from 1994 to 2005”(p. 2). If this is true, then how were the TREC tracks allowed to continue? And does this necessarily mean that there were no useful findings at all, or just in the most direct sense of an improvement over a specific improvement of the previous year? Perhaps there was an alternative improvement to a similar baseline of a previous year which had gone unexplored. Or, perhaps TREC takes the position opposed to the authors that “What we care about is...demonstrating that there is an improvement, not achieving optimal performance”(p. 6).2. I thought it was interesting that in their examinations, “[t]he papers studied includes four that had authors in common with this paper”(p. 4). How could this show a bias? Or does it instead offer a sense of deeper understanding of the field, proving them insiders? Or even exhibit an element of respect to the TREC participants under scrutiny, since they were among them?3. In the last section which describes “A Public Database of Runs Data,” the authors propose that their database will remove the burden from researchers of having to come up with a baseline of their own. Couldn't this be problematic in the sense that everyone will be working on the same, narrow path, when before researchers could build experiments specific to myriad needs? The authors of this paper state that the best system would be one that makes the baseline of the current year out of the “best” experimental result of the previous year. But what if work builds to create a path that is not suited to address issues recognized by other researchers- should their baselines be discounted?
Kekäläinen, Jaana, and Kalervo Järvelin. Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology 53.13 (2002): 1120-1129.Summary: Kekäläinen et. al. attempt to propose new methods for grading the relevance of documents in test collections beyond the traditional binary judgments of the TREC test collections. They discuss earlier methods of relevance judging and use their case study using a text collection of newspaper articles to demonstrate some of their proposed measures. The authors stress the importance retrieving highly relevant documents in the search process and the use of a graded relevance assessments.1. The authors reassessed 38 topics from TREC 7-8 test collections and found that they agreed with the previous relevance judgments by 95%. Given that IR evaluation is so reliant on relevance judgments, is there any way the bias of different assessors can be accounted for and minimized in the system evaluation process?2. The authors state that as the different levels of relevance in a test collection become more stringent, one retrieval method being tested will "stand out form the rest." If so much weight it put on the documents that have been judged "highly relevant," do they not run the risk of minimizing the importance of other relevant documents and promoting a IR system built to only retrieve those highly relevant documents if it meant a better score?3. In their case, the authors had 4 judges assess the level of relevance of their documents. In the event that the assessors' judgments differed by 1 point (21% of the relevant documents), the researchers alternated which assessor's judgment they used. Does this alternating selection method not lessen the credibility of the relevance judgment process?
The following questions come from Kekäläinen & Jarvelin’s “Using Graded Relevance Assessments in IR Evaluation” (2002).Summary: The article looks at using 4 grades of relevance judgements for IR evaluation. In particular the authors are interested in whether graded relevance judgments yield different P-R curves than binary relevance judgments. In the course of study three different “Query Expansion” techniques are used, which attempt to increase the robustness of sample queries. The authors find that the differing relevance levels yield significantly different Precision-Recall results, substantiating the authors claim that simple binary relevance measures may hide certain IR system behaviors.1. The article is interested, among many other things, in whether Query Expansion (QE) techniques are beneficial when graded (non-binary) relevance assessments are done. Three query expansion techniques are used to for the research: an Average of all expanded terms (includes synonyms and constituents of admin areas), a “Boolean query” that finds phrase windows containing at least one of each term (or synonym of term) in the window, or SYN query that calculates a score from the sum of the synset of each query word. What are some alternative ways that Query Expansion could be done?2. While Kekäläinen (pg 14-17) shows that the performance of each query expansion technique can change at different levels of graded relevance judgments (e.g. SSYN/e remains the best approach but the degree to which methods are separated change based on specific level of relevance). Knowing that a method might perform equal to another at a certain degree of relevance but be significantly different from another method at a different degree of relevance, which levels might be the best ones for characterizing system performance?3. The authors state that the documents used in the study had their word tokens simplified (words were reduced to a morphologically basic form and contractions were split). This was said to be done to help term matching between query and document. What are some other information extraction tasks that could improve document IR?
Evaluating the performance of information retrieval systems using test collectionsHaving read this paper, I find it’s very interesting to use crowdsourcing to gather relevance assessment. For this new approach, here I have several questions here.Firstly, how can we control the quality of relevance judgment via crowdsourcing technique? It’s obvious that new assessors usually need kinds of training before they start to judge topics. However, how can we train these participants through website? Or, can we just break the standard way of judging relevance into several steps by which these participants can be lead to complete their judgment against each topic. Secondly, how can we assign topics to participants? As we know, usually, these participants come from different backgrounds. Do we assign the topics due to their backgrounds? If so, more specifically, how can we define their backgrounds, according to their education majors, their jobs, or their interest? Lastly, the author mentions that it’s problematic when using crowdsourcing in special domains. Why? Is this because few participants to judge the topics in these fields, or because the topics in these fields are very difficult?