Saturday, November 9, 2013

11-14 Optional alternative reading


  1. How Evaluator Domain Expertise Affects Search Result Relevance Judgments
    - By Kenneth A. Kinney, Scott B. Huffman, and Juting Zhai

    (Instead of Kazai paper on User Intent and Assessor Disagreement in Web Search Evaluation)

    An experimental analysis of how relevance judgments differ between generalists and domain experts on web based search engine was done. It was found that the queries drawn from hard domains (hardness based on query clarity) require domain expertise for better relevance judgments in comparison to that of generalists. Also disagreements on the intent of the query among the generalists were higher than experts and this was mainly attributed to the simple keyword matching followed by generalists. The experiments were performed in the field of computer programming and biomedical domains. Finally, as a part of inference it was found that with expert statements about the query intent, the generalists performed better than previous experiments.

    1. In order to find out if the generalists outperform or at least match the domain experts in judging relevance, would it not be appropriate to judge in a binary relevance mode rather than a graded relevance? Why was graded relevance used? In my opinion, learning to rank methodology could have identified the margin of difference in relevance levels rather than employing graded relevance.

    2. The authors have stated that generalists’ ratings are prone to errors because of absence of clearly understanding the query’s intent. How do you ensure that the domain experts have understood the query's underlying intent very well? Only the query owners could accurately identify the underlying intent of a query and so their ranking of relevance of documents would be more appropriate than that of domain experts. Their expertise in the domain might lead to a biased opinion on the domain rather than a versatile diversified thought of relevance for the query.

    3. How can search effectiveness be related to domain knowledge of the assessor? What does the author term as search effectiveness - if it refers to how effective the search results were then is that not dependent on the test collection? On the other hand if it refers to how effectively the user is able to search for relevant documents for his/her query, then would it not be refining query terms and other similar tasks from the user side? How can the latter method be called as search effectiveness?

    4. What is the difference between providing intent statements to generalists and using domain expert raters for assessing and evaluation? The probability of obtaining the intent statements from users for every query is highly negligible and so the task of obtaining intent statements is completely dependent on domain experts. In that case, would the domain expertise not become necessary for making accurate or nearly accurate relevance judgments?

  2. Summary: This article is a follow-up article to the reading from week 9 titled “Models and Metrics: IR Evaluation as a User Process.” In this study, the authors conduct user tests to see if actual user behavior fits any of the evaluation metrics. They have 34 users test conduct six searches and use eye-tracking devices and click data to see what documents they examine. When a user clicks a document, they also rate if it is “useful” or “not useful”. They conclude that RBP is not a bad model, and SDCG and AP do not approximate user behavior. The authors state that their metric INSQ is the best model of user behavior because it accounts for the different reasons why users search for documents, that they examine documents to an arbitrary depth, that they are more likely to continue when more investing in searching, and that they alter behavior based on previous documents.

    1. The authors mention Smucker and Clarke's evaluation metric which considers the amount of time a user spends searching. How does this compare to this article's measure of T documents they expect to examine? Which one do you think is more effective?

    2. The authors briefly mention how when a user examines a document from a search engine they first read the snippet and decide whether or not to click (p. 8). How would you evaluate effectiveness if the user gains all the information they need from the snippet? How would you take the snippet into account when testing a system?

    3. Using eye tracking, the authors discover that a user doesn’t look at the items returned from a search engine in sequential order, rather their gaze jumps around a bit (p. 6). How does this discovery that users don't simply look down the list affect traditional metrics? Since the eyes only jump 1-2 rankings does it have a significant effect?

  3. Relatively Relevant: Assessor Shift in Document Judgments - Sanderson, Scholer, and Turpin

    Summary: This paper analyzes how assessors’ judgment can change over time and may be inconsistent. The paper also looks patterns in relevance judgments and how documents judged as relevant appear as a cluster because assessors’ conception of what might be relevant changes over time. They also put forth that assessors look back at a document that they have judged relevant and use that as a frame of reference while judging consecutive documents. Also, if there are a large number of documents to judge, the ‘reference document’ keeps changing, and thus the criteria for marking a document as relevant may change too.

    1. This paper refers to the Cranfield paradigm and says that to promote consistency of judgments, the documents to be judged for each topic are assigned to a single assessor (pg. 2). Do you agree with this method, or should documents of a topic be distributed among assessors to avoid aspects that can lead to inconsistencies like biases, lack of interest, and/or boredom?

    2. While identifying the relevance shift, the researchers assume that the time period between 2 documents being assessed is proportional to the distance between the documents in the list (pg. 3). Why only consider distance? What about characteristics of individual document like length and difficulty level?

    3. The researchers suggest that most of the times the assessment of relevance for topics in TREC is conducted by people involved in the topic development process, and they have a priori mental model of the number of relevant documents that should appear in the final qrels (pg. 7). Thus they adjust their criteria during the assessment to suit this model. Would it be interesting/relevant to have people who aren’t TREC-related judge the documents, and then study the judgment patterns to see if there is any clustering and/or relevance shifts?

  4. User vs. Models: What Observations Tell Us About Effectiveness Metrics
    Summary: Currently, a number of static and adaptive models of users are employed by researchers to evaluate the success of their system. Each of these models makes assumptions about human behavior and their mathematical constraints depict a specific type of search behavior. The authors of this paper feel these commonly used models do not accurately reflect human search behavior. The authors made five assumptions about the way people search and then set up an experiment to verify or disprove these assumptions. The authors found that there were some subtle differences between the assumptions. Based on these assumptions, the authors seek to quantify the behaviors and establish their own user model.

    1. For their study, the authors evaluated the demographic of people who participated. They found that 50% of the participants were not native English speakers although they all claimed to be fluent in the language. From my experience, I have met people who are fluent in English or would consider themselves fluent in English, but I would not consider them fluent in terms of informal English. At my last internship, I would have to explain different slang terms to my co-worker after each meeting. Most of the phrases I would never have given a second thought to. Since the author is so focused on capturing human behavior, could this demographic be deceitful? Their opinions of relevance may reflect the lack of familiarity with informal English. In addition, the demographic is completely made up of computer scientists and engineers.

    2. The authors seek to create a more realistic user model by evaluating human behavior over an experimental study. For their design, the authors restrict a lot of the ways people can interact with the system. As a result, the authors have ended up creating a search experience that is not directly reflective of a real life search experience. The authors made these decisions for cost and reducing the complexity of evaluations. Does this contradict the goal of the author? How realistic of a user model can the author establish if he isn’t considering the user in his real environment?

    3. The author mentions the limitation of the user models is that they do not really reflect what the user is doing. However, it is easy to say that the limitation of any user model is that there will always be a user who behaves differently. If the model doesn’t accurately reflect human behavior, is it still helpful for evaluating between two systems? Some of the models mentioned are commonly used but do not realistically reflect human searching behavior. Since these models are still able to allow for confidence in distinguishing between two systems, what is the added benefit of the effort it would take to create a more realistic model?