Wednesday, September 11, 2013

19-Sep Tefko Saracevic. 2007. Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part III: Behavior and Effects of Relevance


  1. 1. Some of the issues facing the IR community with respect to relevance are clearly interrelated. For example, the need to study relevance behaviors in different contexts, both in and outside of retrieval activities, and the need to study cognitive, affective, and situational factors as
    they dynamically affect relevance and are affected in turn (2141). How many of these issues could be solved by expanding IR experimentation outside of the university and into other environments, where students would no longer be the test subjects? To be clear, I do not ask this for the reasons cited by the authors in the "Beyond Students" section. The paper mentions studying other populations in order to understand what others find relevant and compare it to "student relevance" in order to see whether student relevance can be considered "the norm"... but it sounds a little bit like the authors are selling the diversity of relevance contexts short, after just mentioning their importance in the previous section of their paper. Beyond what students perceive as relevant vs. what others may perceive as relevant, is not the point that relevance changes under different contexts (for instance, at different types of jobs, or at home, or while performing research), and that, whether or not relevance is perceived similarly by different populations, behaviors may change under different conditions anyway?

    2. Regarding relevance judgment consistency: Have researchers tried taking the mean of the relevance judgments across multiple judges? The paper mentions several authors who take unions and/or intersections of different judged-relevant document sets, but what about ranking based on the mean of the judgments? Also, would the results differ if you took the mean of several binary judgments of the same document, versus the mean of several incremental (n>2) judgments on the same document?

    3. How do individuals' relevance judgments change as individuals judge more documents? Is it possible that they change for the worse, in some respects? Essentially, this is a question about the interaction of test and treatment and/or treatment and treatment. Suppose that someone gives you a query, "symptoms of pancreatic cancer". You develop a prior idea of what relevant results will look like. However, upon viewing results, you update this idea, gradual turning it into a "posterior" assessment of what is and isn't relevant. What you learn as you examine documents may either limit or expand your perspective on what is relevant, relative to your initial perspective when you began viewing and judging documents. Work has been done on performative evidence of such biases - for instance, the experiments in which documents are presented to different assessors in different orders - but what about the behavior itself? Is there any qualitative understanding of how it works and/or how to manipulate it? Does it change in a testing environment versus when users perform actual search queries at work or at home?

  2. Beyond topics, the author mentions that there are other factors contributing to informative relevance, such as topicality, novelty, and reliability. In this case, if valid, is it unsafe to judge the relevance in the traditional way championed by TREC?

    When it comes to relevance dynamics, I have a question about Google. In Google, it has devised many different specialized search engines, such as Blog Search, Patent Search, Scholar Search, and so on. So, does it mean criteria of judging relevance will vary with the different realms where users search?

    At the conclusion part, the author mentions the importance to study relevance. Here, I have a question whether the research of relevance will become increasingly vital in the future, if we wish to get a breakthrough in this realm?

  3. 1. Disclaimer: I have not read the Choi and Ramsussen (2002) study cited in this paper (page 2128, 2130).
    Do you agree that visual information "provides clues that make for faster inference than textual information" (pg.? What would convince you or validate that something is relevant - an abstract that summarizes the document, or a flowchart or image? So can visual information summarize / represent content better than text? Also doesn’t Saracevic say that of all the document representation formats, abstracts and titles produce the most clues (pg. 2130)? Then does he not think visual information is a type of document representation?

    2. Do you think changing location has any impact on the relevance judgments made by a user for the same query? Mobile devices are ubiquitous, and users are searching for things on the go. What role does current location, and context have, if any, in making relevance judgments? So when studying which documents are relevant for a particular query should researchers and conferences take that into consideration? Also, should they ensure that all users making these judgments are using the same type of device?

    3. We read in the Dong, Loh, and Mondry study (cited on pg. 2131) that a physician’s assessment was considered a “gold standard”. Should evaluators be classified like this, into different groups based on their backgrounds, and subject-knowledge of the documents being assessed? Should judgments made by the “gold standard” or subject-experts be given more weight / importance while marking relevance, as compared to evaluators with insufficient knowledge?

  4. 1. The author mentions that among populations with high expertise, the relevance overlap can be as high as 80%. However, in the earlier paper by Voorhees, it the overlap was much smaller (although Voorhees's conclusion that this barely had any effect on ranking the systems). We assume that the different judgments mentioned in that paper were also by different populations but with similar expertise(although I might be wrong on this). Was that just an outlier for that set of judgments, but if so, how can we accept the consequence that non-overlapping judgment do not have significant impact on ranking of different IR systems?
    2. The generalization that as time progresses and we go through different phases of a single task (becoming more informed about the material as we progress, like with a term project) we continue to use the same relevance judgment criteria but assign different weights, seems to be applying an objective premise to something inherently subjective. If not, how can we test such a hypothesis and/or discover such weights given enough funding and a population? Note that this is not just about user behavior: the Smucker and Clarke paper successfully objectifies a user property using a convincing model. I don't see how the above can be modeled though since the criteria we are trying to weight cannot be measured directly.
    3. The proposal by Saracevic that despite methodological flaws, the field of IR has seen considerable improvement seems to be (somewhat) competitive with the paper we discussed in class last week about how different systems compare against weak baselines and that the incremental improvements we see in systems don't really 'add up'. It might be true that IR systems have improved greatly if we consider the last three decades, but its more debatable if we only consider the previous decade. On the other hand, it does seem like search engines like Google have done an impressive job of adapting to user preferences, through more finely tuned algorithms (possibly) and better interfaces (definitely). This comes back again to the question we discussed in class about whether research in IR needs to be more closely tied to corporate needs. Saracevic has called for more research funding into relevance, but is it the right idea to be spending money on a topic that seems to have been investigated on a mass scale by private companies but without their data going public?

  5. 1. It is mentioned in the paper that many different studies have been performed on the study of the relevance criteria. Are there any studies that summarize all these studies and propose a model which contains more generalized criteria shared by different studies? If not, what are the possible difficulties that prevent such kind of work? Are there differences in weight or importance for all those relevance criteria found from previous studies?

    2. Most of the studies mentioned in this paper used a limited number of students as study objects, which, as it is mentioned several times later in the paper, is mostly due to a limitation of research funding. Is there any study that focus on how the differences of number of participants affect the final results on relevance study? This might give a justification on the choice of a small number of students. With the advancement of technology, we can make use of novel techniques to facilitate the search. For example, we may use Amazon Mechanical Turk to do a relevance study on a broader range of people with relatively low cost. Is there any study performed with these techniques?

    3. Lastly, the author arranges all these studies in an author-centric way, which is in contrast to the suggestions from previous review paper that we’d better build the review in a concept-centric way. Is it because most of the studies mentioned here are more focusing on methodologies than on conclusions, or the author-centric way has its own advantages and thus is still being used routinely?

  6. 1. Saracevic almost directly addresses the concerns presented in “Improvements That Don't Add Up.” Saracevic finds synthesis to be the answer to problematic data. As he summarizes numerous studies of IR that he has examined throughout his long career in this field, he explains that “methods used in these studies were not standardized and they varied widely...[yet] it is really refreshing to see conclusions made based on data, rather than on the basis of examples, anecdotes, [etc.]...[T]he most important aspect of the results is that the studies independently observed a remarkably similar or equivalent set of relevance criteria and clues”(p. 5). This sentiment seems to fall in line with the discussion of seminars which we had in class-- what is important is that there are ideas being shared, studies being done, and discussions taking place. Is this the most ideal method for advancing and evaluating IR?

    1.5. On a related note, the author describes that he “will concentrate exclusively on observational, empirical, or experimental studies...that contained some kind of data directly addressing relevance. Works that discuss or review the same topics, but do not contain data are not included.”(p. 6)
    If this data is so flawed, as he seems to acknowledge (as in the first of my questions), and as the authors of “Improvements That Don't Add Up” insist upon, why is it so vital that data be present? Couldn't articles more focused on “discussing” and “reviewing” be just as useful, or possibly more so in terms of evaluating IR systems and examining work in this field?

    2. It seems like almost all of the studies summarized in this article are conducted using students, faculty, and librarians as the sole participants (with the exception of a couple of clinical studies). How does conducting experiments solely inside of an academic bubble create a bias in the findings? Isn't this a classist approach to identifying IR needs and potential improvements? Why is this practice so universal across studies? Is it motivated in part by laziness? Or is it because “[l]esser subject expertise seems to lead to more lenient and relatively higher relevance ratings-- lesser expertise results in more leniency in judgment”(p. 11)? When the author addresses this issue directly, he acknowledges that “it is an open question whether conclusions and generalizations can be extended to other populations in real life” and that these studies actually only reveal “student relevance”(p. 13). What should be done to measure “real life” populations? Is this a responsibility if we are invested in improving the field?

    3. It seems as though IR is very different from other sciences when it comes to controlled variables in an experiment. In chemistry, for example, there should be very strict controls in any given experiment to limit confounding variables, and isolate true causes of reactions or phenomena. But, with IR, the more highly controlled an experiment is, the less it reflects truth, or “real life” as the author of this article is fond of saying. He explains this in the section dealing with the TREC postulates: “The postulates are stringent laboratory assumptions, easily challenged”(p. 7). He then presents the other side of the coin- if all studies are done in “real life” scenarios: “How do you evaluate something solely on the basis of human judgments that are not stable and consistent?”(p. 10). Techniques that combine behaviorism with the scientific method of the laboratory are “harder to do, but much can be learned”(p. 16). Is the consensus of most researchers in this field to find a synthesis between empirical, controlled data and a more qualitative, narrative of human perception?

  7. 1. The author addresses an interesting point that has hindered relevance related research: funding. The author notes that most of the studies are either funded by the researcher himself or there is limited funding provided. As a result, most studies end up using students as their participants. As a consequence, we have a detailed view of relevance related to a student user. The question then arises whether this can translate into a view of an actual end user for a system. The author feels that the use of students alone limits what can be inferred since students are not representative of the population of end users. Is it possible to leverage students to provide a diverse and reflective participant group? At my undergraduate university, there are students of all ages and backgrounds. I feel like you would be able to get the type of users you would want from your study if you pooled the right people. Since not all students are coming right out of high school or have the same computer skills, I don’t see why students would have their own distinctive classification.

    2. Collaborating statements made in class, the author notes that studies have proven that the order in which documents are presented effect the degree of relevance a judge will assign it. For instance, the higher ranked a document is, the higher chance the document will be viewed as relevant. Therefore, relevance is not a judgment that is made based solely on the document being viewed, but is dependent on what the assessor has already viewed. The only time a study found relevance to be independent was when the total number of documents was small. In other papers we have read, they have mentioned that relevance judgments were to be made under the assumption that all documents were independent. The authors of those papers go on to explain that this is a simplifying assumption and they are aware that relevance judgments in real life situations are not so cut and dry. Why do the studies these authors perform choose to keep relevance judgments independent of each other? This paper and other author comments highlight the research community’s knowledge that this assumption typically does not apply to real life. Is this independent viewpoint still assumed because only one judge is used per topic and eventually documents would no longer contain new information and could be considered not relevant?

    3. Some of the first studies the author looked into focused on classifying the different criteria users employed to determine a document’s relevance. From study-to-study an array of classifications arose that were remarkably similar. The author concludes from the studies’ results that different users still invoked similar criteria. As a result, I can understand why work would arise in further classifying the decisions users make to mark a document’s relevance. However, past papers have mentioned that relevance differs significantly from person-to-person and even over the same person across time. Later studies mentioned in this paper also point out this discovery. Commonly, studies have found around 30% overlap in relevance judgments. Although studies have shown ways to improve this number, it is never going to lead to 100% agreement across all judges. In the end, is there a point to modeling users’ relevance judgments when the end product is never going to be systematically the same?

  8. 1. On p. 14, Saracevic states that “relevance is poor.” However, much research is being done on coming up with better search results by companies like Google and Microsoft. Would you consider work that Google, Microsoft, and other companies do to be research on relevance, since it comes from a profit motive? Or is it more consumer behavior studies? How does something qualify as reliable “research”?

    2. Saracevic discusses in-depth studies on overlap between assessors. On p. 12, he summarizes, “higher expertise results in larger overlap. Lower expertise results in smaller overlap.” Does more overlap just mean that experts all come from the same philosophical background or that the documents really are more relevant? Do issues with less overlap in relevance when there are more assessors pose a problem for Google?

    3. Is relevance something that is learned or more or less "common sense" (as the Google rater guidelines tell raters to use)? In what ways do we teach relevance (in schools, at home, etc.)? Do computers (search engines, etc.) also teach us what relevance is by the way that they display the results?

  9. 1. My First question is in Relevance Clues (actually it is not limited to this one section but across the paper). From this section we can see that people have elaborate to work on numerous experiments about information relevance across the recent twenty years. However, each of the experiments seems to be independent with others. This reminds me the example Prof. Lease talked about in the class: when we went over the research improvement and put them into one unified research stack, we would find that there are no step-by-step improvement for one specified field that would lead to significant improvement. Here we have the similar problem: each experiment only serves for its own corresponding paper, and no following-up afterwards, which makes us hard to define the features and clues for information relevance in a unified way. How can we make use of all these information and get a convincing conclusion based on these numerous experiments?

    2. My second question is about Relevance Dynamic. Relevance Dynamic sounds to me more likely to be individual behaviors instead of a statistical, general problem. For example, one person’s judgment for relevance changes as time goes by and more knowledge gained. In the experimental environment, we can test this kind of change and observe factors for Relevance Dynamic. However, we can also think of another scenario that would reflect group relevance dynamic. For example, several years ago, if we talked about 4S and S4, it merely made any sense other than combination of characters. Now when we talk about 4S and S4, we would naturally relate them to IPhone 4S and Galaxy S4. This is interesting. How can we reflect this kind of Relevance Dynamic in modern search engine besides using keyword match?

    3. My third question is about reflection on population. For almost all these experiments, the number of participants for each experiment is less than 100, some are even less than 10. I can hardly admit that this would reflect real situation of the population. Another problem is also referred in this paper, most of the participants were students, which was limited by funding and the availability to test on the general public. How do we make the experiments scaled up and more convincing? Nowadays we have cloud-based experiments such as Mechanical Turk to encourage more people to participate in the experiments. But it is kind of hard to control variables as it is in experimental environments? What else can we do to make the experiments better?

  10. A wide range of clues or criteria were investigated but I hardly can find any comparable metrics. Looks like they have been working on their own without open standards or same sets of hypotheses. It creates non-trivial issues in generalizing and synthesizing their research findings. I can not rule out fat chance of errors and biases for those conclusions made by Saracevic regarding Relevance Clues.

    What exactly are relevance feedback? In page 2131, Saracevic states “relevance feedback is available in real life systems and conditions, users tend to use relevance feedback very sparingly – relevance feedback is not used that much”. I can't help wondering what real life examples are for relevance feedback?

    In page 2132, Saracevic summarizes “When it comes to judgements, the central assumption in any and all IR evaluation using Cranfield and derivative approaches, such as TREC, has five relevance assumption”. And then Saracevic basically points out none of these five relevance assumption hold. It is scary. I can't help asking this bold question that if foundation is weaker, what happened to those progresses, theories, practices and hypothesis based on the foundation? But in the next paragraph (in the same page of 2132), Saracevic states “However, using this weak view of relevance over decades, IR tests were highly successful in a sense that they produced numerous advanced IR procedures and systems”. Is this statement quite contradictory to the previous one? Or I simply missed something important?

  11. Saracevic says that relevance judgments are assumed to not change over time (pg 7) and are binary. This seems the exact opposite of how Google rates relevancy. Google seems to work on a 5pt likert scale with all kinds of additional flags, notations etc.. In addition, Google seems to have layers of raters, and approaches it's raters more along the lines of: 'try your best, don't spend too much time'. Raters are people, people change. Even in this document, the author points to how subject matter experts tend to have greater consensus but select less documents as relevant. Assuming that raters don't change over time seems like a pretty large fallacy. Is one method/approach better than another?

    What about items that are saved/printed but are not relevant to the user but may be relevant to someone else? Does that still count as relevant? Say I see a blog post, or a job posting that I think would be perfect for someone else, is that still considered relevant to me? Or relevant to both of us?

    How does date effect users relevance weights? I was slightly surprised to see that it's not a criteria for importance. Date seems like a double edged sword. Since I do a lot of technology research, it's important to find articles that are working with the latest technologies. There were no touch screen papers 10-15 years ago, so date is very important sometimes when doing research. However, other topics such as human behavior don't change that much over time, and could easily go back to the 1930's. How does date effect what relevance weight is given to a particular topic?

  12. 1. In page 2128, it mentioned image clues. Image is just one of the complex types of the data/information. There are still many other objects which are now being investigated. What is the status of the research on those data?

    2. Page 2134, the sample mentioned in the section of “Beyond consistency” is more likely to discuss the issues in “dynamic” and “stability”. What is the relationship between stability and consistency here?

    3. The author talked about “student relevance” at page 2138. It might be true then. However, with the development of WWW and social media, the participants can be easily expanded to other groups. Does such introduction of new groups change these remarks here? What are the new biases imported then?

  13. This comment has been removed by the author.

  14. 1. It is said that "Rees and Schultz (1967) pioneered this line of inquiry by studying changes in relevance assessments over three stages of a given research project in diabetes."(p.2128) What are those three stages? How do the cognitive state and task change in each stage? What are their influences?
    2. When discussing the relevance feedback, only the search with “term” is mentioned. (p.2129) However, recent work has been expanded to various data types. What is the state of search with other data types?
    3. What is the meaning of weak relevance in page 2132? Information retrieval has imported many other factors like cognitivity and so on. Does it mean that the remarks mentioned here are invalid now?

  15. 1.The author states that Manual Relevance Feedback is used to improve the query quality but doesn't elaborate on how the quality of the query is to be improved. I wonder if Manual RF continues to be based on a query independent collection model? Or, if it makes use of a query specific model? In case it is the latter, how are metrics like efficiency traded off against personalization? Also, how is the ambiguity between the user's actual information need and the query gap accounted for? Further, how does the re-ranking scheme that Manual RF uses hope to dynamically choose a superior user-specific set (while also determining its size) and what would be the parameters which require consideration given that all of this analysis is based on the initial retrieval performance?

    2. In the section on 'Effects of Relevance' postulated through Relevance Judges - the paper speaks of how it vital to 'measure variation in a relevance assessment due to domain knowledge and develop a measure of relevance similarity'. I'm curious about what serves as the reference point in this measure of relevance similarity. Like for instance - would we be using a vector, a spatial or a temporal model to make this judgement? Since the relevance similarity measure requires to be judged not only on the basis of the topic covered but also on the perspective and the extent of coverage - how do we propose to compute this measure say, between a newspaper article and a journal article which do in fact cover the same topic - however, obviously would still be rather disparate? Thus, wouldn't the relevance similarity measure in effect require to be an amalgamation of multiple similarity measures?

    3. The general tone projected throughout this paper is that different characteristics require consideration at different times and though investigations have been conducted in both realms - we still lack complete clarity on how much a priority these aspects are. Could we go ahead and categorically state then that IR as a field requires to be built on variable premises? It has been stated repeatedly that Relevance by generalized conclusion is in fact measurable - however, I am still unsure on how a quantity that is dependent on factors which do not have a unit of calibration could be measurable. Aren't we subjecting ourselves to a vicious pool when we continue to treat every conclusion drawn as a hypothesis and instead of getting rid of the assumptions and proving them as tautologies we instead incorporate them to 'enhance IR performance'?

  16. 1. Saracevic lists numerous studies that arrived at different numbers of relevance criteria. It seems really difficult to compare any two of these studies beyond the fact that relevance means different things to different uers. How many of those studies had overlapping criteria and couldn't the IR field benefit from trying to have a foundation that all research could pull from?

    2. Only one study was mentioned when it came to the issue of image relevance. Is this because describing what makes an image relevant is more difficult than text-based documents?

    3. Saracevic points out the issue of using students as users in most of the studies listed. One of the explanations given for this use of students is the lack of funding. While not everyone can afford to use a vast number of raters or test changes on live population portions like Google does, couldn't researchers use volunteers outside of the student community to provide another type of user in their trials?

  17. Where do we use RF? I can see how it can help auto-calibrate relevance judgments. The paper makes an interesting statement that under an experimental setup, incorporating relevance is appreciated, but sparingly used in real life systems. An argument I can see here is the sense of accomplishment or productivity a subject experiences, when decisions made are actively seen on the system being tested, but, under a general search setting the user is trying to accomplish a task but is less concerned about the system itself.

    The paper goes into great depth, to show that assumptions made in the relevance definition adopted by the Cranfield paradigm is easily invalidated. While it is clearly objectionable, how much value is there value in eliciting better relevance judgments. The point here is that the systems at their current state may not be capable of processing the additional information provided by such careful labeling. Hence, a sign of progress is to maybe adopt a specific notion of relevance and perfect it before relaxing constraints. In some sense isn’t this also reflected by the funding situation for research in understand relevance.

    Scope for crowdsouricng to elicit relevance judgments – The paper makes a claim that often a policy adopted is to limit one judge per query; similar points are made across the paper. Under the crowdsourcing paradigm it is often the case that there is need for such redundancy to aggregate dependable results.

  18. On page 2130, Saracevic is quick to point out that the studies he examined “were not standardized” and “varied widely” but applauds the conclusions reached that have data actually backing them up. Although the studies have that assorted aspect to them, does it still make them useful as that, in a way, mimics the reality of relevance in the hands of users?

    Saracevic mentions that users “infer relevance of information or information objects on a continuum and comparatively”(2137). As a result of this, would measurements of relevance based upon exposure to a subject be beneficial for creating a more complete understanding of judgments?

    Obviously funding has been a problem with regards to research and as a result HCI as well as relevance studies/research has lagged behind or fallen into the realm of computer scientists and others. Can “further exploration and expansion of bright ideas” come from funded research that won’t break the bank? Or is the cost of relevance judgments simply too large such that smaller locally funded research is the only feasible way to go?

  19. In the Beyond Binary section, an experiment performed by Eisenberg and Hue (1987) using 78 graduate and undergraduate students to judge 15 documents is mentioned. The volunteers had to mark the relevance level of the documents on a continuous 100mm line based on the their perceived value of document relevance. Marking document relevance relative to each other may not pose a problem, but identifying the boundary between relevant and non-relevant documents could be problematic (unless the ranking is binary-relevant). Moreover, documents that are specific, succinct and relevant are rated higher by users than the documents which are verbose and relevant. Is the above taken into account in the experiment inferences? Is it acceptable to assume the binary relevance model when the relevance scale is continuous?

    It is surprising that none of the central assumptions of Relevance namely topical, binary, independent, stable, consistent hold completely. This shows the huge role played by individual differences in contributing to relevance judgements. This is further supported by various studies that conclude that a general group agreement over a document is around 30%. Thus, it appears that search engines should focus more on context of a query (and user personalization) than conventional methods to ameliorate the above.

    The author expresses doubt on whether students are a representative sample of the general user population and I happen to share the concern too. Most of the Generation-X users’ behaviour cannot be captured by student sample. A lot of non-native English user population activity cannot be simulated by students (as students may have higher proficiency in English, particularly in developing countries, than the common masses). On the other hand, most of the users are of Generation-Y and much of the internet demography is comprised of people aged 10-40. Perhaps, it is better to perform some experiments (obviously this involves some cost!) with general user population and then compare for the similarities of results in the student experiment counterparts to resolve this issue.

  20. 1- Some issues with the authors methods: Listing studies the author does not comment on their relevance or validity. Some studies appear to be pretty questionable especially the ones that use only one or two participants. By listing them all together the author implies equal relevance and validity but how can we judge that for ourselves with out separately investigating each one? Should the author have to defend or support the works he uses? The author identifies what he believes is ‘a fairly complete representation’ of the past 30 years of work on this topic. However he groups the studies according to his own preference and for his own reasons. The original purpose of the studies is obscured and whatever meaningful conclusions the authors of the original experiments believed they made were entirely steamrolled by this authors attempt to make his own broad conclusions. The author admitted and repeatedly stated that his ‘conclusions’ were mere hypothesis and not solid ground work. Wouldn’t it have been more helpful to present actual conclusions made in studies in their original context and purpose so that the user would have an easier time formulating their own conclusions and assessment of credibility?

    2- A conclusion that was suggested from the study of the consistency of relevance judgements is that even though human relevance judgements do vary wildly system performance appears stable across groups of judgements. Does this mean that a system can be programmed to more consistently identify relevant documents than a human can? Or is the algorithm only as good at identification as its programmers are? Would we prefer a system that can consistently give us relevant documents only according to specific rules or do we prefer it when an inconsistent human is there to help establish a relative relevance? Will one day computers be come so good at retrieval that they can establish even relative relevance better than humans can?

    3- One of the listed hurdles to the advancement of the field is proprietary research. The author describes how large search engine companies do not have to openly share their results, methods, and advancements. Large search engine companies are not in the business of academic advancement. Is is reasonable to expect them to abide by the same standards. No one suggests that Coke reveal its secret recipe to advance home cooking and the field of food science. How is asking google to reveal its methods different?

  21. 1. The author is questing that the applications of the relevance models being discussed or hypothesized is not actually being implemented. But he has defined the scope of the paper to be such that only the relevance studies are being discussed. He states: “I concentrate exclusively on observational, empirical, or experimental studies, that is, on works that contain data directly addressing relevance. Works that discuss or review the same topics, but do not contain data are not included, with a few exceptions to provide a context.” Software Engineers would use the aspect where these relevance studies will be applied and their implementation are out of the scope for this paper. Also He has mentioned that search engines have been very protective about the work they are doing in this field. Therefore his statement seems to be lacking evidence. Additionally the type of interaction which author has wished for in these two fields would be very challenging.

    2. The authors has indicated that research is being done over all the world in the field of Relevance retrieval in page 15. Therefore are not the questions raised by author in the section “Globalization of Information Retrieval—Globalization of Relevance” already being answered? All that is needed is maybe some kind of effort to collect the results of all these independent studies.

    3. Isn’t the author under-grading the importance of students being the user test group in many of these independent studies? Students themselves form a very diverse group. This is one user group which would be informed enough so to make the results informative, and yet not biased.

  22. 1) I found it interesting that in the beginning of the review, the author mentions that, “although related to relevance by assumption, studies on implicit or secondary relevance are also not included here.” While I understand that the review has to maintain a certain focus, is not the majority of relevance implicit? Since relevance is inherently subjective, it seems that the interpretation of implicit relevance would be a key part of relevance studies.

    2) One primary focus of relevance research involves investigating what criteria users use to determine what is or is not relevant. Saracevic goes on to summarize a plethora of articles on the topic. However, he’s only able to find a single article regarding image clues (pg.2128). Is there still this little research on this topic in 2013? I recall reading and hearing that people responsible more actively to image based cues. Why would there be this little research into imaged based relevance?

    3) I found it interesting that although the review summarizes research on binary relevance and independent relevance, there is no discussion of relative relevance. From the papers we read previously and the discussion we had in class, I thought that there would be lots of research regarding how users’ assessment of the relevance of a document is altered by surrounding results. Why is this not a topic in the review? It could potentially be that “relative relevance” is too broad of a research area or overlaps with other topics too much, and thus is better split amongst other topics. However, that could be said about many of the topics in this review (ex. Relevance Feedback could be seen as a special case of Relevance Dynamics).

  23. 1. In this article the author discusses the idea of Relevance Feedback. He stated that there are two different types of Relevance Feedback (RF) that are manual RF and automatic RF. Manual RF is from user responses and an algorithm accomplishes automatic RF. What are the pros and cons of manual and automatic RF and what type of situations would you use one type of RF instead of the other?

    2. In this article the author states that as the level of expertise that relevance judges have in a subject increases so does the level of agreement that they have about the relevance of a document. However it is possible that they agree on the relevance of a document merely because of bias that they have because they all work in the same field. How would one go about compensating for this type of bias in a evaluation and is it important enough to eliminate this bias by using a varied pool of relevance judges in exchange for a lower level of agreement on the relevance of documents?

    3. The author states, in the reflections section, that many of these studies were similar to if not based on the behaviorism model of psychology. However he goes on to state that the behaviorism model went out of fashion in the scientific community because it made several assumptions about the simplicity of human nature. Should the field of IR evaluation research attempt to distance itself from this model because of its outdated ideas? Does the fact that much of the research that has been done in this model has been successful mean that IR evaluation research should continue to use this model?

  24. 1. The author explains the fact that there is annotator disagreement as to document relevance by suggesting that there may be different pools of relevant documents. While this is probably true for many subjects, does it not ignore the possibility that humans may sometimes be unreliable judges?

    2. The consensus from the literature studying the effect of assessor disagreement on IR performance seems to be that it has low to no effect when multiple queries are averaged. However, the author does mention that particular queries can have significant performance changes when only highly relevant documents are considered. This makes me wonder: what are the common features of documents that produce mixed relevancy rankings? Perhaps if this was known, some greater insight into the problem could be gained.

    3. The author mentions that a number of IR studies have attempted to use a notion of graded relevance. It is also mentioned that differences in graded relevance criterion depend on the output of the whole search (i.e. medium relevance cannot be understood without first knowing what a highly relevant and low relevant document is for a query). Might it be possible to develop graded relevance criteria not subject to somewhat arbitrary relations to document pools? Can the concept of medium relevance even be clearly defined?

  25. 1. This paper talks about how relevance is tangled up with the information
    systems that implement IR, and how it is also very deeply affected by human
    concerns such as our behaviours around relevance. It goes on to talk about how
    relevance judgements must be stratified to address these entanglements, but all
    of the test collections we seem to have studied seem to have pretty strict
    definitions of relevance (and they must, since these definitions form the basis
    for any numerical analysis). How do test collections reflect the different
    layers of relevance? That is, the paper talks about different types of
    relevance that address different human and technological concerns, so how do
    the test collections reflect these different definitions of relevance?

    2. The paper discusses the problem of finding funding for relevance research.
    It cites the fact that money for IR research tends to flow towards the
    "computers and information" side of the field as opposed to the "humans and
    information" side. As a result, the agenda of the research is determined more
    by the computer science community than the social sciences communities. The
    author claims that scholarship on relevance has not progressed in a meaningful,
    comprehensive way. This seems like a pretty dire claim, and I wonder whether
    it's just a pessimistic view of the situation. Is it possible that because of
    the integral role that computers play in our lives, relevance research just
    doesn't make sense outside of the context of computers?

    3. In cases where the relevance judges were studied, the author seems to place
    a high degree of importance on agreement among relevance judges (and I assume
    this is a reflection of the field as a whole). Is there any understanding
    around the degree to which agreement actually implies correctness? In other
    words, if a group of relevance judges completely agree on the relevance
    judgements for some test collection, then how do we know that these judgements
    are actually correct, and not just a reflection of the judges' preferences?