Saturday, August 24, 2013

Kelly FnTIR 2009 Ch. 2-4.

Diane Kelly. Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval 3.1—2 (2009). READ ONLY CHAPTERS 2-4, pp. 9-30.


  1. 1. For the Cognitive Viewpoint in IR part, I especially like one dimensions of the cognitive viewpoint: “Information is situational and contextual”. This dimension tells us the information happens in specific situations and affected by that particular time and space (this is also addressed in dimension 3 and dimension 4). So I wonder besides making simplifying assumptions and abstractions about some parts of the process, what else we can do to incorporate the environment effect in the IIR research? For example, when we analyze the user log, how do we take the situation of users (location, gender, age, etc) into consideration instead of taking them as equivalent to each other?

    2. In the Laboratory and Naturalistic Studies, this article gives the examples of using user log as well as the study conducted by Anick for live trials of an interface for query expansion. It reminds me of the advertisement from Microsoft, Bing it On (, which lets the users test which search engines they prefer based on the results. Bing can also make use of the user data. For example, they can get the queries and results that users choose google over bing and then analyze and revise their search methods as well as the representation of the search results. So I wonder what else can be used as Naturalistic Studies, and what would motivate people to participate in such kind of experiments?

    3. The Wizard of Oz Studies and Simulations sound interesting. In this article it refers that besides systems are simulated in Wizard of Oz studies, another entity that has been simulated in IR studies is users. However, in this method, since the users are simulated, not real users, how well the simulation can be representative of real users’ behavior? For example, I just assume that the simulation of users is rule-based, and the rules are set by the researchers. How can we avoid over-fitting between the rules for the simulated users and the systems?

  2. 1) In the introductory chapter about the Interactive Information Retrieval, it has been mentioned that IIR has historically emphasized effectiveness, innovation and application. Since, Interactive Information Retrieval System studies have been mentioned to lie on the middle of the continuum anchored by system centred studies and user centred studies, does it not make more sense to carry out IIR studies emphasizing on interaction and refinement?

    2) In the TREC-9 track, it has been mentioned that there were 8 fact finding tasks that were created, with 4 n-answer tasks and 4 specific comparison tasks. What is the significance of a fact finding task , given that it is going to result in the identification of a document and not information extracted from the specific document?

    3) In the HARD Track, what is the significance behind the emphasis on single-cycle user-system interactions? What are the characteristics that can be highly observed from a single-cycle user-system interaction when compared to a multi-cycle user-system interaction? A user trying multiple searches disseminates more information that can be analyzed when compared to a user which just a single interaction with the system.

  3. 1.Some of the TREC Interactive Tracks included different / multiple participant sites. There is no mention about the context of each of those sites and if it was a experimental design consideration. Context is usually a factor in most behavioral research, and IIR research includes a cognitive viewpoint. Then, how should, if at all, IIR research that is not restricted to a single, controlled and consistent environment represent the participant’s context? Is it relevant while evaluating users’ search behaviors and associated interactions, especially if making “cross-site” comparisons?

    2.From their onset the TREC Interactive Tracks were supposed to develop an “evaluation framework for studying interaction and users” (pg. 17), however don’t you think, at least initially, there was a certain level of disregard for the human / cognition side and the corresponding analysis. For example, TREC-3 had no protocol for type or number of participant as long as they could work the system (pg. 18). Similarly TREC-4 had “no standard protocol for administering the experiments” (pg. 19) as long as participants could perform the required tasks. Does this a demonstrate system focused method and is it recommended for IIR research today?

    3.With reference to the diagram on page 10: Log analysis is represented as more system than human focused while the TREC Interactive Tracks are the mid-point. Don’t you think there is a conflict between the two, in that the Tracks, especially the initial ones seem less human-focused than a log analysis, which is fundamentally a description of the user-system interaction? So, should this diagram distinguish between the different Tracks? Also while a log analysis may not directly explain them, it at least accounts for behavioral trends. So do you think its position on the diagram needs to be revised, especially if the Tracks are split up?

  4. 1) When describing TREC 5, Diane notes the creation of a new task dubbed “aspectual recall task” and its goal is to “[require] subjects to find documents that [discuss] different aspects of a topic rather than all the documents relevant to a topic.” However, this explanation resembles the definition of precision rather than the recall measurement discussed in class. Why is this?

    2) When explaining TREC 6, Diane comments that participants wanted to create their own baseline systems and how this would make cross-site comparisons impossible to do. Other than increasing participation, what are some additional benefits if participants are allowed to do such thing?

    3) During this semester, all of us will need to consider research approaches (exploration, description, or explanation) especially when we decide on our topic for the term project. Which research approach would be better suited for students with little or no research experience? Are there any drawbacks and benefits of picking one over the others?

  5. 1. Kelly writes that “although it is often claimed that the system is being evaluated and not the user, in practice this is difficult to do since a user is required for interaction.” To what degree do you think that pure system testing is possible? Do you agree with this claim that there is always an evaluation of the user, even if the user is the researcher?

    2. On page 16, Kelly discusses how once a user begins interacting with a system they start to learn the system. Can you ever have a pure user to test your system? Can you or should you design experiments to mitigate this learning bias?

    3. Kelly argues that single-system tests are “weaker” than comparison tests (p. 27). Do you agree with this claim? What would be the benefits to running a single system test where more qualitative and contextual data could be gathered from the user?

  6. In second part of this paper, the author introduces serval studies ranging from system focus to Human focus. The discussions about these studies, however, are a bit general. Could you compare these studies more detailedly, especially for each study’s pros and cons.

    In the cognitive viewpoint, it’s pointed out information seeking and retrieval can be influenced by cultural environment. So, is it really true that behaviors or patterns of seeking information will be different due to different backgrounds of users; for instance, will behaviors of seeking informations be different between American people and Chinese people? And How about the age? Are the patterns of seeking information different between young people and old people?

    The author mentions that case studies, like interview, have been conducted by the increasing number of researchers. As to the study case, we usually get information through via conversations. However, sometimes, the interpretation of conversations by some recorders may be different due to their unique backgrounds; in this sense, it’s likely to be unreliable and subjective to get information by case studies. So, is there some equipments or approaches that can overcome this disadvantage by recording and analyzing conversation more objectively and scientifically?

  7. 1. Compared with traditional IR, IIR is more user- oriented. The variability of personal preference and behavior will have a much stronger effect on the evaluation. How did IIR control these variables and what’s difference between IR and IIR in terms of the control mechanism?

    2. The chapter mentioned briefly that standard precision and aspectual recall was used as metrics for evaluating performance in Track, why are these metrics used and what’re the differences of usage of metrics between IR and IIR?

    3. The chapter gives a nice introduction to interactive informational retrieval. One thing that’s not mentioned is how studies from IIR can be incorporated with strongly system-oriented and user-oriented studies. Which method is supposed to be used in what environment?

  8. 1. The evolution of the TREC Interactive task did not present itself to be smooth. While it appears to have moved through phases, exploratory in design through explanation -- I am not sure. (It is also not clear if the TREC-12 enables explanation or only statistics were reported after comparison with automatic methods).

    2. The claim of the TREC HARD task is that while the experiment is not longitudinal – capturing user context qualifies it to be IIR, this is still unclear to me. Meta data recorded can help measure many metrics to quantify the search, but this task being less useful for cross-site evaluation, but very useful to single-site studies is not clear.

    3. Case studies seem to be ideal experimental setups when designing personalized search environments, or optimizing for context specific searches. What is not clear is the breadth to depth focus of these case studies and what qualifies as a case study.

  9. 1. Chapter two depicts the wide range of possible experiments in the information retrieval field. One end of the spectrum is system focus studies and the opposite end is user studies. Most of the papers I read mentioned that information retrieval is inherently a user-oriented process, but that system studies tend to be the preferred method of evaluations. A range of different reasons such as cost and time are mentioned for why the studies have gone for the more controlled system experiments. The author does a nice job in this chapter of outlining the different steps on the spectrum and the types of experiments typically associated with each. However, I am now curious about the frequency of these different experiments and if the internet age has affected the types of evaluations. With the increase in web based applications related to information retrieval, has the emphasis started to switch to user-based studies? In my software architecture class, we discussed how a company thrives based on its reputation. This point was, again, brought up in our in class discussion. Therefore, I would think that the focus of companies such as Google would be on ensuring high user satisfaction.

    2. Most of the evaluations mentioned in this paper as well as the other ones is comparing two or more information retrieval systems. The authors in this case also address evaluations of a single system. The only example the author gives is usability tests. As a developer makes changes to an existing system or develops a new system, what type of testing takes place in isolation? All the papers have mentioned the time and cost of conducting comparison evaluations, therefore, I assume there is some standard that developers use to determine if their system is worth further evaluations.

    3. Chapter 4 mentioned natural experiments and gives an example experiment performed by Anick. Anick has a subset of users interact with the experimental interface and then different subset of users interact with the original interface. In class, we focused on the steps that Google goes through before releasing a new version of Google Search. One of the steps was the same as Anick’s. Google using the natural experiment would seem to lend weight to investing in these types of studies in future evaluations. Looking back at the spectrum of IIR research depicted in chapter 2, is it possible to build a best practices for evaluating IIR systems? For example, are there certain system or user oriented tests that are best to perform earlier or later in development? Are system tests better in the beginning of development and user tests better at the end when tweaking the system to achieve better user satisfaction?

  10. 1. The Kelly article contrasts systems interested IIR work with more cognitive science interested IIR work. With the Cognitive Science orientation the user is more the center of study, with one key user-oriented study feature being the user’s “social and cultural” context. While I think it could be interesting to know how these factors might influence information seeking behavior, doesn’t the relativeness of an individual’s context and perspective make these types of studies non-generalizable and effectively less useful?

    2. The Kelly article goes through considerable effort to point out some flaws in the past TREC tracks, yet the author does not propose new experiment designs to overcome them (at least in the chapters we read). In what ways can IIR experiments be redesigned to capture more user-centered factors?

    3. Throughout the paper emphasis is placed on the user-centered IR experience, but how much can the user be trusted to faithfully relay their true IR experience? How strong are people’s meta-cognitive abilities and how might that impact user-centered IIR research design?

  11. 1. Kelly makes clear that those conducting “Wizard of Oz studies” are aware of the pitfalls of using simulated users in an experiment, but some researchers still use this approach for the sake of time and economic concerns. What struck me, though, was the fact that the other benefits of using simulated users were their “carefully controlled characteristics”(p. 30). Isn't the fact that human users have behaviors that are impossible to anticipate, shape, or truly “control” their most desired attribute in an experiment? How can studies that rely on simulated users prepare a search engine for successful interactions with human users?

    2. Section 3.2 deals with the “Text Retrieval Conference,” and summarizes each TREC. I am wondering how closely tech companies and innovators follow these conferences. For example, did the introduction of n-answer tasks (“these tasks required subjects to find some number of answers in response to a question”[p. 21]), “specific comparison tasks,” and ciQA tasks (section 3.2.3) lead to the possibility of complex IR systems like AnswerBus? What is the time line like in terms of comparing emerging, mainstream search engines and what IR strategies are incorporated into conferences like TREC?

    3. I was surprised by the difference in description of what happens during TREC proceedings between Kelly's article and the Voorhees description. Kelly claims early in chapter 2 that these studies were “directly related to the human” and that it was “ use interviews” and acquire “feedback”(p. 12) from subjects. Later, Kelly goes into depth about how participants were supposed to come up with ideal routing queries (Section 3.2.1) which culminated in an evaluation of human techniques vs. automatic ones, and then even further into “user-system interactions”(p. 22) in the section on TREC-HARD. Perhaps it is because the Voorhees article was primarily focused on the original Cranfield TREC experiments, but I am surprised that Kelly presents these proceedings closer to the human-studies end of the spectrum than the systems-studies end.

  12. The work for the Interactive Track (TRECs 3-11) points to a general research problem involving human case study that is given human limitations (physical and cognitive), it is not possible to construct a perfect evaluation covering most validity issues. Instead, researchers shall make tasks simpler (use only four n-answer tasks and four specific-comparison tasks in TREC-9 for instance) for limited topics. This inevitably poses the validity issue (not enough topics coverage) but can be solved by requiring larger number of subjects and resources (well, the limitation shifts possibly to cost and time).

    The HARD Track focuses on single cycle user-system interactions, it provides an opportunity for the participants to elicit feedback from the assessors. But the assessors themselves might complete a number of interactions forms for a given topic multiple times, which poses a validity issue of learning effects. I am wondering how much randomization can help here and whether there are any better solution to mitigate this threat.

    In the paper, authors mention in some types of IIR studies, only a single system is evaluated and traditional usability tests are examples. I am wondering since there is no point of comparison, how researchers are able to identify those usability issues which users are used to (but still problems)? For instance, only compare Windows(before Windows 7) and Linux, users know they have been deprived of right of using multiple desktops and switching easily using merely a short cut key.

  13. In studies focused on transaction logs, at what point do the assumptions made about users move from being straight assumptions into universally accepted norms for the whole gamut of users interacting with the search engine? And how do these assumptions and eventual norms influence user actions in a sort of feedback loop?

    Due to the five listed dimensions of cognitive viewpoints as outlined by Ingwersen and Järvelin, is it possible to find a true middle ground that encompasses the user experience as well as the experiences of the developers or researchers? Or is it best to simply continue hitting each point of the spectrum and studying interaction along the entire landscape while attempting to include the cognitive aspects of everyone involved?

    According to Kelly, “Very often researchers stop at prediction and do not pursue explanation, but it is actually explanation that is tied most closely to theoretical development”(26). Wouldn’t explanations be a solid foundation for the advancement of understanding IR related topics particularly involving the different kinds of studies?

  14. During the TRECs that Kelly discusses, participants were allowed to bring in subjects to test that years topics and provide data. Was the criteria for being a subject in these trials uniform across the participating sites?

    Kelly mentions that in TREC 6 Interactive, subjects spent 3 hours going through each topic. In TREC 9, participants had gotten the length of subject sessions down to 5 minutes. Was this an effort to provide more of a typical user session within the IR experiments?

    Kelly mentions Heine with regard to his discussion on simulation experiments where instead of actual users, researchers build simulated users to carry out the tests. I understand the idea behind being able to control as many variables of an experiment as possible, but isn't the point of IR to serve the needs of actual users?

  15. 1. How to decide the weight of system and human factors in the different types of archetypical IIR study? Is there a study that may be classified as more than one type of study discussed in Chapter Two?

    2. "Fundamentally IIR is about humans, cognition, interactions, information and retrieval, and most IIR researchers would probably align their research with the cognitive perspective." (p.17) How should researchers in the technical field study IIR? How to align their technical background with the cognitive perspective?

    3. One of the discovery about IIR evaluation is that “assessors’ relevance judgements were not generalizable and using these judgements to evaluate the performance of others was fraught with difficulty.” (p.22) However, whether assessor’s judgements are generalizable is decided by who will use the results of the evaluation. If there is no target users of a certain IIR system, both users and assessors should be employed in the evaluation.

  16. 1. Conceptually, I don't fully understand the difference between a study and an experiment. From the section on 'naturalistic' studies, it seems like a study would be an investigation where the scientist is not controlling for variables. However, the term gets fuzzied when the author mentions it is possible to conduct 'natural experiments', and in a previous section describes things like 'exploratory studies' which seem to enter experimental territory. Clear definitions delineating the differing characteristics of 'study' and 'experiment' would go a long way towards clarifying these issues. For example, is a study simply a sequence of observations?

    2. Are Wizard of Oz studies symmetric with respect to systems and users? It seems to me that when systems are 'simulated' users are blind to it, whereas systems might be 'aware' that users are simulated, because the designers of 'simulated' users might also be the designers of the system (or know something about the system), leading to a subtle asymmetry or bias. Are conductors of Wizard of Oz studies required to explicitly control for this bias or is it glossed over in practice?

    3. In the section on TREC interactive task, I didn't quite get what was meant by a routing query, or even a participating site. Moreover, why did automatic techniques for constructing routing queries outperform humans? The lessons learned from this TREC task aren't completely clear.

  17. 1. In the study where users are employed to make relevance assessments of documents in relation to tasks, there is a lack of interest in users’ search experiences and behaviors, and their interactions with systems. (pp.9-10) What is the value of users in this type of study then? What’s the difference between assessors and users?
    2. From TREC-3 to TREC-12, subjects are asked to perform different tasks. The role of the subjects is changing from receivers of information service and evaluators to information organizers. The complexity of users’ behavior is increasing. The question is what are the goals of the TREC studies? Are they evaluating the users or the system?
    3. Why only assessors are involved in HARD and ciQA? Why there are no users in these two tracks?
    4. “Not all explanatory studies offer explanations – many just report observations and statistics without offering any explanation.” (p.26) In this case, what’s the difference between descriptive and explanatory studies?

  18. It appears to me that the cognitive viewpoint gives a lot of importance to the users’ current emotional state. The current state of the user certainly plays a very important role in the user-system interaction but it is unclear (and not mentioned) as to how it can help create models about the general user-system interaction. Unlike the click mining and dwell time parameters which help in understanding users’ intent, emotional context is difficult to define and thus cannot be a direct parameter in constructing a general model for IIR!

    The various experiment tracks and their sequences show us the differences in the methodologies employed over the years. Apart from introducing to us the complexity of user-system interaction have there been any other practical insights?

    The whole point of IIR is to get a hybrid of user and system perspectives. In the Wizard-of-Oz process, it appears as if we are settling to partial study. Say we have simulated users and we perform OZ experiments indefinitely (for a long time). I feel that this might produce insights into IIR based on the constraints of the user simulation. Thus, to me, it appears that we are dealing with compromised interactive systems, which may not produce reliable practical insights.

  19. 1. I understand how the utilization of Transaction logs would culminate in results that are 'descriptive rather than explanatory'. However, I'm curious to know how the IR system handles issues like user tracking if a user restarts a search session in case he gets disconnected from the network or is conducting multiple different searches in the same session as keeping track of every user while also identifying every user doesn't seem like a trivial issue. The author also states that it would be possible to manipulate this process and study differences in performances. But, he doesn't elaborate on any potential methodology that can be used as a manipulation. What are some effective strategies that are used in such scenarios?

    2. The TREC Tasks' purpose isn't completely clear to me. I would assume that TREC 3 and TREC 4 which focus on routing and ad hoc respectively serve to work towards improving the efficiency of IR systems. In TREC 3, there are no limitations on the search space and given that the 'subjects did search the training database' - doesn't this process get rather expensive given the enormous knowledge domain? And, similarly in TREC 4 - it was the onus of the users to select relevant documents from their sub collections - wouldn't this process get rather overwhelming given that it required parsing through several documents?

    3. As simulated users imbibe from users and it is through the simulated task that all forms of situational relevance is hypothesized - aren't we placing several assumptions on the users who do form part of the training set? And say, even if we do attempt to get rid of these assumptions - to ensure a realistic simulation we would require to understand and be able to draw comprehensive conclusions on the users through a relevance assessment independently by monitoring retrieval as well as feedback mechanisms. Further, we would now have to evaluate the system in such a way that it does cater to the dynamic nature of the information needs. Wouldn't we be compromising at some point through this execution?

  20. I can see why simulated users would be appealing due to time/cost factors, but how realistic are these simulated users?
    Is there a percentage of error allowed?
    How closely related are simulated users to real users results?
    Are the simulated users constructed by one individual, and thus possibly tainted by that individuals biases, or are they pooled from many prior individuals logs? Is this data given freely, or is it something users don't realize they are handing over to search companies?
    Also, what are these 'characteristics and values' the simulated users are given? Who decides these? and to what percentage each characteristic and value represented? This is fascinating!

  21. This article describes a continuum of evaluations that goes from user based to system based.
    The other two articles focus mainly on these two systems as
    binary opposites. Would the ideas and methods they propose change if they examined
    evaluation methods closer to the center of this continuum?

    In this article the author states that the TREC Interactive Track showed
    that the relevance judgments made by assessors could not be applied by to evaluate other
    assessors. However the article by Voorhees shows that inconsistency between assessments
    did not interfere with the overall stability of the data set. Does this finding by Voorhees
    negate the findings by the TREC Interactive Track?

    In this article the author describes briefly the subject of using
    simulated users to test the interface of a system in a manner that is cheaper than using
    actual people as the users. However, the author only briefly mentions that these
    simulations can be problematic due to concerns over realism. What steps could be taken to
    make these simulated users better at mimicking real user behavior.

  22. 1. In the study of “information behavior” the researcher controls the aspects of what results are retrieved in response to a user’s query. What does the author mean by “control the aspects of search process” mean? If search results were broken up and isolated to study information behavior then is it not losing the main property of retrieval which is to retrieve all relevant documents of a search?

    2. In the TREC Interactive Track it is said that the “ad-hoc task required subjects to find and save as many relevant documents as possible. Subjects were also asked to create a final best query for the topic.” It is a known fact that in a operating-world scenario it is a rarity that a user gives a query that is a best match for an ideal retrieval nor is the user going to look for hundreds of documents. So what was the objective behind this task when it was assigned to the subjects?

    3. The idea of separating experiment and evaluation while discussing about IIR is stated in the section 4.2. In my opinion they are both correlated as an experiment results in evaluating a system’s functioning. Even though the approaches to evaluate a system might vary, how can this separation assist in the study of IIR?

  23. 1) A main finding of TREC 4 in the TREC Interactive Track experiments was that relevance standards of users differed greatly from that of formal judgment assessors. What are some solutions for this?

    2) Brainstorm: Invent some studies that would be considered Exploratory, Descriptive, and Explanatory. Discuss also- is each study an evaluation or an experiment?

    3) As the TREC Track experiments progressed the researchers became more concerned with examining logs, clarification forms, and other test subject documentation. In way ways did this change the experiment from the Cranfield model and what new purposes are there for this data?
    3a) Sub-discussion: what are privacy rights/concerns when encountering user specific data (less of a concern in this context because subjects knew they were subjects but more broadly important)

  24. 1) I’ll preface by saying, this might be my internal bias due to my computing background, and if I was a psychology major my perspective might be different. I’m having trouble buying that for any real-life scenario the “human-focus” approaches would yield more interesting or useful results than a corresponding “system-focus” approach. Specifically, with all the data available through search results and click-throughs, time seems to be best spent in developing algorithms to improve these types of systems. Perhaps TREC interactive studies could help build some sort of intuition to guide this development, but in what case would a human study, on its own, yield more useful improvement to an IR system?

    2) As far as human studies go, the TREC HARD track was unique in that it took the ad-hoc document retrieval experiment and focused specifically on a single interaction. Due to the inherent noise and large datasets involved with system approaches, no amount of data mining could reveal the information regarding a user’s specific search process. In what ways could information about an individual’s specific search behavior be used in conjunction with inferred patterns that are revealed through system studies?

    3) The Vorhees article mentions the heavy cost involved with user evaluation, and uses that to motivate system evaluation. However, Chapter 2 outlines this continuum from system to human that has a large grey area. What sorts of attempts have been made at reinforcing system based evaluations with specific, focused user evaluation? This intuitively seems like the way to go since even at a small scale the precision gained from getting data directly from a few users could serve to further validate or potentially bring up questions regarding a system.

  25. 1. I think that automatic queries would not be able to generate a sample, which is representative of human behavior. Many times the user if not even sure of what exactly he is looking for. And after numerous tries and runs he would be able to generate the query that works for him. The search engine has the chance to learn from this. But will not a simulated query generator be very specific on what it needs to look for rather than querying for a wider set of result?
    2. The TREC tasks didn’t take into account the personal identifiers like location, language, gender, etc of the users into account in there. For many queries the result set for a user might vary based on them. How can a search engine be evaluated based on returning efficient data being affect by these factors?
    3. The author has brought up various ways to evaluate a search engine in the later Chapter 4. I am curious as to how the effectiveness of various evaluating process can be defined for a search engine. How can it be determines that for a given search engine which of the evaluation process makes more sense?

  26. 1) Does archetypical IIR study also evaluate the 'impact of new IR technologies' on user? Given that devices like google glass and other virtual reality devices are revolutionizing the way users interact with information systems, isn't this a major factor to be evaluated?

    2) In naturalistic IR evaluation, what is the scope for automation of evaluating IR systems? In general, the approaches did not discuss anything about automation.

    3) I really liked the concept of "Wizard of Oz studies". It should give first hand feedback on emerging HCI technologies without involving huge cost for developing the system entirely. How well can this type of study be applied specifically for search engines?