Thursday, September 5, 2013

12-Sep Kelly Ch. 10


  1. 1. How thorough do you need to be when thinking about the user's context? What factors are most important and why? In thinking about the user’s context, does the researcher risk forming conclusions based on assumptions about the effects of the user’s context?

    2. Kelly describes how difficult is it to measure topic familiarity and expertise (p. 105). What do you think is the best measurement of topic familiarity—degrees, classes taken—or a seven-point ranking scale? What aspects should be taken into account in new metrics or is topic familiarity something that can ever be compared mathematically? Or should efforts just be made to eliminate the bias of topic familiarity by having a wide, random sample?

    3. Kelly discusses many measures for assessing the performance of an interactive information retrieval system (pg. 108-113). What aspects need to be included in a metric for evaluating the “apple” Google search example? Is there a particular metric you would choose? Would the metric you choose be different if writing a paper for an American History class?

  2. 1. Section “10.1.1 Individual Differences” includes a section on cognitive style, breaking subjects' conceptualizations of problems into two categories: “[T]he wholist-analytic style characterizes users according to whether they tend to process information in wholes or parts and the verbal imagery style characterizes users according to whether they are verbal or spacial learners”(p. 104). What would be a better indicator of these cognitive classifications- qualitative information from subjects' reporting on their experiences within the search process or quantitative information logged during a subjects' interaction with queries and results in a controlled search?

    2. Since the general American population is so familiar at this point with viewing ranked documents on a results page, it has become standard to trust that the documents which appear higher on the results ranking page are more relevant to our needs and interests than those ranked and appearing lower. But, as I was reading the section on Discounted cumulated gain which “allows for multi-level (or graded) relevance assessments, which makes it more versatile and reflective of how most subjects make relevance assessments”(p. 111), I began wondering about this pattern of selection. Is discounted cumulated gain based upon, or created to “reflect” a natural human behavior, or is our behavior a learned reaction to a programming practice we have become very familiar with?

    3. I am wondering if universities and institutions commission studies concerning cost and utility measures. The article describes a shift in the popularity of utility measures, due to the fact that nowadays, cost of IR services “are usually incurred by the user's institutions, the the price of information services are often out of users' awareness, despite continuing to be an important issue for institutions”(p. 115). Is cost of information still a major area of research for businesses, institutions, or places of learning which have to absorb this cost for their users?

  3. 1. This chapter gives a nice overview of different types of measures for characterizing subjects and their information needs and etc. One question that is not discussed and might be interesting is that are there any intrinsic connections within these types of measures (contextual, interaction, performance and utility). Do these measures contribute independently or collaboratively to the understanding of the user’s activity?
    2. It is mentioned in individual difference measurement that creativity and cognitive style are measured. How can creativity be quantified? There are two basic aspects of cognitive style, wholist-analytic style and verbal imagery style. What are the differences between these two styles, and what are the advantages and disadvantages of these two styles? Are they used separately or collaboratively?
    3. Satisfaction is measured in usability study. How do we quantify the satisfaction? What are the relationships between satisfaction and other measurements such as efficiency and effectiveness? Intuitively high efficiency and effectiveness will lead to higher satisfaction, and if so, what’s the point of measuring satisfaction which is to some extent redundant compared with previous metrics? Since different people have different preferences in evaluation of search engine, how does the measurement of satisfaction eliminate this bias by choosing subjects? Lastly, if satisfaction is the ultimate goal we want to achieve for our search engine, why not just measure satisfaction for IIR evaluation? Why bother to spend extra efforts for other measures?

  4. 1. I think that Riding and Cheema (pg.104) may have missed a third basic aspect of cognitive style, that explains how people use the information they process. Don’t you think it is important to evaluate how a user furthers / implements / expends any information s/he has processed? Doesn’t it provide more insight on how people think about and approach problems, aka their cognitive style?

    2. RHL indicates that a user has retrieved half of the relevant documents (pg. 112). In a fluctuating environment like the Web does this measure still hold valid? So, in the context of Web search, can we pin down a definite number (or a total number) of relevant documents? Isn’t it a continuously changing figure?

    3. Evaluating usability encompasses some subjective measures that “gauge subjects’ feelings about their interactions with the system.” Subjects (may) come in with a number of biases, which not only affect their evaluation of the system, but also the way they interact with it. For example, a person familiar with Google search may have a negative reaction to Bing solely because its different from what s/he is used to. Thus his or her interaction with Bing may be tedious, and vice versa. How do information professionals cater to such biases while evaluating their systems? Should Google / Bing just disregard such biases or react to them in an attempt to increase their user base?

  5. The author Kelly provides a thorough review of the measures that are employed in evaluating IIR. Although the chapter is comprehensive, I believe that the following points where worth discussing about :

    1.The author while discussing about the contextual,interaction, performance and usability measures mentions about factors that might affect study like method variance, inefficiency of binary relevance and conflicting results based on performance. What effect will normalization of the parameters( relevance parameters, user interaction parameters) have on improving the efficiency of the system?
    (Note : Cheng et al ( introduce a new set of normalized parameters in addition to those mentioned here and have claimed to provide better results in user satisfaction.)

    2. While discussing about the performance measures, the author mentions Time Based Measures (10.3.4) evaluating the time taken for the search process (search speed, qualified search speed). Do the Time Based Measures really reflect what we expect them to reflect? Are Time Based Measures co-dependent on factors such as connectivity, query traffic etc.?

    2.a) How do these measures help us understand the context of a user with a very slow internet connection? It is only logical to believe that the user's behavior is also dependent on factors which have not been considered here (like time constraints and bandwidth constraints)

    3.Are there any objective measures to evaluate the IIR usability? Effectiveness, satisfaction, ease of use, learning and usefulness, preference and cognitive load ( all the way down to 10.4.6) deal with subjective measures. These subjective measures could vary with different individuals. Does this not necessitate the need for more objective measures for evaluation?

  6. 1)How is relative relevance used in practice? What is the next step after finding the overlap between assessor's relevance and user's relevance? How can the system's relevance ranking be compensated/adjusted based on these overlap findings?

    2) the tools provided for usability measurements are questionnaires. In addition, why cannot usability of the IR system be derived from the other category of metrics? For example, the 'no of queries' parameter along with 'search result' (user found relevant doc or not) could contribute a component to the usability value of the system. In such manner, several components can be combined to get the usability value instead of relying on questionnaire. why not?

    3) A system always trades off different measures to optimally meet the needs of average user community. In that case, how can an IR system trade off the four categories of measures (contextual, interactive, performance, usability) to address all users? How are these categories of measures related to each other and how to prioritize them?

  7. 1. The author mentions that part of the measures used to evaluate satisfaction is through questionnaires. Typically, when I read reviews of a product, they are either extremely negative or extremely positive. When a web search engine is working properly, I honestly don’t think much of it. Can this translate into the questionnaire responses? If the most relevant link is within the first 10 results and not the top result would people give an average evaluation when that could be considered very successful? The author mentioned how multiple items are used to assess satisfaction but based on the author’s elaboration, it looks like it is all open ended questions asked to an end user. Are there any mathematical considerations that are factored in with the questionnaire responses to determine satisfaction?

    2. To evaluate the performance of an IIR system, the author mentions the fact that IIR only has a handful of performance measures of its own. As a result, researches will use measures from IR even though they involve too many assumptions that do not to translate into IIR. The author even goes as far as to say IIR researchers have to “suspend disbelief” when using some of these measures. The author focuses on the discovery of the discounted cumulative gain measurement that seems to address some of the assumptions. In particular, the measure acknowledges that an algorithm may return an abundance of relevant documents, but an end user isn’t going to look through all of them and will focus on the higher ranked documents. DCG seems to help bridge the performance measure gap, but are they underlying assumptions that make its use limited to IIR?

    3. The author briefly mentions the informativeness measure that is based on the idea that users can supply a relative relevance judgment rather than an absolute judgment. The idea is interesting because past authors have mentioned issues related to the time cost tied to absolute relevance judgments and the bias of polling – an alternative to complete relevance judgments. Validation of this measure never occurred due to the death of the creator, but at the time of this chapter’s publication, researchers were beginning to revisit the idea. With the nature of web searches, where the end user is realistically only going to want to look through the first page of results, this measure seems like an interesting avenue for research to look into. However, is the interest in relevance research motivation by a belief in the technique itself or due to the lack of other avenues? The author made it sound like there was a significant lull between the original research and the new developments. Are there weaknesses that the author didn't address since it is not a widely accepted measure?

  8. 1) When describing individual differences, Kelly mentions that the level of search experience is not prominent in current research due to lack of variability. Does this statement reflect the user's difficulty in adopting a new system? Moreover, is the statement implying that a novel IIR interface will have a difficult time being accepted?

    2) From the description of DCG, it seems that it is one of the best evaluation functions since it accommodates multi-level relevance, accounts user's unwillingness to look into many results, and the observation that documents ranked lower are less useful. Are there any drawbacks or incorrect assumptions in DCG?

    3) When discussing effectiveness, Kelly defines Hornbaek's recall function as the “subjects' ability to recall information from the interface”. What is the reasoning or justification for this change in definition? Moreover, how can this new definition be quantified for evaluation purposes?

  9. Doesn’t graded relevance need calibration – calibration in the binary case maybe of less use. Also, doesn’t graded relevance break the assumption of independent relevance judgments (especially with no calibration) – while binary relevance too might introduce dependence in judgments, it may be less evident. What I am trying to understand is the applicability of graded relevance in the set of measures discussed.

    Reliability of results is a key concern raised by the paper, which as the author indicated is dealt by measuring a quantity from different information sources to draw consensus. Readings till now indicate that IIR evaluation is expensive and generally conducted in small groups; the chapter does not discuss effect of this when quantifying reliability of discussed measures.

    Doesn’t inflation introduce systematic bias? And hence can it not be calibrated? The author makes a statement of lacking calibration on several user measurements; I am not clear as to why this is not possible.

  10. In taking about informativeness, the author introduces a method, relative evaluation of relevance, in this paper; here, I’m wondering what a sort of statistical methods can be employed to evaluate the relative relevance, if two or more IR systems are compared?

    As to the usability test, I feel a bit confused. Is it really very subjective? Is the self-report measure the only method to measure usability? I think many other means, like eye tracking and fMRI, are also available to the usability test. These methods are actually very objective. As to the self-report, do you think the attitude to it, held by the author, is a bit too critical? Although self-report scale is a subjective method, in fact, a well-designed scale is surely of great validity and reliability to measure individuals’ psychological characters. If used correctly, like many objective methods, it can measure user need, satisfaction, and other psychological aspects of IIR accurately.

    After reading this chapter of the paper, I find that many scholars are more likely to take the factors, such as performs of the system, performs of users, and users need, into consideration when evaluating interaction information retrieval system; However, is it possible that other factors, like emotional role, may affect information retrieval as well. For instance, is it possible that some bad emotion gained by subjects during searching tasks may bring certain negative influence on their performance and satisfaction?

  11. On page 100 the author has categorized the classes of measure as: contextual, interaction, performance, and usability. But none of these four classes evaluate the overall user experience. One might argue that user experience is a combination these described factors, but there are many perspectives that are being missed by these classifications. A user experience also needs to take into account the following facts:

    a. Reliability of the data set being returned. For eg. If incorrect directions are returned by GoogleMaps the user experience would be very negative as compared to when an unreliable document is returned to a query on Google web search. How can the importance of reliable data being returned be judged?
    b. Each user might be looking for something different. Some might be looking for more credible data whereas for some only basic information on the topic might be the ideal result. For eg the results expected by a student would be different than ones expected by a scholar on the same topic. How can a measure be put on the expectations a user has from a search engine? Also how can an algorithm take into account the various expectations users have from the data being returned?
    c. The way in which the data is being represented might affect the way the users perceive the result set. For eg. A user might be delighted by the fact that when he queries for Apple instead of just seeing a whole lot of text he is able to see the share price of the company Apple, he also able to see links to places he might be able to buy an apple product, and so on. So how can evaluation be quantified for the presentation of the results is being returned?
    d. Sometimes specific web sites are sought by some; because they wish to avoid commercial sites or are looking for documents in other languages. How can these criteria’s be quantified in evaluation of search engines?

  12. 1. How important are measurement metrics? Are they the only way to rate a system’s reliability? How can one measure the validity and reliability of measurement metrics?
    2. Is classification of metrics useful? How important is classification while evaluation? Isn’t there a lot of ambiguity while classification as there can be more than one category where a metric can play a significant role? For example, “Task difficulty” is a measurement treated as a contextual measure while it is also an integral measure that determines efficiency and performance of the system? So to which category would it belong to and does that make any difference?
    3. The “Qualified Search Speed” is search speed for each relevance category. Usually while viewing search results, most users consider the entire result set as one and mostly end up looking into the top ones. Thus, by default, it renders the lower ranked documents’ of lesser or no use. So why is QSS considered as a time-based measure for evaluating the IIR system?

  13. 1. Have there been concrete studies (there was a mention in the reading, and suggestions, but I did not explicitly see examples of studies) that actually tailor their IR algorithms to simple contextual cues? For example, if the target audience is female (to take the simplest contextual information: gender) then relevance judgments for some queries might be different from that of males. An algorithm that is able to derive better results, given this contextual information, would be more tailored to a user's needs than one that does not. However, without quantitative data, this is hard to assess.
    2. An important point mentioned in the reading was how relevance judgments from domain experts often differed from those of users. In an earlier reading, we saw an analysis of how differing overlaps of relevance judgments (still from domain experts however) still reproduce consistent results and rankings of algorithms. Has there been even a single study that takes relevant judgments from users and domain experts and compares IR methods along this dimension to reproduce the same consistency in rankings? I suspect not.
    3. Although usability was found to have three primary components: efficiency, effectiveness and satisfaction, I didn't see anything on how we could combine these measures to come up with a unified usability score that would answer the basic question on whether one system was more usable than the other. If all this data could be quantified (perhaps has been), we would have a better idea of the relative importance of these components, and the tradeoffs. Have there been efforts in the research community to do so?

  14. This comment has been removed by the author.

  15. In contextual measures section it is mentioned that even the slightest changes in measurement methodology results in a considerable variance (which is called the Method Varaince). Depending on the purpose of a study different measurement methods might be used at different phases of the experiment. Does this mean that we have to limit to the 'Standardized' measuring methods? How do we overcome the above limitation?

    In the discussion of individual differences it appears that the level of Web Search expertise and Domain expertise of an individual play a key role in determining the context. Web Search expertise could be tracked from a users search history. However, the Domain expertise depends on whether a particular query belongs to a domain and classification of web-space into domains is a tricky problem. Hence, my opinion is that the domain expertise parameter can only be realised in a laboratory setting where the finite number of topics can be classified into domains but not in a real user-system interaction.

    Both DCG and RHL take into account the multi-level relevance of documents. However the pros and cons of each of them are not mentioned. Are these two measures correlated? What are the contexts in which using one over the other proves to be an advantage?
    (I think DCG is more Precision-oriented and RHL is more Recall-oriented. Is this wrong?)

  16. Self-report measures are used to probe subjects about their attitudes and feelings about the system and their interactions with it. Are there any validity issues regarding this approach besides response biases mentioned? Also for response biases, are there any effective solutions to mitigate them?

    How to define interaction? How to quantitatively measure interaction? How to interpret interaction measures? How to determine whether one particular interaction good or bad?

    In IIR research, systems and users' experiences can be evaluated based on flow and engagement. Considering both of them are abstract, what are the effective ways of measure them and how practical are those measurements (if they do exist)?

  17. 1. Page 104, 2nd paragraph, the measures mentioned here, like intelligence, creativity and so on, are hard to measured with digital form. How do they serve to the purpose mentioned by Boyce as stated at the beginning of section 10.1.1?

    2. In section 10.3.3, several measures were mentioned. For DCG, how is the log base determined, or, which factors can impact the value of this log base? RHL is also an interesting one. What is its relationship/difference with DCG? Which measure is better? Are there any criteria to make a decision that which one is more suitable in a given scenario?

    3. Overall, this chapter listed many measures in different categories. But it seems the classification of each measure is not separated. For example, precision and recall may be considered in both Performance and Feedback. So, what are the basic or common measures that should be considered in IIR/HCI?

  18. Kelly mentions user biases in various other fields that use self-reporting. In the case of IIR, the same sorts of biases would no doubt appear when questioned about topics such as easy of use and satisfaction with a particular search engine. Because of this, does a measurement system that includes self-reporting provide enough substantial feedback to validate the continued use of self-reporting?

    Given the disparity that can occur between a TREC assessors relevant document choices and the document choices made by random subjects, relative relevance has come into play. In order to compensate for contextual differences and even the other classes of measurement it seems that relative relevance offers a clearer solution, albeit not a perfect one, for creating a set of relevant documents. What I’m wondering is does the overlap that becomes apparent between the TREC assessors choices and the subject choices automatically move up in a relevance search or do multiple assessors and subjects get a crack at the documents to generate more data and get a clearer document set?

    In section 10.3.5 Kelly mentioned the work done by Tague in relation to informativeness. Would using informativeness as a benchmark for relevance make more sense in determining relevance with the knowledge that the same kind of potential problems would arise due to subject contextual differences? Or would the scale allow for a better filtering out of the difference and reveal a better evaluation method?

  19. 1. Individual differences are mentioned in section 10.1.1. It seems that these variables are useful to build a classifier. Will these variables be used in classification? Furthermore, what is the value of such classifier in IIR?
    2. It is mentioned in section 10.1.2 that “it is difficult to devise instruments for measuring them”. In such cases, how are these variables used in IIR measurement. Are they only considered in qualitative analysis? Is there any method to handle them with quantitative methods?
    3. One of the assumptions for DCG is “the number of topically relevant documents in a corpus is likely to exceed the number of documents a subject is willing to examine”. (p.111) What will happen if such assumption does not hold?

  20. 1. Kelly mentions the differences between IR and HCI with the definitions of terms like "usability." Seeing as IR is a cross-discipline field, wouldn't it make cooperation between the various parts of research easier if there were unified definitions of concepts?

    2. Kelly discusses the difficulty in determining a subject's expertise level in a field for the purposes of testing. The typical seven point scale doesn't seem give enough information for most testing. Could researchers not provide a small quiz or series of questions on the subject before the session to help determine that level of knowledge?

    3. The bias of subjects has been looked at when it comes to self-reporting, whether it's the amount of time it took them to perform designated tasks or rate their experience with a system. Could there not be a psychological phenomenon behind these types of responses to post-session reports that could be accounted for if understood?

  21. 1. Since the assessor judgements are evaluated on the basis of a static , adhoc paradigm and sees appraisal through a binary assessment - are we leaving any room for effects due to dynamic phenomenon which could affect the relevance of the document lists that have been populated when being judged by the user? Like for instance the users' perception of the social desirability of the document at that instant, the fact that there is a learning curve associated with the user since the time he has started parsing through 'relevant' documents and so the intuition that he has gained on the topic so far across the timeline should affect the relevance of the next 'relevant' document purely on redundancy. How do we account for this perceived self efficacy and the state of mind of the individual when these parameters are variable and do not really have a value to attune them to? Also, how we hope to resolve this critical gap caused due to objective and subjective response biases?

    2. Even if we are accommodating Multi Level Relevance and Rank - the assessor judgements may not be a right fit for the user as both Dunlop's Expected Search Duration and Cooper's Expected Search Length are also in finality based on a predefined retrieval unit? For eg. the Expected Search Duration is actually a product of the expected number of documents retrieved and the average duration. Further, while estimating these values - what is the approximation that is used for the time that the user spends parsing through documents which are not in fact considered relevant for the user? And, again I think the issue of redundancy creeps in while computing a factor like the cumulative gain as the information can be marked 'less relevant' purely as the user weighed it less cause it was generated later. How do IIR systems take into consideration these issues?

    3. When attempting to evaluate feedback from the subjects - the paper elucidates on research that has been conducted towards understanding the 'Ease of Use, Ease of Leaning and Usefulness' of the IIR. However, these metrics are also in some way have a dependency associated with the evaluation metrics of effectiveness and efficiency. Doesn't this imply a bias towards our current evaluation metrics? Also, wouldn't it be more tangible to instead work towards finding additional measures which could be used towards evaluating IIR than introduce additional factors which need to be taken into consideration however do not have a unit for calibration? And so, to what extent does this domain knowledge of the system and the user influence affect the performance of IIR especially when using a sample set of people is never truly representative of every user on the web?

  22. 1. In section 10.4.2, Kelly said that subject are asked to make holistic evaluations, basing their preferences on entire lists rather than individual documents. Why we could not make evaluations depend on individuals? Evaluation on individual may not totally useless.

    2. In section 10.4.3, Kelly mentioned workload method, and used six factors to measure workload. But it is still a very vague measure method. It is very hard to define workload requirement. For different individual, different weight instrument may lead to a totally different result.

    3. Kelly mentions IIR research could take flow and engagement into consideration(p. 123). However, the notion of flow is also a broad concept, how to judge whether a person is fully immersed in what he is doing, characterized by a feeling of focus?

  23. In 10.1.1 Boyce is quoted as saying, "The purpose of measuring user characteristics separately from the search process is to be able to use them to predict performance or to explain differences in performance". While I understand and agree with this concept, I also wonder how one scales user characteristics and ties those various scales to performance. User characteristics (as described earlier on in the paper range from sex, personality type, age, geographic location etc).

    In IIR, the user is taken into consideration when creating metrics and performance data. How do system based analytics such as described last week in Kelly's description of where there are 'simulated users' compare to these studies that evaluate the entire experience? Is there a place where the two versions of 'user' are averaged out?

    In regards to time measures, there are time-on task(TOT) measures that compare a new users TOT to an experienced users TOT. Can these measures be a way to measure learnabliity of a system?

  24. 1) During the discussion of performance measures, the author mentions informativeness (proposed by Tague) as a method that uses “relative evaluations of relevance” rather than “absolute measures.” I found this technique to be intuitively far superior to the “scaled relevance” technique (where each user would rank a document on a scale of relevance), since it has the potential to yield much more accurate assessment which is free of measurement bias. Specifically, if a user had just seen several documents that were “very” irrelevant, the user might be more inclined to rank a more relevant document even higher, simply because it was “so much more” relevant than the recent history of documents. Meanwhile, relative relevance inherently forces the user to compare all documents against all other documents, instead of just the results that are “nearby.” In what scenarios might the “relative relevance” technique prove less reliable than the more traditional “scaled relevance” method?

    2) Chapter 10 states that effectiveness, efficiency and satisfaction are the three primary components of usability. In previous readings, we read about effectiveness and efficiency and how they typically sit on opposites ends of the spectrum, in that more effectiveness can often mean less efficiency and vice versa. Where does satisfaction fit into this relationship? The author states that satisfaction is difficult to quantify due to its subjective nature. Moreover, it seems that satisfaction can be largely defined by efficiency and effectiveness (depending on the context and the user). Is there any value in measuring satisfaction, when that time and effort could be spent on effectiveness and efficiency?

    3) In Chapter 10, the author notes that “self-report” is the key feature distinguishing usability measures from performance measures. The author notes that studies using “self-report” data are far more susceptible to bias resulting from method variance than studies which do not involve user interaction. In what ways to usability measures make up for this added bias?

  25. 1- What do psychometrics have to do with IRR evaluation? What IIR class of measure do they most influence? What are some ways to control/account for or reduce these effects?

    2- A common theme among readings is the problem of using binary and static relevance judgments for Cranfield and TREC testing when in fact actual users in real systems use completely different systems to determine relevance. Are evaluations bases on these relevance judgments valid/useful? A lot of work has been done on inventing new metrics to account for this but what can be done in terms of experimental design? What can/should an end user contribute to experiment design?

    3- Idea that came to me that I would like to discuss: The article concerns itself with describing many of the contextual factors that can make developing and testing searching systems difficult. They basically boil down to: lots of different types of people are looking for different types information for lots of different reasons. Instead of concerning ourselves with making a single system that can cater to them all at once what about developing a search engine that can be customized by the user to preform different searches at different times. For example a user can select if they want scholarly sources, or if they already have some background on the subject, or if they are searching for information for a younger person. These settings would be able to be toggled on and off per search so that the user is better able to affect the outcome of individual searches. Some search engines might offer personality tests or cognitive style tests that users can than opt into applying to their results. What would be the obstacles in implementing something like this? Is this even useful to do overtly since big search engines are already working hard behind the scenes at profiling users and tailoring results?

  26. This comment has been removed by the author.

  27. In this article the author describes several different contextual variables important in evaluating IIR systems. One of these variables is the information needs of the search. The author states that it is very difficult to create instruments to quantify this information need. What are some possible methods that could be used to measure these needs? Since it is so difficult to measure these needs is it worth the difficulty to use them?

    In this article the author discusses the idea of Informativeness as a measure of the relativeness of the results of a search. This method, put forth by Tague, differs from the traditional method of assessing search results by TREC and other collections and has garnered much interest in recent years. Why would informativeness be a more effective measure of evaluating search results than the traditional method of assigning relevance methods to each document?

    The author of this article talks about the different definitions of the term usability. Mainly she focuses on the definition used by the International Organization for Standards. However she mentions that Nielsen has another definition of usability. What are the major differences between the ISO and Nielsen definitions of usability? Which one seems to be a better definition?

  28. 1. This chapter is an excellent introduction to many of the pitfalls of researchers’ measurement procedures. However, the assessment of interaction-based research and the sections describing measures such as Utility, Informativeness, and Cost, are limited in the factors that they cover and the types of research examined. For instance, there is limited discussion of why users may or may not be using a system, or their purpose for doing so, and how it may impact the type of interaction sought and the type of measurement needed (and, perhaps, the definition of the related constructs). For example, one question that springs to mind is, how do user interactions and preferences (and perceived utility/usability) differ when employing an enterprise retrieval system as part of a sequence of tasks performed in a team at work, versus using a system for personal research at home? Do retrieval researchers incorporate constructs such as “group value” or “intention to use” vis-à-vis constructs such as team hierarchicalization or task interdependency, or team objectives, or current IT artifiacts?
    2. The short “Preference” section (10.4.2) drew my attention and wonder. Design and aesthetics have recently become popular topics for discussion and debate in some technology and business communities, with tech firms such as Apple (and IR branches such as Bing) drawing user preference data from a wealth of knowledge found beyond the answers to broad prompts such as, “how satisfied were you with the search process?” or “please holistically evaluate your results/experience.” It may not touch upon the technical side of their work, but I am curious as to whether or not IR researchers in academia incorporate more specific and detailed factors related to user satisfaction and preference, such as responses to minimalist vs. maximalist layouts, or to particular color schemes. Are there particular branches of IR research that manipulate and examine user interface design and presentation?
    3. Such concepts as “flow” may be particularly crucial to individuals working in environments where research has shown that employees’ work patterns or habits are in fact built into the work processes (e.g., hospitals). How do IR researchers currently examine flow? Have there been prior efforts to connect it to task type, user or team objectives, task interdependency, or (broadly) the cost of process disruption in a user environment?