1. The authors mention that users do not always travel straight down from document 1 to document k, considering every document on the way. Are there engine-specific features (e.g., summary text) or content-specific features (text size, document length, etc.) that may lead users to skip over otherwise-relevant documents, both in the lab and in practice? 2. The authors also mention that users occasionally return to previous documents. The purpose of this seems to be to gain additional information, perhaps about the query topic for its own sake or perhaps in order to better judge other documents. Are there content-specific features that may lead users back to specific documents? For instance, graphics, conceptual frameworks, rules or principles...? Have there been any efforts to examine and measure this?3. This study begs the question of how document summaries are generated by search engines. Are they created and presented in a manner that minimizes the time required to make a document relevance estimation? Are there several approaches used by search engines in generating such summaries, or do they generally just collect them from the web pages themselves? What might be some better ways to depict the document content to this end than the current summary approach?
What information is included in the summaries of the user study conducted by Smucker and Jethani? As the authors mention, the summaries should be short. If the summaries in this user study provide many cues to decision, the time measured here to estimate the time of making decision is less likely to be accurate.In the formal, Td (l) = al+b, it’s indicted that b is a constant amount of time for making decision. So, how can we gain the constant b more accurately? Is it safe and precise to generate this constant through this sample of only 48 participants? Why the authors compute the amount of time per summary for a participant by summing all time spent on the summaries page and dividing this result by the maximum rank of a clicked summary?
1. Rather than the paper itself, which was an interesting read, what really intrigued me was that this paper won the best paper award. Doubtless, this was a well written paper with solid experimental results but I'm wondering which one (or both) of these two reasons (or more?) contributed to its win: because it tried to objectively model user behavior and incorporate it into a systems level gain function, or because of the novelty of a time-calibrated measure? 2. The time biased gain is highly correlated to MAP by Kendall's tau and MAP has been better explored/has more discriminative power. So what is the practical utility of using time biased gain vs MAP? 3. It should be interesting to apply log data to a modified version of time biased gain and compute correlation between those. That way, we can also deduce a correlation between preference gain and MAP and the use of log data for tuning algorithms would have a sounder foundation. I wonder if any studies are addressing that or if its futile doing that? It seems to be a natural fit since timestamps are usually included in search logs.
1. This paper presents an interesting and novel measure of effectiveness. Despite all the novelties, there are spaces for future work as well. TREC 2005 Robust track was used for this study, and since this paper is published last year, is there any reason why a relatively old test collection is used? What is the background information (age, educational level, cultural background and etc.) of the 48 participants in the study? Since participants determines the search time, which is the most critical factor in this study, an improper selection of participants will place a strong bias to the results.2. The time to judge document was fitted in linear model to the document length. Does that match the real situation? Intuitively, the time to judge document is strongly determined by the context or the hardness of the documents. Also, people will become tire and distracted after reading document for a long time, and as a result, they will spend much more time finishing long documents. This is not reflected in Figure 1 but it might be because the document is not long enough. Also the author mentioned in the paper later that the time to judge document is unique to documents. So is it better to try different models for the time of judge document?3. Why equation (6) is used as the decay function? Why “half-life” of the initial users is used in this function? Also, since there are individual differences in rates of judging documents, is it problematic to use average “half-life” of users as standard? And is this “half-life” strongly depends on the group of users we’re looking at (i.e. A highly educated group will probably be more efficient in searching and have a lower h)? Since h measures the time when user stopped scanning, it’s probably determined by the hardness of topics as well, which means h is topic-specific. If h is topic-specific, then the decay function is specific, which prevents the general usage of this measurement. Is there any study between the “half-life” and hardness of topics?
1. In the introduction, the author mentioned the two different approaches to IR evaluations: system oriented and user oriented. User oriented studies are able to reflect a more realistic view of the users’ search process, but these studies are noted to be very costly and complex. On the other hand, system oriented studies use a simplified user model and rely on relevance judgments to determine the quality of a list of documents. In comparison to user studies, system studies are simple and repeatable. The author introduces a new measure for system evaluation that uses time as a way to determine the gain of a document. In support of the decision to use time as the driving factor, the author mentions how it leads to a more realistic user model since users do not take the same amount of time to review every document. The measure can be viewed as helping to bridge the gap between system oriented and user oriented studies. There is not one set formula and calibrating the formula to fit different user needs appears to require several different analyses to get the right values. Given this added burden to the study organizer, can one still claim the typical benefits of a system oriented study? It could still be considered easily repeatable as long as the type of user and his information need doesn’t change. 2. When calibrating values for the formulas, the population average time is calculated using a weighted average. Each user contribution to the average is based on their level of activity. The author justifies the weighted average approach because a simple average is known to hide the variation in the time different users take to judge a document. In class, we discussed the difference between MAP and GMAP. Since GMAP was not a basic average, it was able to highlight cases of poor performance whereas MAP might end up hiding poor results due to a single or a few well-performing queries. For two more probability measures listed after the average population time, the authors use a weighted approach as well. In the evaluations, the authors depict how the use of the weight averages enables their measure to reasonably reflect user behavior whereas a simple average would have missed the mark. Outside of modeling user behavior, can the weighted average highlight special cases similar to GMAP? Can this be used to help gauge the "hardness" of a topic?3. After presenting the measure, the author goes on to use the Robost TREC track as a detailed example of how to use the time based gain measure in a study. The TREC robust track was geared toward a user that wanted to find all of the relevant documents he could about a topic within a given time frame. The authors explain their choice of using a standard exponential decay formula. At the end of their explanation, the author mentions how detailed log data can be used instead to create a decay formula as well as the expected time to reach rank k. If it is possible to derive this information from user log data, why does the author choose to estimate using the formulas he selected? In addition, why did the author feel the TREC robust track would be a good example? The author mentions using log data would be suited for a web search evaluation. Why did the author not use a web search study as an example - particularly since the track the authors used for an example places a focus on recall when most users are not interested in finding all relevant documents?
1. As Smucker and Clarke state on p. 3, most effectiveness measures are based on the assumption that users scroll through results by scrolling down a list. Is this a reasonable assumption? Do you typically scroll straight through or skip around when looking at search results? How could metrics be calibrated to account for users skipping around the page? 2. Is time-based gain flexible enough to be adapted to evaluate general search engines? Do you think it would need to be calibrated using designated assessors for each topic, like the authors describe on p. 7? How often would it need to be calibrated?3. On p. 4, the authors describe the process for the data collection and stated that users had to find relevant documents while making “as few mistakes as possible”. What do the authors define as a mistake here—does it refer to selecting an article that they later discover is not relevant? How would they know if the user did make a mistake? Why does this matter for the effectiveness of the measure?
1. My first question is about variables that would affect decision making in Section 3 Calibration. It says their user model is an idealized individual representing the population as a whole. Plus, they also treat all topics to be the same. These are strong assumptions from my point of view. In reality, there are many situations that would violate these assumptions. For example, people who are searching for an academic paper tend to have more knowledge in their search. This violates both the idealized individual representation and the same topic assumption. Here the authors made the assumption for simplification. I think if we include these factors into consideration, there would be adjustment for the current approach. How can we classify and define the users’ knowledge difference as well as the topic affection?2. My second question lies in Section 3.2. In this section, it shows comparison with a commercial web search engine on the probabilities of users clicking on documents of varying relevance. This comparison is not approximate in my opinion. The commercial web search engine belongs to vertical search, while the experiment in this paper is general keyword search. In consideration of all the assumptions that have been made, these two experiments seem to be quite different from each other. It also makes me think that suppose we apply the current method referred in this paper to vertical search (or in a specific domain), what would things change? One main part I can think of is the difference in the summary. For example, if you search for something in amazon.com, the summary for each item is not pure text, instead there would be photo, user rating, price, etc. These factors will affect the users’ decision making a lot. Besides this, what else would make the current method different? 3. My third question is about duplicated documents. Duplicates are referred several parts across this paper. In Section 3.2 Calibration Values, it says the linear fit to the duplicates does not explain any of the variance in the time to judge a duplicate, thus they simply treat duplicates as zero length documents. First, we need to know what documents are treated to be duplicates. In this paper, it gives one approach: two documents are treated as duplicates if all shared shingles are identical. Based on this idea to detect duplicates, is is reasonable to treat duplicates as zero length documents? What if the main bodies of two documents are identical, but the users are more interested in the different parts? For example, users are more interested in the discussions or comments for the same article from different website? Another situation is that if you detect two documents as duplicates, but the user clicks the lower ranked one, does this mean anything about the ranking?
The authors discuss the calibration and validation of time-biased gain against the test collection in the TREC 2005 Robust Track experiments. I am wondering how representative this TREC track is and whether there are external and internal validity issues regarding this approach as the finding might not applied to other Test collections? The authors have made a primitive approach to address this validity issue by pointing out in the context of web search, calibration might be based on interaction logs taken from the search engine itself, but without mentioning how to do the calibration at all in this case. In Section 3, the authors mention “There are nearly endless variables that could be taken into consideration in the models of decision making for summaries and documents”. But in the end, they chose to use user and topic. I am wondering whether there are any validities issues with this approach? For instance, why not take into account the context of search (location and time) which in my opinion is highly relevant to search query and might change the results significantly?In Section 3.2, the authors mention “Without eye-tracking we cannot know for certain how much time is spent on each individual summary. Thus, we compute the amount of time per summary for a participant to be the sum of all time spent on the summaries page divided by the maximum rank of a clicked summary”. I believe the authors make an assumption that the last summary users look at is equivalent to the last summary users click. I think the assumption is not valid in many cases. For instance, many users will skim through the entire page and decide to click on which one. I am not sure others but for me, I was trained at the very young age to skim through article and then read thoroughly those paragraphs important. I presume it is a recommended reading habit for many others. If so, the methodology authors adopt to replace real eye-tracking might have non-trivial validity issue and I am wondering how this can impact the end results.
1. Dunlop techniques of using HCI methods to predict user performance are dependent on the user’s skills - reading level and speed, comprehension, and background knowledge etc. and thus this relativity can be questioned “while measuring the number of relevant documents found in a given amount of time” (pg. 3). However don’t you think such methods effectively address the efficiency of a system by evaluating how it responds to / interacts with different types of users? 2. For a study making time-based judgments that are dependent on time spent on summaries of a document (before clicking on it), and time spent on whole documents, don’t you think this study is assuming aspects that actually need to be investigated / measured? For the purpose of this study, isn’t it important to:  distinguish the time spent on summaries of clicked on versus documents not clicked, and  assess clicks made by reading just the title of a result versus reading the summary? These factors seem crucial to understanding this system from a user-oriented perspective.3. Can the time-based gain measure effectively assess image / media searches? Media searches are not necessarily supplemented by summaries. In that context, what factors would be considered in estimating time and gain? For media searches would it still be important to see if the user goes on to look at the source document?
1. In section 1, it mentioned an implicit assumption of the user model that the users viewed documents at a constant rate. If such assumption does not hold, what will happen? Furthermore, from my view of point, it is not an accurate term like “constant rate”, do you think that “the users views documents one by one” makes more sense since it is what we are doing when browsing the search pages.2. It is interesting to discuss that the variable gk is under the assumption of viewing documents at a constant rate. Is it a real assumption that such user model was built on, or, behind this symbol, it intrinsically has the factor of time which was ignored by previous work?3. Overall, the values of calibration came from user study. What were the characteristics of such participants? Did such characteristics impact the final result? Also, if we need to scale the study with more participants and more data, what is the cost? How about the scalability of this method?
1. The idea behind the attenuated gain values in ERR is mentioned in the article.(p.2) What are the same ideas in RPB and nDCG since the formulas to calculate such values are different from that of ERR? 2. This paper introduced the time factor into the model. Are there any other factor that can follow this method? 3. It is supposed that the D(t) “decreases monotonically”.(p.2) Is it true in our life? In some cases, if I see something exciting when browsing the results, I would be stimulated to go much further than what I expect.
1. Right at the outset - the author states that what the paper hopes to establish is that gain associated with a document varies as a time based calibration. But, how can we base the premise without taking into consideration the learning curve of the user over time? Also, doesn't the assumption that the discount association would be thought of as independent of the document without weighing the dependencies amongst documents populated on a search page - impact this effectiveness metric? And finally, I do see how taking into account the User's browsing model would serve as an advantage - but then again how does this calibration hope to solve the issue of incompleteness which continues be to a persistent issue left to tackle? 2. On working with the Time Based evaluation metric - how do we hope to account for the bias that is generated from the formula proposed in the paper where 'gain is smeared equally across all documents'? How feasible would it be to go ahead with this metric of effectiveness when we have an enormous sample space and would have to parse and update so much log data? The paper also doesn't go on to elucidate how we could approximate an individual's reading time. My predominant concern with a Time based evaluation metric has to do with the concern on how we would handle noisy correlations. For instance, if the user is unsure of whether a document is to be marked relevant/ irrelevant - he would still spend 'time' on it while hoping to make a decision. And say, he finally marks it as relevant. In all such cases - the gain imbibed from this section of documents would be exaggerated although the document in consideration was just borderline relevant. What provision can we add to incorporate these cases? 3. The paper does deserve appreciation for attempting to provide a comprehensive alternative to an effectiveness metric and also maintaining succinctness in its validation. I'm curious however, on whether we would ever be able to extend the concept of 'Time Based Gain' to evaluate IR systems which work with multimedia. The bottleneck here is obvious - that being every user would continue to view a video until he/she finds something of relevance in it. While this does add to the time component and thereby the gain factor - it does not in anyway correspond with the fact that the user has viewed the video clip because he/she has already earmarked it as relevant and therefore worth viewing. Since multimedia retrieval is an important field in IR and almost all search engines have shifted to handling heterogeneous searches - how would we transform this concept to see implementation here?
1. Smucker and Clarke present a new effectiveness metric that attempts to incorporate time-based measures into IR evaluation. They try to break away from the assumption that user's move through documents at a constant rate, but did they take into account other user variables in their tests? Things such as a user's reading level and education level would affect the rate at which a user could move through a document.2. The authors mention that longer TREC documents were more likely to be seen as relevant. While more information in an article does make an article more relevant to users, couldn't having to read through a series of longer documents from a search result list decrease the user experience and make them less likely to continue searching the results? Doesn't more work make for a less enjoyable experience for the user?3. If the authors are trying to make a more complete time-based effective measure, why would they stick to using so many of the TREC guidelines for relevance when so many other researchers have found fault with the TREC approaches?
1. The user(both ideaa/non ideal) is more likely to stop looking for further documents once he finds out a relevant document. In most of the queries, the top few results contain the most relevant documents and the time spent by the users to identify these documents in such a scenario is understandably negligible. So, is it safe to assume that the Time Biased Gain deals with those searches/queries for which there is less probability of finding the relevant result among the top few results? 2. How does the TBG value fare in a non-ideal world? The authors state that there is an assumption of a patient and ideal user because of the difficulty in identifying how a topic will affect users' behavior. Will the above mentioned problem be solved if the users are ranked on 3 groups based on the expertise on the topic? (0-novice, 1-beginner, 2-expert)? How relevant is the TBG in actual search engines?3. How does the parameter 'length' work in non-textual data? In the context of an image or a video search, the term length does not make much sense. Is the TBG a measure only for searches which are used to retrieve text documents? The measures and formulations seem to hold good even for non textual searches except for the length factor.
I think Smucker's study provides good insights about the potential of doing time-dependent measurements. Moreover, he shows how relevant this type of measurement is, particularly, to web search. However, the study itself is not directly relevant to the user experience because of the comparison to trec_eval; both emphasize recall. This measurement, as previously discussed in class, is not representative of the user model which prioritizes high precision.1) One of the things that Smucker did to the data collected was to “filter out” by setting the length of duplicates to be 0. His justification was that users will immediately return to the result list once they recognize a duplicate result. Doesn't this also show our lack in understanding and the need for more work in the dependency relationship among results?2) In the interpretation section, Smucker defines the proposed metric without normalization as a lower bound on the number of documents a user is expected to save in T time. In class we have discussed that normalization is typically used for averaging different runs. This use seems to be fine for the metric as it is. In that case, what are other potential uses of normalizing the result?
The authors assume that the time to read URL summaries is uniform. I believe the summary read time follows the same pattern of document attention. Summaries at the top are read more carefully than ones at the bottom. It is also a common sight that several related URL links (of the same organisation) might appear in the list of retrieved documents. In such cases, we simply skip over those just by glancing at the URL domain names. Hence I think TS needs to be more accurate and in accordance with the user behaviour.Users definitely take more time judging longer documents than shorter ones. However, there could be situations where both the documents provide the same information but the longer one might be more comprehendible than the shorter one. The point to note is that judging documents for relevance also depends on the reading capabilities of the reader (the Lexile Measure of readers). So the background and capabilities of the volunteers need to be explained in the experiments performed.‘We are forced to treat all topics to be the same’–authors on pg 97 (3rd page in the 10 pg document). The T(k) model found as al+b in the paper will produce different results depending on topic detail. If the query area is broad, the answers might be found sooner and the relevance judgements done faster. Though not the topic, the topic detail (based on the query) will affect T(k) and thereby the discount value.
How does expected gain differ from actual gain? (pg 8) Do these values have to be normalized? And why even calculate expected gain, why not just use actual gain? On page 6 the authors point out that not all movement is forward and that a participants clicks can jump around. How does this affect binary relevance? Do algorithms take this into consideration? This paper takes note of how various topics might effect evaluation time, but then goes on to assume that all topics are considered equal because there is no standard/viable way to rate topic difficulty. In addition, what may be difficult for one user/participant might be easy for another user who is a subject matter expert. How does education level/intelligence effect the way users search? Papers are often judged/ranked in relevance by word count, couldn't topic difficulty and reading time be influenced as well by word length? Couldn't we apply some of those same measures to topic difficulty?
Time calibration was performed under a non-realistic experimental condition. The objective of the users was to mark relevance for as many as possible in a given timeframe. Real behavior is vastly different; to this end, it is difficult to see the value of this measure.I understand that the current formulation of the metric is still developmental, with numerous extensions proposed. Where should the focus lie? From the motivation for this measure I would have expected more work on the decay; this was just adopted and not validated.It is difficult to see the applicability of this metric to ‘do queries’ and even less to ‘go queries’.
1. The author has indicated on that longer documents have a higher probability of being more relevant than smaller ones. But if the algorithms starts returning only long documents then the user would need to spend a more time in order to judge whether the document is actually relevant or not. And his disappointment too would be higher if the document found is not relevant. So will not an algorithm returning longer documents vs smaller one might turn into a bad user experience?2. The author states: “If we used the participant averages, we would overestimate the amount of time it takes our population to reach rank k. This difference also means that most statistics reported by user studies cannot be directly used in metrics like ours. “Isn't the whole point of a new metric the fact that we want to be able to put a measure to evaluate a search engine based of user behavior more precisely? So shouldn’t the metric be refined or maybe benchmarked in a manner so as to allow for the statistics being reported, can be taken into account?3. The author states that: “Given constraints on time and abilities, not all users will detect when a document is relevant.” But isn’t the fact that user was not able to detect the right document a fault of the search engine itself? How can it be determined that the user didn’t get what he was looking for?
1. The authors argue that traditional effectiveness measures should be augmented with an additional factor to better represent user behavior and preference. While a more user-accurate performance measure is certainly appealing, do systems ranked with a time-augmented NDCG, as opposed to a normal NDCG, produce different system’s rankings? Although absolute measures of performance for particular IR systems would change, doesn’t the very little difference in ‘discriminative power’ between a time-biased method and NDCG suggest that relative comparisons between systems would essentially be the same? Does a time-based performance measure change what we think of any particular ranking algorithm?2. One of the statistics used in the time-biased equation uses a linear model of the time it takes a user to read a document given a word-length. However, the authors admit that document length only explains 12% of the variation in time to read. What are some additional features, besides document length, that might account for how long it takes someone to read a document?3. Does a generalized time-biased decay function make too many generalizations about what a user find useful? Would it be possible using this metric for an IR system to tailor search results to say, a user’s reading speed to reading comprehension?
1- One of the major factors in this equation is the notion that it takes a user longer to determine the relevance of a longer document. I know from my personal experience that isn’t necessarily true. If I am trying to determine relevance in a constricted period of time I scan about the same amount of text in a long document as a short document. So would it be more accurate to say that the probability that a user identifies a relevant document increases when the document is short? Since if a longer document is scanned it is more likely that something is missed. Or is it assumes for the purpose of this formula that each document is given an equally through examination? Is that a valid general user experience or am I the only one who does that?2- What do people think about the split that a user looks at the summaries about 33% of the time and the full documents about 67% of the time? At first I thought this was an odd split because I figured that a user making good judgments should spend more time looking at full documents. But then I wondered what could be done to shift the split the other way? What can be done to make summaries and other ‘first look’ information more helpful so that a user can better determine relevance without actually examining the full text? 3- I am actually a little confused about what is being measured. I understand that for a given topic a formula was established to take into account the time it takes users to assess topics. This includes a variable for length of documents, some constants for reading summaries (based on total number of summaries), a constant for time to make a decision, and calculations for the probability that a relevant document will be identified and saved. But what do we do with this information? Are we trying to determine what use a user got out of spending different amounts of time with one algorithm? Are we using it to compare algorithms? I think how long a user has to spend to find what they need is an important factor in measuring the effectiveness of a system but ultimately isn’t that more dependent on what the user is doing then what the search engine is producing? When we get the final Time Based Gain of a system what do we now know?
1) Regarding computation of parse times for single summaries, the authors note that we cannot know how much time is spent on an individual summary since they are all displayed on a page at once. Instead an average based on the maximum rank of the clicked summary is used. In general, I thought the time-based approach in this paper added great value to the accuracy of evaluation, but wouldn’t taking averages in this manner potentially yield misleading results if a single summary took a long time to read/parse/skim? Instead each summary could be labeled, and each set of search results could be cross-referenced against other sets in order to determine intersections of summaries, consequently weeding out summaries that take abnormally long times to parse.2) Another question regarding the computation of T(k) (which I found to be the most interesting part of their time-biased evaluation measure) involves using a weighted average of time to judge summaries. First, they weight time for users who need to look at more summaries, more highly than those who find their results closer to the top of the page. Second, they group together the time on a summary page across all views of the summary page in a session, which effectively increases time per summary, if results do not immediately provide a relevant result (Refer to equation 5). Since the authors do not go into detail, what benefits does this approach have over simply taking averages across all users?3) When applied in the context of a system-based evaluation approach, the technique in this paper seems to suffer from the added burden of training/calibration effort through user interaction. However, the numeric parameters associated with equation 2 (specifically the decay function and the “time to relevance”) could potentially be determined by exploring existing data sets of user behavior as well as simulation. What challenges are there in leveraging existing data for this purpose?
1. In this article the authors propose a new metric for evaluating the effectiveness of a search engine, the time-based gain metric. They employed this metric on the 2005 TREC Robust Track. Is it safe to trust the validity of this metric when the authors only tested it on one specific track of TREC? How would the application of this metric to other tracks of the TREC affect the strength of its effectiveness rating?2. The authors state that one of the assumptions about this metric, similar to other metrics used by TREC, is that it assumes that the user goes down a list of results in a consecutive manner, going from the top of the list to the bottom. However in modern search engines it is not necessary for the user to view the results in a consecutive manner and they may sometimes view a lower document on the results list and then return to a higher document that they now think is better. How would you go about changing this metric to account for this type of behavior in a user? Is it necessary to do so or is the current model good enough?3. The authors of this paper discuss the idea that one of the goals in evaluation research is to use system-oriented tests to help determine user behavior and research. They argue that their metric is an effective method of doing this because it incorporates time, as the cost measure, which users have said, is very important in their determination if a search engine is effective. Do you agree with their argument that this measure is an effective method of combining systems-based tests with user-based research? Do you think that systems-based tests will ever provide a complete model of user behavior?