Saturday, August 24, 2013

Croft, Metzler, Strohman Ch. 8

Bruce Croft, Don Metzler, and Trevor Strohman. 2009. Search Engines: Information Retrieval in Practice. Ch. 8: Evaluating Search Engines. 


  1. 1. This article mainly talks about different aspects and methods we can use to evaluate search engines. However, the methods referred here are more likely to be evaluating general search engines. Here by saying “general”, I mean the intent is to provide users with most relevant information. However, there are different kinds of search engines, such as providing vertical search, entity search and so on, whose aims are a little different from the “general” search engines. For example, there is also search engine on Amazon website, when someone searches for something within the website, they want the results to be not only relevant, but also highly ranked and popular. In the meanwhile, from the merchants’ viewpoint, they want that the search results they provide would bring more benefits. So my question here would be: what else factors should we incorporate when we are dealing with different kinds of search engines?

    2. I am interested in Section 8.3 Logging. In this section, it talks about how we can make use of the user log information to improve or evaluate the search engine. Actually for this part of making use of logging to optimize search engine I just have some random thought here. Suppose I am a competitor of google and I want to make the search results of google worse (which we should not). Suppose I know google is using use log to optimize search results. Then suppose I have millions of machines and each of them is randomly issuing queries on google search engine. For each of the search results, I just automatically choose the last result for all queries. Would this make google search engine worse after a period of time?

    3. My third question lies in Section 8.4.3 Focusing On The Top Documents. Each time when I use search engine, I have the same question. Since most of the time we merely care about the top results, is it necessary for us to rank all the returned results? For example, when we search “sports” in google, there are altogether over 2,660,000,000 results. Is it necessary for google’s server to calculate the 2659999999th result and 2660000000th result and then order them in place since we only care about the top results? Are there any truncating strategies that save the time of ranking all of them (for example, ranking top 1000 with some truncating methods in order, and just list the rest of them)?

  2. This comment has been removed by the author.

  3. 1. The authors state that clickthrough rates have been shown to correlate with relevance judgments (p. 11). Are you convinced by this claim based on the evidence presented? What studies could be conducted to reinforce this claim?

    2. In discussing experiments for question answering (p. 22), the authors state that these questions only have one relevant document. In what sense is this true in a non-experimental setting (such as a Google search)? Are there limitations to stating that there is only one relevant document for question answering?

    3. What measure of improvement do you think is necessary to implement a change? If a large improvement is made do you risk losing users because the system is too different from what they're accustomed to using?

  4. 1. How does Click Deviation fare as a measure of preference generation? Let us assume that a user clicks all the documents retrieved and spends exactly 20 seconds in each document before moving to the next one, in the given order. How does the Click Deviation and Click Distribution work as a good measure, as there is a high probability that the user might not have found what he wants, even after clicking on all the documents for verification?

    2.What is the accepted level of significance (α), when comparing two different ranking algorithms? It is mentioned in the chapter that the alpha level is small (ranging from .05 to .01) but it would add more perspective if we could know what the actual level of significance is, when two algorithms are compared.

    3.What is the significance behind using t-test, and the Wilcoxon Signed- Ranks Test for statistical significance when there are other statistical tests for significance like Chi-Square Test and F-Test? Is there a specific reason behind choosing the Student's T-Test and the Wilcoxon's Signed Rank Test?

    4.How do the concept of pooling and Query logs work, when there is a shift in the meaning of a particular query? For example, until 1990s, a query for "Apple" might have been directed to a fruit, but a query since then might be targeting the fruit, the company or a product of the company. How does pooling work when there is a 'transition' in the meaning of a particular query? (Given that the same query can lead to different set of relevant documents at different points of time).

  5. 1. In the section about 'Query Logging' it has been mentioned that the user in context of a particular search can be potentially identified only through the session id and time of logging. But how effective can this be? Is it possible to at least approximately map a session to a user in cases when a single user does not always use just one instance of the search engine at any given instance of time? Also it comes with additional difficulties such as multiple users hitting the same search at a given time and most importantly missing or out-of-order transaction logs as the user might navigate between pages faster than the logging mechanism. This could result in incorrect mapping of the query with the results, leading to incorrect relevance judgment and hence incorrect ranking. The query log is a vital source for evaluating the search engine algorithm but how effectively can it be used in analytics is a question to ponder.

    2. When 'Relevance factor' is by itself considered to be a question in debate in the world of information retrieval how can one say that classification of a document as non-relevant is always good? Classification might produce faulty results when used imprecisely. It is based on the relevance of the document or it is the usefulness to the user. But the crux of the problem lies in identifying the degree of relevance or usefulness. So how desirable is the use of classification algorithms in evaluating search engines?

    3. There are multiple effectiveness measures and different test mentioned to evaluate a search engine but to decide which test proves to be more useful for which problem, an artificial training algorithm is necessary for accurate results. So using a SVM would provide more reliable results for a query, provided the feature extraction and training set has been properly devised. So what methods could be employed to decide how balanced is the feature set for a ranking support vector machine or any classifier in general so as to get an optimal accuracy and query throughput?

    4. The main disadvantages of using a NDCG in my opinion are identifying an ideal DCG (which may not be available at all times unless a relevance feedback is provided) and its inability to distinguish between the good quality results to rate the ranking positions. Will this not be a problem while ranking the top results? So in which cases can a NDCG be useful while evaluating a search engine?

  6. In the ‘Why Evaluate?’ section, an attempt is made to define effectiveness of search engine. It says “For a given query and specific definition of relevance, we can more precisely define effectiveness as a …”. Given that relevance could be user or topic based, I don’t understand how a ‘specific’ (mathematical or notational) definition can be formed. How can it be denoted? If a notation to define relevance is possible, then why do we need human graders to rate different algorithms? We could, for a query, define the relevance formula and then program a computer to rate it. Hence, is it possible to even define relevance in a practical manner?

    The different test collections or evaluation corpora like CACM, AP, GOV2 that are used as test benches though from different parts of the world, are all in English. So how do the search engine companies try looking at their performance in different languages? Is there a standardised evaluation corpus for every language? Since a search engine deals with a lot more than displaying static information (dynamic data like recent developments), how is the performance of an algorithm over these ‘changing’ areas measured?

    In the basic guidelines for creating an evaluation corpus section, the following appears – “For most applications, it is generally easier for people to decide between at least three levels of relevance, which are definitely relevant, definitely not relevant, and possibly relevant. These can be converted into binary judgments by assigning the possibly relevant to either one of the other levels”. For many of the queries there will be huge number of documents which could be categorised as partially relevant. Without proper justifications, converting these into the binary judgement will be a bias. Under what conditions can the above be implemented? With DCG we could avoid the above problem, but is DCG a better ranking measure for all applications?

  7. 1. Queries are important component of the test collection. Based on different demands, cultural background, educational levels and personal preference, queries cover various areas, and queries with similar meanings can be expressed in different ways, and queries without similar meaning might still be connected to each other (i.e. football and sports). Including queries like these in test collection will cause redundancy of the evaluation and affect the statistics. How do these test collections maintain a broad sample of different areas while preserving the independency of queries to each other?

    2. The chapter talks about the effectiveness and efficiency separately, but is there any intrinsic connection between each other. For instance, high throughput of results is likely to increase recall and decrease precisions. If so, why efficiency is not considered in search engine evaluation?

    3. The chapter gives a nice overview about metrics and methodologies for evaluation of search engine. One thing I’m interested in and which is no mentioned is whether there’s studies on query specific search engine evaluation. Intuitively, topics from different areas might be stored differently based on personal preference or the intrinsic nature of data in that area. A make up example is that movies might be stored with tags of names of directors and actors. Some search engine might be extremely good on queries related with these topics. Is there any study on query specific evaluation of search engine and whether this is necessary and valuable?

  8. 1. The author mentions analyzing query logs which can contain a wide array of information such as the query itself to timestamps to click data. The main thing to note is the authors made sure to mention that this data is able to correlate with relevance but is not used in place of relevance judgments. The biggest reason why, is the users’ bias towards the top displayed results. We also mentioned in class the ambiguity associated with this type of data. The more information captured in these logs, the more storage is required as well as computation power to sort through the logs. Depending on the application, is sorting through this data always worth it?

    2. The author readdresses the issue of user bias towards the top document search results. Most of the evaluation techniques in the other papers seem to focus more on recall than precision. For instance, the TREC experiments were concerned with how many relevant documents were returned regardless of the ranking they received. Given the human psychology aspect, the information retrieval field does need to address the issue of user bias. The author goes on to explain different methods that can be used to calculate how well a system is returning relevant results in top ranks. One such method was the DCG whose denominator has no justification other than it makes a smooth curve. Since user satisfaction can affect the return use of a system, is there a benefit to figuring out a better denominator for this formula, or is it used in comparison between two different systems and, therefore, holding it constant is all that is needed?

    3. A common theme in all the readings as well as our first in-class discussion is that evaluating an information retrieval system is a subjective process even with formal procedures and mathematical formulas in place. The authors explained the process of using significance tests to evaluate two different information retrieval systems. Similar to our in class discussion on recall and precision, it is possible for the significance test to incorrectly reject or incorrectly assume the null hypothesis. In the steps for the test, a significance level is used as the bound for rejecting the null hypothesis. Although there is a mathematical foundation, the results of the test can be skewed depending on the significance level chosen. The authors list some of the common values to reduce the tests chances of rejecting the null hypothesis when it should accept it. Given that this significance value can have an impact on these types of errors occurring, it seems there is a greater chance for a tester to skew the results for personal gain or an inexperienced tester may accidently skew the results.

  9. 1) The strategy “Skip Above and Skip Next” considers that that the clicked result is more relevant than the ranks above it and immediately below it. However, we know that users most of the time scan the results from top to bottom (higher to lower ranks). Hence, wouldn't it be better to assume that a strategy of “Skip Above” would yield more reliable results than the “Skip Above and Skip Next” strategy?

    2) Other than being consistent with the idea that precision decreases, are there any other reasons why the only formula used in IR for interpolation is P(R) = max{P' : R'>= R ^ (R',P') Element_Of S}? Wouldn't it be more accurate to use methods for linear regression like least squares method to model this?

    3) What is the logic, justification, or mathematical explanation as to why the Wilcoxon and sign test are acceptable (but not ideal) ways to do significance test?

  10. 1. I am interested in the level of human interaction with these algorithms in the process of evaluating search engines. First, do human raters (like the Google raters we were discussing in the first class) do tests other than matched-pair experiments, as the authors of this article define: “[When] the rankings that are compared are based on the same set of queries for both retrieval algorithms”(p. 29)? And second, in the ending summary of the article, the authors describe that programers will create a graph to show the improvements made to the search engine, and to further illustrate the significance they will set “a threshold on the level of improvement that constitutes ‘noticeable’”(p. 37). How do researchers/programmers determine when improvements are not simply evident through testing data, but also noticeable to people?

    2. The authors seem to show concern in section 8.4.4 “Using Preferences” that there is “no standard effectiveness measure based on preferences” and that “no studies are available that show that this effectiveness measure is useful for comparing systems”(p. 25). This lack of exploration does seem surprising, especially considering the authors’ earlier explanation of preferences in the section on query logs in which they claim that preferences are useful for removing bias in the examination of tasks with multiple levels of relevance. Are preferences becoming a more common area of research since the publication of this article, or is this an area that still needs much more exploration?

    3. In the section dealing with interpolated recall-precision, the authors state: “Because search engines are imperfect and almost always retrieve some non-relevant documents, precision tends to decrease with increasing recall...precision values always go down (or stay the same) with increasing recall”(p. 20). Does this imply that a search engine with a smaller collection of documents to search through would have a skewed average rate of precision so that it would appear to have higher precision than a huge search engine, such as Google, who may have a higher recall?

  11. 1. In the Croft reading a distinction is drawn between studies that work with pre-annotated IR evaluation datasets like TREC and studies which use query log information across much larger document sets. Since true recall scores cannot feasibly be calculated when working with all documents on the Internet, researchers in the latter category have relied on recall at rank p as a replacement for the true effectiveness measure. While this measure has a certain pragmatic appeal, doesn’t it assume that relevant documents have already been ranked highly? What if there are very important documents that are either not ranked highly or not being returned at all?

    2. Croft et al. mention that while annotators don’t always agree on every single document, disagreement is not severe enough to cause major variations in error rates. Nevertheless, judging from the very low assessor relevance agreement in the Voorhees reading (Voorhees pg 10), it seems likely to me that the quality of documents returned is quite different. Given that relevance decisions are likely to depend heavily on certain factors inherent to the query (i.e. level of vagueness in the query), how can systems overcome this variance? While there is an obvious appeal for focusing on cut and dry queries for IR evaluation, how might systems prejudge the soundness/clarity of a particular query?

    3. Throughout the chapter the authors mention the importance of finding appropriate sample size and maintaining strict separation of train and test datasets. Yet the population of Internet documents seems extraordinarily diverse. How have researchers ensured that certain, perhaps more obscure, document formats aren’t neglected in IR system training and evaluation?

  12. In section 8.5, the authors describe various efficiency metrics, and two that caught my eye were indexing size (index storage), and indexing temporary space (temporary storage used while indexing) are mentioned as metrics. I find these metrics very arbitrary because temporary storage comes often from shared storage pools, or volume groups of storage; which in large enterprise systems that would house these indexes, almost always have redundancy of storage and temporary storage. Often system admins know how much storage and temporary storage they have allotted, but rarely know accurate performance data on shared/temporary storage, especially if these pools are pulling from different storage pools, system clusters, and the temporary storage can even span hardware. Knowing this, my questions are:

    1)Are these metrics tested in dedicated smaller systems, or run through larger test systems? Is there a gradated set of systems these new algorithms get run through?

    2) As faster, and more efficient hardware, firmware, and software are created as the basis of these systems, are old algorithm tests ever re-tested on new systems to ensure that hardware/software isn't effecting these efficiency metrics?

    3) Are there 'classic' tests that get run on every system? Such as using the Gov2, or AP systems? Is there a baseline, industry wide set of algorithm analytics that can be compared against, or are these standards created specific to the search engine/company?

  13. When creating a new test collection using “pooling” and deciding to use a new algorithm on the collection problems can arise as far as determining relevance. How then do the studies involving the TREC data decide if the judgments are complete enough given the fact that new algorithms might identify relevant documents but then place them under the non-relevant umbrella? Wouldn’t the new documents found benefit from being tied into the relevance judgments moving forward?

    Query logs offer a useful tool for tracking user data and optimizing searching for a particular user. It almost seems as though the more data you feed into a search engine and query logs are built up, the search will funnel you toward a specific group of results on a regular basis. How would a search engine deal with a dynamic search pattern or even a significant shift in searching for specific documents whether they be about a group of similar subjects or vastly different subjects?

    Clickthrough data acts in much the same way as the problem I see with the query logs and forcing the issue with relevance based upon user clicks and preferences between two documents. By creating this hierarchy of rankings, will future search results reflect the clicks and relevance of the initial search until such time as the new clickthrough data is added to the pool of overall user data? Do preferences continue to rearrange themselves or do they inevitably filter into a concrete set of search results until the user generates enough differences for the list to in effect rest itself?

  14. In the article, relevance judgements vary depending on the person making the judgements. And even for the same person, the judgement might be different depending on the person's mood, context and time. To counter for it, the authors argue that it is relatively easier for people to decide between three levels of relevance, namely definitely relevant, definitely not relevant, and possibly relevant. I am wondering whether it really mitigates this validity threat? May be instead to use more people for judgements is a better solution?

    The authors mention user preferences can be inferred from query logs. I believe with sufficient quantity, user preferences is better alternative to relevant judgements for evaluation compared with domain experts as it is unbiased and objective. But from the article, it looks like in the relevant literature, there are not many studies available to measure the effectiveness of this approach (even for an established Kendall tau coefficient, it just seems reasonable without any empirical evidence to support). I am wondering where there are any technical issues related?

    In the article, there is a subsection dedicated to 'Setting Parameter Values' for the search engines. Where it is true the values of these parameters can have a major impact on retrieval effectiveness, I think there is little study to be done to investigate what 'client-side' parameters shall be there to make retrieval more user-relevant (for instance, spell auto-check option shall be a client-side parameter instead of forcing upon users by Google).

  15. As mentioned in the paper that information in query logs is likely to be not so precise as explicit relevance judgments, so, what approaches can we use to overcome this drawback when analyzing query logs?

    The author points that dwell time and search exit action are two of best predictors. Here, there is a problem that how we can make sure that users are reading the clicked result during the page dwell time recorded by logs. Is it possible that some people aren’t interested in the clicked result, but they don’t come back to the result page because someone is talking to them at time.

    In the part about null hypothesis, it’s written, on my book, that the Values for a are small, typically 0.05 and 0.1, to reduce the chance of a Type I error. Is the number 0.1 printed incorrectly here? I remember the values for a is 0.05 and 0.01 in statistics. In addition, I think it is very difficult to meet prerequisites of T-test in information retrieval study, so if we want to conduct significant test and can hardly meet all these prerequisites, should we use randomization test, or still take T-test? And could you introduce something about randomization test or other test method that can be employed in information retrieval study.

  16. 1. If we think of an IR evaluation metric that is focussed primarily on queries - high precision queries would seem the best bet towards elevating the system's performance. However, is working towards escalating the IR system's performance the most critical aspect of a query? Or, should we instead work towards introducing queries in order to help us gain more intuition on understanding and evaluating how the IR system works implying that we will now be tending towards high diversity queries instead of high precision? Additionally, how would we be able to execute this kind of trade off wherein we do not penalize the addition of queries which give us a sense of the test collection's scope through increment in breadth but also manage to reward the addition of queries which augment the relevance?

    2. As elucidated on in the paper, dealing with a single test collection automatically places us in a situation of compromise as providing optimal results for every user keeping personalization of the IR system as prerogative would be rather tedious. Given that, we intend to perform a thorough analysis of the implicit user characteristics and work comprehensively with panel data the obvious bottleneck to this hypothesis is the lack of training data. How would we be able to generate this training data? And further, would it be sufficient for us to continue to plough through using testing data as our premise and undermining the effect of overfitting in this situation as ultimately the idea is to maximize personalization?

    3. Given the mammoth sample space and data volume that we would be dealing with in order to construct evaluation metrics for IR systems - isn't it important for us to consider the effect that adversarial web search may have on the robustness of the learning algorithms? Through my project which deals with Spam Rankings, I've witnessed an exorbitant amount of spam which originates due to botnet vulnerabilities. Isn't it a logical extension that malicious clicks and fraud would also be negatively impacting our appraisal of IR systems? Would it be immature to neglect this issue? Or, are there any methodologies we can implement to deal with this exploitation?

  17. The Cranfield experiments are constantly referred to as the model for evaluating search performance. Having taken place in the 1970s, have there been no new methods that have proven more relevant given the changes in IR since the 1970s?

    The GOV2 collection, while an improvement over the AP collection that preceded it, by today's standards is a small and limited collection. Other than the cost of judging the relevance of websites, why has there not been (from what I can gather) a push for a larger scale collection to match the growing demands users place on search engines?

    The clips of Google employees explaining some of their process seems to clash a little with the strict experimental and technical nature of this paper. When we live in a world where people are constantly interacting with computers and information, doesn't it make more sense to evaluate an IR system using larger samples of the user base that the engine serves? Or is it too costly to perform such large scale tests?

  18. 1. In section 8.1, it discussed the relationship among effectiveness, efficiency and cost. Generally speaking, we agree with this point. But we do not fancy the term “determined” here. It is true to determine cost from other 2 factors, but it is very difficult to say how to get the effectiveness value from efficiency and cost. Is there any model or method to support such calculations?

    2. In section 8.2, it is discussed pooling technique. It is said top k results from different search engine are presented in some random order to people doing the relevance judgement. There are many methods to randomize the order, which one is better, or is there any criteria to select the method? How the different orders impact the final results?

    3. At the final paragraph of 8.2, the implicit relevance judgments of user actions is discussed. To what extend do those actions impact the results? Is there any measurement to evaluate such impact.

    4. At the beginning of section 8.3, it mentioned the query logs is not as precise as explicit relevance judgments. Which factors impact the precision?

  19. 1. The author mentions a 'randomization' test in that it is more powerful than the t-test because it is nonparametric and is expensive to compute. Since this seems to be the 'gold standard' statistical measure (being used to validate the t-test results), I would like to know more details about it, since no further explanation seems to have been provided in the chapter. More generally, why is it expensive to compute and what possible advantages can it offer?

    2. On the efficiency section, it seems like the prevalent approaches are to optimize for time and volume (throughput, latency). The issues of storage and space are only mentioned towards the end, with index inversion and query caching as examples. I would like to know more details on how far it's acceptable (in current scenarios) to exploit the time-space tradeoff in IR. In database research, query caching is equivalent to 'materialized views' and the problem of deciding which set of views to optimally materialize is NP hard. Given the cheapness of storage, how do we compare two systems (on efficiency) if one takes much less time for a specific set of user queries but at the expense of a lot more space? Can we come up with a viable cost model for such scenarios?

    3. It would be useful to classify the utility of different metrics (MAP, NDCG etc.) based on what search task is getting conducted and which sets of users are getting targeted. For example, which metric would be most preferable for the legal domain? A taxonomy or rule/tree based classification would really help to place all these different evaluation metrics in a practical context of how useful each of these really are.

  20. 1. Looking at Fig 8.1, the TREC topic, for a given “TITLE” value, is it possible such title has multiple “DESC”? For example, when we search on google typing “Washington”, it may refers to the state, the city or the person.
    2. “Top k results ,,,, from the rankings obtained by different search engines…”. (p.7, paragraph 1) It is not clear whether each search engine provide its top k result or totally k entries are obtained. If it is the latter, how to allot the k into several search engines.
    3. In the whole section 8.3, it discussed the term “preference” derived from the logging data. What is the relationship between preference and relevance? By default, preference is usually measured by multi-grades, while relevance is a binary value. How to match them or map them?

  21. 1. A clicked on and / or printed out document does not (even implicitly) illustrate the relevance of the document unless a complete trail is followed, specifically if relevance is based on binary judgment. A user may have subsequently trashed it, stored it in a folder titled ‘not important’, or stored it as ‘important’. So would you rate user actions / interactions as good measures of relevance (pg. 9)? And until what point in any interaction should a user be monitored for it to be considered as an indicator of relevance?

    2. Do you agree with the Skip Above and Skip Next strategy of generating preferences that an unclicked result immediately following a clicked result is less relevant than a clicked result? I agree that unclicked results above a clicked result may be less relevant, in that the user has chosen to ignore it / not click on it. But how can relevance judgments be made about results that a user may not have even navigated to?

    3. After reading this chapter I’m wondering what ranking system did Google’s “I’m feeling lucky” use. How was relevance defined in that context and how was the ranking of that system different from a regular Google search? I bet many Google users who used that feature, would go back and look at results of a regular search – as validation. What does that say about the relevance of the “I’m feeling lucky” result?

  22. An effective retrieval method is expected to present diverse results apart from being highly relevant, by diverse I mean context insensitive – news articles, videos, images, informational and navigational pages. The metrics introduced in this chapter do not motive this, are there other metrics that do.

    The NDCG metric measures ordering of relevance levels in a retrieved list – this is great if relevance judgments are independent. An extension to graded relevance is to record pairwise relevance (inducing dependence), which can enable discerning highly relevant documents and even induce local ordering or documents. Are there metrics, which can incentivize such behavior (even partial ordering).

    With availability of past evaluations sets I don’t see the need for cross-validation to better estimate parameters. Training on previous sets and evaluating on the current test set seems to be a viable option - is this practiced? Won’t this help better generalize the methods and help alleviate query bias?

  23. In this chapter they discuss how the use of strategies that generate user preferences can help address the bias in clickthrough data. Is there a way that these or other similar strategies could be used to address other biases like the learning curve bias that was found in the TREC HARD track and the TREC ciQA Task?

    This chapter discusses how the private information gained in query logs should be protected. They also state that anonymizing this data may reduce its usefulness. However, due to the sheer amount of data that is collected through these logs is the trade-off between privacy and the usefulness of the data important since it would be very difficult for a researcher or a programmer to use this data in a harmful way?

    In this chapter the format of a topic is discussed throughly. The authors point out that each topic for a TREC evaluation consists of a short title, a longer description, and a lengthy narrative that describes what is relevant for this topic. They also state that most TREC studies focus on the title field for the query. Should these studies focus on the title field for their queries or should they use the longer description field that may produce better and more focused results?

  24. 1) The article mentions pooling as a method of gathering some large subset of documents to be assessed. An important part of the Crainfield model is that it deals with a set of documents that are all assessed and ranked. If some documents are missed and thus go unranked how will that affect the systems being tested and what are some strategies to account for these unranked documents?

    2)The signed test is a test to evaluate the significance of the difference between two algorithms to help determine if the new or modified algorithm is better than the original by such a degree that it is worth adopting. The signed test specifically ignores the magnitude of difference of accuracy or usefulness between the two systems concerning itself only with the number of times one system performed better than the other. This is helpful in preventing the adoption of systems that are better but only by a small margin that would be imperceptible to users. Question: since using this method artificially depresses relative usefulness wouldn’t it actually hinder making a decision between multiple new algorithms that are each better than the first by wide margins?

    3) The article alluded to making a query index and caching the responses to all queries of a certain length. This would improve both latency and query throughput and theoretically improve user experience. The article rejected the idea because it would requite too much space to hold such a cache. However the article was written in 2009 and since then advances have been made in storage capacity. Is it fruitful to have a discussion about this now? Is this option significantly more viable? Can a IR system track its most frequent queries and just cache the responses to that as opposed to every query? Is this even relevant or helpful as information can change and become available so quickly? Can this be scaled down and applied to more limited, smaller, or even closed data systems that are searched?

  25. 1) Chapter 8 discusses efficiency, effectiveness and cost as the 3 principle tradeoffs in search engine design. In terms of searching text databases, I’m having trouble imagining scenarios where efficiency is a major issue, particularly since the cost of “processors, memory, disk and networking” has made it so that relatively “cheap” systems can sift through immense amounts of data. In the table on page 5, the largest data set is only 426 gigabytes, which is relatively insignificant today (even at a personal computing level). Have modern technical advances trivialized research in optimizing efficiency (specifically with regards to specialized, text-based IR systems)

    2) Chapter 8 goes on to discuss the tradeoffs between using relevance judgments and query log data as techniques for determining relevance of queries. Relevance judgments appear to be inherently more subjective, while the author states that query logs are not as precise. Perhaps, it is due to my computing background, but the immense amount of data points offered by query logs seem to offer much more objective data. Is it not feasible that through the appropriate analysis and tweaking of algorithms, query log data (which includes an immense information from dwell time to clickthrough patterns) can also be just as precise as relevance judgments without sacrificing objectivity?

    3) Chapter 8 discusses how specific types of people are chosen and trained to perform relevance judgments. This makes sense for the very specialized search engines that are used as examples (ACM uses computer scientists, sites in .gov domain uses government analysts, etc.). However, intuitively it seems quite surprising that general purpose search engines such as Google also have success using raters that need to perfectly represent the average user, and even more so that they pursue this avenue with all of the computing resources they have for data mining purposes. How can raters be as useful in a general purpose IR system as in the more specialized cases discussed in the chapter?

  26. This comment has been removed by the author.

  27. 1. The author has not discussed that for the same query different search engines might be designed to return different types of data or maybe different data set all together. Can tests be run on a same data set for such two algorithms, or would they need to be run different data sets? How do we grade the relevance of documents being returned if they are run on same data set? How do we compare the performance of one algorithm to another?
    2. The author has discussed that testing of the search engine should be done with a query set which should be representative of what the application needs to be accomplished. But in order to do this effectively it would imply that they have a through knowledge of the data present in the data set, which doesn’t seem possible? And if they do have a knowledge of the data they are likely to fall into the trap of testing on training data..
    3. How are the weights decided to evaluate F measure for a given Algorithm or for a given query?

  28. 1) The "skip above and skip next" strategy is based on clicks and hence should also take into account of the user's behavior after the click which signifies the relevance of the result. But, even if that information is available in logs, the context of the search needs to be taken into account while aggregating the actions of multiple users.

    2) What are some methods to encode user identity while using real time data for evaluation?

    3) Speaking of performance (throughput vs latency), what is the factor that makes search engines (or Google) to choose the number of results in a page? (I did a search and found 17 results in page 1) This is significant because, on one hand, there is lot to learn about context if the user does not find his info on say top 5 results. This gives more opportunity to render better results. On the other hand, fetching 5 results for a query should be computationally less intensive compared to say 17 results. Imagine the amount of data traffic that would drop!!! Oh My!

  29. 1. The reading states that pooling may be biased against new algorithms that
    find relevant documents that existing algorithms do not find, but that
    studies have shown that this doesn't really have a big effect on its
    capacity to judge an algorithm. But it seems like pooling would have a
    severe effect if it were used for a new document *type* (i.e. blog entries
    vs. online shopping results). Is this the case?
    2. Is it possible to go from preferences on a body of documents to absolute
    3. The CACM collection seems so much vastly smaller than the GOV2 collection,
    so what's the draw for using it vs. the GOV2 collection?