Thursday, September 5, 2013

12-Sep Harman. Information Retrieval Evaluation Ch. 2


  1. 1. In Section, it talks about the spam track which constructed public corpus consisting spams. Spam is kind of vague by definition. One general idea is that advertisement is treated as spam. However, nowadays on various social networks, we also talk a lot about social marketing. In the meanwhile, sometimes we would like to make use of social networks to find good products recommended by other. From this point of view, certain kind of advertisement is beneficial to us and should not be treated as spam. How do we define “real” spam and how can we recognize and classify them?

    2. My second question is about Section Legal Track. It seems to me that this track is somewhat like vertical search, in which case recall is the main issue that we need to take extra care about. It also says all these documents are in the form of XML records. So when we search this kind of information, do we just return certain node in the XML tree, or a sub-tree in the XML file or just the whole file? Which is the best and why?

    3. My third question is about the Blog Track in Section For the blog track, one main task is the blog distillation task which would find interesting blogs for users to follow or read in their RSS reader. The primary goal was to find blogs that were devoted to certain topic. However, this will lack diversity and users would not be really interested in purely limited to one topic. Also some popular meta blogs would be excluded or ranked lower based on this idea. From this point of view, how do we mix personalized interest with diversity in blog recommendation?

  2. 1. It is mentioned that in the process of document collection for TIPSTER task, some documents are selected less for their content than their length of articles. As we know that computers are so powerful that it can finish scanning documents in seconds, and the variability of the length of documents shall not affect the performance of the search engine. So why length becomes the determine factor? On the other hand, if the length does affect the performance of search engine, what are the possible effects and how to correct them?

    2. It is said in “building the ad hoc collections” that top X ranked documents from individual system were selected for input to pool. However, if relevance in TREC is only a binary judgment, how can these documents be ranked? Also it’s mentioned that duplicates were removed for the collection. Since in real world we will have duplicated information everywhere, is it also a task for the search engine to filter the duplicated information? If so, why not include the duplicated results and evaluate the performance of the search engine on these duplicated documents?

    3. In “analysis of the ad hoc collections”, one interesting measure that’s not discussed in any previous reading materials are the hardness of topics. What are the possible ways of quantifying the hardness of topics? How does the hardness of topics affect the evaluation of search engine? How does the other characteristic, variability of topic affect the evaluation of search engine? Are there bias within the performance of different search engine in topics with different hardness and variability? If so, shall we choose topics with same/similar hardness and variability for evaluation of search engine?

  3. The author thoroughly examines the methodologies used in TREC, NTCIR,INEX, CLEF etc., and discusses the evolution of these systems in order to meet modifying requirements. The points of discussion that I put forward for this reading are :

    1. In 2.4, the author discusses about other TREC Retrieval tasks in addition to the TREC ad-hoc tests discussed in 2.3. Does TREC deal only with keyed input (words/letters from any language as discussed in CLIR, CLEF (2.4.2)) and speech data(2.4.1)? Recent information retrieval challenges include image input (Google Image Search) which is very intuitive (and possibly with a higher impact). Does TREC (or NTCIR, INEX etc.,) deal with image (or other forms of) input?

    2.In the author describes about the QA track of the TREC model. A related work by Stoyanchev et al ( mentions that the performance of IR system is an upper bound for the overall QA system performance. It is also described in the that the performance will be evaluated based on the recall performance in "nuggets (some of them being vital)". How does this methodology work for a very large corpus (the whole internet for example)? Is it possible to follow this methodology to provide efficient factoid producing systems? Or do the state-of-the-art systems use a different methodology?

    3.In, the author explains about the routing task where the user is looking for more than one document (an ongoing information need) i.e the the document set keeps growing for the same topic and the user continues to look for more relevant documents. In such a scenario, does it not make more intuitive sense to modify the order(rank) of the upcoming documents by observing the user behavior? What are the costs involved in dynamically changing the rank of the documents (after they have been returned to the user)?

  4. 1. Harman mentions both that precision/recall scores varied greatly based on topic and that certain topics have very high levels of assessor disagreement as to document relevance. Voorhees (2000) showed that while variations in relevance judgments have an effect on overall MAP scores, systems tend to remain in the same relative positions. Is this a problem with TREC experiment design or simply an inevitability in IR work?, How might systems pre-judge the ‘hardness’ of a query?

    2. Document “pooling” is described as a process where a sample of the top X unique documents are taken from every participating IR system, combined, sorted, and judged. Harman seems to describe pooling as a necessary evil and an inevitable consequence of time and resource constraints. Nevertheless, might there be other ‘pooling’-like document sampling methodologies that are less prone to bias?

    3. Question-answering track required a system to return an answer for a simple factoid, list, or definition type question (or some combination of these) given a question. While the Question-answer task is presented as separate from the general IR task of most TRECs, could high performance Question-Answering techniques be partnered with a general IR approach to system’s knowledge of user queries?

  5. 1) In section 2.4.3 “...Web retrieval...searching”, Harman describes how assessors developed additional topics to reflect actual user models. He also notes that the title of new topics (derived from query logs) maintained the original misspellings. Is there a reason to keep such data?

    2) In the section “Terabyte... and the ‘new’ Web Tracks”, Harman mentions that pools were biased towards title words from the topics making exotic runs score lower. However, wouldn’t this be acceptable since the way we typically do searches is through key words and thus, it would give priority to “common case”?

    3) When discussing multi-language collections, Harman states that topics should be selected such that there is no bias towards a language. However, doesn’t this bias reflect real world usage making it an acceptable property?

  6. In discussing about the overlap measure of intersection over the union, would it be a bit subjective to say these topics with high agreement, by simply providing the numbers: 70% and 80%? Is there a more objective and fixed criterion to the agreement of topics?

    The author introduces routing and filtering tasks in this chapter and mentions that several methods tried, but failed to solve the problem of the pooling for relevance judging. Here, I have two questions about that. What methods did these scholars try? And why did they fail?

    In this chapter, the author mentions user model many times. I think this concept should be very important in IR. Here I have some questions: how to design a user model in information retrieval? And could you introduce some commonly used user models in this field?

  7. On page: 32 the author states that “the number of new relevant documents found was shown to be more strongly correlated with the original number of relevant documents, i.e., topics with many relevant documents are more likely to have additional ones, than with the number of documents judged.”
    Considering that fact that diversity in the test data pool needs to be encouraged (Section 2.7.4). Will not having a large number of documents pertaining to same topic indicate that the documents for this topic maybe need to be removed from the pool because it is making the data set biased towards a topic?

    On page: 33 author states: “If test collections do not reflect this noisy situation, then the systems that are built using these collections to test their algorithms will not work well in operational settings.” This might be true for many search engines I don’t think this statement can be generalized. Not all the search engines need to be designed to take into account that noise might be present in their data set. For example the search engines being used in Amazon shouldn’t focus at all on the fact that the database might have incorrect information.

    How did the TREC testing take into account the copyright constraints? If the documents collection had to be shared with others then how were issues related with copyright resolved?

  8. 1. Among the different kinds of searches mentioned, the one that I don't see mentioned is diversity search with dimensions along 'format': for instance, if I am in a pharmaceutical organization, I may want my query to access multiple sources (emails, the web, documents) and show me highly relevant results from all these sources. This kind of hybrid search would have a lot of application, especially in organizational settings (it would be a broader version of enterprise search essentially). Is there a TREC track devoted to this, or more generally, does this kind of hybrid diversity search have a specific term?
    2. For the domain specific legal retrieval task, it seems like a hint can be taken from what was done for the question-answering (with regular expressions being used to check how good answers were, in a separate part of the reading) and that domain experts can specify 'answer templates' that will help to retrieve a more complete set of judgments. From the reading, graduate students and human experts had to spend considerable time forming relevant judgments. Wouldn't this expense be better redirected towards having larger datasets, more queries, and more complete judgments by exploiting the domain expertise in an active learning framework? Is there a flaw to this methodology that is not evident?
    3. Is the reason that the spam track ran for three years the lack of data, as mentioned? It would seem that with online data increasing, spam should become more important and the track should not only have continued, but have picked up pace.

  9. What are the advantages of measuring the difficulty of a given topic? The article mentions a measure called “hardness”, which is oriented towards high recall performance. Is this an effective measurement for the difficulty? What exactly are the particular important characteristics to define the difficulty of a given topic?

    What are the required number of topics for a test collection? The author mentions it was “folklore” that a minimum of 25 topics were needed. What are the bases for this assumption? Shall different topics have different requirement for the number of topics? What are the rules for making this decision if there are any?

    Question-Answering track requires change to the definition of a test collection since the answer set is no longer the set of relevant documents but “pieces” of documents. The question now is what determines the length of the “pieces”? Do different questions have fixed length of the answers or varied length? If it is varied length, what are the limitations? Since the unit judged is only the answer string, does it pose a consistency issue that the test collection is no longer reusable? Any solutions to mitigate this issue?

  10. On page 31 paragraph 2, the author informs that the manual keyword field (the concept field) that was used till TREC-2 was removed in TREC-3 as real user questions would not involve such a field. It can be considered an improvement as retaining the field would definitely have increased the search effort for user. But we still have e-commerce sites (Ebay, Bestbuy) which use the manual keyword field for better retrieval of results. Given that websites like the above are much smaller in scale than the whole web-space, wouldn't they be better off handling ‘real user queries’ without the manual keyword field? What makes the trade-off of concept field Vs user-comfort possible in the general web search and not in e-commerce websites?

    At the end of paragraph 2 in the next page, it is stated that as opposed to the SMART project’s 25 topics stable performance “folklore”, TREC’s 50 topics produced stable average performance measures. Is stable average performance only a function of number of topics selected? What about topic hardness, relevant information present on the web and topic diversity? How can these parameters be included in the calculation?

    The TREC collections emphasize the completeness of collections, stating that more complete the list of relevant documents more is the collection’s usefulness. Over the years TREC collections have been evolving in accordance with the user search trends. However the sources of data collection for TREC over the years have been more or less the same – Newspapers, journals, government sites. Though the diversity of topics and variable length articles are covered, a lot of other domains such as Entertainment (which happens to be one of the ever-changing fields) are not covered by the sources from which TREC data is collected. Can TREC be considered as a true sample representing ‘all’ the information on the internet? Or is it more organised than the information on the web? Are absolute measures of search methods on TREC collections not proportionate to their actual performance?

  11. 1. The question-answering track required systems to return the answer to a question. The answer set is consisted of strings rather than documents. (p.42) The question is in question-answering tasks, should answers containing the keywords in the question be retrieved? Or, should every question have a predefined set of answer strings and answers containing the answer strings are retrieved? If the latter, can this be achieved in the real web circumstance, where there are innumerous questions?

    2. In routing and filtering tasks, the user has an on-going information need. Therefore, the topic is fixed, but the document set keeps changing. (p.44) To deal with this problem, should IR designers predict how a certain topic will expand in the search process or design based on analyzing previous search data?

    3. The terabyte track worked on large web collection. It is pointed out that the pools were biased as there was a high occurrence of title words from the topics in the relevant documents. (p.45)To what extent can we sacrifice accuracy for efficiency?

  12. 1. The author argues that topic and user variation is realistic and must be accepted as part of any testing. Test collection should reflect this noisy situation. (p.33) The question is how can we evaluate the effect of this variation on the experiment results.
    2. Speech, video and image retrieval is mentioned in the section “Retrieval from ‘Noisy’ Text”. (p.36) Must these kinds of information be transcribed or depicted in words in order to be retrieved? What are the similarities and differences between text retrieval and other formats retrieval?
    3. The author discusses several domain-specific retrieval tasks in section 2.4.4. What’s the difference between general information retrieval and domain specific retrieval? For example, should we take the features of legal terminology and other facts into consideration when studying legal information retrieval?

  13. Harmon notes in section 2.4 that result comparisons involving different languages are not considered valid. How then do we, in essence, reconcile document collections across differing languages and by extension native relevance judgments? Especially in dealing with subjects such as the humanities which oftentimes involve other languages and document sets for study?

    In section Harmon makes note of how legal searches have an abundance of noise levels given the nature of the legal profession and the documents involved. Does returning the “good” sets as mentioned by Harmon in opposition to ranked lists mean just as much a traditional recall type of search or would the legal profession be best served by having relevance judgments made by multiple assessors to provide “good” sets for searches particularly when there are regulations with legal consequences hanging overhead?

    In section, Harmon discusses the question answering track which focuses on pieces of documents to generate an answer to the question posed. She points out that in TREC 2002 that only a single answer was allowed to test the system for correctness. With the introduction of “nuggets” and denoting some of these nuggets as vital to a specific question, does this process automatically force binary decisions upon more abstract questions that have more than one relevant and vital answer?

  14. Building a new test collection ever so often seems to be a natural requirement considering the progress in efficiency and effectiveness of present day retrieval systems at scale. The author states that using/modifying existing collections over building a new collection. The Cranfield paradigm calls for applicability of test collections over time. With a temporally evolving definition of relevance, metrics and addition of new types of media I can see how test collections are easily outdated.

    A point often made is that of disagreement between relevance judgments across judges. The author here highlights the high degree of disagreement on documents marked relevant as compared to non-relevant. The claim made here is that of the topic statement being ambiguous, where judges may be biased to make strict or weak relevance judgments. However, doesn’t the narrative define the scope? Is the problem really with topic definition?

    Domain-specific test collections require domain specific metrics? Or is the present set of standard metrics universally applicable (while they may be applicable are they effective?).

  15. 1. The author has elaborated on the Video Retrieval Track that was conducted by TREC which facilitated content based information retrieval. I am assuming that the granularity in this retrieval mechanism is a 'screen' or a collection of frames. In such a case, how are the semantics of the video captured? Also, what is the retrieval unit on which the effectiveness of the system is based? Further, as we cannot use time as a parameter and the user in most probability continued to view the video till he/she finds something relevant - what are the system parameters proposed which look into this assessment? And finally, how does the assessor categorize a video as relevant especially given that the information is through a continuous data feed and there is a significant amount of fuzziness when information is expressed in the multimedia way?

    2. In the section that deals with making use of IR systems for Blogs - I'm curious to know how the system broadens its knowledge domain and perceives the informal texts while understanding semantics and how is it that we able to achieve a qualitative as well as a quantitative assessment through our current methodology in the evaluation of blogs. Also, how does the IR system work towards opinion extraction and sentiment classification once the domain of the unclassified blog has been determined? The method proposed currently involves including a simple count of positive and negative words. But, wouldn't this be an insufficient analysis as several phrases like 'awfully good', 'civil war' and even 'easy problem' may be wrongly categorized. Semantics plays too crucial a role here and so any restriction - be it in terms of vocabulary or grammar or literature as a whole could affect the this domain specific retrieval task pretty negatively. Are there alternative hypothesis to elevate the performance of IR systems in Blogs?

    3. I am unclear on how disparate the results of a routing/filtering task would be from a normal search characterized by trivial Q&A. The author states that in the former the 'topic continues to be fixed' - however, the user intends to collect more documents. But again, how are we going to define a 'fixed topic' and wouldn't the relevance judgements in any case still continue to be on the basis of the an assessment made in a previous experiment and therefore incomplete? Moreover, how does the user's learning curve affect the documents rendered during routing/filtering and how is it used constructively towards predicting the minute topic shifts as this kind of comprehensive indexing would require inferences to be drawn from semantics? Also, routing does not seem to account for backtracking of documents - doesn't this in anyway suggest an oversimplification?

  16. 1. TIPSTER assessors were asked to judge a document as relevant as long it contained any information that may be included while writing a report on the subject of the topic (pg. 28). Doesn’t this explanation push you in favor of binary relevance judgments? After all they limit bias, and ensure that all (possibly) relevant documents are recognized.

    2. Do you agree that test collections should be noisy, in that they should reflect topic and user variation? In that context then should non-native speakers be included in topic generation and relevance judgment as representatives for non-natives users?

    3. The way the domain specific retrieval systems are structured seems to push away any user outside the field, or a novice investigating and trying to learn more about the field. The domain-specific model does not seem to cater to multidisciplinary learning. Do you think that is concern based on how education and learning is organized today? I wonder if the term “domain-specific” is even relevant today? Shouldn’t retrieval systems be interoperable?


  17. Because TIPSTER was designed based on a 'realistic user model' who were 'presumed to be intelligence analysts, but could also be other types of users that work with information intensively', was it doomed to extinction, or in an alternate, shorter form becoming the basis of what we now consider advance search/filtering?

    As there becomes more content on the internet, and more to search through, will all users need to become the TIPSTER 'realistic user model?' or at least be familiar with this type of multi-feild search?

    It's interesting to note that the author talks about the history of video search, and how it started out with text translation, and was based around text with errors being measured in word error rate (WER), and then goes on to say that by 2003, video retrieval was based on the image of the video. I think that's interesting, because I feel like when I use youtube, vimeo and other video based search engines,(granted this is 10 years later) my keywords are based on what happens in the video. Searching the event that takes place in the video such as 'news reporter falls off grape stomping stand' or 'wedding proposals choreographed' seems to be the norm. I would be interested to know more about how this currently works. Have we swung back towards text? Perhaps not in a direct translation type of application, but in a more theme/story based search? Or is this considered search more based on imagery?

  18. 1. When retrieving documents from blogs (domain-specific retrieval) the task was to find all documents containing an opinion on a given topic and then to identify if that opinion was positive, negative or mixed. This classification of the opinion requires understanding of the style of language (emotion of the text written). In order to decipher the emotions from text, what methods were devised and how accurate can the system be in classifying this while retrieving?

    2. In using TREC for “Routing and Filtering” task, the user would be more focused in collecting more information of an already started search. So when the topic is fixed but the document collection is varying, would building test collection dynamically help in this case? How difficult can it be? What would be an optimal choice - an existing collection or new collection?

    3. In the section “Building web data collections”, if the study conducted by Yahoo emphasized in collecting large number of topics with shallow relevance judging, then how likely is it that the collection was built specifically for the experiment under study? Would this not result in false results for a generic case study? How can one relate it to the search on an entire web?

  19. 1-What are some strategies historically used to make sure that topics/tasks/queries being used in test collections reflected actual user conditions and interest? Are these adequate? What else can be done to improve the relation between test collection and reality?

    2-As we move in to searching huge collections (the internet) the Cranfield model seems to break down. It relies heavily on access to documents pre-judged for relevance which is not possible on a collection the size of the internet. Should a new standard measurement be developed? Can we get similar results if a system does not have pre-judged documents, but instead documents are judged for relevance after they are returned? Should multiple judges per topic be introduced to help distribute the work and offer more diverse opinions? Is it useful to crowd source relevance judgments? How can implied judgements (not clicking on a result) be interpreted?

    3- I was interested in the development of the legal track because it was developed directly due to legislation at the federal level regarding admissible evidence. How did the development of this track differ from the others, which seemed to be generated as someone developed and interest in the topic? When a track is designed specifically for a law does that impact the usefulness of systems tested and applied outside that narrow field?

  20. 1) The author comments that, “users come to retrieval systems with different expectations, and most of these expectations are unstated.” Assuming this statement is true and that computing resources and user statistics (ex. click through data) are far more readily available today than in the 90s is there value in researching better techniques for determining “relevance” (and therefore “expectations”) through user interactions (potentially subjective), where instead we could be investigating probabilistic algorithms for interpreting all the user data (arguably more objective)?

    2) The chapter briefly discusses video retrieval, and mentions that some correlations comes from matching speech in the video to user queries. Putting aside the challenges of speech-to-text, what additional challenges exist in determining relevance in the context of video versus just text documents?

    3) The chapter groups together multiple TREC tracks that “push the limits” of the Cranfield Model and one of these is spam. I can imagine that a sort of “inverse relevance” technique could be used, which would be loosely based on Cranfield. Since the implications of incorrect relevance are much more severe in the case of a spam filter, is the Cranfield model actually useful in the context of spam?

  21. 1. To construct the pools for TREC, people select the top X ranked documents for input to the pool for each topic. I am very curious about the value of X, for different methods, is there a tradeoff on the X? Maybe some methods have a good results just when X is very small, enlarge the X may introduce many noisy results while some other methods just do well for a large X value.

    2. For video retrieval track, the author doesn't mention how to dig the video's semantics to connect with relative topics?

    3. Will the document sources affect the retrieval effect? For example, for a certain topic in a domain, relevant documents may just have several resources like Wall Street Journal, Associate Press (P. 30). Different Journal or Press must have different style which will impact the result. How to avoid this kind of bias?

  22. This article describes the cross language retrieval tasks of the TREC CLIR, the NTCIR, and the CLEF. In each of these tasks the topics are created by native speakers of the languages included. The article also states that this is important because these individuals would know how someone would phrase the search in their language. What about people who have only a limited knowledge of this language or speak a different dialect of this language? Is it important to account for such a person?

    The author of this article describes several different domain-specific retrieval tasks that were done in many of the TRECs. These tasks included the genomics task, the legal task, a track that focused on blogs, as well as others. Why would a domain-specific retrieval be any different from the regular TREC collection? What were the goals that these tracks sought to achieve that were different from the goals of the regular TREC?

    The author of this article finishes this article by giving suggestions about using test collections. One of the things she describes if the idea of subsetting or modifying an existing collection to meet the specific needs of your search. Is subsetting or modifying an existing collection a good idea? What type of problem could you run into by doing this to a test collection that is already established? Is it better than creating your own test collection using documents that have already been collected?

  23. 1. Just as the research has changed and updated over time, so too has society and its understanding of the world. Do older collections suffer from changes in relevance judgments over time? How do researchers compare current results from research findings from 20 years ago? Do they need to take the passage of time into consideration when examining particular topics?
    2. The disagreement on relevance in topics mentioned on page 33 is disconcerting. However, the point that judgment variance is natural makes perfect sense in light of the discussion and the examples given on that page. However, is it not possible to further standardize topic relevance in order to improve agreement between judges (and search results)? Although part of this may be due to ambiguous interpretations of “relevance,” part of it may be due to the judges and their training. How have researchers and policymakers evaluated ways to test and improve the training and instructions given to judges?
    3. Page 37’s summary of efforts in the area of “noisy” text is insightful and interesting. On the one hand, it sounds like algorithms whose results have higher recall may help prevent the exclusion of relevant documents that contain noisy data; however, this comes with its own limitations (e.g., too many results to find the sought-after information quickly). How else do researchers seek to properly rank relevant but noisy documents?


    4. Page 37 states: “it is critical that the topics be created by native speakers in each language to insure that they reflect how a person would actually express their information need in that language.” Is there ever a situation in which it is actually beneficial for a non-native speaker to create the topics instead? It may be the case that non-native searchers are sometimes looking for search results in a non-native order, as it were.
    5. (Chapter 3) The discussion of Bhavnani’s research on pg. 69 raises an excellent point that was also briefly discussed in our first class, with respect to filtering search results: if individuals search within their specialty area, they appear to find the correct results fairly quickly; however, when they do not, they tend to find poorer quality results than specialists do, whether they know these results to be of lower quality or not. How have search engines honed their systems in order to provide high quality results to both specialists and non-specialists of a particular topic? How can this be accomplished without low precision, high recall results?