Friday, October 4, 2013

10-10 A Chouldechova, D Mease. Differences in search engine evaluations between query owners and non-owners. WSDM’13.


  1. 1. My first question is in Section 3.2 Task Description Experiments. In these experiments the higher quality search result set and the lower quality search result set were constructed by getting the results from google’s search results. My question here is: if there are duplications for one document (or near duplication for one document) in both higher quality set and lower quality set, how can we deal with them?

    2. My second question is about the experiment results. Figure 2 shows us the comparison of owner and non-owner experiment mean scores for Experiments 1-6. Generally, mean score for owners is a little better than non-owners for each of the six experiments. The mean scores are so close for Experiment 3 and Experiment 4, and even for the other three (excluding Experiment 6), the difference is not that much. From Figure 4, we can see the corresponding 95% confidence intervals for the difference between owner and non-owner mean scores. The p-value is somewhere around 0.05. So I am not quite convinced about the superiority of the owners over non-owners, especially in consideration of the extra efforts taken to gather the owner data.

    3. My third question is about the non-owners. There are no explicit information about the non-owners (although we know that they are the hired assessors). Are they experts in the topics? Are they randomly picked or professional on certain topics? Just as referred in Section 2.2, there are corresponding researches on examining the impact of ownership such as golden assessors, silver assessors and bronze assessors. What would the experiment results be like if we take these factors into consideration?

  2. 1. This paper studies the differences in relevance judgment between query owners and non-owners. My first question is why Google’s ranked results are used as gold set. This makes the comparison of judgments between query owners and non-owners more like a comparison of the closeness to the Google performance. Also, why these five experiments are performed and are there justifications why results at certain rank are swapped or replaced?

    2. Since the Google results are used as gold data set and they are changing dynamically all the time by the activities of users, emerging/dead of web pages. At which point the Google results were captured and used as gold set? Have the authors compared the Google results of the same query at different time to see if there’s any difference? Another possible problem is that the first document returned by Google is not changed across all the results. For navigational purpose only the first result is relevant, in which case the pairs of data sets are equal, which contributes nothing to the comparison. Have the authors tried to eliminate this redundancy?

    3. At the end of the paper the authors address a potential useful point about this study, which is to reduce the number of assessors needed for evaluation of search engine. And they suggest that the number of assessors needed is inversely proportional to that of the square of mean scores. However, those mean scores are acquired after a large collection of assessors’ judgments. How can we determine the mean score before we even don’t know how many assessors to choose?

  3. In this paper, Alexandra introduces a very unique and interesting approach to assign queries to assessors. About the way in which assessors judge search results, I have a question here: how does she control the quality of these assessors’ judging? It’s very difficult to tell which side is better in fact. Is it possible that some assessors judge research results on basis of their intuition or guess? Is there any guidance for assessors to judge these results in these experiments?

    In designing experiments, not considering experiment 6, why does Alexandra design the other five experiments in such ways? What does she try to show through the five different experiments?

    In collecting query, assessors are encouraged not to include any highly personal information there. So, in this case, is it possible that some queries are very general and known by most people. More specifically, for some certain queries recruited in Alexandra’s research, is it possible that both owners and non-owners have searched these queries before, but only owners contribute them to Alexandra? If these situations do exist in this study, will it create certain bias there?

  4. 1. In class discussions, we have mentioned that query can have multiple meanings. Even the Google directions to assessors outline the ability of a query to take on several different meanings. By experimental design, the owner will know exactly what he meant when he typed the query in the first time. The non-owner will judge the relevance based on his own opinion. Odds are, the non-owner will think of the most common interpretation of the query, but the owner may have had a less common interpretation. When he sees his query again in the experiment, he might naturally think about what it means to him instead of accounting for the common meaning. How does this experiment account for the natural ambiguity in queries? Does the design of the lower quality results prevent this?

    2. When it came to diving into a deeper analysis of the results, the authors mentioned that they downsized their data to just be the owner of the query and only one non-owner of the query. The authors go on to use this simplified view for a majority of their analysis and as the backbone for most of the conclusions they draw. However, the authors do not give much insight into the process they went through to downsize the participants. Did the authors use random selection, first non-owner response, or did they do it based on the response, which would invalidate their evaluations?

    3. In their conclusions, the authors reference the new trend of web search companies using as much personal information as they can about the query owner to display the results. As a result, the authors feel that owner based query evaluations should in general be explored further. We have also mentioned in class how a query owner knows their exact intent and it is on the search engine to interpret the query meaning regardless of any ambiguity of the query terms. However, in their experimental design, owners are not told if they were the originator of the query.

  5. 1. Since the authors can't really say anything about Experiment 6, and from what little they do reveal, it seems to use a different methodology, should it really be included in this study, or should they have published it as a separate, related study? They say that they decided to convert the data they had into +1 and -1 evaluations- couldn't this throw off the results of the experiment?

    1.5 Also regarding the use of the +1,0,-1 system, shouldn't the authors have let the assessors indicate these scores instead of translating feedback into this system? It seems like it would eliminate bias.

    2. One of my big questions while reading this article was- Are they suggesting the use of query-owners as assessors? Would that even be feasible? They sort of answer these two questions in the conclusion: yes (they are suggesting the use of query-owners as assessors), and no (it's not really feasible). It seems like they go to a lot of trouble to prove something that is common sense, and then explain that it can't be used for anything.

    3. They claim in the outset of this paper that one of the issues they faced when combining the literature review with the experiment was that the literature is concerned with "absolute assessment of single documents" and the experiment is concerned with "relative measures of sets of documents"(p. 104). How can disparities like this be avoided when combining a literature review and experimental data into a paper, and how bad are the implications of such a disparity?

  6. 1. The authors depend on Google’s natural rankings for determining which list of test results is “better”. Do you think this is a reliable baseline? Why or why not?

    2. The authors write that when using assessor queries it is difficult to obtain representativeness (p. 110). How many queries does Google sample when it is testing a system? How do you determine if a query is “representative”?

    3. In Experiment 6, the authors tested a variation to the Google algorithm that had not yet been released and discovered that the data found by testing it with assessor owners contradicted some of the data from other test (p. 106). Since the results from this part of the experiment contradicted some other tests of Google search quality, can they be sure that the results displayed in the other tests were really of higher/lower quality?

  7. The Experiment 6 is unique as it might be privacy reason that the authors are not allowed to release more information about it. But then problem comes as if there are no ground truth of which set is of higher quality on average, why is this experiment required? In Figure 2, the two sets all have negative values. What does it imply? Does it imply both are wrong but with different level of mistakes?

    To find out why “the owner's scores are on average more positive than the non-owner's scores for Experiment 1 through 5”, the authors analyze histograms of the assessors -1, 0 and +1 scores. I can not see the difference between experiment 1 and 5 as claimed by the authors.

    To compute 95% confidence intervals of each of the five experiments, the authors state a formula. I am not sure where does weight of 1.96 come from or do I miss something?

  8. 1. I do think that sidestepping the factor of memory is not completely justifiable as even though owners do not need to remember the exact information they got through the query - following a sequence and trying to work towards that pattern would be an implicit trait. I do not get how at the outset the authors would like to take into the consideration the fact that ownership of a query does in a way affect the assessment of relevance but throughout their work they undermine the ability of the assessor to even vaguely remember the results that were rendered from the query. Doesn't this account for bias? Is it fair to make such a hypothesis for the experiment?

    2. In Experiment 4 - the higher quality set it provided with a different query as opposed to the lower quality ranking - which populates the rankings delivered by Google of the same query with a missing word. How do the authors justify making use of this experiment in the study they conducted? The purpose of the investigation is to analyse whether people who are owners of queries do in anyway have an upper hand over assessors who have no personal association with the query at all. So, changing the query by removing the third word as a general rule is going to skew the results as it is not even representative of the content that the user wished to find information on in the first place? The assessors are judging the relevance of 2 different ranked sets on the basis of a single query - when in fact there have been 2 different queries. I don't understand the credibility of this test.

    3. The definition of relevance of a document continues to remain incomplete. And, as has been stated in the paper - the authors were unable to provide complete relevance judgements for their ranked documents. I feel the experiments conducted play safe in the sense that they have not toyed with the first and second rank of highest relevance documents which were rendered by Google. So, how can we justify the results when it is an incomplete relevance which is used as the only metric in the paper for predicting that the owners of queries? How would the results vary in the case that we used the median as a metric? And how different would the results be if we drew a comparison between time spent in reading the documents and time spent in analysing its relevance?

  9. Chouldechova points out that, “the only way an assessor could identify his or her own query would be from memory”(105). Since the queries were gathered multiple times over a 19 week period, would it be of any benefit to extend the experiment by several more weeks? I imagine if asked to submit a query that contained non-personal etc. information assessors would either pick a general query or a query tailored specifically to their interests and thus identify their query when completing the experiment.

    The paper outlines three problems which make using owner assessments difficult to pursue with regularity and that a “good method” for finding users who wish to participate does not yet exist. This begs the question, why not just ask for individuals to volunteer for participation? The lack of needing to be experts should allow any users the ability to participate in this experiment if they so choose.

    The second problem deals with the lack of participants sticking with the experiment. One possible solution includes finding replacement users for abandoned queries. For more generalized queries, would replacement users be that much of a hinderance to the overall experiment? Or does replacing a user lead to possible muddling of query intentions. For example, if I were to search for Football I’d be looking for NFL information but a possible replacement might focus on soccer since football can include that as well.

  10. 1. Users/Non-Users aside, the results of their experiments seem to indicate that there is a rapid drop-off in relevance after the first ~20 ranked search results. Experiment 3's relatively strong performance compared to Experiment 2 suggests either that ordered ranking is an extremely influential factor on relevance judgment, even in the top 10; or that top 10 results are moderately interchangeable; or both. In fact, the results of Experiment 2 appear to support the latter. However, if this is the case, then it is also probably unlikely that there is a great difference between results 9 and 10, and results 10 and 11. Therefore, what are the reasons for the experimental results in experiment 3? Is it because ranked search results 11 and 12 are so clearly lower than ranked search results 1 and 4 (the documents on either side of 11 and 12) that the users KNOW something is the matter with these results? How would the experimental results differ if documents 9 and 10 were replaced by documents 11 and 12? What if documents 9 and 10 were replaced by documents 19 and 20? What is the threshold at which the juxtaposition of differently ranked documents is reduced to nil?

    2. How does user intent factor in here? We have discussed in class how different intents yield different behaviors - for instance, searching for an address may only require one relevant finding, and the rest could be relevant or irrelevant. It appears that intent could impact both the user/non-user relevance judgment performance of each method mentioned in this paper, and the implications of the findings of this paper.

    3. What are the theoretical implications of this paper, and what might be other evidence of these implications? For example, do the findings suggest that previous user familiarity improves relevance judgment behavior? Do the findings suggest that user familiarity is a proxy for user interest in a topic, which improves relevance judgment behavior? While left unclear in the paper, identifying the theoretical implications would be helpful for further experimental confirmation as well as generating alternative techniques to improve relevance judgments (such as prompting users to read about a topic before searching and performing judgments, or allowing users to select topics in which they are most interested).

  11. 1. I greatly enjoyed this paper. One way to interpret the study's findings is related to the question of “what is the intent behind ambiguous queries?” When assessors who are not the originators of a query judge a documents relevance, they many times must guess about a query’s intent. Query intent can be very ambiguous especially in short, keyword queries (most of what is studied). Were the short queries used at all ambiguous, and could this explain the discrepancy between the groups?

    2. What explains the discrepancy between this study’s findings and the findings of the Voorhee’s (2000) study. Is it possible that the negligible effects found in the Voorrhee’s paper are explained by the greater detail that query generators give in the TREC experiments (query description and query narrative). Might this call into question whether it is the particularities of the user that generates the query ambiguity, or the particularities of query language that generates the ambiguity?

    3. The authors do not say whether participants were required to read the results of each document group or not. Additionally, might there be some psychological mechanism for a person merely preferring something that is more familiar that would explain the results? How can we be sure that a greater awareness of the relevancy is the motivation behind the choice?

  12. 1. The ‘non-owners’ were clearly at a disadvantage assuming that they were not given a defined judging criteria. It can be argued that, this was not the case for the ‘query-owners’ too, but the query may have been worded such that intent is implicitly reflected in the retrieved list.

    2. Motivation for experiments was not evident – though some are cited to previous designs. Moreover, results do not discuss individual cases except for clear outliers (i.e. experiment 6). I would argue that discussion on sample size reduction did not require as much analysis with the rest not being adequately addressed.

    3. For experiment 5, I would have definitely liked to see more depth to the analysis. What is not clear is the impact on the search results and intent when the 3rd word is removed. It would have also been useful to record query recollection.

  13. The two sets chosen for the 5 experiments (we don't really know what they did for the sixth) seem to have been rather ad hoc. They do make intuitive sense, but lack scientific rigor. How can we trust that if we hadn't chosen some other scheme for generating these sets that the results might not have been different? After all, not all the results showed statistically significant differences.

    In this paper, more than most, it seems as if the type of query would have made a difference. As cited in the related works section, works from computer science programming or biomedicine have proved that domain expertise does matter for information retrieval in these fields. The queries in this paper were simply selected (by users themselves). They could have ranged from anything on the latest press article on Justin Bieber to the Syrian crisis. To make the experiment more conclusive, wouldn't it have been a good experimental design to ask the users to give preference to 'serious' query submissions?

    It was very interesting to me that the authors presented results on Experiment 6, which they could not describe and which they knew did not yield any real insight. The paper would have been substantive enough without that experiment. Was there any reasonable motive to include such an experiment? The authors did draw a sketchy conclusion based on the results of that experiment but it wasn't anything that added anything to the bulk of the paper. So I'm left wondering about why they did it.

  14. 1. Given the user bias toward the higher ranked documents in web search results pages, why did the authors choose to go so far down the Google results? Documents 42-50 on any given query are not likely to be anywhere near as relevant as a top 10 or even top 20 document.

    2. In experiment 5, the authors removed the third word of longer queries for the lower quality search set. In some cases wouldn't that change the query itself? At what point does it stop being a fair comparison between the two results sets?

    3. Given the little information given on experiment 6 within this paper, what was the authors' reason for including it so briefly in their paper if they were not going to go into detail about it?

  15. Experiment 2 takes the same list of documents and swaps two of the 2 to 5 documents with two of 6-10 documents. What do the authors hope to achieve by experiment 2? The Delta value, the difference between the mean owner and mean non-owner scores, appears to be the same for experiment 2 as experiment 1. Can it be concluded that owners are better judges of the ranking order than non-owners?

    It is not clear as to why experiment 6 was mentioned in the paper. The authors say that this experiment is interesting as it produces data that is counter intuitive and inconsistent with the other experiments. Is this due to the random +1 and -1 values used for rankings?

    The query specificity is overlooked in the various experiments mentioned. Some queries have higher number of relevant documents than others. In the case of experiment 1, where results retrieved at positions 42 to 50 are used, specific queries may be a bad choice (as they may not contain so many relevant documents). This might potentially fill the gap between the means of the owner and non owner scores.

    The authors talk about avoiding the memory effect, yet end up displaying the unmodified Google results in experiment 4. Is this a bias? Also, it appears that each user gives one query of his choice to the researchers. Over the course of this study, a user would have given less than 10 queries (could be less than 10 as the users were approached only 7 times in 19 weeks!). Thus, how hard can it become not to remember 10 queries, especially when the user makes an effort to explicitly locate the query in his/her browsing history before it is submitted? Thus avoiding the memory effect seems harder than was mentioned.

  16. 1. Voorhees had concluded that for TREC data there wasn't any substantial impact on the performance when authors of the topics were asked to assess the documents. But in this paper it has been experimentally shown that query owners do have a significant edge over the non-query owners in assessments. So does this contradictory evaluation mean, "To get the right answer from a system, ask the right question"? I mean should the user be intelligent to retrieve what he wants rather than infusing the intelligence to the system?

    2. “It is not possible to sample queries from the search engine logs and subsequently locate the corresponding users who issued the queries”. Having stated this as a critical constraint, why not use the power of GWAP to do a real time assessment when the user keys in his own query? It can be made evaluated and scored (satisfying criteria 4 & 5 from Christopher & Padmini's paper) based on the query results and their relevance.

    3. “The use of query owners as assessors will become necessary as search engines continue to personalize results based on a user’s location, query history, social graph and other data”. The authors have made a vision statement about the future search engines where it would be a necessary thing to have query owners assess the documents. How realistic is this vision? Have query owners assess implies every real time user assessing their own query. Although it improves accuracy, how feasible is this idea is a question to ponder.

  17. 1. Do you think the experiment is biased towards the owners, in that they have an advantage of having viewed the results for the query they provided prior to the relevance judgment experiment(s)?

    2. Here, Google ranking is the standard for relevance judgments – whoever matches Google wins. Do you think this experiment would have similar results using other search engines, or scholarly databases, where queries need to be more formal?

    3. Don’t you think ownership is a very high-level (superficial) way of evaluating relevance judgments? The researchers should have described their users a bit more. Apart from ownership I think aspects like expertise, background, location, and compensation should be taken into account, or at least clarified for the readers.

  18. On page 110, the authors state that using owners vs. non-owners, you could use fewer owner evaluators than non-owner evaluators to get a more positive assessment. Then the authors say, 'how much fewer depends on the difference between owner and non-owner assessments for the given experiment'. How can we find this data in a non-experimental situation?

    The authors mention how experiment 6 is different that the other experiments, and how it doesn't effect any of the numbers etc. . .but then they use it later when talking about statistical inference. I almost feel that experiment 6 should be removed from the paper, what do other people think?
    The authors say that there was no way the assessor would know whether he or she was evaluating a query he or she owned, but if they were only asked for 7 queries over the span of 19 weeks, what are the odds that they would remember their queries?

  19. 1-Only some assessors chose to participate and even fewer elected to remain in the experiment for all six of the variations. Does that skew the results toward assessors that are interested in participating? Does the behavior of eager assessors differ from others in anyway that would change results? What does it say about the experiment design that even professional assessors could not be throughly engaged?

    2- The authors make a big leap by relying on Google’s ranking result. As information retrieval experts wouldn’t they do better to establish some sort of objective better and worse rankings? Relying on Google’s relative rankings seems lazy at best an scientifically unjustifiable at worse.

    3- My understanding of author’s findings is that the use of owners of queries can reduce the number of assessors needed to detect and effect or difference between one run/system/ranking and another if that effect or differences is already quantifiable. Isn’t this somewhat useless since often assessors are used to determine what the difference between systems is in the first place? If we already know what the difference is and how subtle it is what would we use assessors to establish? What am I missing?

  20. 1) Given that the experiment uses the personal queries of professional assessors. By how much are the results depicted influenced by the fact that they are assessors and they know how good queries look like?

    2) While using query owners may improve the accuracy of assessments, wouldn't this approach indirectly reduce the diversity of the search result? This is because the owner already has an idea of what he is looking.

    3) Is there a conceptual difference between Experiments 2, 3, 4? Or is the purpose just to have more stable results?

  21. 1. In Introduction section, it said “the person … have more background knowledge ...”. It seems not hold in real life. We may ask search engine about what we just heard, without any background on it. So, can we treat it as an assumption here?
    2. In Section 1, 2nd paragraph, the authors said they began with assessors and asked them to recall the queries. The authors mentioned later they tried to remove the queries in assessors’ memory. However, we did not find any evidence to support that their actions are effective.
    3. In page 105, when selecting queries, the instructions required for “personal reasons”. Is there any duplication between these personal queries and rating queries? If they exist, what is the impact to the final result?

  22. 1. In this article the authors point out that much of the results that they obtained were different from previous studies because those studies often dealt with complete assessments of single documents while their study deals with subjective assessments of groups of documents. Do you believe it is possible to accurately compare the results of this study with previous ones given this discrepancy? If not how would you go about constructing a study that tested this method but could be compared to previous work?
    2. In this article the authors state that there are two possible reasons that owner’s score were consistently higher than non-owner’s scores. They said that owner either gave fewer -1’s and the same amount of 0’s or fewer 0’s and the same amount of -1’s. They observed both of these behaviors during the study. Why would owners be more likely to give fewer -1’s in some cases and fewer 0’s in others?
    3. The authors of this article did 6 different experiments with 6 different methods of ordering the two sets of documents that the user would have to choose between. Most of these were just Google search results that were reordered in various ways. Experiment 6, however, was attempting to test an experimental change to Google’s algorithm. Since the change to the algorithm is protected company data they could not post their results. Should they have still included the results of experiment 6 in this paper even though they could not really discuss what it meant? What purpose does including it serve?

  23. 1. What is the contribution of this paper compared with the related work on owner assessment?
    2. The paper uses Google’s result as a standard. Is it a biased standard? Or, can we say that this paper is just for Google?
    3. It seems that Experiment 6 is not completed. So the conclusion from this experiment is also questionable, especially when it shows inconsistency with Experiment 1-5. Is it proper to provide such an incomplete information in a research paper?

  24. 1. The authors provide a novel approach using which they claim to reduce the number of assessors required to evaluate a search engine, by using the help of query owners. In addition to the doubts regarding the level to which this experiment can be scaled, does this experiment not introduce a lot of bias in judgement? How can an assessor provide an objective assessment given the fact that he is the author for the query and there might be an inherent meaning for ‘his’ query.

    1.1 For example, a query on “Harry” can mean Harry Potter to a user A, and Prince Harry to user B, if we use the services of user A, are we not biased towards the former?

    2.In the insertion experiment 3, the authors insert G11 and G12 into the positions of G2 and G3. When the assessor compares this with the high quality result( G1:G10), it is deductible that there is no qualitative difference in the quality, although qualitative measures such as MAP might differ (most of the relevant documents are already available). Is this qualitative similarity the reason why there is no reduction in the number of assessors needed to establish statistical significance for the experiment 3? Does introducing query owners, improve the system qualitatively and compromise on the quantitative measures of the search engine?

    3.What is your comment on the cost of obtaining query owners for assessment when compared to the cost of randomly choosing assessors, as there are significant tasks involved? Is ~20% a significant reduction in the number of assessors when keeping in mind the effort and cost associated with procuring query owners? And how confident can we be in that the reduction in number of assessors does not correspond to the introduction of bias in the search results?

  25. 1) This is probably my favorite paper out of those that we read so far, both in terms of the ingenuity of their idea and the experiments they chose for evaluation. Although I understand why they did not analyze and characterize why certain experiments performed better than others (the point was to show owner query can be more useful than non-owner), is it possible to draw some conclusions for why Experiments 3 and 4 did not perform as well as the others? It might have to do with the fact that they are both insertions, so non-owners can more easily identify the difference between the two sets of results and infer which should be better.

    2) I was impressed with the fact that they took into account the memory effect associated with reusing exactly the same results (done in Experiment 1). This particularly struck a chord with me because it removes the added subjectivity associated with having owners do the analysis with a “cheat sheet” (their memory). Even with this stipulation, the owners still performed better than the non-owners, which strengthens their argument. However, in experiment 3 where the use the actual Google top 10 results were used both sides performed almost equally. Why might this be the case? Is it a side-effect of using insertions referenced in my first question or is there some other reason?

    3) One other thought that persisted with me throughout the duration of the paper: is it possible to design an effective experiment that involves live users? In the conclusion the authors mention that it is already difficult to utilize owners, but it shows promising results. Using live users seems like a natural next step and might have an additional benefit. Specifically, such users are inherently most motivated to actually find a relevant result, since finding it benefits them directly. Using owners partially provides the benefit of live users, but in the end what they determine to be relevant can still only be trusted in good faith (i.e., as long as they come up with something reasonable they will still be paid).