Concepts of Information Retrieval -- UT Austin INF384H (Fall 2013): 19-Sep Google’s Search Quality Rating Guidelines. Version 1.0.

Wednesday, September 11, 2013

19-Sep Google’s Search Quality Rating Guidelines. Version 1.0.

32 comments:

Sam BlazekSeptember 15, 2013 at 12:10 PM
This comment has been removed by the author.
ReplyDelete
Replies
Sam BlazekSeptember 15, 2013 at 12:11 PM
1. This goes back to a question on Saracevic (EDIT: fixed name typo). Google recommends in section 2.0: "Important: If you use a search engine to research the query, please do not rely only on the ranking of results that you see displayed on the search results page." While this is a noble intention, it is unrealistic in two ways: one, it attempts to reduce the amount of "learning" taking place during the actual relevance judgment process to 0, which may be impossible for many queries (interaction of treatment and treatment); and two, by requiring users to develop a prior knowledge of the query topic, it renders the relevance judgments distinct from those of users who are, in fact, trying to learn about a topic. This also impacts any time-based studies that may be performed, à la Smucker and Clarke, and also impacts queries particularly in the "Know" category of user intents.

2. Are there quick, automated means of discovering user intent when multiple interpretations for a query exist? Knowing intent may assist in identifying which interpretation of a query is sought by the user, so that alternative interpretations can be more highly ranked when users are seeking them.

3. How does Google handle the ranking and relevance assessment of search outputs from other search engines? At one point, it mentions that other search results may be very relevant for users; however, you never see "Bing/Yahoo/DuckDuckGo Results for: ..." in Google search results. Are there specifically blocked search engine links? Are they treated as "webspam"?
ReplyDelete
Replies
Jessica_MaySeptember 15, 2013 at 2:21 PM
1. The mention of having anti-virus and anti-spyware on the computer (p. 6) is the first realization I had that the raters are using their own private resources to complete this task for Google [First, is this a correct assumption?]. This brings to mind a recent article (http://www.newyorker.com/talk/financial/2013/09/16/130916ta_talk_surowiecki) about the rise of the “sharing economy,” where different companies, mostly small start-ups use consumers' own dormant resources (for example, a car that may be driven only an hour or two a day becomes a “cab” in a ride-sharing company). How else does the sharing economy figure in to IR and IR evaluation?

2. Section 3.0 discusses flagging language in documents, leading me to a few questions on the topic- first, the article states that “acceptable language” should be flagged with “other languages that are commonly used by a significant percentage of the population in the task location”(p. 12). Does this include Spanish in certain regions of the US? Second, when documents are flagged as “foreign language,” are they no longer considered to be relevant? Ellen Voorhees voiced concern in “The Philosophy of Information Retrieval Evaluation” (in the section on cross-language test collections) about how the effectiveness of pooling is compromised as certain languages are favored over others. Are Google Translate and other search engine translation features working to include more results from foreign sources? How are these features and applications being used to address these issues?

3. An enormous portion of this article is devoted to defining, recognizing, and flagging spam. Spammers employ countless techniques including copied content from legitimate sites pumped with paid ads, pages that just contain freely available feeds (like RSS) with PPC ads, doorway pages (p. 37) that lead nowhere, templates, and copied message boards. My question involves the level of profitability of these types of sites. What kind of traffic do they get? Is it actually worthwhile to the advertisers and hosts? They seem so obscure and ineffectual- is that because Google is doing such a good job of suppressing them?
ReplyDelete
Replies
AnonymousSeptember 16, 2013 at 7:55 AM
In introducing the purpose of search quality rating, why does Google emphasize that participants’ ratings don’t directly impact Google’s search rankings or ranking algorithms?

The guidelines of this search quality rating are very complicated and trivial, and these rating tasks are completed by online system. So, I’m wondering how Google controls the quality of these assessors’ rating tasks here.

In this paper, it’s obvious that the topicality is not the only factor considered by Google when its assessors judge the relevance. In this sense, does it mean that the approach of judging relevance, recruited by TREC , is actually inaccurate?
ReplyDelete
Replies
UnknownSeptember 16, 2013 at 3:49 PM
1. These guidelines bring to attention the playoff between relevance judgments and 'user intent'. Google is adamant that its raters consider dominant and common interpretations very carefully, and weight minor interpretations much less (the example is often that of 'windows' or 'kayak'; in both cases the companies are dominant interpretations). In practice, we would expect interpretations (dominant ones especially) to have lots of overlap across the population. However, can we use these interpretations as reliable relevance judgments or is there a catch to that?
2. An interesting type of 'navigational' search is presented, via the 'Vital' category. This category serves the purpose of a navigational search wherein only the vital document is necessary but also another kind of search where a vital document should be returned but with supporting pages also (for example, if we type Microsoft, the microsoft homepage should be a vital document but not the only thing that's potentially useful). I don't yet see a ranking that can capture this second category. Assuming that we want the vital document to be the very first item on the list, the mean reciprocal rank or the DCG metrics try to capture relevance declining with rank, but in this case, we would have something like an exponential decay (or something like a smoothened step function that would consider batches of vital or highly relevant documents but assign extremely low weights to the ones beyond). Would modifying the denominator in the DCG be enough or has something else been proposed to capture this kind of situation?
3. The do-know-go profiling of user intent seems intriguing as an objective measure. Since the publication of these guidelines, has anyone tried to measure relevance of pages along those dimensions and evaluate algorithms based on their ability to distinguish between queries that fall in one category or more? I would suspect algorithms that have that discriminative ability would also rate high on usability metrics.
ReplyDelete
Replies
UnknownSeptember 16, 2013 at 4:30 PM
1. According to the instructions, user intent (go, know, do) is inferred in ordered to give guidance on relevance judgment. However, user intent is often mixed. A user who wants to know some information may want to download the documents as well. And a user who wants to navigate the web pages may want to gain some information from them at the same time. If so, is it necessary to divide the user intent into these three types?

2. What is the background information of these raters? Since the rater needs to make judgment based on their knowledge, there might be a strong bias on relevance judgment by the raters. How does Google select a generalized population of raters to avoid the bias? Unratable is a rating scale used by the raters. However, in reality, results which are unratable shall be as little useful as those off-topic or useless results, since none of them provides any useful information to the users. So why do we have a separate category for the unratable results? How does Google handle these unratable results?

3. It is mentioned that a lot of spam pages are developed to make money. Those spam pages try to confuse the search engines with hidden text, key word stuffing, multiple frames and etc. Since most of these tricks are based on an understanding of how search engine works, are there any studies which try to modify the performance of search engines to reduce the probability of returning those spam web pages as results?
ReplyDelete
Replies
UnknownSeptember 16, 2013 at 5:54 PM
1. When a teacher wants to emphasize instructions on an exam, I have found they always bold the information. The change in text naturally draws your attention and I place more weight in the emphasized sentence than anything else on the page. Outside of key words, only two sentences are in bold font in this document: the importance of utility and protecting your computer from malicious attacks. The second is an obvious concern from browsing the web in general. The first is directly related to the whole point of the rating system. The emphasis placed on utility is understandable since part one of the documents is entirely dedicated to laying out the scale to quantify utility. Google made a point of saying in multiple places that the scores raters give are not directly used to change the ranking of Google search results. However, the author felt the need for the reader to know that utility is really the point of concern. Throughout the document, raters are also instructed not to struggle over some of the other flags they are required to report. It would be interesting to know how minor the other feedback areas are in comparison to the utility score a document receives but we do not have access to this information. In addition, location was mentioned all throughout the document both in the context of the users’ physical location and any location the query may specify. Therefore, can location be considered the biggest factor impacting utility?

2. In class, we discussed the eventual move from binary relevance judgments to graded relevance judgments for IR evaluation studies. It was mentioned last Thursday that raters felt a lot better about their decisions when a third degree of relevance was added. In addition, we discussed how too many degrees of relevance can be confusing for the user. In a few studies involving graded relevance as well as Google’s scale, five degrees of relevance are used. Is there a particular study or origin to a five-point scale? Is it based on the fact that five star ratings are used on a lot of evaluation sites? When a rater has to mark a website as spam or not, the rater has three different options: spam, maybe spam and not spam. Is this motivated by the same psychological reason people prefer a three point scale to a two point scale for relevance judgments?

3. As I was reading the document, one sentence really caught my eye. When discussing Google’s graded relevance schema, the author instructs raters who are torn between two ratings to go with the lower one. Does Google propose a conservative default behavior to “fail gracefully” as we have mentioned in class? It has been shown that negative experiences are what motivates a user to switch search engines despite any number of successful web searches. Therefore, whenever a rater is on the fence about how relevant a document is, is it really less negatively impactful to rank the document lower? Or is it as simple as when a rater can’t decide, odds are the document is not overwhelmingly relevant and thus a lower rank is typically warranted?
ReplyDelete
Replies
Kari BeetsSeptember 16, 2013 at 8:29 PM
1. The rater guidelines continually mention the phrase “user intent.” How is the rater supposed to determine “user intent”? How would a rater know if they are understanding the query correctly if a user inputs misspellings or doesn’t know all the search terms? What do you think about Google’s reliance on this idea?

2. In talking about Vital pages, the rater guidelines state, “Most Vital pages are very helpful. However, please not that this is not a requirement for a rating of Vital” (p. 12). Why would Google promote an official page if it was not “very helpful”? If designing a system, would you promote official pages by putting them higher in the results?

3. How do you decide what are major and minor interpretations? If there are enough raters, does this lessen the judgment of any individual rater? How should the ratings be averaged or combined? How do you think should Google combine rater judgments with other evidence such as clickthroughs?
ReplyDelete
Replies
UnknownSeptember 16, 2013 at 10:25 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownSeptember 16, 2013 at 10:26 PM
1. In this search quality evaluation guideline, location is an important feature and has been referred many times in both part 1&2. Locations are taken into consideration both explicitly (location in query) and implicitly (task location, can be gathered by checking the IP address). Instead of restating the importance of locations, here I care more about how we can record location information in our index. For example, in part 2 Section 3.1, it talks about the different between query [facebook] and query [ice rink]. The reason why the two queries are different is that [facebook] is not associated with a specific location, while [ice rink] is location-specified. From this statement, it seems to me that each keyword in the index is associated with an attribute which shows whether it is location-based query or not. Is is true? Is there better way for us to represent this kind of information in the back-end?

2. My second question is more like a general question about how we should treat this search quality evaluation guide. This guideline is written by google. In the meanwhile, google uses it to test its own search engine. Since in modern search engine, many optimizations are rule-based (add different features to adjust the search results). Is it possible that to some degree this guideline would overfit the current search results? For example, when google was writing this guideline, they found there were some features missing in their search engine. Then they used the guideline to test their own search engine after the adjustment, they could get a higher performance, which is not quite objective to me. Let’s take the question one step further, how do we avoid over-fitting for this kind of rule-based testing in our own experiments?

3. My third question is in Section 3.1.1 Copied Text and PPC Ads and 3.1.2 Feeds and PPC Ads. In these two parts this paper talks about suspicious contents as well as PPC Ads. It sounds vague and somewhat unrealistic to me. As we all know, PPC is widely used by google for advertisement. Many people tend to add google advertisement widget in their own website and google will pay for the site owner when someone clicks the advertisement. In the meanwhile, many people like to copy good articles to their own blogs in order to save or archive. Now suppose in my personal blog, I often post articles from other resources, and I also have some google advertisement widget in my blog post. Does it mean that my blog is a spam? Why or why not?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 9:58 AM
The article classifies user intent in three possible categories namely Action intent, Information intent and Navigation intent. Though classifying queries in this way can help raters figure out how to rate a web page, I think the classification in some cases will be quite ambiguous. For instance, user query of “Sydney Opera House”. Does it mean “purchase a ticket to the Sydney Opera House” (in this case it is an Action Intent) or “navigate to the home page of the Sydney Opera House” (in this case it is a navigation intent) or user simply just wants to know bit more about Sydney Opera House itself (Information intent)? If the classification sometimes is unclear, how raters can decide one intent is more relevant than another intent thus giving different ratings to the search results?

In the rating scale, the article has a rating option of Vital. The Vital rating requires the query must have a dominant interpretation. On top of that, the dominant interpretation of the query has to be either navigation or entity. I think this rating option shall be equivalent to very useful or highly relevant. But in reality, does this rating match the expectation? Is it possible some of the Vital pages might not be useful or even relevant?

Queries might carry with themselves some explicit temporal information, in which case it might be easier to determine the timeliness and return most relevant pages. However, what if queries have implicit temporal information, I am wondering in this case how to determine the queries demand (for instance future events, most recent events or past events)?
ReplyDelete
Replies
Natalie MillerSeptember 17, 2013 at 10:21 AM
The way google addresses porn is very interesting. I agree that stumbling across unintentional porn is very disturbing, and that often times that is not what the user is looking for, but reading through this, Google repeatedly emphasizes that the non-porn interpretation of the query is dominant, even if the evaluator thinks the user is looking for porn (pg 25). These queries seem to be always rated 'Off-Topic' or 'Useless'. I think this is very interesting considering how the internet is supposed to be this democratic entity, and here Google is basically forcing it's own morals/values onto it's users. Granted, I think these values are the norm for many cultures, it just makes me wonder what other values are being placed upon users without them even knowing. Should Google do this? What other value judgements and policies is Google directing in regards to the internet/society as a whole?

What is the radius where a local search becomes off-topic? Google uses the example of US/London, but what about city to city? Area of a city? County?

In regards to mobile web sites, it looks, according to this Google document, that they are automatically put in the category of 'Slightly Relevant' because they offer less content and the functionality and the page is different from the desktop webpage. I can think of quite a few mobile sites that are really relevant and designed specifically to answer the minimum amount of information needed on the go. Just because the website offers less information doesn't make it less relevant (precision vs. recall). Perhaps these are more precise pieces of information that a user would need while being mobile. And now, as more and more websites have a mobile versions, I feel like this rating system needs to change to accommodate the changes in technology and reflect a more accurate use of mobile web. What does everyone else think? Should a website be reduced in relevancy because it has less information or more precise information?
ReplyDelete
Replies
Ashwini KamathSeptember 17, 2013 at 1:04 PM
1. Do you think there is a difference between relevance and usefulness of a search result? In that context, what you think of using the terms interchangeably the way Google has in its rating scale?

2. Don’t you think a lot seems to be left up to the user to “interpret” while making relevance judgments, especially user intent? Asking the users to interpret queries by “doing a little web search” (pg. 8) seems to defeat the purpose of the whole exercise. Also, the definition of dominant interpretation seems flawed. Depending on the user’s context (location) [windows] may not necessarily mean Microsoft Windows, but an opening in the wall. Also when a user types [window], should that be interpreted as [windows] and thus Microsoft Windows? Is there a way to define queries with more than 1 common meaning, for better evaluations?

3. A whole chunk of these guidelines is devoted to marking potentially harmful content – malicious and spam. Am I correct in saying that these are relative terms? These terms depend on a lot of personal concepts like religion, beliefs, location, education, background knowledge, etc. Does the recruiting process look at balancing these biases and preconceived notions that (might) influence judgment? And then does Google review their assessments to minimize personal biases?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 2:27 PM
The Search Quality Evaluation documents (Version 3.27 and the 43 page version 1.0) make for an interesting read and comprehensively explain the importance of rating the search in order to improve the Search Engine performance. Some points that might lead to interesting discussions are :

1. In the section 2.0, there are guidelines for understanding the query by understanding user intent, knowledge of user location and the knowledge of the possibility of a query with multiple meanings. There is no mention of query with no meaning- i.e queries which have symbols, encrypted words etc., which might have no inherent meaning but might actually result in some relevant results. It is understandable that this forms a minority of all the queries, but how are these kinds of queries handled? And how are the search results for these queries rated?

2.In section 5.1, the document advises the rater to assign a lower rating when the rater is confused between two ratings. What is the justification behind this instruction other than that it might reduce the time for rating a document? In the subsequent line, the document instructs the assessor to use his 'best judgment' when there is a difficulty among 3 ratings. Is this not an apparent ambiguity in the guidelines?

3) What are some other rating scales used by other search engines for evaluation? Is there a significance for a 5 point rating scale (excluding unratable documents)? Bing also uses a 5 point rating scale (http://searchengineland.com/bing-search-quality-rating-guidelines-130592) in its rating matrix.

ReplyDelete
Replies
Haofeng Zhou (Rabby)September 17, 2013 at 2:38 PM
1. “Useful” and “relevant” are two scales in the rating schema. But the difference between the two scales is not obvious and the evaluation results could be too subjective. Why not design a schema without overlapping scales? What is the benefit of using the “useful” and “relevant” scales?

2. Understanding user intent is a requirement to the raters. However, users are diverse; so are users’ intents. Raters should also base the query interpretation on their understanding of user’s intent. Is it possible for the raters to both represent the diverse users and interpret the queries in a common sense?

3. The user intent for a query is classified in three major categories, that is, action intent, information intent and navigation intent. However, it’s usually hard to distinguish between action intent and information intent as action intent always follows information intent; and navigation intent is always followed by information and/or action intent. What are the benefits of such a classification?
ReplyDelete
Replies
LucySeptember 17, 2013 at 2:49 PM
1. It’s said in the preface that Google relies on raters to help them measure the quality of search results, ranking and search experience. The question is shouldn’t users’ search log be the best and most reliable source for evaluation? What problems can raters solve while users can’t?
2. When explaining the “useful” scale, it is said that useful pages often have some or all of the following characteristics: high satisfying, authoritative, entertaining, and/or recent. (4.2) Why should we take “entertaining” into consideration when evaluating the usefulness of a web page?
3. A rating of slightly relevant should be assigned to mobile landing pages. (4.4) Does it mean that all mobile landing pages can’t be useful/relevant even if the content fit the query very well and are helpful to most users?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 3:33 PM
1. Google encourages raters to research their query before rating pages, but not to take top results from their research into ratings just because they appeared near the top of their research results. If raters are supposed to act as users, wouldn't the top results be useful to someone who is typing in that query in a search engine? Wouldn't the user want to learn about the topic just as the rater is?

2. It seems like Google has gone to great lengths to detail every step of the rating process (even in the shorter rater guidelines). Given the success Google has had with their search engine and rating process, why don't more researchers adopt their techniques when it comes to evaluating their IR systems?

3. When talking about Webspam, Google seems to want to minimize the spam that they have in their results and root out tricky techniques that webmasters use. Google mentions a scenario where a Vital page can also be given a Spam flag. If a website is Vital to the query, but seen using Spam-like techniques, how does Google deal with that in their algorithm?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 5:08 PM
1. When Google relevancy asessors judge the importance of a page, they are asked to consider what most of the people typing the query would like to receive. How correct are our intuitions about other people's query intentions?

2. Some of the earlier readings have suggested that there is a significant degree of annotator disagreement about document relevancy. Is this also the case with Google’s relevancy judgements, given that they’ve provided much more precise criteria for judges? How many times must a document be assessed by different relevancy judges before intuitions about population behavior can confidently be abstracted?

3. One of the tasks that assessors are made to do involves indicating whether the location of user’s query is important for results. With certain queries we would anticipate this to be the case (i.e. queries involving stores, events, restaurants, and organizations), but what should happen when there is no relevant local document? Obviously an assessor is not going to mark a non-local store’s page as relevant, but what should be the information returned to the user? If we are confident that they are searching for a local store and an IR process returns nothing, should a system tell the user that there is no local store? Should the nearest non-local store be returned? How far away does something need to be before it is not local?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 5:22 PM
1. While the user intent is a factor that is fathomable as it does come across as more obvious when page utility is used as a metric - what I am curious about is the intangibles. Like for instance how does Google forge ahead towards building the taste profile of the individual while keeping in mind the aesthetics? How does Google justify categorizing user's intent based on their previous searches and whether those were driven by dominant, minor or common interpretations? There would be the need to make use of a dynamic test collection in this scenario and so does the fact that Google conducts so much of its analysis on interpretations as a relevance judgement cause an oversight? Especially when we are dealing with building a user taste profile?

2. We've seen how Google makes use of a Sandbox and an escape hatch when dealing with misspelled queries. Through this paper - is does seem like much of Google's search ranking is based on the creation of an inverted index framework that builds indices and retrieves information. This being the case - how does Google go about re-writing the queries when they don't match the 'standard' format? And, to what extent? Also, what is the algorithm that's involved that helps map nested queries especially when there is also an NLP parser which is functional and providing suggestions and that the ranking function is optimized by tanking the cost of interest?

3. The paper does present a rather thorough overview of the various pages that could be categorized as webspam. I'm curious to know how Google deals with User generated spam. Since, there is an enormous amount of syncing that Google supports as a forum - aren't the potential cases or profile spam way higher? Like for instance, a simple case of could be because of not putting the priority as 'nofollow' attribute? And, with the internet being all about social media now - to what magnitude does this affect guest books and blogs - given that it would be trivial for spammers to build links and thereby produce a gigantic amount of comment spam?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 5:50 PM
1. Section 1.4 states that raters must represent the user and thus the task locations and languages mimic those of users living in the same area or country. How does this take into account regional differences in countries or users who are using their mother language in a different country?

1A: Section 2.4 also mentions informing an employer if your knowledge of a location or user a rater is supposed to represent is lacking. How does the system take into account people who don’t follow this and simply engage in rating without knowledge of the language or location?

Section 5.2 Location is Important, touches on my previous question about locations and regional differences on a country by country and even city by city scale. But even in the US people call different products or things differently. Off the top of my head I know various regions call carbonated drinks; pop, soda, or coke. Part 2 section 1.0 query locations goes further into this line of thinking but does the system take into account those differences(soda, pop, coke) based upon location as well and how would a rater handle such a judgment within an area that has overlap of dialect or definitions?

The conclusion states that spam recognition is a skill that can only be gained through “practice and exposure.” This leads me to believe that a rater remains a part of the program for an extended period of time. How often are new raters brought in to the program to provide different relevance judgments within the guidelines issued by google?
ReplyDelete
Replies
Dheeraj BorraSeptember 17, 2013 at 5:57 PM
I read the 43 page version of the google rating guidelines.

Here are some questions on guidelines to raters that might be interesting to know how they are implemented.

Pg 22: For misspelled and mistyped queries, the guidelines suggest taking a correct stance to user intent. Is this not the same as the aggressive search criteria used by Google (As shown in the first class!)? For queries that are not “obviously” mistyped or misspelled, the raters are asked to not make assumptions regarding user intent. How does the search engine differentiate between the obvious and not obvious misspelled queries before deciding on user intent?

Pg 24: How do search engines take care of old and new pages? Do they swap the priority of old and new pages at a particular point in time? Or is it based on the user queries and the corresponding chosen relevant event page (feedback)?

Pg 32: In some specific cases, a page can be rated vital and marked for spam. How do search engines trade off page utility with spam? How is the user expected to react for such urls?

In case of queries with multiple intents (Do-Know-Go), the raters are asked to make best judgements of query meaning based on the perceived user intent. Does this not make the judgements across different raters inconsistent? For a ‘Know’ query an official blog might be the best source of detail information where as for a ‘Do’ query the download page is a better landing page. Do the dissimilar perceived user intents lead to improper evaluation of search engines? Should user judgements in validation and rating guidelines be curtailed to very specific situations?

It is debatable whether the raters chosen would be a representative set of the general population.
Does a search engine company (here Google) select raters from different backgrounds and different academic levels and from different demographic criteria? How often do they run these rating experiments (keeping in mind the evolving trends).
I believe the question of the ‘proper’ representative set is in itself a tantalizing problem!
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 6:35 PM
1) When advising about user intent, the guidelines state that raters should not struggle, that they should “give their best rating and move on.” This statement seems to introduce a lot of noise to the ratings since all the raters at some point will be in that situation. Wouldn't it be better for rating data if raters were to skip queries they have difficulty rating since they are not representative of the user population.

2) Is it fair to assume that “Unratable: Foreign Language” pages in a locale are negative instances in the learning algorithm for that particular locale? What would be the benefits and drawbacks of doing such thing?

3) The guidelines show that user experience is important and if raters feel uneasy about a page, that page will likely be omitted from the search results. However, when discussing flags, the guidelines make a binary judgment for porn and malicious flags but a graded judgment for spam. Is there a reason for this?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 8:21 PM
1. The guidelines mention that “These evaluations will only be useful is evaluating the search engine and do not affect the rankings of the page/URL in the result set being returned.” Why shouldn't this component be used to some extent? After all these evaluations provide a relevance measure from user’s perspective to some extent.

2. The fact that the guidelines have it written in their manual that tests should try and think from user’s perspective, his intent when querying in the search engine; represents the fact that the people involved in the testing would be biased.
Additionally, as mentioned by Mark D. Smucker the average users are very different from the ones we do testing on. An example would be the scenario where they saw in their testing that some testers were very fast and were able to read judge the search results in a very short period of time. This indicates that the user base as a whole is not being represented even in the testing efforts being coordinated by Institutions like Google. Then how can it be made sure that the testers base you are trying to setup is unbiased and does consist of a diverse set of people representing the diverse crowd you will be catering the service to?

3. The guidelines section 3.1 describe when location maters in a query. But what if for a topic the search engine doesn’t return anything when just looking at the query from location. Carrying forward the eg. mentioned in page: 31 pertaining to Ice rinks. If someone living in the Sahara desert asks about Ice Rinks, then that means he is not asking about ice rinks closest to him but what do they mean or just basic info about them. The search engine needs to address the situation that a same query might be looking for results based of location and it might not. Now how will the answers rated by testers in this scenario? Will not the completeness of this document rather than the help the graders restrict their scope of thinking?
ReplyDelete
Replies
Aashish SheshadriSeptember 17, 2013 at 9:02 PM
The guidelines place heavy emphasis, and provide detailed direction to identify malicious/spam content on the web. On page 6, there is mention of the specific ratings not necessarily impacting the ranking algorithm; page 7 says “Do not struggle with each rating, give your best rating and move on”. It is just surprising to me that such detail is not evident in the relevance model used where there is clearly a lot of ambiguity.

I would have expected the rater’s familiarity to the query be recorded, even difficulty, incompleteness etc.; is there a reason for not doing this?

The basis for a lot of content on the Internet is revenue generation. Copy content can have value too – There can be useful compilations of the diverse content present on the web already. Is it justified to flag such content as spam because of financial intent?
ReplyDelete
Replies
Tomasz KalbarczykSeptember 17, 2013 at 9:04 PM
1) As per the guidelines, each rater must be familiar with the “user intent” associated with each query. Without being provided any guidance for the user intent, how is Google able to determine that the raters are suited to the specific queries they are given? Taking one of their examples, if someone searches “patriots,” my knowledge of the NFL makes it easy for me to determine what the user was looking for, but it might not be the case for another rater. Does Google look for raters that are “good searchers” or does it also try to find raters that are “bad searchers” (spending lots of time searching, reentering their queries, etc) ?

2) For the five levels of its rating scale, Google uses various adjectives. To guide the rater, Google uses other levels of relevance (such as tiers of interpretation) and expands upon what is meant by each of these adjectives. Why did Google choose these specific adjectives for its scale? They might be more descriptive than the generic Extremely Relevant, Very Relevant, Relevant, Slightly Relevant, Not Relevant or 1 to 5, but I can imagine they trigger different impulses depending on the user, and seem to add an additional variable.

3) According to the guidelines, Google expects users to vet results for malicious sites, and even has an entire section devoted to identifying spam sites. Raters are told how to disable CSS and JavaScript in addition to identifying redirects and cloaking. All of this seems technical at least for the average web browser user. As such, might this have the effect of making raters more representative of the “tech savvy” user base rather than the average Google user, causing other factors of their relevance judgment to come into question? Would vetting sites not benefit from a more automated approach, and then perhaps a targeted user evaluation of the spam filtering?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 9:13 PM
1. “Information or “know” queries may be about recent or past events. The landing page should be rated based on its fit to the informational need of the query. So most of the times one may need to consider the content of the page rather than the date on the page”. Rating a page by content is primarily important but in the case of the information query one should consider both new and old pages. So would it not be more appropriate to classify based upon the date and then rate them? Why is date not considered as an important rating criterion?

2. In the section about commercial intent, “affiliates” is mostly considered as websites that exists primarily to make money. They portray content from other merchant sites such as Amazon or eBay and redirect to the real merchant page. They may not be intended to just make money through “PPC – pay per click” but may actually be a marketing strategy deployed by the real merchant. How can one distinguish between the two while rating (without clicking on these as they are suggested to be marked as spam)?

3. Developing rating skills is through practice and exposure. Judging and rating a search result is so much dependent on the rater. Even though Google has shared useful information on its search quality guidelines, the most important tool that can guide one to rate a page is “intuition”. This is so evident form the various guidelines mentioned in the article but then how relevant can this be for a beginner who is just venturing into this field of information retrieval system?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 9:34 PM
1. In this document Google states that scores given by raters are not used to evaluate a sites ranking but are instead used to calculate the effectiveness of the search engine. However this begs the question why are they not ranking the effectiveness of specific websites? What would be the benefits and what would be the difficulties of ranking specific websites in relation to their relevance to a query or topic?

2. In this document Google states that the two main methods used to determine if a rater represents a user is the country the rater is in and the language that the rater speaks. Are these two values the only two values that are necessary to describe a typical Google user? What other things would you consider using to determine a typical Google user? Age? Ethnicity?

3. We have seen previously that Google also uses query logs to help rate and improve their search engine. What are the benefits and drawbacks to query logs and the ratings produced by these raters? Are there specific purposes that you would use one method of evaluation for that you would not use the other for?
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 9:52 PM
1- Some obvious changes were made from the “official” guidelines to the “cliffs notes” guidelines. My question is why bother? The official guidelines were already out and releasing a public friendly or politically correct version did nothing but make people look for differences between the two and examine the sections that Google presumably wanted to keep under the radar. A general web user isn’t going to seek out these ratings because they usually won’t care or know they exist. A person in the field of IR however would seek out the guidelines and would be able to the official ones and would not be fooled by the cliffs notes version. Google does not appear to be run by fools. I assume they thought about this themselves. So what did they think they would gain from this release?

2- The instructions mention that a rater will be using their own computer (because they should have personal anti-virus protection). That means these ranking are not conducted in a controlled environment. What happens if the rater is distracted by their personal environment somehow? Or if they discuss the rating with another person? How can google ensure the integrity of their rankings? Or do they employ so many rankers that it doesn’t matter? Is it part of google’s plan that raters don’t have direct access to google people to ask them questions about topics or ranking? Or is it just that there is no possible way google could house or create access for the number of rankers they need to employ?

3- An implied requirement of a rater is that they are very up to date on common culture and trends and can interpret queries such as ‘Kayak’ as having a dominant meaning of travel booking and not recreational boating. How are rater’s picked? How is this quality identified and evaluated? Is there a pop culture immersion program or quiz? Or continuing education requirement so that raters have to keep their hand in?
ReplyDelete
Replies
Aaron StacySeptember 19, 2013 at 5:16 AM
1. The document asks the rater to research the query in order to understand it.
This seems like it introduces a massive amount of subjectivity that doesn't
seem to be tracked in the rater's response. If I was a rater, and I wanted to
understand a query better, the first thing I would do is google it -- doesn't
that completely defeat the purpose? and the other options listed are wikipedia
and online dictionaries, which while they don't create a feedback loop,
certainly seem biased towards non-current information. It seems like the
research process could largely invalidate the relevance rating, so it should be
reported more thoroughly in the rater's response.

2. The rating system seems relatively complex, and I'm curious how Google
ensures that the raters are following the guidelines. It seems like a large
number of the specifics could be overlooked, and it would be difficult for
Google to even notice. Are the raters audited or monitored? And wouldn't a
quality control process on the raters favor existing search results and harm
diversity?

3. The differences between this and the leaked document seem to fall under the
following categories:

- Removing referencens to certain sites so as not to endorse them: It seems
like the leaked document had a lot of specific examples, but many of those
were removed so as not to endorse one website over another.

- Removing specifics that could be used for SEO: The public document added a
few statements that seemed aimed at discouraging tricks to increase SEO, but
large sections of specific examples were removed, and it appears the
motivation was deterring people that were trying to game the system.

Is that correct? Are there other possible motivations that I'm missing?
ReplyDelete
Replies
UnknownDecember 19, 2013 at 4:01 AM
When one conceives the issue at hand, i have to agree with your endings. You intelligibly show cognition about this topic and i have much to learn after reading
your post.Lot's of greetings and i will come back for any further updates.

Website Development
ReplyDelete
Replies
UnknownDecember 20, 2013 at 2:25 AM
A great article indeed and a very detailed, realistic and superb analysis of the current and past scenarios.

Hosting Services Karachi
ReplyDelete
Replies

Add comment

Wednesday, September 11, 2013

19-Sep Google’s Search Quality Rating Guidelines. Version 1.0.

32 comments:

About Me