Saturday, November 9, 2013

11-14 Falk Scholer, et al. The Effect of Threshold Priming and Need for Cognition on Relevance Calibration and Assessment.


  1. 1. I had some issues with the "need for cognition" or NFC: How reliable is the self-reporting of these characteristics? Wouldn't some participants be insulted by this classification system? Would this study be complete without this measurement- just examining the threshold priming? In the conclusion, they describe the significance of the NFC and how it correlates high NFC participants to the expert judges, but it seems out of place and problematic to me.

    2. The motivation behind only presenting one document at a time to the participants is still unclear to me. Can anyone find a defense for this?

    3. When the documents were initially ranked as highly relevant, relevant and marginally relevant, how can we be sure that those making this initial judgment were not subject to threshold pairing or that they were were considered high NFC?

  2. 1. It was found that the mean relevance ratings did not differ significantly when the relevance study was performed on two difference sections of the epilogue divided on the basis of time. This indicates that the relevance scores do not vary much over time for the test collection they had used in this study. But the previous experiments conducted by many people had proven results that time does play an important role in assessment. So the effect of threshold priming largely depends on the test collection more than the assessor itself. Also this experiment was conducted only on epilogues, so would the ranks also play a significant role in the scores? Would the result be different when tested with the prologue set?

    2. The difference in relevance scores varied in the prologue sections when assessed by high, medium and low treatment groups. However this difference disappeared in the epilogue. How can these results be correlated to the cognition effect of the assessors when the results could simply imply that the lower ranked documents (epilogue) was not evaluated or assessed properly just like any normal user due to a biased effect?

    3. In an experiment where the different treatment groups were studied against the experts, it was found that the participants in the low group differed with a larger margin from the experts than the other two groups. In my opinion these results are incomplete, as they have not explicitly stated how the experts' judgments were studied/measured. It is possible that there might have been disagreements among the experts as well as the effect of threshold priming. How was the effect of threshold priming and cognition effect studied on the experts?

  3. 1. My first question is about the experiments. One conclusion drawn from the experiment is that people’s internal relevance models are impacted by the relevance of the documents they initially view and that they can re-calibrate these models as they encounter documents with more diverse relevance scores. In the experiments, eighty-two participants completed the experiment. However, no participants’ pre-knowledge of the topics is discussed here. My assumption is that if people have more knowledge than others about the topics, their internal relevance models are less likely to be affected by the relevance of the documents they initially view.

    2. My second question is about the usage of the conclusion from this paper. From this paper we know that the documents and information exposed to users can affect the users’ “internal model”. So I am wondering whether we can make use of this in search engine. For example, in search engine when we search for some some keywords, there are some recommended related keywords for the users to choose from. If we can understand how and why people click some related keywords, we can know better what kind of information users want and relate them together using their “internal model”.

    3. My third question is in Section 3.1 Treatment. In this section, it gives the definitions of various levels of relevance: highly relevant; relevant; marginally relevant; not relevant. This seems vague to me, especially when we relate this to people’s cognition. For example, although it gives the definition of marginally relevant, I am still not sure how we can classify one topic as relevant or marginally relevant. I am wondering that whether this would affect the accuracy of the experiment results and how.

  4. 1. With regard to variability in relevance judgments, it seems clear that some variability is to be expected from our readings. This paper describes many factors leading to this variability as inescapable (p. 1), and questions others. However, I'm curious as to what these inescapable factors are. I don't remember any specific findings on this in previous studies. What inevitable conditions in the document and in the assessors can lead to variations? This could be useful to know if one wished to build a predictive model for relevance variation, where one could have an "expected" level of variation and then may seek to examine deviations from it.

    2. Including and beyond questions of user intent, how do our 'internal models' differ for different topics and queries? How do they function and perhaps update differently for different types of search?

    3. So the study examines how need for cognition impacts relevance behavior, and has some very interesting findings in this regard. What exactly do the findings mean, from a theoretical standpoint? I'm intrigued by why the phenomenon takes place, now that they have shown that it does. Is there any related psychological or cognitive research that might help here?

  5. 1. How are datasets chosen? Why TREC 7 and 8? A lot papers including this do not justify the choice. The datasets used were created more than a decade ago. While these the notion of properties being the same, it will be more convincing if recent qrels too are investigated. Or is the choice of TREC 7 and 8 suitable since the research community itself has been focusing on these datasets, enabling comparison?

    2. An interesting experiment would be to make k passes and investigate convergence and variability. While I understand it does not suit the research question posed an attempt was made to include duplicates with out informing the assessor. Even study three phases, where the first phase is common to all, like a burn-in period.

    3. The notion of bias presented in the paper seems to be intuitive, but what remains to be answered is the impact on evaluation itself – how does system ranking change with the small avg. bias displayed across groups? How important is it for us to consider the effects of variable exposure.

  6. The authors use 82 participants in the user study. However, they are recruited from the university of North Carolina, Chapel Hill.. The study like this requires diversity to rule out bias (e.g., if all participants are from a same user group, they might have the similar responses, which would jeopardize the representativeness of the study)

    The authors use 3 topics for the user study. It is a big limitation in turns of generalization. An alternative can be more topics with each topic assigned fewer participants.

    To measure the impact of Need for Cognition to the relevance judgements, the participants in the user study shall have significant difference in the interest to the topics selected. From the paper (Section 3.2), the difference in the interest is unfortunately not significant enough.

  7. 1. This paper presents an interesting study on the effects of the order of tested documents and need for cognition on relevance judgment. And the topics were selected so that they are not familiar with those subjects. Why are these topics selected? One possible answer is that it’s better to study the development of relevance model by the assessors. Is there any study using topics which are better known to assessors? Since results on TREC are usually rated by experts in corresponding field, does the order of relevant documents really matter?

    2. My second question is about the experts’ judgment which is selected as the gold data set. What are the backgrounds of these experts and what is the order of documents they evaluate? Since individual difference (need for cognition) is considered here, is it possible that the higher agreement between HNFC and expert assessors compared to LNFC and expert assessors is simply due to the fact that those expert assessors are belong to HNFC? Also is there any following study on how the ranking of systems will be affected by these different relevance judgments, which is the most critical result we care about?

    3. Few details are provided on the measurement of NFC (need for cognition), and it’s a little bit confusing for the judgment time study between LNFC and HNFC. Many factors can affect the judgment time and it’s possible that assessors from HNFC may know the topics better thus are more interested in the topics, which may suggest that the measurement of judgment time is a circular measurement. Also what’s the potential benefit of measuring the judgment time? Is there any study on the effect of different topics on the judgment time?

  8. 1. The need for cognition test that the authors did seems arbitrary. Did the authors base this on a real psychometric test, and do we have numbers for what a typical score looks like for the general population? It seems like, in a university sampling of experimental subjects, one would expect the need for cognition to be higher than in the general population.

    2. Was assigning graded relevance a good idea? I'm not quite sure we would see any significant difference if a binary relevance scale had been used instead. Is there anything in the paper that explains why we may infer similar results for a binary scale?

    3. It seems like there were a lot of uncontrolled variables. Many users indicated, for example, that it was hard to ignore novelty, or that unappealing displays or document lengths biased results. Wouldn't it have been better experimental design to try and keep these as constant as possible, and even to choose topics that users might have been more familiar with, so that many of these biases could be discounted and users would have found the search more 'enjoyable'? It seems like this experiment is not reflective of real assessor situations since there were so many uncontrolled variables.

  9. 1. The paper asserts that differences in relevance judgements have little impact on relative system effectiveness, but serious impact on absolute evaluation, and that this matters in environments where absolute measures of retrieval completeness and accuracy are important such as e-discovery, patent retrieval, and research literature surveys.

    While I can see how this is important from the "Improvements That Don't Add Up" perspective, I don't see how this difference effects the evaluation's ability to find the best search engine. It seems like at the end of the day, a search engine will be used, and it is preferable for it to be better than the rest, and an absolute evaluation is still not needed for this.

    2. The experimental procedure involves showing the user an prologue, or set of documents that are selected based on relevance. In the experiment, this was about 40% of the documents the assessor viewed. Doesn't this practice invalidate the use of pooling, since the documents judged are no longer randomly sampled? And even if the epilogue is not considered as part of the relevance judgements, doesn't it still reinforce the existing search engine's definition of relevance (i.e. hurt completeness) since the assessor is conditioned to a certain definition of relevance?

    3. I wonder if this experiment would yield the same result if the prologue was not such a significant size of the epilogue. It seems like there's a good chance that the threshold priming would "wear off" if the epilogue was increased in size to something more realistic of what the assessors have to deal with.

  10. In this paper, several treatments have been designed there, whereas these situations are too extreme to be met in the real relevance judgments. So, I have a question: why have the authors not designed a control group where the documents are displayed randomly at Prologue.

    The consistent conception of relevance have been mentioned and studied in this research. Here, I am wondering how the consistent conception of relevance for a topic could be measured precisely, since it’s very subjective and seems difficult to measure directly?

    It’s suggested that participants could be selected based on the score of NFC in this paper; however, there’s problem here: which benchmark can be recruited to determine who are suitable to become an assessor?

  11. 1. Even when making use of threshold priming - how do we still account for the variation in relevance assessment that would result due to the order in which the documents to be assessed are presented to the users? Also, how does making use of threshold priming hope to deal with the disparity in the results that arises due to varying densities in the relevance documents

    2. I think there requires to be importance given to the assessor's characteristics on the basis of how long it takes them to assign relevance to the documents, what comprises the areas of specialization, what is their conceptualization of the relevance model for the topic being searched, attention span of the assessors and how the impact of searching seems to impact the relevance judgement made by the assessor?

    3. The paper does elaborate on what would make a document 'relevant' and my last question is with regards to this quantification of relevance. What would be the questions that the a document which has been cited as relevant would require to answer? And, how much information would suffice? Does the 'relevant' document need to deal with the topic holistically or would it be alright if the document just caters to a specific subsection of the topic?

  12. 1. In their evaluation, the authors focus on whether or not there is a significance difference between the high and low treatments they performed to the ranked document lists. However, the authors also had a medium treatment they performed. Although they formulated a hypothesis related to the high and low treatments, not much focus was paid to the medium treatments. Is the assumption that the medium treatments would have no real impact on the users? If so, what would be the point of wasting the resources to evaluate this degree of treatment? Or is the medium treatment supposed to be reflective of the actual first 20 documents from the TREC experiments?

    2. In this experiment, the authors mentioned using the judgments from the assessors of the TREC experiments they pulled documents and queries from. However, the whole point of this paper is knowing the relevance judgments are subjective and vary from person-to-person and by a single person over time. From their description, it sounds like outside of comparisons between these experts and the study participants, the authors used the expert judge’s relevance assessments as the ground truth. Specifically, the authors determined in advance how many relevant documents would appear to users in the ‘prologue’ section. Were these relevant judgments coming from the TREC evaluations? If not, who was making these distinctions? In addition, how can the authors guarantee the documents selected really reflect relevant and not relevant documents since relevance is considered subjective?

    3. As the final part of their experiment, the authors asked for the users to provide feedback in the form of an exit questionnaire. The authors had a series of research questions they were hoping to answer through their experimental design, but I feel a lot of interesting points were brought up in the user responses. The users mentioned things like being influenced by the user interface. More importantly, the authors mentioned how they designed the experiment to focus on a specific type of search task and relevance assessment. However, the users revealed that they did not strictly follow this particular restriction when evaluating the documents. Should the research focus on why there are disagreements between relevance judgments be driven by psychological considerations if the users don’t follow the modeling assumptions researchers initially started out with?

  13. Q. The author state that “One assessor may make different assessments of the same document at different times and under different conditions.” Now after stating this and given the fact the assessors were not allowed to go back and change their feedback, the author statement that they wish to calibrate the assessors based of the assessment they have made seems to be contradictory. The author clearly mentions "Our findings indicate that assessors should be exposed to documents from multiple relevance levels early in the judging process, in order to calibrate their relevance thresholds in a balanced way, and that individual difference measures might be a useful way to screen assessors.”
    Q. If relevance is not known then how can it be made sure that the documents being shown to assessors are of higher relevance ? One application can be back tracking based of the relevance judgment being done by participants and then normalising the data based on these findings. But how best can that be achieved given the fact without any kind of relevance feedback for documents?
    Q. Will not allowing users to go back and change their assessment help in overcoming the bias mentioned by the author. The author mentions that in their study the assessor were informed that "Participants were instructed that once they submitted an assessment they could not return to revise it later.” This is a valid step if the assumption is made that the assessors are aware about the topics before hand. But if the participant is not much aware about the topic his relevance markings for the starting queries will be the ones based of which his decisions will be based. If he is allowed to go back and change judgment his results might be different even the author says ”Participants continue to refine their mental relevance models over time. Even if they do not have any reference points to begin with, they are able to re-calibrate once they begin to see documents that are relevant to different degrees.”

  14. The concept of need for cognition is very interesting when dealing with relevance judgments. The way it is defined by Scholer et al. in relation to the assessments makes a lot of sense especially when considering how assessors might approach the tasks they are assigned. Does the need for cognition change depending upon the topic of interest or being the query owner?

    Related to my first question, the researchers noted that they “had hoped participant’s interest in the topics might be slightly higher…”(3). Would having a broader range of topics have met their “hope” for a higher interest and in turn better identify the NFC and threshold priming?

    In section 5, Scholer et al., mention possible future work involving focus on self-agreement. Would this change the manner in which relevance judgments are dealt with going forward toward a more Saracevic view of how individuals define relevance? It’s interesting that there has not been some research into self-agreement on relevance assessments, particularly for those creating “gold standard” sets.

  15. 1) While I think the Need for Cognition factor provides additional useful information, is simply asking users about themselves to determine NFC the best way to go? People have inherently biased views about themselves, and might be apprehensive about answering completely honestly. Are there ways to determine this factor more implicitly, or through patterns of their behavior?

    2) Since the study was part of a university, it makes sense that they were compelled to use people affiliated with the university, but why further constraint the sample size by using a vast majority of women? Moreover, I thought it was a bit disingenuous to say that the ages ranged from 18-55, when the median was 23.7.

    3) In the conclusion (and in their results regarding NFC), the authors state that high NFC participants’ level of agreement with the expert assessors was significantly higher than that of low NFC participants. Do they provide any evidence of this? All of the data and figures correspond to cases where there were no statistically significant results. It’s odd that in the one case where they claim to have a found a correlation (between expert judging and NFC), they present little data.

  16. 1. The participants were presented with one document at a time in the experiment. Why was a clickable list not used in such case? Is a list more effective to improve the assessment quality?
    2. Page 6, 2nd last paragraph, it was said “they tended to boost the relevance of low-relevant document”. From which evidence did the authors get such conclusion? It likely lacks some solid support.
    3. In section 4.4, some participants’ comments were mentioned as their pain points in the experiment. How to solve these paint points? Is more clear and instructive narrative helpful?

  17. 1. The authors spend a great deal of time discussing the NFC, but are satisfied with information provided by the participants as a means to determine their NFC. Was there not a better and less subjective way for the authors to get this kind of information than a self-report from the participants?

    2. The authors displayed their documents one at a time to the participants and state that they decided not to use a ranked list of documents. How would a ranked list have affected their study and why did they opt for a single document display?

    3. The authors constructed the prologue and epilogue sections of the results to see the affect of priming on the participants relevance judgments, but was such a large prologue really necessary to see these effects? How did they arrive at their construction of the results?

  18. 1. The authors talk about threshold priming and NFC separately in this paper. What is the relationship between these 2 aspects? Does NCF impact the threshold priming?
    2. The participants know little about the selected topics in section 3. Compared with the experts who are familiar with the topic, does it make sense to set this bar in the experiment?
    3. The NFC section is not clear. Without knowing what the exact 18 items are, we have no idea whether the method mentioned here is reasonable or not.

  19. 1. In this article the authors discuss the three topics that they chose to use for their experiment. They gave very specific features that they were looking for in a topic. They listed three topics that they felt contained these features. Do you feel that these topics were the best topics they could use for this experiment? What type of problems could these topics cause? Were the features that they put forth really the best features that a topic could have to help with this experiment?
    2. In this article the authors used three sets of duplicate documents to test the intra-assessor agreement of different assessors along the different treatments that they applied. We saw the other Scholer et al. article dealt with the idea of duplicate documents and relevance. How do you think the results of this article and of the other Scholer et al. article relate to each other? Do they agree or disagree?
    3. Also the other Scholer et al. article discussed several factors that could lead to problems of intra-assessor disagreement. These factors include distance between duplicate documents and the number of relevant documents between duplicate documents. With these factors in mind how would you go about creating a prologue section of documents like the ones shown here to help maximize the effects of threshold priming?

  20. Effect of Threshold Priming and Need for Cognition on Relevance Calibration and Assessment

    1- I have an issue with the fundamental set up of the experiment. The set up required the experimenters to know a head of time what the relevance of the documents to be judged was before any judging took place. For this and many other experimental set ups that would be quite possible but when in reality new judgements need to be made it would not be. I understand that the purpose of the experiment is to see if the order in which relevant documents has are seen has any affect. But that that knowledge is hard to put to practical use.

    2- A particularly significant finding to me was the fact that the judges only agreed with themselves about 50% of the time. I wish this had been discussed in more depth. Does this call into question the validity of the experiment? Can the internal disagreement be blamed by fatigue as the repeating documents were the very last ones? What does this say about relevant judgements in general?

    3- About the need for cognition: I wonder if the pool was biased in this case. Most of the participants were college students, and the type of college students who are willing to spend their time participating in research studies. Isn’t that group FAR more likely than the average human to have a high need for cognition? I think that the high/low classification is a division of people already at the high end of the spectrum. It is probably a good idea to have assessors with a high NFC but I do not think that this paper convincingly makes that case without conducting a NFC test of the general public to see if their participants have an NFC distribution that matches. Also something to think about: if the average population has a different NFC than judges do will the judges make assessments that match the general user’s?

  21. The authors mention that ‘Relevant’ category implies that the document has more information than the topic description but the presentation is not exhaustive. This is a vague instruction as ‘exhaustive’ is not properly defined. How do the participants decide the complete relevance of the documents (regarding the exhaustive information)?

    In the epilogue, the authors repeat the first three documents in the last three positions to measure the temporal consistency of participants’ judgments. The differences are not statistically significant and I feel that the short span of 22 documents, between documents 23 and 46, could be reason why the results failed in significance testing.

    It is not surprising that the authors observed that many participants valued document appeal so much. Document and Interface appeal could very well be one of the prominent reasons why many of the search engine companies spend so much on designing better and user friendly interfaces. On the other hand, an uncluttered and simple user interface will keep the users focused on their task and also help them accomplish more. Thus it makes sense, intuitively, that users value the presentation highly owing to their psychological behavior.

  22. 1.In the section 4.4, the author discusses about the challenges in assessment. As one of the challenges, the author describes that the participants struggled with documents that contained relevant terms but no real discussion of the issues. This pops out a very important question regarding the IR system. Does an IR system need to completely understand the query and the subjective intent of the user? From our readings, it seems that it is only way in which high relevance can be achieved. But it is never possible to understand the subjective intent of the query. Where does the ‘intelligence’ part of IR stop?

    2.The article reports that the self agreement across the participants was 51.62% - i.e repeated evaluations of the same document might result in conflicts half the time. Can this observation be translated into a statement stating that relevance assessments are mostly random? Or is this observation a result of an observed factor affecting the relevance assessment of every document?

    3. Does this article provide a convincing way to screen assessors based on their agreement levels? How are assessors normally screened? Do we need to screen assessors based on their consistency or expertise? How can we use factors such as the NFC factor (NFC high/NFC low) as a reliable parameter to screen assessors? The authors haven’t discussed about the reliability of the measures using which they propose to screen assessors in the future. (discussed in the conclusion).

  23. 1) Given that effectiveness is related to the productivity of the user, what is absolute evaluation of systems effectiveness? How is relevance assessments a limiting factor in measuring it?

    2) In their experimental setup, their prologue's size is greater than a third of the entire experiment. This seems a bit excessive since in other literatures most of the participants get the “hang of things” in a small number of trials. Also, over time participants might start to degrade in performance due to losing of interest, fatigue, etc. Are there benefits for such a big prologue that can out weight the negative consequences of it?

    3) High NFC is shown to not be a significant factor to increase agreement with expert judges but those with high NFC did achieve better agreement than the low NFC participants. However, it was also noted that High NFC participants did take longer to analyze documents. Wouldn't it be interesting to see data about how the agreement with experts changed over time?

  24. 1. What can be done to handle the cognitive biases towards relevance from NFC. Would it be possible to custom design a document pool based on a person’s predispositions to get the best annotation quality?

    2. The study succeeds in showing that particular document ‘treatments’ in the prologue effect relevance judging behavior. However, it is unclear to me whether the change in judging is the result of a changing judging criteria or rather decreasing cognitive effort. Is the convergence in judging times of HNFC and LNFC evidence of a changing judging criteria or cognitive strain on the assessor?

    3. When the participants described their challenges in the task one of the things they brought up was a feeling of not understanding the topic intent or topic background. Do people who indicate some level of discomfort or uncertainty with the topic intent show unique assessment behaviors? How large of a problem is this for assessment work?