I am trying to put together a framework for search quality evaluation for a specialist information provider.
At the moment quality is measured by counting the number of hits for certain key docs across various queries, and monitoring changes on a regular schedule. I’d like to broaden this out into something more scalable and robust, from which a more extensive range of metrics can be calculated. (As an aside, I know there are many ways of evaluating the overall search experience, but I’m focusing solely on ranked retrieval and relevance here).
We are in the fortunate position of being able to acquire binary relevance judgements from SMEs, so can aspire to something like the TREC approach:
http://trec.nist.gov/data/reljudge_eng.html
But of course we are running just a single site search engine here, so can’t pool results across runs to produce a consolidated ‘gold standard’ result set as you would in the TREC framework.
I am sure this scenario repeats the world over. One solution I can think of is to run your existing search engine with various alternative configurations, e.g. precision oriented, recall oriented, freshness oriented, etc. and aggregate the top N results from each to emulate the pooling approach. Can anyone suggest any others? Or perhaps an alternative method entirely?
Here’s how we do it (published 2/20/2012): http://www.opensourceconnections.com/2012/02/20/an-introduction-to-results-utility-positioning/
Thx Jason – are you sure that URL is right? It’s giving me a 404.
Tony–Yeah, it’s scheduled to be published Monday afternoon GMT -5.
Doh! those US date formats get me every time 🙂
OK. Let’s try this link again. Here’s the proper one: http://www.opensourceconnections.com/2012/02/19/an-introduction-to-results-utility-positioning/
I can’t blame the Brits for the date kerfuffle!
I blame Pope Gregory. Or was it Julius Caesar. One of those guys …
I think you are going to need to make some
more decisions about what exactly you want
to learn from the evaluation. The
(not-always-successfully-obtained) goal in
TREC is to create reusable collections that
support a wide variety of different measures.
By reusable, I mean that retrieval system variants
that did not participate in the construction
of the test collection can be fairly evaluated
using the collection. The desire to support
a wide variety of measures is motivated by the
fact that the measures are the surrogates for
different retrieval tasks. TREC and similar
community evaluation efforts try to leverage
the cost of building the collections by making
them as broadly usable as possible. But the more specific
you can be about what exactly you need to measure,
the easier it is to build a collection that allows
you to measure that. (The flip side, of course,
is that the collection may not be good for
measuring anything other than that.)
The “minimal test collection” technique introduced
by UMass was designed precisely for the case you
describe if you have a (mostly) stable set of alternative
retrieval algorithms that you are trying to compare.
The basic component of this technique is an algorithm
that, given a judgment budget and a measure, selects
which documents to judge across a set of runs
(output of different retrieval algorithms) such that
you get the maximal amount of power to distinguish among the runs.
author = { Ben Carterette and James Allan and Ramesh Sitaraman},
title = {Minimal Test Collections for Retrieval Evaluation},
booktitle = {Proceedings of the Twenty-Ninth Annual International
ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR 2006)},
year = {2006},
pages = {268–275}
Ben Carterette and colleagues have also looked at the reusability
of test collections:
author = { Ben Carterette and Evangelos Kanoulas and
Virgil Pavlu and Hui Fang},
title = {Reusable Test Collections Through Experimental
Design},
booktitle = {Proceedings of the 33rd International
ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR 2010)},
year = {2010},
pages = {547–554}
and
author = {Ben Carterette and Evgeniy Gabrilovich and
Vanja Josifovski and Donal Metzler},
title = {Measuring the Reusability of Test Collections},
booktitle = {Proceedings of WSDM’10},
Restricting the measures of interest is another way
of reining in the problem. If the task you are interested
in is adequately modeled by precision in the top X documents,
then you only need to judge the top X documents from
each run. Also, you have an easy test for whether a
new run is treated fairly: are all X documents judged?
However, note that in my opinion very few real retrieval
tasks are in fact adequately modeled by, say,
Precision(10) scores using binary judgments.
Even in web search, usually considered to be the prototypical
high-precision search task, you generally care
whether the top ten documents returned cover different
aspects of the search topic.
Assuming you want to measure something more akin
to MAP (which would be my suggestion), there
has been work on approximating the value of
MAP using many fewer judgments than straight
TREC-type pooling would require. For example,
author = {Emine Yilmaz and Evangelos Kanoulas
and Javed A. Aslam},
title = {A Simple and Efficient Sampling Method for
Estimating {AP} and {NDCG}},
booktitle = {Proceedings of the 31st Annual International
{ACM SIGIR} Conference on Research and Development
in Information Retrieval},
year = {2008},
pages = {603–610}
There have also been other measures proposed to compensate
for the fact that the relevance judgments are very incomplete.
While these approaches all attack the few(er) judgments
side of the problem, they still assume that those judgments
represent an unbiased sample of the relevant documents.
Thus, they are likely to be less appropriate for your
case where you have a single system than in a community
effort such as TREC.
Ellen Voorhees
(TREC project manager)
Hi Ellen – thanks for stopping by and for the detailed response. I like the way you have outlined the tradeoff between re-usability and customisation of a collection to specific needs – that’s really useful.
On a wider note, I am very sympathetic to the idea that search experience should be measured in a more holistic (user-centred) manner – but for this particular project we are focusing specifically on ranked retrieval and how best to measure that in a commercial context.
Let me follow up on a few points:
* “make some more decisions about what exactly you want to learn from the evaluation”
At the very least in the first instance we’d like to be able to calculate some form of precision and recall for each run. We can’t do this at present as all we have are a few rels for some hand-picked documents and topics. So the “minimal test collection” approach could be suitable for our needs – I’ll have a look.
“If the task you are interested in is adequately modeled by precision in the top X documents, then you only need to judge the top X documents from each run. ”
As it happens there would be some support for this approach (even though I think it frames the search experience even more narrowly – see my comment above) – for example, prceision at 10 or 20 would be understood and gain traction from within the business. But to operationalise this, would we need to execute those judgements for every run? If so, that might not work for us – our engagement with the SMEs is such that we can ask for rels on a one-off (i.e. aperiodic basis) as part of a specific project or exercise, but not so easily on an ad hoc / operational basis. Or have I misread you?
“very few real retrieval tasks are in fact adequately modeled by, say, Precision(10) scores using binary judgments.”
Actually, we might be the exception that proves the rule – for *some* of our use cases, there is a very clear goal-directed aspect to the task; not quite “known item” as such, but certainly a case where the end users should know they’ve found it when they see it (assuming the SERP is designed appropriately, but that’s a whole separate issue – and a forthcoming blog post, as it happens).
“Even in web search, usually considered to be the prototypical high-precision search task, you generally care whether the top ten documents returned cover different aspects of the search topic”
Yep agree, but following on from previous comment, diversity hasn’t up to now been an explicit goal for us. In fact for the particular use case above, it may well be an anti-pattern. (Interesting point though, and one I think we should reconsider when reviewing our other search scenarios.)
Thaks also for the ref to the work on approximating MAP – will also look into that.
Ellen: if you’re still listening (or anyone else for that matter 🙂 – do you know of any published case studies or examples where people have applied the above framework in a business / operational context (with all the constraints on time, resources and decision making autonomy that that implies)?
I’m assuming the framework you are referring to is the
minimal test case framework. I don’t know of any such
case studies, but it’s not clear that I’d be the best person
to ask. I can ping Ben to see if he knows of some.
With regard to the use of precision@X in the earlier comments,
yes, you’d want all of the top X documents for each of the runs
you are comparing to be judged. But would you be doing
the comparison in the operational context? I had envisioned
something along the following lines:
1. You have current version of operational system plus some variant(s)
that you want to compare to this version. You also have
a set of queries that you are going to do the comparison over.
2. You run the queries on using each system variant and judge
the top X documents from each variant. If you have S systems,
this will be at most S*X documents per query to judge, but likely
a small fraction of that many since systems will retrieve
some documents in common.
3. Some time later you have another set of variants to evaluate.
You run the queries again using these variants. Now
you have another set of documents to judge, but this set
is likely to be even smaller because many of the top X retrieved
will have been judged in earlier rounds. [Of course, this last
assumption won’t necessarily be true if your document set
is changing significantly between tests. Perhaps that is
how the operational context fits in. Even in this case, you
are still doing evaluation aperidocially which matches
your SME schedule, though perhaps causing you to ask
too much of your SMEs each time.]
Even if you want to evaluate the current system at set intervals,
you are going to need to use a common set of queries for which
you have (some) judgments. Precision@X has the advantage
of being easily bounded if judgments for some documents
are missing. Assume all documents with missing judgments
are relevant and compute precision, then repeat assuming
all unjudged are not relevant. If too many documents are unjudged
you will have bounds that are so broad as to be meaningless,
but you are guaranteed that the value is within those simple bounds.
At the next time you can get judgments, have those documents
be judged and you can get tighter bounds.
Ellen
Hi Ellen and Tony, I don’t know of any published case studies either. The method was used for the TREC Million Query track, which published three overview papers:
(Accidentally submitted my previous comment early, trying again)
Hi Ellen and Tony, I don’t know of any published case studies either. But the method was used for the TREC Million Query track, which published three overview papers:
Click to access 1MQ.OVERVIEW16.pdf
Click to access MQ.OVERVIEW.pdf
Click to access MQ09OVERVIEW.pdf
Evangelos Kanoulas, Emine Yilmaz, and I also gave a tutorial at SIGIR 2010 on this topic; you can find the slides here:
http://ir.cis.udel.edu/SIGIR10tutorial/
BTW, I agree with everything Ellen said above… just providing a few more links in case they prove useful 🙂
Hi Ellen – thanks again for stopping by. This is really useful. If you do hear of any case studies, please let me know. In the meantime, let me pick up a few points:
“But would you be doing the comparison in the operational context?”
I’m a little unsure what you mean by operational context here – in principle what we’d envisaged doing is creating a baseline set of rels that the SMEs judge offline. But even then there’s a lot of practical details to be worked out, e.g.
* the choice of presentation medium (Excel?)
* the way the judgements are framed (e.g. “which of the following would you expect to be returned for <query?)
* the degree of context we provide around each topic / info need (nothing? Or at the other extreme, a subjective mini-scenario?)
* whether to randomise the order? (probably)
* the nature of the rels (binary vs. 3-way, etc.?)
and so on.
"You have current version of operational system plus some variant(s) that you want to compare to this version. "
Actually in our case we have just one variant that we want to baseline (at the outset). Other variants can and will of course come along … but we don't have a 'backlog' of those as such. Instead, we would hope that the insight we get from the initial benchmarking exercise (plus other, more user-centred inputs) would give us the insight we need to actually conceive and develop those other variants, which we can then evaluate, with the ultimate goal of course of continually improving the service.
"Now, you have another set of documents to judge, but this set is likely to be even smaller because many of the top X retrieved will have been judged in earlier rounds. "
Ah, of course! Sounds obvious now but that's a great insight that I hadn't considered. Thanks 🙂 It might be a bit fiddly tracking the individual docs in our case, but it's the right thing to do.
"Of course, this last assumption won't necessarily be true if your document set is changing significantly between tests. "
This is certainly true in our case, but I don't have a feel for the magnitude of the problem yet. The delta between the initial benchmarking exercise and a subsequent run (say a month or so later) should give us some quantitative insight into this.
"though perhaps causing you to ask too much of your SMEs each time"
I worry about this … while on the one hand we want to do what's right and gather robust data, OTOH if we are too ambitious we'll just lose their good will & interest. It's a fine line to tread, and from where we are now, my instinct is to be conservative (at least in the initial engagement). As an aside, this is one of many places where principles developed within the research community don't really work when transferred to a commercial environment… there is a significant mapping / translation exercise to be done, and I'm not sure whose responsibility that is (or should be). But that's another story / blog post… 🙂
"you are going to need to use a common set of queries for which you have (some) judgments"
Yes, we're hoping to use our top 50 or so (initially), conflated to merge common classes of information need (at a fairly superficial / lexical level).
"Precision@X has the advantage of being easily bounded if judgments for some documents are missing. …you are guaranteed that the value is within those simple bounds"
Billiant. That is a great way to approach the problem – we can use that right now.
Thanks once again for your input, Ellen. It's been really useful.
Tony – I’m late to the game on this comment, but I thought I would (briefly) share what I’ve been working on lately myself. I spoke about this at Enterprise Search Summit last fall in Washington D.C.
I am planning an article for my blog which will provide more details.
The question that’s driven me to this approach is the idea that when one is trying to improve the quality of results users are getting, what is a scalable way to do that?
Commonly, one approach is to make some changes (changes to content, changes to search engine settings, changes to the search UI, etc) and then have a group of users test the results. You get feedback that will help you make a decision for another “tweak” and you do it all over again. “Wash rinse repeat”.
Very manually intensive to do so.
But, I would say, much of the “impression” of the quality of a search results page to a person is not really assessed by looking in detail at every result and performing a deep assessment of the quality of that relative to the search term used.
Instead, there are some simple heuristics users apply – keyword density in what they see on the search results page, or known targets for that particular search term or age of the results.
What I’ve done (in a proof of concept utility) is to build a means to automate an assessment of search results for a set of search terms.
For each search term, it executes the search and retrieves the set of results. For each result, it then scores it between 1 and 100 on a number of factors (like the above). Then, it averages this score across the results for each term to be able to score the search term overall. (You could wait this average by position on the search results page.)
Then you can average that score across all of the terms to get an average score for “search results page quality” for a set of terms. (You can also wait this by how much usage the term gets.)
Using this, you can do several iterations of the “wash rinse repeat” cycle in very short cycles and then once you get to a good overall score, do a loop with human evaluators (I don’t think you could remove the human evaluation entirely).
There are many other interesting analyses one can do with this data as well (looking at the average score for each of the dimensions for example or looking at the results grouped by their underlying taxonomy, etc). And, if you do this data collection over time, another one I have been looking at recently is the idea of measuring the stability of results for specific terms over a stretch of time (see my post yesterday on the Enterprise Search group on LinkedIn for some more on that – hopefully this link works – http://www.linkedin.com/groupItem?view=&gid=161594&type=member&item=121955544&qid=a6e868d2-442a-40de-8107-8c60ec951b67&trk=group_most_popular-0-b-ttl&goback=%2Egde_161594_member_122307112%2Egmp_161594 )
Hey Lee, thanks for stopping by. This is all good stuff. I think the crux of it lies with how we define the users’ goals when interacting with the search results. I like the idea of measuring “impressions” of search result quality … but I’m struggling to relate that to a direct use case or common end user behaviour. In my experience most end users goals are intimately connected with the actual information that is provided through those results and the extent to which it satisfies their information needs… do you have a sense for how reliable these measures are as a proxy for those end user behaviours?
One bit I didn’t quite follow was “Then, it averages this score across the results for each term to be able to score the search term overall.” So are you saying that you iterate over each of N queries, then within that iterate over each of M terms within the query to provide the overall average?
Also what kind of human evaluation did you envisage? As you no doubt have seen from the dialogue above, there is a considerable disconnect between the academic protocol and what typically happens in an operational environment. I aspire to doing things formally … but it often feels like swimming against the tide 😦
I think what you’re doing in measuring the ‘churn’ of results for a specific query is really interesting too. Let me know when the blog post comes out!