A client of mine wants to measure the difference between manual tagging and auto-classification on unstructured documents, focusing in particular on its impact on retrieval (i.e. relevance ranking). At the moment they are considering two contrasting approaches:
- Create a list of all the insertions and deletions (i.e. instances where the auto and manual tags differ for a given document), and sort by frequency. Take those that appear more than given number of times (say 20), and count how often they appear as search terms in the top 1000 queries for the past 6 months. Include exact matches (where a tag and a query term are identical), and partial matches (where a tag is wholly included in a query), but exclude everything else. For tags that don’t appear in the top 1000, assume a notional frequency of say 70. Then divide the figure you get by the total number of queries over the past 6 months. This gives you a measure of how important those insertions and deletions are, and thus the impact of manual tagging on retrieval.
- Run a controlled experiment in which the tagging condition is the independent variable and the relevance ranking is the dependent variable. Use a benchmark set of queries and relevance judgements, and calculate precision and recall.
Surprisingly (to me, at least) there seems to be some debate as to which is the best approach.
Which one would you choose, and why?


cant you just show them examples of auto-classification and that they are ****? as existing evidence, you could refer them to marti hearst’s CACM piece from 2006 that said carefully constructed metadata is always better than automatically generated. because you can match the manually constructed content to the users mental model as much as possible.
however, if you wanted to run a study.. id do the latter with real user tasks. but thats more my background.
Thanks Max – actually just eyeballing the tags isn’t as one sided as you might think. I could probably tell which was which by eye, but it isn’t a case of one set being plausible and the other not (hence the need to measure this formally). And yes, the client manages both systems, so they have seen examples of both sets of tags
Thx for the Hearst ref – that’s a good angle to take. I guess in principle there’s a pretty wide spectrum between ‘carefully constructed’ on one side and auto generated on the other, but in practice the two instances we have may actually not be that far apart.
Any observations on the first of the 2 approaches above?
As I understand approach one, it measures only how often assigned tags occur in queries. That itself doesn’t say anything about how effective those tags are for satisfying searches, unless you assume that a document assigned a tag is relevant to all queries that contain the tag (and only relevant to queries that contain the tag?). That’s an assumption I find dubious at best. Human indexing has been around a long time and has not solved the information retrieval problem.
Ellen Voorhees
Well I would naturally vote for user-study
Being practical, the first approach seems interesting and probably less expensive. However what about the scale to analyze the final “importance”… if the difference is 20% in favor of automatic classifier, is it sufficient ? if it’s 10% ? 5%…
Moreover as said, it does not really measure the added value of the insertions/deletions only their statistical appearance. Nothing tells you if the tag are precise (ie relevant vs the content tagged). What’s the use of your classifier if it produce tags that appear frequently but apply it on irrelevant content.
So final vote : user-study !
On another though, if you start with option 2 and you’re ready to spend manual annotation time (here on assessing relevance of results) why not simply creating a “ground truth” on the tagging task ? You would be able to measure the actual precision/recall of both classifier and if necessary you could weight the classes (ie the different tags) by their frequency of appearance in the query as proposed in option 1. Thus you have a measure of the classifier impact with regards to tag use.
Simply, if this is a problem of classification, do a classification benchmark (even if user-study are great…) and the bonus is that you include the bias imposed by actual use of the system.
Thanks gdupont: there’s a qualitative difference tho between creating relevance judgements and creating gold standard data – its not just a question of effort. The former is something the technical team can and do perform by proxy with reasonable accuracy, but the latter would require access to the specialist editors (which they don’t have). It might be possibile to create some manual annotations by proxy, but I for one would hesitate to call them ‘gold standard’. Also I suspect doing so for a critical mass of documents would be a much bigger undertaking than creating qrels for a sample of results for each test query (although I’ll double check the figures on this in case I’m overestimating the former)
Being currently involved in ground truth creation right now (not for IR), I can only agree on the comparative amount of work. Moreover in the controlled experiment, you could select the important queries to test to maximize the value of the results (and complete basic benchmark with user feedback for sure).