I need to compare two text classifiers – one human, one machine. They are assigning multiple tags from an ontology. We have an initial corpus of ~700 records tagged by both classifiers. The goal is to measure the ‘value added’ by the human. However, we don’t yet have any ground truth data (i.e. agreed annotations).
Any ideas on how best to approach this problem in a commercial environment (i.e. quickly, simply, with minimum fuss), or indeed what’s possible?
I thought of measuring the absolute delta between the two profiles (regardless of polarity) to give a ceiling on the value added, and/or comparing the profile of tags added by each human coder against the centroid to give a crude measure of inter-coder agreement (and hence difficulty of the task). But neither really measures the ‘value added’ that I’m looking for, so I’m sure there must better solutions.
Suggestions, anyone? Or is this as far as we can go without ground truth data?
I don’t know if the evolving tags of the human coders are their value added. I think the true value of human coding is in the handling of ambiguous language, not in the multiplicity of tags. A growing tagset can be helpful, or it can add information that is minimally useful. And the computer tagset is probably also capable of growing or evolving with some help.
If you take some time to look at the mismatches, you can probably get a quick feel for what the differences were and how best to proceed with describing them.
Thanks – I agree with your comments about the provenance of the value added… but short of observing the task performance (which we can’t do right now) we only have their output as a source of data. BTW I’m not sure if I understand your comments about ‘growing tagset’, but in our case the ontology is fixed (for both taggers).
And yes, I can see differences by eye, but I’m looking for some principled way of quanitifying the differences I am seeing.
How about using ESA (http://www.cs.technion.ac.il/~gabr/resources/code/esa/esa.html) to measure the semantic relatedness of the human assigned tags to the records
Tony,
The ‘value added’ by human taggers concept is unclear. The tagging in both cases is the result of human adding of semantics. Once through the rules for the machine tagger and once via the “human” taggers.
Can you say a bit more about what you see as a separate ‘value added’ by the human taggers?
Patrick
Patrick – Thanks for stopping by. I agree I’ve been a bit ambiguous about that – partially because I was rather hoping that some clarity on that might emerge from the discussion itself 🙂
I guess what I originally anticipated was that there might be something that could act as a proxy for retrieval effectiveness, such as the granularity of tags applied (e.g. a small number of highly specific, fine grained tags might be better than a large number of generic ones). This can be independently measured without ground truth data, as can inter-coder agreement, and various other metrics. But I hadn’t really thought it through any more than that – like i said, I was open to suggestions as to what might serve as a robust but simple measure of value added.
I appreciate that the conventional way to evaluate such things is against gold standard (i.e. manually annotated) test data, but that’s not an option available to us right now. So if you can suggest anything useful in its stead, I’d be delighted to hear it.
why not simply use the inter annotator agreement?
(something like: http://www.comp.nus.edu.sg/~nght/pubs/siglex99.ps)
i know that one is a machine, but i think it could be interesting indeed.
seralf: > why not simply use the inter annotator agreement?
Because we have no gold standard test data (see above)
If you compare the tags from two sources (human and machine), discard all common ones from human annotated tags set, and if your assumption is that human will not add tags which are wrong, what you are left with are tags that are adding value over what was tagged by machine. (Unless I missed the point completely?)
Thanks Ravish… interesting. When you say ‘common’ do you mean high frequency, or common to both sets? I am assuming the former. Unfortunately tho I’m not sure we can say that humans don’t add tags that are ‘wrong’ – in practice some tags are added that probably wouldn’t make the cut if we did a collective rationalisation session (e.g. to produce a gold standard data set).
As it happens tho, we did do something very similar to what you describe, i.e. comparing the two sorted frequency lists of tags, and then identifying those that were ‘disproportionately added’ by humans, and those ‘disproportionately removed’. As I recall, the auto-tags commonly removed by humans tended to be very abstract, generic concepts, while the human added tags tended to be more specific concepts. With a bigger data sample it would probably have been worth analysing this further to find other patterns and characteristics of the two processes.