
Search at the Guardian Newspaper
On Monday night I had the privilege of attending the Search at the Guardian event, organised by Tyler Tate of the Enterprise Search London MeetUp group and Martin Belam of the Guardian newspaper. I say “privilege”, as it seems all 60 places were snapped up within a matter of days, so I consider myself fortunate to have grabbed a place. Seems like Search has gone viral this last week … anyway, it was well worth the trip as the Guardian put on a great show, consisting of talks from their technology development team about the challenges in providing a searchable archive of all their content.
To add an extra note of personal interest, they are actually in the process of migrating from Endeca (for whom I currently work) over to Apache Solr, and with it embracing the wider opportunities of providing open access to their data and search services. One of their key goals in doing this is to enable the development community at large to create value-adding apps and services on top of their data and API, thus transforming the Guardian’s role from publisher to content platform.
By their own admission, they haven’t got the best out of their Endeca investment, and have allowed their installation to get wildly out of date and unsupported. So what they have on their live site is hardly representative of a typical Endeca deployment. But that said, I think there are some basic user experience issues they could improve, regardless of platform. In particular, I think there are significant issues around their implementation of faceted search and the overall design of their results pages. In addition, I think there are some missed opportunities regarding the extent to which the current site supports a serendipitous discovery experience (something which a site like this, if designed appropriately, should really excel at). If I get chance I’ll provide a fuller review, but for now it is probably instructive to refer to the Endeca UI Design Pattern Library, in particular the entries for Faceted Navigation: Vertical Stack, Search Box, and Search Results: Related Content. These patterns provide much of the background necessary for addressing the immediate issues. (NB although these patterns are published by my colleagues at Endeca, the guidance is essentially platform-agnostic and applies to search and discovery experiences in general).
But let’s get back to talking about the event itself. All the half dozen or so presentations were valuable and instructive, but as a UX specialist I did particularly enjoy Martin Belam‘s talk, who discussed “Why news search fails…and what you can do about it“. I have a lot of sympathy with Martin’s observations about the Guardian’s site users and their expectations that the search engine should be able to “read minds”. In particular, he cited the classic problems caused by underspecified or incomplete queries (i.e. should a search for a single word such as “Chile” return stories of mining accidents or football reports?). Interestingly, this type of phenomenon is exactly the sort that should be reduced through features such as Google Instant – if you can see the mix of results your query will return before you hit enter, you are more likely to provide the context needed for adequate disambiguation.
Martin also talked about the “long tail” of search queries, i.e. the hapax legomena that occur in any search log. Search logs, like most natural (language) phenomena display a Zipfian distribution, i.e. term rank and frequency are inversely related by a power law. In the Guardian’s case, this means a typical day can produce some 17,000 unique queries, most consisting idiosyncratic edge cases. However, a few common patterns do re-occur, including:
- People’s names (which are often incomplete, as alluded to above)
- Dates (which Martin argued were highly generative and therefore not easily matched by regular expressions, but based on my experiences with named entity recognition at Reuters, I’d be more optimistic about the prospects for this)
- Misspellings and typographic errors (which in many cases I’d argue are addressable through Auto-correct and Did You Mean techniques, i.e. string-edit distance against a cumulative list of known terms)
Also intriguing was the observation that only 1% of their current page views are search-driven – I wonder how this will change as consumption of their content increasingly occurs in a mobile context, with users engaging in highly goal-driven, spontaneous or impulsive tasks, for which search is the obvious entry point? He also outlined some of the ways in which their site search exploits context and metadata to deliver a richer experience (than web search), and uses manually assigned tags to dynamically generate topical landing pages for arbitrary query combinations (e.g. “chess” and “boxing”). Martin also alluded to a vision of using “multiple search boxes” to infer the user’s intent based on local context (but I’d prefer to think of this as multiple instances of a single search box).
One final point – surely all that manual tagging is insanely time consuming an non-scalable? I understand of course the need to apply human editorial quality control, but at Reuters even back in 2002 we were using semi-automated text categorization solutions to successfully tag over 11,000 stories a day (and had been doing so for many years previously). I’m a bit surprised the Guardian appears to be so reliant on manual methods, and am curious to know how they view the trade-off between efficiency, accuracy & throughput.
So all in all, a very productive an enjoyable evening – thanks again to Tyler and Martin for making this happen.
Hi Tony, I’m glad you enjoyed the evening. We were really taken aback by how much demand there was to attend, and I’ve been thrilled that the feedback on the event has been overwhelmingly positive.
I hope we made it clear enough on the night, and I’m happy to reiterate here, that for sometime we’ve been using an antiquated and unsupported version of the Endeca platform. That means, of course, that we’ve been unable to take advantage of several years worth of new features and improvements to the engine. The hack day based on our content that Endeca did last year was truly impressive, and I think it is a testament to how good the concept of guided navigation via facets is that we were inspired by it when building our in-house CMS. After all, one of our tag pages, for example http://www.guardian.co.uk/world/algeria, is essentially just a search results page in a more content-orientated template.
One final thing, about auto-tagging. I think that if we were to add a significant chunk of archived material into the system we’d definitely need some machine help. At the moment though, our keywords, and there are about 7,000 of them, are more aimed at being a tool of editorial curation, rather than trying to be a comprehensive entity extraction. Of course, with our API, there is nothing to stop someone chucking loads of our content into something like OpenCalais and demonstrating what could be achieved – in fact the Open Platform positively encourages other people to annotate our content – for example see http://www.guardian.co.uk/open-platform/apps-zemanta, http://www.guardian.co.uk/open-platform/apps-zemanta and http://www.guardian.co.uk/open-platform/apps-content-tagger
Hi Martin,
Yes, I agree about the antiquated use of Endeca – I hope I made that observation clear in the above post too! BTW, I think you and your colleagues were very magnanimous in your commentary about your experiences with Endeca and the subsequent decision to migrate to Solr – you may not have been aware that I (as an Endeca representative) was present, but I thought the observations you made were very fair and even handed. Just like the Guardian itself 🙂
Thanks also for the links re auto-tagging. This is as an initiative I’d love to get involved with! Will definitely follow those up.
Tony
I’m interested in the success you had at Reuters with semi-automated approaches. When we went down that route at the BBC we saw some interesting effects where the journalists got, I guess, led astray by the suggestions. They tended to accept the automated suggestions regardless of whether they would have previously considered the article to be about that. I imagined the thought process being something like “Cats? Oh well, I suppose there is a cat mentioned in paragraph 10”. I read some research at the time about semi-automatic approaches similarly lowering the quality of classification by trained medical cataloguers.
Hi Karen
Apologies for tardy response (been away for half term). I think it’s an intriguing question, and one unfortunately I can’t give a definitive answer to, as Reuters closed down its Technology Innovation Group (and the work my team was doing) before we had chance to assess the adoption patterns out in the field. That said, we did document in a fairly rigorous manner everything we did up to that point, e.g. the coding scheme and process, and the analysis of inter-coder consistency are all documented in:
T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1 – from Yesterday’s News to Tomorrow’s Language Resources. In Proceedings of the Third International Conference on Language Resources and Evaluation, 2002.
Click to access LREC_camera_ready.pdf
It’s a long time ago now, but looking briefly at the inter-coder consistency section, there doesn’t appear to be any evidence of the “cats” phenomenon you describe (unless all editors were equally affected!?). But I think to answer the question definitively we’d need to dig deeper into the dynamics of the coding process, something which may no longer be possible from the raw data alone.
Moving on from that, the evolution of the coding process and policies and how they are relate to coding quality are discussed further in:
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
On reflection it sounds to me as if the issue you saw at the BBC may have been a combination of the coding policy in place at the time (which may have been interpreted too liberally), the accuracy of the coding system itself, and the adoption issues that inevitably entail when workflow changes as a result of (semi) automation.
BTW, my ex-Reuters colleague Jonathan Engel may know more – he is an ex-journalist and worked very closely on the roll out of all the above. I’ll drop him a line – but in the meantime you can see more at:
http://www.infoark.co.uk/reuters.htm
Jonathan also produced a white paper on the subject but I can’t seem to find it on his website. It probably got mis-coded 🙂
Tony
Thanks, Tony. To answer Karen’s question, I’ll summarise and update the paper you referred to with some comments below:
Accurate content classification – at Reuters and elsewhere — begins with a well-structured taxonomy of terms, preferably one expanded into a thesaurus and ontology with synonyms, related taxonomy topics and what I call “contextual keywords” – the words or phrases you’d expect to find in relevant documents or suggested by subject experts.
The initial structuring should also separate pure “Subject” topics like Social care or Taxation from Entities, those proper nouns designating Geographic locations, Organisations or People.
Most content classifiers use separate entity-extracting software to identify these terms and add related topics from a repository or authority file. Thus an Entity topic like House of Commons could automatically be tagged with attributes such as Legislature, Lower House, UK, and the subject topic of Laws and law-making.
Subject topics also need to be differentiated from “focused filters” that define Content types (eg Interview, Survey), Events (Election, News announcement) or Projects and Initiatives (Millennium Development Goals, End Child Poverty).
The resulting, now smaller, list of Subject topics still requires the most work during automated classification, but it is possible to achieve accuracy scores of 85 to 95 percent, with results balanced between recall and precision.
As Karen noted, the biggest problem is with over-classification, but it can be avoided by devising rules with high accuracy thresholds to identify a primary Subject topic and limited numbers of secondary topics. So yes, the article might mention “cats,” but not prominently or often enough to suggest the topic tag to editors.
Most classification software operates through a combination of rules and training documents; at the very least, the rules help you build a well-balanced training set.
The key to a good tagging rule lies in understanding the relationships among the components of the controlled vocabulary – the preferred term, the broader and narrower terms, the synonyms and the associated terms – and reflecting these relationships in the correct rule syntax, including proximity comparisons and accuracy weightings.
For instance, the preferred term and synonyms should carry higher weightings than associated terms – the related topics or contextual keywords. However, a rule can be improved by also testing for close proximity of the preferred term and synonyms to these associated terms.
I’ve recently demonstrated the effectiveness of such well-structured rules for other media companies employing automated content tagging.
In one Proof of Concept, I created a pool of documents that had been manually reviewed and tagged with an appropriate topic, eg “Weather and natural disasters.” With the accuracy threshold set fairly high at 80 percent, my basic “out of the box” metadata-generated rules succeeded in finding 84 percent of the relevant documents. In eight out of 10 cases, the accuracy level for these documents exceeded 90 percent.
A lower accuracy threshold (eg one mention of “cats”) would have increased the document retrieval or recall rate closer to 100 percent. However, my preferred method of improvement is to boost the precision score; I customise each “out of the box” rule by asking domain specialists to provide more synonyms and contextual keywords. These terms (eg tabby, domestic shorthair, cat food, kitty litter, cat flap, veterinarian) would be sourced from related documents or via tags from collaborative, social networking sites.
That way, each rule remains consistently structured and weighted to deliver a baseline level of accuracy for the entire corporate vocabulary. If needed, I can then modify the structure or weighting of ALL rules at once to boost the retrieval rates – for instance by increasing the weight of associated, contextual keywords — while retaining a high degree of precision in the results.
This method allows an organisation to produce accurate rules, training sets and content filters quickly, at double or triple the rate compared with the manual method, estimated by both Reuters and Dow Jones as only two topics per day.
For details of recent work, see my consultancy website, http://www.infoark.co.uk.
Thanks for an interesting post. I have added you to my blogroll on http://blog.jamesbayley.com