Feeds:
Posts
Comments

Posts Tagged ‘NLP’

A bit of a diversion here… but couldn’t help laughing when I read the entry for a certain hotel in the printed version of the Pride of Britain Hotels directory earlier today. I’ll spare them the embarrassment of naming them, but if you look closely you’ll find one that serves an English speciality known as “local duck spit roasted on an open fire”. Yum! Tastes much better without the hyphen, I think you’ll agree.

And while we’re on a general Eats, Shoots & Leaves theme, since when did “likely” become an adverb? Continued use of this word in such a manner will ‘likely’ annoy me. Sadly, proliferation of this grammatical monstrosity makes truly depressing viewing (according to Google ngrams):

(more…)

Read Full Post »

Here’s a quick shout out for Friday’s meeting of the London Text Analytics group, which will be held at Fizzback‘s offices on the Strand at 18:30. As usual, we’ll aim to start with a couple of informal talks then adjourn to a local pub for a drink or two afterwards. As it happens, this meetup is now full, but you can always join the waiting list or (if you’re not yet a member) sign up for early notification of the next event. Full details below – hope to see you there.

Automating the formalization of clinical guidelines using information extraction: an overview of recent lexical approaches

Phil Gooch (City University)

Formalizing guideline text into a computable model, and linking clinical terms and recommendations in clinical guidelines to concepts in the electronic patient record (EHR) is difficult as, typically, both the guideline text and EHR content may be ambiguous, inconsistent and make use of implicit and background medical knowledge. How can lexical-based IE approaches help to automate this task? In this presentation, various design patterns are discussed and some tools presented.

Question-Answering over Linked Data

Danica Damljanovic (Sheffield University)

The availability and growth of the Linked Open Data cloud made exploiting the rich semantics easily accessible but also challenging mainly due to its scale. In this talk I will discuss challenges for building a Question-Answering system that uses these data as the main source for finding the answer. I will introduce the FREyA system which combines syntactic parsing with the semantic annotation in order to correctly interpret the question, and involves the user into dialog if necessary. Through the dialog, FREyA allows the user to validate or change the semantic meaning of each word in the question – the user’s input is used to train the system and improve its performance over time.

Related Posts:

  1. Prostitutes Appeal to Pope: Text Analytics applied to Search
  2. The role of Natural Language Processing in Information Retrieval
  3. IR book is out!
  4. Applying text analytics to product innovation and legal cases
  5. Text Analytics: Yesterday, Today and Tomorrow

 

Read Full Post »

Here’s something I’ve been meaning to share for a while. The content was written originally as notes to complement to my tutorial on NLP + IR, but I thought it might be informative to invite a wider audience. At some point I’ll update it to include a section on Sentiment Analysis (who doesn’t these days?!) In the meantime I hope you find it useful – feel free to suggest additions or improvements either here or offline.

1. Introduction

The field of text analytics is laden with jargon and acronyms, so before we go any further, let’s clarify a few terms.

Text analytics describes a set of linguistic, analytical and predictive techniques by which structure and meaning can be automatically extracted from unstructured documents. Text analytics (TA) provides the ability to cluster, categorize and classify unstructured data, supporting both interactive exploration and automated document processing. TA processes can thus be used to improve the accuracy of search engines and to transform the search user experience by enhancing results presentation and facilitating interactive navigation and exploration.

The term text mining has a history of at least a decade or more and is well understood by practitioner and research communities. The term text analytics has perhaps a slightly shorter history than text mining, but is growing rapidly in popularity as a label for the application of NLP techniques in practical, commercially-oriented contexts.

Some authors treat TM and TA as synonyms, but this misses a subtle but important distinction: TM has its origins in automated, offline processing, characterised by ‘data in’ and ‘structure out’. It is primarily concerned with identifying and inferring aggregate patterns and insights across collections of documents. By contrast, TA is a more generic term, often associated with interactive, analytic applications and a focus on the analysis of individual documents.

1.1 The Challenge of Text Analytics

On the surface, the problem that TA attempts to solve appears trivial. After all, as humans we effortlessly create and interpret both spoken and written language, and rarely have to apply conscious thought to the process. But that, in many ways, is part of the problem: if the history of artificial intelligence teaches us anything it is that so many of the faculties that we possess as humans are not open to introspection and are therefore notoriously difficult to analyse and encode algorithmically. But that isn’t to suggest that the task is any less challenging: far from it. Natural language is littered with ambiguity, and it is only because we have evolved highly parallelised linguistic processes that work across multiple levels that we can make any sense of it at all.

Consider the following newspaper headlines, all of which appeared unambiguous to the original writer:

  • DRUNK GETS NINE YEARS IN VIOLIN CASE
  • PROSTITUTES APPEAL TO POPE
  • STOLEN PAINTING FOUND BY TREE
  • RED TAPE HOLDS UP NEW BRIDGE
  • DEER KILL 300,000
  • RESIDENTS CAN DROP OFF TREES
  • INCLUDE CHILDREN WHEN BAKING COOKIES
  • MINERS REFUSE TO WORK AFTER DEATH

Although humorous, they illustrate much of the ambiguity in natural language, and just how much pragmatic and linguistic knowledge must be employed by NLP tools to function accurately.

1.2 Text Analytics Opportunities

It has been estimated that as much as 80% of all potentially usable business information originates in unstructured form,e.g. as:

  • Email, news and blog articles; forum postings; and other social media;
  • Contact-centre notes and transcripts;
  • Surveys, feedback forms and warranty claims;
  • Various types of corporate document, etc.

Moreover, the rate at which the world’s digital data repositories are growing creates a further challenge: according to International Data Corporation, the world generated 161 billion gigabytes of digital information in 2006 and is expected to grow to approximately 988 exabytes by 2010 (Ingersoll & Morton, 2009). And even if much of this date is in the form of audio, video, images, and so on, our current approaches to making this data searchable still revolve around the provision of manual summaries, captions and keyword tags. So in other words, we are producing yet more text that has to be analysed and managed.

In addition to the growth of social media, we are seeing progressively more eCommerce sites employing user-generated content in the form of product reviews and opinions to support the purchasing process and drive sales growth. To maximise the value of this content, these organisations also need to employ analytic tools to make it searchable and navigable.

2. Fundamentals

In the section above we alluded to the highly parallelised nature of human language processing, and the notion that language understanding requires the application of linguistic and pragmatic knowledge across a number of levels. Text analytics is no different – all but the most trivial of systems will apply linguistic analysis across a number of levels, often in cascaded fashion, where the output of the simple, lower level processes feeds into more complex, higher level ones. In this section we will review the main techniques involved.

2.1 Basic Processing: Lexical & Syntactic Analysis

At the most basic level is treatment of character encodings, such as ASCII, UNICODE, UTF-8 and so on. Although not traditionally considered an active area for NLP research, these processes nonetheless have a direct influence on the representation of case (upper & lower), punctuation, accents, and so on, which all in turn carry meaning at the lexical level (see below).

At the level above characters we deal with terms and their variants, i.e. words, abbreviations, acronyms, compounds and so on. At this level, known as lexical analysis, we deal with tokenisation, i.e. the recognition of word boundaries. This process is relatively straightforward for languages that use whitespace as a delimiter, such as English, but is harder for languages such as Chinese and Japanese. At the lexical level we also deal with morphology, i.e. the process by which words are formed. Morphological analysis allows us to identify inflectional variants, such as “run” and “ran”, which are the present and past tenses of a common stem. In addition, we can also apply derivational morphology to recognise that “runner” is a further variant of “run” and “ran”, albeit in a different syntactic category (i.e. a noun in this case).

Note that it is possible to write a reasonably accurate automatic stemmer for English, which will conflate sets of related terms to a common root. A popular example of one such tool is the Porter stemmer.

Finally we use the outputs of these processes to perform syntactic analysis, i.e. the interpretation of the grammatical structure of a piece of text. A popular example of the role of grammar in English is to consider the phrases ‘Venetian blind’ and ‘blind Venetian’ – they contain common words but possess a very different meaning. The most elementary form of syntactic analysis is the application of a Part-of-Speech (PoS) tagger, which attaches a tag to each word to indicate whether it is being used as a noun, verb, adjective, determiner, and so on. The set of tags used by a particular PoS tagger may be used to emphasise certain grammatical distinctions. For example, the Penn Treebank tagset (Marcus et al. 1993), which is commonly used by PoS taggers, has four separate tags for noun:

  • NN (singular common noun or mass noun)
  • NNP (singular proper noun)
  • NNS (plural common noun)
  • NNPS (plural proper noun).

Part of speech tagging has been researched since the 1950s (Voutilainen 2003) and a number of tagging systems are now easily available. These systems use a number of different strategies, including handwritten rules (Karlsson 1990), hidden Markov models (Cutting et al. 1992) and n-grams learned from text annotated with part of speech information (Church 1998).

Operating around the syntactic level are various other analytic processes, such as:

  • Sentence boundary detection, which sounds trivial, but is deceptively complex. A publicly available sentence splitter is available from CPAN
  • Phrase detection, i.e. recognition of noun phrases and other syntactic compounds.
  • Summarisation, i.e. the ability to create an abstract or topical summary from a free text document. This sounds complex, but becomes much more tractable with semi structured text such as HTML content. A publicly available summariser for HTML text is available from CPAN
  • Parsing, i.e. the analysis of a sequence of tokens to determine their syntactic structure with respect to a particular grammar. In practice this often means analysing PoS tagged text to identify Subject—Verb-Object and other relationships to produce higher level linguistic analyses of in the text.

2.2 Named Entity Recognition

Named entity recognition (NER) is the process by which key concepts such as the names of people,

places and organizations are identified within a document. Some commercially available systems also include the ability to identify further entity types, such as postcodes, telephone numbers and so on. In addition, some systems have been built to identify more esoteric items such as the names of proteins. NER is best illustrated using an example, such as the following extract from a news article:

“Mexico has been trying to stage a recovery since the beginning of this year and it’s always been getting ahead of itself in terms of fundamentals,’ said Matthew Hickman of Lehman Brothers in New York.”

The named entities we might expect to extract from this sentence would be:

Persons:          Matthew Hickman
Organisations:    Lehman Brothers
Cities:           New York
Countries:        Mexico

The accurate identification of such entities confers a number of benefits. Firstly, by identifying such concepts in a document, it is possible to index documents at a more fine-grained level, which in turn allows more precise searching. The word ‘Bush’, for example, can refer to either a plant or an ex-President of the USA. By differentiating between words such as these (known as homonyms) we can maximize retrieval precision, i.e. the likelihood that information seekers will receive only those documents that are relevant to their query.

Secondly, by identifying names within a document we can provide links to other related resources. Several content providers already use this technique to highlight the names of the key people mentioned in a news story, and provide links to their biographies. Similarly, it is possible to create more precise filters based on those named entities, allowing personalised delivery of content to individual users. For example, a news service might supply details of upcoming events in ‘New York’ as opposed to ‘new’ events in the city of ‘York’.

Thirdly, the output of NER adds valuable structure to a document, and thus facilitates subsequent

post-processing transformations, such as machine translation (e.g. translating documents from one natural language to another) or text-to-speech output (e.g. providing spoken versions of stock quotes for access by phone).

2.3 Information Extraction

Information Extraction is the process by which structured information is automatically extracted from unstructured documents. Examples of the types of information which might be extracted include the movements of company executives, victims of terrorist attacks, information about mergers and acquisitions and interactions between genes and proteins in scientific articles. When the relevant information has been identified it is then stored in a highly structured format known as a template.

The Message Understanding Conferences (MUC) (Grishman and Sundheim 1996) were an international evaluation exercise in which information extraction systems were tested against one another using common documents and scoring systems. In the sixth MUC conference, participant systems were tasked with the challenge of identifying facts about the movements of executives between companies. The following paragraph shows a typical example, which describes an event in which ‘John J. Dooner Jr.’ becomes chairman of the company ‘McCann-Erickson’:

“Now, Mr James is preparing to sail into the sunset, and Mr Dooner is poised to rev up the engines to guide Interpublic Group’s McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr Dooner, 45.”

This fact is encoded in the following template structure:

<SUCCESSION EVENT-2> :=
SUCCESSION ORG:
<ORGANIZATION-1>
POST: "chairman"
IN AND OUT: <IN AND OUT-4>
VACANCY REASON: DEPART WORKFORCE
<IN AND OUT-4> :=
IO PERSON: <PERSON-1>
NEW STATUS: IN
ON THE JOB: NO
OTHER ORG: <ORGANIZATION-1>
REL OTHER ORG: SAME ORG
<ORGANIZATION-1> :=
ORG NAME: "McCann-Erickson"
ORG ALIAS: "McCann"
ORG TYPE: COMPANY
<PERSON-1> :=
PER NAME: "John J. Dooner Jr."
PER ALIAS: "John Dooner" "Dooner"

These template structures contain a large amount of information that enables complex queries to be carried out against the extracted data. For example, a user could ask for a list of all the events where a finance officer left a company to take up the position of CEO in another company. Once the templates have been extracted for a set of documents they can be used to populate a knowledge base derived from the corpus, or used as document metadata to improve retrieval precision.

However, information extraction from all but the most simple of documents is an imperfect process. One of the main reasons is that the description of an event is often spread across several sentences or even across separate paragraphs. For example, consider the following pair of sentences, which contains information about management succession events:

“Pace American Group Inc. said it notified two top executives it intends to dismiss because an internal investigation found evidence of ‘self-dealing’ and ‘undisclosed financial relationships’. The executives are Don H. Pace, cofounder, president and chief executive officer; and Greg S. Kaplan, senior vice president and chief financial officer.”

The name of the organisation and fact that two executives are leaving are contained in the first sentence. The names of the two particular executives and their posts are listed in the second sentence but it does not mention the fact that the executives are leaving these posts. Therefore, the succession event can only be fully understood through a combination of the information contained in both sentences. Combining such information across sentences is non-trivial since it is necessary to identify the phrases which refer to the common entities, i.e. ‘two top executives’ and ‘the executives’ in the above example. The process by which references between instances of one expression and another is known as anaphora resolution.

A further challenge is recognising instances when a given entity may be referred to by different surface forms. For example, ‘International Business Machines Ltd’ may be referred to by an abbreviation (IBM), a nickname (‘Big Blue’) or a contextual anaphoric expression such as ‘it’ or ‘the company’. Challenges such as this are the main reasons it has proved difficult to create reliable information extraction systems. Scores for the template filling task in the MUC evaluations were generally in the 50–60% F-measure range, which at the time was insufficient for most commercial applications.

An intermediate process which is somewhere between named entity extraction and full template filling is relation extraction. This task is simpler than full template filling since: (a) it aims to identify only relations between named entities (which are considered to be binary); and (b) only relations which occur within a single sentence are considered. These simplifications have meant that the task is more achievable than full template filling.

2.4 Word Sense Disambiguation

In this section, we review some of the challenges presented by the polysemy: the phenomenon whereby a given term may have different meanings or senses. For example, the word ‘bat’ can mean ‘sports equipment’ (as in ‘cricket bat’) or ‘nocturnal mammal’ (as in ‘fruit bat’), etc. Consequently, if we are provided with no additional information, it is impossible to tell whether a query which contains the word ‘bat’ refers to the domain of sports or animals. This phenomenon is common to many natural languages and represents a significant challenge to retrieval precision. The process by which the meanings of a given word is identified is known as word sense disambiguation (WSD).

One of the main determinants of word sense is syntactic role, i.e. words can have different meanings according to their use as a noun, verb, adjective etc. For example, two of the possible meanings of ‘light’ are ‘not heavy’ (as in ‘Magnesium is a light metal’) and ‘illumination’ (as in ‘The light in the kitchen is quite dim’). The first sense only applied when ‘light’ is used as an adjective, and the second only when used as a noun. Consequently, accurate identification of a word’s part of speech can be a valuable determinant of its meaning.

Wilks and Stevenson (1998) showed that word sense ambiguity could be carried out effectively using a part of speech tagger when only broad differences between different word meanings were being considered. For example, this technique could be used to differentiate between the verb sense of “bank” (as in ‘the plane banked sharply’) and the noun sense (as in ‘financial institution’), but couldn’t distinguish between more closely related meanings such as the various noun senses used in ‘a car crashed into the bank’ or ‘the bank gave me a good rate on my mortgage’.

There has been a long history of research into WSD in NLP during which a wide variety of techniques were investigated. Early approaches such as Wilks (1975) and Hirst (1987) applied techniques from artificial intelligence that relied on hand-crafted knowledge sources containing detailed semantic information. However, these approaches were limited by the fact that the information they relied on had to be manually created. Large-scale machine- readable lexical resources, such as WordNet, started to become available in the 1980s and were quickly used for WSD. They have the advantage of providing both a list of possible senses for each word and information which can be used to identify the correct sense, such as the hypernym hierarchy in WordNet.

There has been some disagreement about the usefulness of WSD for IR. Some have argued that the benefits which might be gained from disambiguation are limited. Krovetz and Croft (1992; see also Krovetz 1997) manually disambiguated a standard test corpus and found that a perfect WSD engine would improve retrieval performance by only 2%. Sanderson (1994) performed similar experiments in which ambiguity was artificially introduced to a test collection by automatically creating ‘pseudowords’. He found that a 20–30% error rate in disambiguation led to text retrieval which was at the same level, or possibly worse, than if the ambiguity was left unresolved. Sanderson also found that queries containing fewer than five terms were less sensitive to mistakes in the disambiguation. He concluded that WSD was only useful if it was very accurate or the queries were short.

Others have demonstrated that WSD can be used to improve IR performance. Schutze and Pedersen (1995) showed that disambiguation can substantially improve text retrieval performance; demonstrating an improvement between 7 and 14% on average. Jing and Tzoukermann (1999) have also reported improvements of 8.6% in retrieval performance. Their disambiguation algorithm computes the word sense similarity in the local context of the query, the similarity of the lexical-occurrence information in the corpus, and the morphological relations between words.

2.5 Question Answering

In the traditional standard information retrieval paradigm, the user is provided with a ranked list of documents through which they must search in order to locate the information they are seeking. An alternative approach to meeting the user’s information need in a more focused way is to provide specific answers to specific questions.

Question answering research has a long history (Green et al. 1961; Woods 1973) and was introduced as a task in the Text Retrieval Conference (TREC) IR evaluations in 1999 (Voorhees 1999). Within the TREC framework question answering can be thought of as a specialisation of the standard document retrieval problem in which the queries are expressed as natural language questions and the system is expected to identify the portion of the document in which the answer can be found. Like document retrieval, question answering is often carried out against a well-defined collection of documents which may, or may not, meet the user’s information need.

Questions can be posed in a number of different ways which automatic question answering systems should aim to process. Examples of possible question formats include:

  • Yes/no questions ‘Is George W. Bush the current president of the USA?’ ‘Is the Sea of Tranquillity deep?’
  • ‘Who’ questions ‘Who was the British Prime Minister before Margaret Thatcher?’ ‘When was the Battle of Hastings?’
  • List questions ‘Which football teams have won the Champions League this decade?’ ‘Which roads lead to Rome?’
  • Instruction-based questions ‘How do I cook lasagne?’ ‘What is the best way to build a bridge?’
  • Explanation questions ‘Why did World War I start?’ ‘How does a computer process floating point numbers?’
  • Commands ‘Tell me the height of the Eiffel Tower.’ ‘Name all the Kings of England.’

The standard method for tackling question answering is to approach the problem in three separate stages:

  1. question analysis
  2. document retrieval
  3. answer extraction

The aim of the first stage is to predict the type of answer expected (for example, the expected answer for ‘When was Mozart born?’ is a date) and create a query which can then be passed to an IR system. Since the questions are expressed in natural language it is not surprising that text analysis techniques have been found helpful for this stage. For example, Greenwood et al. (2002) parse the question to produce a logical representation from which the expected answer type can be extracted. The problem of generating a query from the question has not exploited TA techniques to the same extent although some approaches, for example synonym expansion (Harabagiu et al. 2000; Hovy et al. 2000), have been explored.

The second stage of processing passes the query to an IR engine with the aim of retrieving documents which contain the answer to the question. The final stage of processing, answer extraction, aims to extract the answer(s) from these documents, using the expected answer type information determined in the first stage. Various approaches have been applied to this problem and make use of a variety of TA techniques. Straightforward methods include simple regular expressions which match the text (Ravichandran and Hovy 2002; Soubbotin and Soubbotin 2001). More complex methods make use of deep linguistic processing. For example, Harabagiu et al. (2001) and Scott and Gaizauskas (2000) carry out detailed syntactic and semantic analysis of the retrieved documents which is then queried to identify specific answers to the question. Named entity recognition is often used to identify the items in the text which are of the same type as the expected answer and are therefore possible answers to the question.

The nature of the question answering task, which includes the interpretation of natural language questions and identification of specific concepts and structures within documents, makes the standard document retrieval approach inadequate: TA techniques are required to provide the extra linguistic analysis.

3. Evaluation

The most common approach to evaluation within text analytics is to compare machine output with human judgement. For example, part of speech taggers are normally evaluated by manually annotating a test corpus and then applying the tagger to the same data. Performance is then calculated by comparing the two sets of annotations. However, this process is often complicated by two factors: first, difficulties in obtaining consensus on the ‘correct’ manual annotation, and second, issues in determining exactly how annotations should be compared and scored.

The first challenge is caused by the fact that people often disagree on the correct annotation for a particular fragment of text – for example, they may interpret the same sentence differently. Various steps are taken to maximise consensus between the annotators, but agreement is rarely unanimous and as a result the maximal level of human agreement represents something of a ceiling on the performance of any automated process measured using this technique.

The second challenge is due to the complexity of analysis that is generated, since it can be difficult to determine how partial matches should be scored. For example, consider the sentence ‘Bill Gates is CEO of Microsoft’. If a NER system suggests ‘Gates’ is the name of a person (rather than ‘Bill Gates’) what credit, if any, should this, partially correct answer be assigned? Proposing ‘Gates’ as the name of a person, while not ideal, is certainly better than suggesting ‘Microsoft’ as a person.

Once the systems have been scored, the performance is typically measured in terms of precision and recall, where precision is defined as the number of number of entities correctly tagged divided by the total number of entities tagged, and recall is defined as the number of entities correctly tagged divided by the total number of entities which should have been tagged.

Attempts to maximize one measure will typically compromise the other (all other factors being equal), so in practice a suitable trade-off must be found. For this reason, these measures are rarely considered in isolation, and are instead usually combined into a single value, e.g. the F-measure (which is defined as the weighted harmonic mean of precision and recall).

There are significant differences between the typical performances of TA techniques. Part of speech tagging and named entity recognition can typically be performed with >90% accuracy on business news, while accuracy for the information extraction from biomedical documents is typically below 70%. TA applications often combine together several component technologies, for example part of speech tagging and named entity recognition. There are often dependencies between these components; for example the named entity recogniser may rely on output from the part of speech tagger. Inaccurate output from one component can propagate through to subsequent processes that rely on them.

However, it should be noted that measures such as precision and recall are fairly blunt instruments, and take no account of the user experience as a determinant of performance or success. For example, Elworthy et al. (2001) demonstrate the application of TA technology to the retrieval of captioned images, and in so doing also provide a novel method of results presentation and unique user experience that could not have been achieved without the use of text analytics. The value of such a capability cannot be expressed by measuring retrieval accuracy alone.

4. Conclusions

In this post we have outlined some of the fundamental techniques in text analytics and explored the ways in which they can be applied to search and information retrieval. In an upcoming post, we’ll complement this with a look at text analytics applied to social media, in particular sentiment analysis and opinion mining.

5. References

  1. E. Brill, “Transformation based error driven parsing”, In Proceedings of the Third International Workshop on Parsing Technologies, Tilburg, The Netherlands, 1993.Church, K. (1998) A Stochastic Parts Program and Noun Phrase Tagger for Unrestricted Text. Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin TX, pp 136–143.
  2. Cutting, D., Kupiec, J., Pedersen, J. and Sibum, P. (1992) A Practical Part-of-Speech Tagger. Proceedings of the 3rd Conference on Applied Natural Language Processing.
  3. Elworthy, D., Rose, T. G., Clare, A. and Kotcheff, A. (2001) A Natural Language System for the Retrieval of Captioned Images. Journal of Natural Language Engineering, Cambridge University Press, 2001.
  4. Green, B., Wolf, A., Chomsky, C. and Laughery, K. (1961) BASEBALL: An Automatic Question Answerer. Proceedings of the Western Joint Computer Conference, 19 219–224.
  5. Greenwood, M., Roberts, I. and Gaizauskas, R. (2002) The University of Sheffield TREC 2002 Q&A System Proceedings of the 11th Text Retrieval Conference.
  6. Grishman, R. and Sundheim, B. (1996) Message Understanding Conference – 6: A Brief History. Proceedings of the 16th International Conference on Computational Linguistics (COLING-96) Copenhagen, Denmark, pp 466–470.
  7. Harabagiu, S., Moldovan, D., Pasca, M. Surdeanu, M., Bunescu, R., Girju, R., Rus, V. and Morarescu, P. (2000) FALCON: Boosting Knowledge for Answer Engines. Proceedings of the 9th Text Retrieval Conference.
  8. Hirst, G. (1987) Semantic Interpretation and the Resolution of Ambiguity Cambridge University Press, Cambridge, England.
  9. Hovy, E. Gerber, L., Hermjakob, U., Junk, M. and Lin, C. (2000) Question Answering in Webclopedia. Proceedings of the 9th Text Retrieval Conference.
  10. Ingersoll, G.S. and Morton, T.S. (2009) Taming Text. Manning Publications
  11. Jing, H. and Tzoukermann, E. (1999) Information retrieval based on context distance and morphology. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99).
  12. Karlsson, F. (1990) Constraint Grammar as a Framework for Parsing English Running Text. Proceedings of the 13th International Conference on Computational Linguistics.Krovetz, R. (1997) Homonymy and Polysemy in Information Retrieval. Proceedings of the 35th Meeting of the Association for Computational Linguistics and the 8th Meeting of the European Chapter of the Association for Computational Linguistics (ACL/EACL-97).
  13. Krovetz, R. and Croft, B. (1992) Lexical ambiguity and information retrieval. ACM Transactions on Information Systems 10(2) 115–141.
  14. Marcus, M., Santorini, B. and Marcinkiewicz, M. (1993) Building a Large Annotated Corpus of English: The Penn Tree Bank. Computational Linguistics 19(2) 313–330.
  15. Ravichandran, D. and Hovy, E. (2002) Learning Surface Text Patterns for a Question Answering System. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 41–47.
  16. Sanderson, M. (1994) Word sense disambiguation and information retrieval. Proceedings of the 17th ACM SIGIR Conference.
  17. Schutze, H. and Pedersen, J. (1995) Information Retrieval Based on Word Senses. In Symposium on Document Analysis and Information Retrieval (SDAIR), Las Vegas, NV, pp 161–175.
  18. Scott, S. and Gaizauskas, R. (2000) University of Sheffield TREC-9 Q & A System. Proceedings of the 9th Text Retrieval Conference.
  19. Soubbotin, M. and Soubbotin, S. (2001) Patterns of Potential Answer Expressions as Clues to the Right Answers, Proceedings of the 10th Text Retrieval Conference.
  20. Voutilainen, A. (2003) Part of speech Tagging. In The Oxford Handbook of Computational Linguistics Mitkov, R. (ed.) pp 219–232 Oxford University Press.
  21. Wilks, Y. (1975) A Preferential, Pattern Seeking, Semantics for Natural Language Inference. Artificial Intelligence 6.
  22. Wilks, Y. and Stevenson, M. (1998) The Grammar of Sense: Using part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering 4(3).
  23. Woods, W. (1973) Progress in Natural Language Understanding – An Application to Lunar Geology. AFIPS Conference Proceedings, vol. 42, pp 441–450.

6. Glossary

Anaphora resolution: the process by references between separate expressions are identified and resolved.

Artificial intelligence: the intelligence of machines and the branch of computer science which aims to create it.

Categorization: the process by which ideas and objects are recognized, differentiated and understood.

Classification: the process by which an electronic document is assigned to one or more predetermined categories, based on its contents. This contrasts with document categorization, in which the categories must typically be discovered as part of the process.

Clustering: the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

Derivational morphology: the process by which word meaning is changed through the application of derivations, e.g. the suffix “ee” changes the verb employ into the noun employee.

F-measure: a measure of a test’s accuracy. It considers both precision and recall.

Homonym: one of a group of words that share the same spelling and the same pronunciation but have different meanings.

Hypernym: is a word or phrase whose semantic range subsumes that of a set of other words.

Inflection: the process by which language expresses grammatical relations and relational categories such as tense, mood, voice, aspect, person, number, gender and case.

Information Extraction: the process by which structured information is extracted from unstructured textual data.

Lexical analysis: the process of converting a sequence of characters into a sequence of tokens.

Morphology: the study of the structure and content of word forms.

Named entity recognition: a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, etc.

Natural language processing: a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages.

Parsing: see syntactic analysis

Part of speech: a linguistic category of words which is generally defined by the syntactic or morphological behaviour of the word in question

Part of Speech Tagging: the process of marking up the words in a text as corresponding to a particular part of speech

Phrase detection: the recognition of groups of words functioning as a single unit in the syntax of a sentence.

Polysemy: the capacity for a word or phrase to have multiple meanings.

Precision: the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search.

Question answering: the task of automatically answering a question posed in natural language.

Recall: the number of relevant documents retrieved by a search divided by the total number of existing relevant documents.

Relation extraction: the process by which relations between named entities within a single sentence are identified.

Relevance: the extent to which a topic of a search result matches the topic of the query or information need.

Sentence boundary detection: the problem in natural language processing of determining where sentences begin and end.

Sentiment analysis: the process by which the attitude of a speaker or a writer with respect to a given topic is identified. It is usually classified as positive, negative or neutral.

Stemming: the process by which inflected (or derived) words are reduced to their root form.

Summarisation: the ability to create an abstract or topical summary from a free text document.

Syntactic analysis: the process of analyzing a text to determine its structure with respect to a grammar

Text analytics: a set of linguistic, analytical and predictive techniques by which structure and meaning can be extracted from textual data. Similar to text mining, but with a broader focus on interactive applications and a greater emphasis on user experience.

Text mining: the process by which meaning, structure and patterns can be derived from textual data.

Tokenisation: see lexical analysis

Wordnet: a lexical database for the English language. It groups English words into sets of synonyms, provides short, general definitions, and records the various semantic relations between these synonym sets.

Word sense disambiguation: the process of identifying which sense of a word is used in any given sentence, when the word has a number of distinct senses.

Related Posts:

  1. The role of Natural Language Processing in Information Retrieval
  2. IR book is out!
  3. Applying text analytics to product innovation and legal cases
  4. Text Analytics: Yesterday, Today and Tomorrow
  5. GATE Summer School: highlights & reflections

Read Full Post »

Searching for Meaning in Text

Here are the slides from the talk I gave at City University last week, as a guest lecture to their Information Science MSc students. It’s based on the chapter of the same name which I co-authored with Mark Stevenson of Sheffield University and appears in the book called “Information Retrieval: Searching in the 21st Century“. The session was scheduled for 3 hours, and to my amazement, required all of that (thanks largely to an enthusiastic group who asked lots of questions). And no, I didn’t present 3 hours of Powerpoint – the material was punctuated with practical exercises and demos to illustrate the learning points and allow people to explore the key concepts for themselves. These exercises aren’t included in the Slideshare version, but I am happy to make them available to folks who want to enjoy the full experience.

Related Posts:

  1. IR book is out!
  2. Applying text analytics to product innovation and legal cases
  3. Enterprise Search 8, Text Analytics 2 (after extra time)
  4. Text Analytics: Yesterday, Today and Tomorrow
  5. GATE Summer School: highlights & reflections

Read Full Post »

A few folks have pointed out that my recent post on “Changing terms for changing times Usability, HCI, UCD & more” (which examined how the usage of these terms has changed across several decades of publications in Google Books) omitted a consideration of the “demand side” and the frequency of these terms in the Google index. With that in mind, I just updated the original article. If you just want to see the additional material alone, then read on.

Now let’s try a different analysis. If Google Books represents the “supply” side, i.e. content creation, what is the corresponding picture from the “demand” side, i.e. the content consumption? For this we can use Google Trends, which allows us to see how often certain topics have been searched for on Google over time (and also how frequently these topics have appeared in Google News stories). Applying the same query to Google Trends produces the following result, with the y-axis showing the average worldwide traffic for all five terms normalised by the dominant term (usability), and the x-axis showing the time period for which data is available. Note that Google Trends flattens its queries to lower case, thus eliminating one of the noise sources mentioned above. But it clearly cannot differentiate between the different senses for each of the acronyms, and for that reason in this analysis I chose to express the queries in their full British English form:

Clearly, usability remains the dominant term throughout, followed by information architecture and human computer interaction. I don’t think we can conclude much more than that without being able to magnify the scale and properly address the noise issues outlined above.

However, there are two other interesting patterns in these results. First, notice that all queries appear to be declining monotonically throughout the time period. Why is this? I think the answer may lie in the interpretation of the y axis: note that this value is actually the proportion of queries, not their absolute value. So this may reflect a gradual sophistication of user search behaviour on the web, in that queries are becoming ever more diverse as new terms are introduced and average query length increases. This is no doubt facilitated by novel UI features such as Google Instant, which faciliate the expression of longer queries. In other words, although usage of each of the queries above may be growing in absolute terms, they are nonetheless decreasing as proportion of overall web traffic. Unless Google Trends discloses absolute values for search terms, we can never know for sure.

But the more interesting pattern is the spikes in the distribution of the term usability that seem to appear midway through the final quarter of each year. What is causing these? My own hypothesis was that this could be related to World Usability Day, which since 2005 has occurred annually on the second Thursday in November. Indeed, a closer examination of the trends for each individual year does indicate a strong correlation, which appears to be borne out by the topics covered in the news stories associated with each spike. If so, it is reassuring to know that initiatives such as this do indeed appear to be having a global effect(!)

As a final datum, we should of course consider how frequently the various terms appear in the Google index itself (i.e. across the Internet). Would the patterns from Google Books be reflected across the web in general? Querying Google with each term in turn returns the following results (in descending order):

  • “usability”                                   About 21,100,000 results
  • “human computer interaction”      About 2,180,000 results
  • “information architecture”             About 1,850,000 results
  • “user centred design”                      About 131,000 results
  • “human centred design”                    About 61,800 results

So what can we conclude from the above? My own feeling is that it was a worthwhile exercise considering these additional terms, for the sake of completeness if nothing else. And the observations about sophistication of user search behaviour and the spikes due to promotional events such as World Usability Day are also interesting.

For further conclusions and the complete analysis, see the original article.

Related Posts:

  1. Changing terms for changing times Usability, HCI, UCD & more
  2. Ergonomics, Human Factors, User Experience: changing terms for changing times
  3. Tutorial on Designing the Search Experience
  4. The Dimensions of Search User Experience
  5. User Interface Design Patterns for Search & Information Discovery

Read Full Post »

A few days ago I posted an article called “Ergonomics, Human Factors, User Experience: changing terms for changing times”, which examined how the usage of these terms has changed across several decades of publications in Google Books. At the time of writing, my concern was focused on a specific issue (namely the decision of the Ergonomics Society to rename itself the Institute of Ergonomics and Human Factors) and with that in mind, along with the interests of brevity, I limited the analysis to just those three terms.

However, it has been pointed out since (quite rightly) that there are many other terms we could consider. For the sake of completeness then, what follows is a brief examination of some of the alternatives.

Of course, deciding which terms to include is in itself a subjective exercise – whichever set I choose, they will to a degree reflect my own world view and understanding of the field. But we have to start somewhere, and the following are perhaps some of the more obvious candidates:

  • Usability
  • Human Computer Interaction
  • User centred design
  • Human centred design
  • Information architecture

Note that these terms are somewhat more problematic than the original three, due to their lexical variability and the lack of an obvious canonical form. For example, should we consider:

  1. human computer interaction (in full) vs. HCI (abbreviated)
  2. user centred design (British English) vs.  user centered design (American English)
  3. user-centred design (hyphenated) vs. user centred design (non-hyphenated)
  4. User Centred Design (title case) vs. user centred design (lower case)

And so on. All these issues conspire to make the patterns more prone to noise and the results less reliable. But with that caveat in mind, let’s look at each in turn, using the original three as our datum. First, let’s look at usability and HCI. Note that I have chosen to use HCI rather than human computer interaction, as it sidesteps the hyphenisation issue (and, arguably, the capitalisation issue):

We can see that HCI follows a pattern similar to that of Human Factors, with continuous growth until the early 1990’s followed by a gradual decline over the last two decades or so. By contrast, usability displays a sharp rise from around 1990, becoming the dominant term from the mid 1990s onwards. However, the decline in usage over the last half decade or so is also interesting (the peak of 2003 becomes even more apparent if you turn off the smoothing). This is somewhat puzzling… my own subjective perception, for what it’s worth, is that the term usability has developed slightly dated, “ease of use” connotations, which I personally don’t identify with quite as much as I used to. I wonder if the pattern over the last half decade reflects a wider appreciation of this point of view.

Now let’s turn to user centred design and human centred design. I personally don’t tend to use the latter term in everyday language, but recognise the fact that it is preferred by the relevant ISO standards and – surprisingly to me – is also the more common of the two in the Google index. However, this analysis is the most problematic of them all, as it is subject to all the noise issues described above. So again sidestepping the at least one of the issues, we see the following for UCD and HCD:

But I am very sceptical of the value of this analysis. Without some way of conflating the various lexical forms and ignoring capitalisation & so on, we are not going to see a meaningful picture. Moreover, a cursory analysis of the results returned for these terms on Google Books shows several different meanings or senses for UCD, and many more for HCD. To make this analysis meaningful we need some way to perform accurate word sense disambiguation and/or cluster the results accordingly.

I am also somewhat sceptical about the value of including information architecture in this analysis. For sure, it is a term currently used within the digital community to describe the application of the principles of user centred design to the development of information-rich websites and applications. But the term was in use long before the web was invented (notably by the software industry), and I think the pattern we see in the following chart could well reflect its provenance in other disciplines:

Now let’s try a different analysis. If Google Books represents the “supply” side, i.e. content creation, what is the corresponding picture from the “demand” side, i.e. the content consumption? For this we can use Google Trends, which allows us to see how often certain topics have been searched for on Google over time (and also how frequently these topics have appeared in Google News stories). Applying the same query to Google Trends produces the following result, with the y-axis showing the average worldwide traffic for all five terms normalised by the dominant term (usability), and the x-axis showing the time period for which data is available. Note that Google Trends flattens its queries to lower case, thus eliminating one of the noise sources mentioned above. But it clearly cannot differentiate between the different senses for each of the acronyms, and for that reason in this analysis I chose to express the queries in their full British English form:

Clearly, usability remains the dominant term throughout, followed by information architecture and human computer interaction. I don’t think we can conclude much more than that without being able to magnify the scale and properly address the noise issues outlined above.

However, there are two other interesting patterns in these results. First, notice that all queries appear to be declining monotonically throughout the time period. Why is this? I think the answer may lie in the interpretation of the y axis: note that this value is actually the proportion of queries, not their absolute value. So this may reflect a gradual sophistication of user search behaviour on the web, in that queries are becoming ever more diverse as new terms are introduced and average query length increases. This is no doubt facilitated by novel UI features such as Google Instant, which faciliate the expression of longer queries. In other words, although usage of each of the queries above may be growing in absolute terms, they are nonetheless decreasing as proportion of overall web traffic. Unless Google Trends discloses absolute values for search terms, we can never know for sure.

But the more interesting pattern is the spikes in the distribution of the term usability that seem to appear midway through the final quarter of each year. What is causing these? My own hypothesis was that this could be related to World Usability Day, which since 2005 has occurred annually on the second Thursday in November. Indeed, a closer examination of the trends for each individual year does indicate a strong correlation, which appears to be borne out by the topics covered in the news stories associated with each spike. If so, it is reassuring to know that initiatives such as this do indeed appear to be having a global effect(!)

As a final datum, we should of course consider how frequently the various terms appear in the Google index itself (i.e. across the Internet). Would the patterns from Google Books be reflected across the web in general? Querying Google with each term in turn returns the following results (in descending order):

  • “usability”                                   About 21,100,000 results
  • “human computer interaction”      About 2,180,000 results
  • “information architecture”             About 1,850,000 results
  • “user centred design”                      About 131,000 results
  • “human centred design”                    About 61,800 results

So what can we conclude from the above? My own feeling is that it was a worthwhile exercise considering these additional terms, for the sake of completeness if nothing else. And the observations about sophistication of user search behaviour and the spikes due to promotional events such as World Usability Day are also interesting. In addition, this analysis has revealed two other, unanticipated insights, in the form of suggested extensions to the Ngram Viewer:

  • The ability to control case sensitivity and conflate common lexical variations, e.g. due to hyphenation and so on
  • The ability to cluster results around the different senses for each term

Of course, word sense disambiguation is an issue that has challenged some of the best minds in NLP for several decades, so I’m not expecting overnight results here. But robust methods for document clustering are eminently available, and I can think of all manner of ways such techniques could be applied to improve the Ngram Viewer experience and in so doing provide some intriguing visualisation possibilities.

I also have some initial thoughts about how to develop further insight into the context in which terms are used, which may allow us to dig a little deeper below the surface. I hope to post something about this in the next few days. In the meantime, if you have any suggestions or extensions you’d like to see, by all means let me know either here or by email.

Related Posts:

  1. Ergonomics, Human Factors, User Experience: changing terms for changing times
  2. The Ergonomics Society and the Digital Blind Spot
  3. User Research Plans for the Ergonomics Society Website
  4. The Dimensions of Search User Experience
  5. User Interface Design Patterns for Search & Information Discovery

Read Full Post »

Just over a year ago the Ergonomics Society changed its name to the Institute of Ergonomics and Human Factors, in recognition (among other things) of the discrepancy between the growing range of interests within the organisation and the somewhat narrower perception of the term “ergonomics” to the world outside. There are many people who feel passionately about such terminology, and will argue quite strongly that it is perceptions that we should be changing, rather than our own identity. On the other hand, there are those who think that it is quite right that organisations should evolve to reflect the changing world, and that names and labels should not be thought of as sacrosanct.

But the purpose of this post is not to enter that debate. Instead, it is to test some of the assumptions about the usage of these terms, and see if they really hold true. This task is made somewhat easier by the recent launch of the Google Books Ngram Viewer, which gives us a unique insight into the usage of terminology across a range of publications, time periods and corpora (i.e. geographically-based collections). Google Books claims to have scanned over 10% of all books ever published, providing data on the occurrence of phrases up to five words in length from 1400 through to the present day.

For example, I had assumed (wrongly, as it turns out) that the term ergonomics was more prevalent this side of the Atlantic, and conversely that human factors was the dominant term on the other. This is a crucial point, as it underpinned the notion (in my mind, at least) that by adopting the term human factors we would better articulate our identity and mission to a global audience. (Indeed, by changing the name that outcome may still hold true, but evidently not for the reasons I’d assumed.)

So let’s look at the data. First, let’s see how often the terms “ergonomics”, “human factors” and “user experience” appear in all the English books indexed by Google, from 1950 to date. I have included the third term partially as it reflects my own personal interests (being by far the dominant term in the development of digital interactive products, e.g. web, software, mobile, and so on), and partially as a reflection of its growing acceptance in wider UCD circles (such as official standards documentation).

As I expected, user experience took off in the late 1990’s and continues on an upward trend. But what I didn’t anticipate is that ergonomics continues on an upward trend too, whilst usage of the term human factors actually appears to be declining from about 1990 onwards. This was certainly unexpected. So let’s check whether a geographic effect is at play: is this true of books published on both sides of the Atlantic? If we look at just those published in GB, then we see that user experience continues to rise, as expected, but both ergonomics and human factors start to fall from around 2000 onwards (with ergonomics entering a particularly steep decline):

Conversely, if we look at just books published in the US, then all 3 show a strong upward trend, and contrary to my own expectations, ergonomics is actually the dominant term over the last decade and a half:

The contrast between the US & GB trends is puzzling – why are ergonomics and human factors in such steep decline in GB when they are rising in the US? There must be further factors at play. Unfortunately, Google Books does not allow us to formulate queries such as “show me occurrences of these terms in all corpora except this one” (i.e. to see what the global trend is outside of either US or GB). Note also that the y-axis shows only percentages, so it is not possible to establish a baseline by which we can compare the absolute figures for different corpora.

So let’s try a different analysis. If Google Books represents the “supply” side, i.e. content creation, what is the corresponding picture from the “demand” side, i.e. the content consumption? For this we can use Google Trends, which allows us to see how often certain topics have been searched for on Google over time (and also how frequently these topics have appeared in Google News stories). Applying the same query to Google Trends produces the following result, with the y-axis showing the average worldwide traffic for all three terms normalised by the dominant term (ergonomics), and the x-axis showing the time period for which data is available:

Evidently, ergonomics is still the dominant term, but both this and human factors are in decline throughout the time period (note also the seasonal spikes at the end of each calendar year). User experience appears stable and dominates the Google News content, but it is hard to conclude more without being able to magnify the scale or probe further.

As a final datum, we should of course consider how frequently the various terms appear in the Google index itself (i.e. across the Internet). Would the patterns from Google Books be reflected across the web in general? Querying Google with each term in turn returns the following results:

  • Ergonomics                         6,110,000 results
  • “Human factors”              2,350,000 results
  • “User experience”          14,900,000 results

So, what can we conclude from all this? My personal feeling is that these investigations present many more questions than answers at this stage, so any conclusions we can draw thus far should be tentative at best. That said, the data would appear to suggest the following:

  1. The assumption that human factors is the dominant term in the US no longer appears to be true (at least, not for the last decade and a half)
  2. Usage of human factors appears to be declining globally, while both ergonomics and user experience continue to rise
  3. Both ergonomics and human factors have risen in the US over the last decade, but fallen in GB
  4. From a “demand” perspective, both ergonomics and human factors appear to have steadily declined over the time period for which data is available
  5. User experience is by far the dominant term across the Internet in general

Over the next few days I’ll be thinking about further ways to explore or validate the above observations. In the meantime, if you want to suggest any extensions or observations of your own, just drop me a line either here by email.

Related Posts:

  1. The Ergonomics Society and the Digital Blind Spot
  2. User Scenarios for the Ergonomics Society Website
  3. User Research Plans for the Ergonomics Society Website
  4. The Dimensions of Search User Experience
  5. User Interface Design Patterns for Search & Information Discovery

Read Full Post »

Here’s a New Year’s resolution: to kick-start the London Text Analytics group and get the programme of MeetUps going again. The Enterprise Search London group certainly sets the example: 250+ members and 8 meetings over the course of 2010. By contrast, Text Analytics is flatlining at around 40 members, with just a couple of Meetups thus far. Of course, this disparity may simply be nature’s way of telling me as organiser to be more proactive… but it does make me wonder:  what is a realistic size for a group of this nature anyway?

I’ve worked in NLP and IR for almost two decades, in both research and commercial settings, and have seen many changes in that period. I’ve seen many ideas and approaches transfer from one discipline to the other, and consider the two communities to be closely related. But beneath the superficial associations there are some significant differences that fundamentally influence the dynamics of the two communities.

Search, or Information Retrieval in its broadest sense, represents an activity has clear touchpoints on the lives of many diverse professions across a variety of organisational contexts. If search engines didn’t exist, then human resources would still be deployed to perform the functions they provide (albeit somewhat less efficiently). In this respect, search is embedded in the fabric of many organisations, creating a common vocabulary and set of professional practices for the individuals within them.

Text analytics, by contrast, is quite different. The primary challenges within the NLP community have traditionally been tasks such as word sense disambiguation, named entity recognition, anaphora resolution, and so on. How many of these have tangible counterparts outside of the research lab? How often are human resources routinely deployed to undertake such tasks as part of normal business operations? With the possible exceptions of speech recognition (e.g. captioning services for broadcast media) and machine translation (e.g. of EU parliamentary proceedings), the answer would seem to be very few. NLP just does not touch on the lives of diverse professionals in the same way as IR. Even a cursory glance at the frequencies with which these terms appear in Google Books Ngram Viewer would seem to indicate a differential of around 5-6 over the last decade (in favour of IR):

"Natural Language Processing" vs. Information Retrieval"

So maybe the difference in the sizes between the two groups is a reflection of the natural order of things. That said, I am convinced that there are many more individuals in London & the South East working in NLP / text analytics who would benefit from a closer association with their peers. With that in mind, I’ll be twisting a few arms in the near future to get the best we can out of the MeetUp group. Watch this space for updates 🙂

Related Posts:

  1. Text Analytics: Yesterday, Today and Tomorrow
  2. London Text Analytics: Inaugural Meeting
  3. GATE Summer School: highlights & reflections
  4. Search at the Guardian Newspaper
  5. Enterprise Search Meetup: Searching for Leisure, Travelmatch and Real-time search

Read Full Post »

Following the panel session on “What will Search look like in 2015” at the recent Search Solutions conference, fellow IRSG members Udo Kruschwitz, Andy MacFarlane and I were invited to draft an article on the subject for publication in the BCS’s ITNow magazine. It’s due to come out in January, but if you can’t wait till then, I’ve included a version of it below.When we talk about search, we talk about Google. Google has transformed our lives and even our everyday language (if we need to look up information on the latest BCS events, we ‘Google’ it). Yet only a decade ago few people had come across the term. The question is: where are we heading? What will search look like in 5, 10 or 50 years time?

One way of trying to predict what the future holds is to solicit the views of leading figures and companies in the search business. The BCS Information Retrieval Specialist Group (IRSG) has been organising such events for a number of years, and Search Solutions 2010, held in October at BCS London, is the latest such example. What follows is a summary of that discussion punctuated by our own observations andreflections.

One of the dominant themes at Search Solutions was the growing importance of “freshness” as a concept in search. In the past, most measures of search effectiveness have been primarily concerned with relevance, but a document that is considered relevant for a given query at this precise moment may not be so tomorrow or even 30 minutes hence. But freshness is not simply about identifying the most recent documents, as these are not necessarily the most authoritative results. In addition, not all queries require the same level of freshness.

Another issue that is becoming increasingly prominent is the use of context in search. Few people are aware that major search engines already make use of a variety of contextual information when returning search results, such as the user’s previous searches as well as the physical location from which the query is submitted. This is all closely related to another area where significant progress can be expected over the next few years: personalisation. Personalisation does not mean that users will be required to explicitly declare their interests (this is exactly what most users do not want to do!); instead, the search engine tries to infer users’ interests from implicit cues, e.g. time spent viewing a document, the fact that a document that has been selected in preference to another ranked higher in the results list, and so on. Personalised search results can be tailored to individual searchers and also to groups of similar users (“social networks”).

One of the most promising technologies that could radically change search as we know it is linguistic analysis and natural language processing (NLP). It has been argued for many years that NLP will eventually find its way from the research lab into the mainstream, and we are now seeing signs of it scaling to address web-scale search in the form of information extraction and question-answering services. The growth of social media and user-generated content (in the form of blogs, wikis, product reviews, etc.) is a further driver for the use of NLP.

The landscape of the search industry will change as well. At the moment, web search is dominated by three major players: Google, Microsoft (Bing) and Yahoo. Few would doubt that these three will continue to dominate in some form over the coming years, but we shouldn’t underestimate the disruptive effect of start-ups such as Blekko. In web search, the competition is no more than a click away.

By contrast, the world of enterprise search (i.e. the application of search technology to information within an organization) is much more volatile. As with web search, the landscape is currently dominated by three major players (Microsoft (FAST), Autonomy and Endeca), but in this case there are a great many smaller players, each trying to carve out their own identity in a complex market. No doubt by 2015 much consolidation will have taken place, with mergers, acquisitions and new entrants adding to a changing picture.

A major catalyst for this change is a growing awareness of open source alternatives, in the form of search platforms such as Solr. This platform in particular has reached something of a tipping point in maturity and stability, and has also the necessary support ecosystem to be considered a serious alternative to commercial offerings.

A further driver for change is a growing focus on the user experience, and the expectation that search engines should do more than simply deliver ten blue links. For sure, relevance is important, and most of the search vendors already do a decent job of utilising the cues available to present highly relevant results. But there is a growing realisation that the real value in search (particularly in its enterprise setting) is in embedding it a wider discovery context, so that in addition to supporting basic lookup tasks (such as known-item search and fact retrieval), the system provides support for much more complex, exploratory search tasks, such as comparison, aggregation, analysis, synthesis, evaluation, and so on. Clearly, for these sorts of activity a much richer kind of interaction or dialogue between system and end user is required.

A much ignored issue in search is that of accessibility and the needs of disabled users. There has been a little work for blind people, but very little in other directions, such as users with cognitive deficits (such as dyslexia) or physical problems (inability to use limbs etc).

The Disability Rights Commission conducted a major research project into interaction with various types of websites, including search. The study examined the effect of navigation and search on users with various types of disabilities, including blind, partially sighted, dyslexic  people, people with hearing impairments and physically disabled people. Of these groups the people with the most problems were blind people.  More recently some work at City University has shown that dyslexia has a clear effect on directed and undirected search for information. It is estimated that around 10% of the population have a form of dyslexia.  Work on assistive technologies is useful to improve the experience of searching. For blind and partially sighted users, more intelligent ways of picking content from websites is desirable; otherwise screen readers can be overwhelmed with information, and cannot report research results to users very efficiently. For dyslexic people there is a need to understand the searching behaviour of such users, and build personalised interfaces which react to the type of dyslexia and learn from their interaction with the user interface. Severely physically disabled people would benefit from better audio recognition technology, which would allow them to enter queries using their voice, or using face gestures for those who cannot speak.

A further problem is the digital divide – providing access to information in third world countries, often with oppressive governments. The use of mobile devices can be a help, and there is growing evidence that in many poor countries, mobile phones are becoming a very useful way of exchanging information where communication channels are restricted. The challenge for search designers on mobile devices is dealing with the screen size, the network bandwidth etc. For some parts of the world these are significant barriers.

Finally, when most people talk about search, they typically envisage a web page with a search box and a results list. But search is increasingly becoming a ubiquitous part of our daily lives, helping us make sense of the world around us. Search is the means by which we are able to cope with our overflowing email inboxes, to generate insights from masses of corporate data, and to discover new restaurants in an unfamiliar city armed only with a smartphone and an Internet connection. Search will be everywhere, but invisible, contextualised, and personalised.

Related Posts:

  1. Zipf’s Law: how can something so simple explain so much
  2. Search Solutions 2010: Highlights & Reflections
  3. Search at the Guardian Newspaper
  4. The Dimensions of Search User Experience
  5. User Interface Design Patterns for Search & Information Discovery

Read Full Post »

Here’s something I’ve been meaning to share for a while: the slides for a talk entitled “Text Analytics: Yesterday, Today and Tomorrow”, co-authored with colleagues Vladimir Zelevinsky and Michael Ferretti. In this we outline some of the key challenges in text analytics, describe some of Endeca’s current research in this area, examine the current state of the text analytics market and explore some of the prospects for the future.

Read Full Post »

« Newer Posts - Older Posts »