Feeds:
Posts
Comments

Posts Tagged ‘text mining’

Here’s something I’ve been meaning to share for a while. The content was written originally as notes to complement to my tutorial on NLP + IR, but I thought it might be informative to invite a wider audience. At some point I’ll update it to include a section on Sentiment Analysis (who doesn’t these days?!) In the meantime I hope you find it useful – feel free to suggest additions or improvements either here or offline.

1. Introduction

The field of text analytics is laden with jargon and acronyms, so before we go any further, let’s clarify a few terms.

Text analytics describes a set of linguistic, analytical and predictive techniques by which structure and meaning can be automatically extracted from unstructured documents. Text analytics (TA) provides the ability to cluster, categorize and classify unstructured data, supporting both interactive exploration and automated document processing. TA processes can thus be used to improve the accuracy of search engines and to transform the search user experience by enhancing results presentation and facilitating interactive navigation and exploration.

The term text mining has a history of at least a decade or more and is well understood by practitioner and research communities. The term text analytics has perhaps a slightly shorter history than text mining, but is growing rapidly in popularity as a label for the application of NLP techniques in practical, commercially-oriented contexts.

Some authors treat TM and TA as synonyms, but this misses a subtle but important distinction: TM has its origins in automated, offline processing, characterised by ‘data in’ and ‘structure out’. It is primarily concerned with identifying and inferring aggregate patterns and insights across collections of documents. By contrast, TA is a more generic term, often associated with interactive, analytic applications and a focus on the analysis of individual documents.

1.1 The Challenge of Text Analytics

On the surface, the problem that TA attempts to solve appears trivial. After all, as humans we effortlessly create and interpret both spoken and written language, and rarely have to apply conscious thought to the process. But that, in many ways, is part of the problem: if the history of artificial intelligence teaches us anything it is that so many of the faculties that we possess as humans are not open to introspection and are therefore notoriously difficult to analyse and encode algorithmically. But that isn’t to suggest that the task is any less challenging: far from it. Natural language is littered with ambiguity, and it is only because we have evolved highly parallelised linguistic processes that work across multiple levels that we can make any sense of it at all.

Consider the following newspaper headlines, all of which appeared unambiguous to the original writer:

  • DRUNK GETS NINE YEARS IN VIOLIN CASE
  • PROSTITUTES APPEAL TO POPE
  • STOLEN PAINTING FOUND BY TREE
  • RED TAPE HOLDS UP NEW BRIDGE
  • DEER KILL 300,000
  • RESIDENTS CAN DROP OFF TREES
  • INCLUDE CHILDREN WHEN BAKING COOKIES
  • MINERS REFUSE TO WORK AFTER DEATH

Although humorous, they illustrate much of the ambiguity in natural language, and just how much pragmatic and linguistic knowledge must be employed by NLP tools to function accurately.

1.2 Text Analytics Opportunities

It has been estimated that as much as 80% of all potentially usable business information originates in unstructured form,e.g. as:

  • Email, news and blog articles; forum postings; and other social media;
  • Contact-centre notes and transcripts;
  • Surveys, feedback forms and warranty claims;
  • Various types of corporate document, etc.

Moreover, the rate at which the world’s digital data repositories are growing creates a further challenge: according to International Data Corporation, the world generated 161 billion gigabytes of digital information in 2006 and is expected to grow to approximately 988 exabytes by 2010 (Ingersoll & Morton, 2009). And even if much of this date is in the form of audio, video, images, and so on, our current approaches to making this data searchable still revolve around the provision of manual summaries, captions and keyword tags. So in other words, we are producing yet more text that has to be analysed and managed.

In addition to the growth of social media, we are seeing progressively more eCommerce sites employing user-generated content in the form of product reviews and opinions to support the purchasing process and drive sales growth. To maximise the value of this content, these organisations also need to employ analytic tools to make it searchable and navigable.

2. Fundamentals

In the section above we alluded to the highly parallelised nature of human language processing, and the notion that language understanding requires the application of linguistic and pragmatic knowledge across a number of levels. Text analytics is no different – all but the most trivial of systems will apply linguistic analysis across a number of levels, often in cascaded fashion, where the output of the simple, lower level processes feeds into more complex, higher level ones. In this section we will review the main techniques involved.

2.1 Basic Processing: Lexical & Syntactic Analysis

At the most basic level is treatment of character encodings, such as ASCII, UNICODE, UTF-8 and so on. Although not traditionally considered an active area for NLP research, these processes nonetheless have a direct influence on the representation of case (upper & lower), punctuation, accents, and so on, which all in turn carry meaning at the lexical level (see below).

At the level above characters we deal with terms and their variants, i.e. words, abbreviations, acronyms, compounds and so on. At this level, known as lexical analysis, we deal with tokenisation, i.e. the recognition of word boundaries. This process is relatively straightforward for languages that use whitespace as a delimiter, such as English, but is harder for languages such as Chinese and Japanese. At the lexical level we also deal with morphology, i.e. the process by which words are formed. Morphological analysis allows us to identify inflectional variants, such as “run” and “ran”, which are the present and past tenses of a common stem. In addition, we can also apply derivational morphology to recognise that “runner” is a further variant of “run” and “ran”, albeit in a different syntactic category (i.e. a noun in this case).

Note that it is possible to write a reasonably accurate automatic stemmer for English, which will conflate sets of related terms to a common root. A popular example of one such tool is the Porter stemmer.

Finally we use the outputs of these processes to perform syntactic analysis, i.e. the interpretation of the grammatical structure of a piece of text. A popular example of the role of grammar in English is to consider the phrases ‘Venetian blind’ and ‘blind Venetian’ – they contain common words but possess a very different meaning. The most elementary form of syntactic analysis is the application of a Part-of-Speech (PoS) tagger, which attaches a tag to each word to indicate whether it is being used as a noun, verb, adjective, determiner, and so on. The set of tags used by a particular PoS tagger may be used to emphasise certain grammatical distinctions. For example, the Penn Treebank tagset (Marcus et al. 1993), which is commonly used by PoS taggers, has four separate tags for noun:

  • NN (singular common noun or mass noun)
  • NNP (singular proper noun)
  • NNS (plural common noun)
  • NNPS (plural proper noun).

Part of speech tagging has been researched since the 1950s (Voutilainen 2003) and a number of tagging systems are now easily available. These systems use a number of different strategies, including handwritten rules (Karlsson 1990), hidden Markov models (Cutting et al. 1992) and n-grams learned from text annotated with part of speech information (Church 1998).

Operating around the syntactic level are various other analytic processes, such as:

  • Sentence boundary detection, which sounds trivial, but is deceptively complex. A publicly available sentence splitter is available from CPAN
  • Phrase detection, i.e. recognition of noun phrases and other syntactic compounds.
  • Summarisation, i.e. the ability to create an abstract or topical summary from a free text document. This sounds complex, but becomes much more tractable with semi structured text such as HTML content. A publicly available summariser for HTML text is available from CPAN
  • Parsing, i.e. the analysis of a sequence of tokens to determine their syntactic structure with respect to a particular grammar. In practice this often means analysing PoS tagged text to identify Subject—Verb-Object and other relationships to produce higher level linguistic analyses of in the text.

2.2 Named Entity Recognition

Named entity recognition (NER) is the process by which key concepts such as the names of people,

places and organizations are identified within a document. Some commercially available systems also include the ability to identify further entity types, such as postcodes, telephone numbers and so on. In addition, some systems have been built to identify more esoteric items such as the names of proteins. NER is best illustrated using an example, such as the following extract from a news article:

“Mexico has been trying to stage a recovery since the beginning of this year and it’s always been getting ahead of itself in terms of fundamentals,’ said Matthew Hickman of Lehman Brothers in New York.”

The named entities we might expect to extract from this sentence would be:

Persons:          Matthew Hickman
Organisations:    Lehman Brothers
Cities:           New York
Countries:        Mexico

The accurate identification of such entities confers a number of benefits. Firstly, by identifying such concepts in a document, it is possible to index documents at a more fine-grained level, which in turn allows more precise searching. The word ‘Bush’, for example, can refer to either a plant or an ex-President of the USA. By differentiating between words such as these (known as homonyms) we can maximize retrieval precision, i.e. the likelihood that information seekers will receive only those documents that are relevant to their query.

Secondly, by identifying names within a document we can provide links to other related resources. Several content providers already use this technique to highlight the names of the key people mentioned in a news story, and provide links to their biographies. Similarly, it is possible to create more precise filters based on those named entities, allowing personalised delivery of content to individual users. For example, a news service might supply details of upcoming events in ‘New York’ as opposed to ‘new’ events in the city of ‘York’.

Thirdly, the output of NER adds valuable structure to a document, and thus facilitates subsequent

post-processing transformations, such as machine translation (e.g. translating documents from one natural language to another) or text-to-speech output (e.g. providing spoken versions of stock quotes for access by phone).

2.3 Information Extraction

Information Extraction is the process by which structured information is automatically extracted from unstructured documents. Examples of the types of information which might be extracted include the movements of company executives, victims of terrorist attacks, information about mergers and acquisitions and interactions between genes and proteins in scientific articles. When the relevant information has been identified it is then stored in a highly structured format known as a template.

The Message Understanding Conferences (MUC) (Grishman and Sundheim 1996) were an international evaluation exercise in which information extraction systems were tested against one another using common documents and scoring systems. In the sixth MUC conference, participant systems were tasked with the challenge of identifying facts about the movements of executives between companies. The following paragraph shows a typical example, which describes an event in which ‘John J. Dooner Jr.’ becomes chairman of the company ‘McCann-Erickson’:

“Now, Mr James is preparing to sail into the sunset, and Mr Dooner is poised to rev up the engines to guide Interpublic Group’s McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr Dooner, 45.”

This fact is encoded in the following template structure:

<SUCCESSION EVENT-2> :=
SUCCESSION ORG:
<ORGANIZATION-1>
POST: "chairman"
IN AND OUT: <IN AND OUT-4>
VACANCY REASON: DEPART WORKFORCE
<IN AND OUT-4> :=
IO PERSON: <PERSON-1>
NEW STATUS: IN
ON THE JOB: NO
OTHER ORG: <ORGANIZATION-1>
REL OTHER ORG: SAME ORG
<ORGANIZATION-1> :=
ORG NAME: "McCann-Erickson"
ORG ALIAS: "McCann"
ORG TYPE: COMPANY
<PERSON-1> :=
PER NAME: "John J. Dooner Jr."
PER ALIAS: "John Dooner" "Dooner"

These template structures contain a large amount of information that enables complex queries to be carried out against the extracted data. For example, a user could ask for a list of all the events where a finance officer left a company to take up the position of CEO in another company. Once the templates have been extracted for a set of documents they can be used to populate a knowledge base derived from the corpus, or used as document metadata to improve retrieval precision.

However, information extraction from all but the most simple of documents is an imperfect process. One of the main reasons is that the description of an event is often spread across several sentences or even across separate paragraphs. For example, consider the following pair of sentences, which contains information about management succession events:

“Pace American Group Inc. said it notified two top executives it intends to dismiss because an internal investigation found evidence of ‘self-dealing’ and ‘undisclosed financial relationships’. The executives are Don H. Pace, cofounder, president and chief executive officer; and Greg S. Kaplan, senior vice president and chief financial officer.”

The name of the organisation and fact that two executives are leaving are contained in the first sentence. The names of the two particular executives and their posts are listed in the second sentence but it does not mention the fact that the executives are leaving these posts. Therefore, the succession event can only be fully understood through a combination of the information contained in both sentences. Combining such information across sentences is non-trivial since it is necessary to identify the phrases which refer to the common entities, i.e. ‘two top executives’ and ‘the executives’ in the above example. The process by which references between instances of one expression and another is known as anaphora resolution.

A further challenge is recognising instances when a given entity may be referred to by different surface forms. For example, ‘International Business Machines Ltd’ may be referred to by an abbreviation (IBM), a nickname (‘Big Blue’) or a contextual anaphoric expression such as ‘it’ or ‘the company’. Challenges such as this are the main reasons it has proved difficult to create reliable information extraction systems. Scores for the template filling task in the MUC evaluations were generally in the 50–60% F-measure range, which at the time was insufficient for most commercial applications.

An intermediate process which is somewhere between named entity extraction and full template filling is relation extraction. This task is simpler than full template filling since: (a) it aims to identify only relations between named entities (which are considered to be binary); and (b) only relations which occur within a single sentence are considered. These simplifications have meant that the task is more achievable than full template filling.

2.4 Word Sense Disambiguation

In this section, we review some of the challenges presented by the polysemy: the phenomenon whereby a given term may have different meanings or senses. For example, the word ‘bat’ can mean ‘sports equipment’ (as in ‘cricket bat’) or ‘nocturnal mammal’ (as in ‘fruit bat’), etc. Consequently, if we are provided with no additional information, it is impossible to tell whether a query which contains the word ‘bat’ refers to the domain of sports or animals. This phenomenon is common to many natural languages and represents a significant challenge to retrieval precision. The process by which the meanings of a given word is identified is known as word sense disambiguation (WSD).

One of the main determinants of word sense is syntactic role, i.e. words can have different meanings according to their use as a noun, verb, adjective etc. For example, two of the possible meanings of ‘light’ are ‘not heavy’ (as in ‘Magnesium is a light metal’) and ‘illumination’ (as in ‘The light in the kitchen is quite dim’). The first sense only applied when ‘light’ is used as an adjective, and the second only when used as a noun. Consequently, accurate identification of a word’s part of speech can be a valuable determinant of its meaning.

Wilks and Stevenson (1998) showed that word sense ambiguity could be carried out effectively using a part of speech tagger when only broad differences between different word meanings were being considered. For example, this technique could be used to differentiate between the verb sense of “bank” (as in ‘the plane banked sharply’) and the noun sense (as in ‘financial institution’), but couldn’t distinguish between more closely related meanings such as the various noun senses used in ‘a car crashed into the bank’ or ‘the bank gave me a good rate on my mortgage’.

There has been a long history of research into WSD in NLP during which a wide variety of techniques were investigated. Early approaches such as Wilks (1975) and Hirst (1987) applied techniques from artificial intelligence that relied on hand-crafted knowledge sources containing detailed semantic information. However, these approaches were limited by the fact that the information they relied on had to be manually created. Large-scale machine- readable lexical resources, such as WordNet, started to become available in the 1980s and were quickly used for WSD. They have the advantage of providing both a list of possible senses for each word and information which can be used to identify the correct sense, such as the hypernym hierarchy in WordNet.

There has been some disagreement about the usefulness of WSD for IR. Some have argued that the benefits which might be gained from disambiguation are limited. Krovetz and Croft (1992; see also Krovetz 1997) manually disambiguated a standard test corpus and found that a perfect WSD engine would improve retrieval performance by only 2%. Sanderson (1994) performed similar experiments in which ambiguity was artificially introduced to a test collection by automatically creating ‘pseudowords’. He found that a 20–30% error rate in disambiguation led to text retrieval which was at the same level, or possibly worse, than if the ambiguity was left unresolved. Sanderson also found that queries containing fewer than five terms were less sensitive to mistakes in the disambiguation. He concluded that WSD was only useful if it was very accurate or the queries were short.

Others have demonstrated that WSD can be used to improve IR performance. Schutze and Pedersen (1995) showed that disambiguation can substantially improve text retrieval performance; demonstrating an improvement between 7 and 14% on average. Jing and Tzoukermann (1999) have also reported improvements of 8.6% in retrieval performance. Their disambiguation algorithm computes the word sense similarity in the local context of the query, the similarity of the lexical-occurrence information in the corpus, and the morphological relations between words.

2.5 Question Answering

In the traditional standard information retrieval paradigm, the user is provided with a ranked list of documents through which they must search in order to locate the information they are seeking. An alternative approach to meeting the user’s information need in a more focused way is to provide specific answers to specific questions.

Question answering research has a long history (Green et al. 1961; Woods 1973) and was introduced as a task in the Text Retrieval Conference (TREC) IR evaluations in 1999 (Voorhees 1999). Within the TREC framework question answering can be thought of as a specialisation of the standard document retrieval problem in which the queries are expressed as natural language questions and the system is expected to identify the portion of the document in which the answer can be found. Like document retrieval, question answering is often carried out against a well-defined collection of documents which may, or may not, meet the user’s information need.

Questions can be posed in a number of different ways which automatic question answering systems should aim to process. Examples of possible question formats include:

  • Yes/no questions ‘Is George W. Bush the current president of the USA?’ ‘Is the Sea of Tranquillity deep?’
  • ‘Who’ questions ‘Who was the British Prime Minister before Margaret Thatcher?’ ‘When was the Battle of Hastings?’
  • List questions ‘Which football teams have won the Champions League this decade?’ ‘Which roads lead to Rome?’
  • Instruction-based questions ‘How do I cook lasagne?’ ‘What is the best way to build a bridge?’
  • Explanation questions ‘Why did World War I start?’ ‘How does a computer process floating point numbers?’
  • Commands ‘Tell me the height of the Eiffel Tower.’ ‘Name all the Kings of England.’

The standard method for tackling question answering is to approach the problem in three separate stages:

  1. question analysis
  2. document retrieval
  3. answer extraction

The aim of the first stage is to predict the type of answer expected (for example, the expected answer for ‘When was Mozart born?’ is a date) and create a query which can then be passed to an IR system. Since the questions are expressed in natural language it is not surprising that text analysis techniques have been found helpful for this stage. For example, Greenwood et al. (2002) parse the question to produce a logical representation from which the expected answer type can be extracted. The problem of generating a query from the question has not exploited TA techniques to the same extent although some approaches, for example synonym expansion (Harabagiu et al. 2000; Hovy et al. 2000), have been explored.

The second stage of processing passes the query to an IR engine with the aim of retrieving documents which contain the answer to the question. The final stage of processing, answer extraction, aims to extract the answer(s) from these documents, using the expected answer type information determined in the first stage. Various approaches have been applied to this problem and make use of a variety of TA techniques. Straightforward methods include simple regular expressions which match the text (Ravichandran and Hovy 2002; Soubbotin and Soubbotin 2001). More complex methods make use of deep linguistic processing. For example, Harabagiu et al. (2001) and Scott and Gaizauskas (2000) carry out detailed syntactic and semantic analysis of the retrieved documents which is then queried to identify specific answers to the question. Named entity recognition is often used to identify the items in the text which are of the same type as the expected answer and are therefore possible answers to the question.

The nature of the question answering task, which includes the interpretation of natural language questions and identification of specific concepts and structures within documents, makes the standard document retrieval approach inadequate: TA techniques are required to provide the extra linguistic analysis.

3. Evaluation

The most common approach to evaluation within text analytics is to compare machine output with human judgement. For example, part of speech taggers are normally evaluated by manually annotating a test corpus and then applying the tagger to the same data. Performance is then calculated by comparing the two sets of annotations. However, this process is often complicated by two factors: first, difficulties in obtaining consensus on the ‘correct’ manual annotation, and second, issues in determining exactly how annotations should be compared and scored.

The first challenge is caused by the fact that people often disagree on the correct annotation for a particular fragment of text – for example, they may interpret the same sentence differently. Various steps are taken to maximise consensus between the annotators, but agreement is rarely unanimous and as a result the maximal level of human agreement represents something of a ceiling on the performance of any automated process measured using this technique.

The second challenge is due to the complexity of analysis that is generated, since it can be difficult to determine how partial matches should be scored. For example, consider the sentence ‘Bill Gates is CEO of Microsoft’. If a NER system suggests ‘Gates’ is the name of a person (rather than ‘Bill Gates’) what credit, if any, should this, partially correct answer be assigned? Proposing ‘Gates’ as the name of a person, while not ideal, is certainly better than suggesting ‘Microsoft’ as a person.

Once the systems have been scored, the performance is typically measured in terms of precision and recall, where precision is defined as the number of number of entities correctly tagged divided by the total number of entities tagged, and recall is defined as the number of entities correctly tagged divided by the total number of entities which should have been tagged.

Attempts to maximize one measure will typically compromise the other (all other factors being equal), so in practice a suitable trade-off must be found. For this reason, these measures are rarely considered in isolation, and are instead usually combined into a single value, e.g. the F-measure (which is defined as the weighted harmonic mean of precision and recall).

There are significant differences between the typical performances of TA techniques. Part of speech tagging and named entity recognition can typically be performed with >90% accuracy on business news, while accuracy for the information extraction from biomedical documents is typically below 70%. TA applications often combine together several component technologies, for example part of speech tagging and named entity recognition. There are often dependencies between these components; for example the named entity recogniser may rely on output from the part of speech tagger. Inaccurate output from one component can propagate through to subsequent processes that rely on them.

However, it should be noted that measures such as precision and recall are fairly blunt instruments, and take no account of the user experience as a determinant of performance or success. For example, Elworthy et al. (2001) demonstrate the application of TA technology to the retrieval of captioned images, and in so doing also provide a novel method of results presentation and unique user experience that could not have been achieved without the use of text analytics. The value of such a capability cannot be expressed by measuring retrieval accuracy alone.

4. Conclusions

In this post we have outlined some of the fundamental techniques in text analytics and explored the ways in which they can be applied to search and information retrieval. In an upcoming post, we’ll complement this with a look at text analytics applied to social media, in particular sentiment analysis and opinion mining.

5. References

  1. E. Brill, “Transformation based error driven parsing”, In Proceedings of the Third International Workshop on Parsing Technologies, Tilburg, The Netherlands, 1993.Church, K. (1998) A Stochastic Parts Program and Noun Phrase Tagger for Unrestricted Text. Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin TX, pp 136–143.
  2. Cutting, D., Kupiec, J., Pedersen, J. and Sibum, P. (1992) A Practical Part-of-Speech Tagger. Proceedings of the 3rd Conference on Applied Natural Language Processing.
  3. Elworthy, D., Rose, T. G., Clare, A. and Kotcheff, A. (2001) A Natural Language System for the Retrieval of Captioned Images. Journal of Natural Language Engineering, Cambridge University Press, 2001.
  4. Green, B., Wolf, A., Chomsky, C. and Laughery, K. (1961) BASEBALL: An Automatic Question Answerer. Proceedings of the Western Joint Computer Conference, 19 219–224.
  5. Greenwood, M., Roberts, I. and Gaizauskas, R. (2002) The University of Sheffield TREC 2002 Q&A System Proceedings of the 11th Text Retrieval Conference.
  6. Grishman, R. and Sundheim, B. (1996) Message Understanding Conference – 6: A Brief History. Proceedings of the 16th International Conference on Computational Linguistics (COLING-96) Copenhagen, Denmark, pp 466–470.
  7. Harabagiu, S., Moldovan, D., Pasca, M. Surdeanu, M., Bunescu, R., Girju, R., Rus, V. and Morarescu, P. (2000) FALCON: Boosting Knowledge for Answer Engines. Proceedings of the 9th Text Retrieval Conference.
  8. Hirst, G. (1987) Semantic Interpretation and the Resolution of Ambiguity Cambridge University Press, Cambridge, England.
  9. Hovy, E. Gerber, L., Hermjakob, U., Junk, M. and Lin, C. (2000) Question Answering in Webclopedia. Proceedings of the 9th Text Retrieval Conference.
  10. Ingersoll, G.S. and Morton, T.S. (2009) Taming Text. Manning Publications
  11. Jing, H. and Tzoukermann, E. (1999) Information retrieval based on context distance and morphology. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99).
  12. Karlsson, F. (1990) Constraint Grammar as a Framework for Parsing English Running Text. Proceedings of the 13th International Conference on Computational Linguistics.Krovetz, R. (1997) Homonymy and Polysemy in Information Retrieval. Proceedings of the 35th Meeting of the Association for Computational Linguistics and the 8th Meeting of the European Chapter of the Association for Computational Linguistics (ACL/EACL-97).
  13. Krovetz, R. and Croft, B. (1992) Lexical ambiguity and information retrieval. ACM Transactions on Information Systems 10(2) 115–141.
  14. Marcus, M., Santorini, B. and Marcinkiewicz, M. (1993) Building a Large Annotated Corpus of English: The Penn Tree Bank. Computational Linguistics 19(2) 313–330.
  15. Ravichandran, D. and Hovy, E. (2002) Learning Surface Text Patterns for a Question Answering System. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 41–47.
  16. Sanderson, M. (1994) Word sense disambiguation and information retrieval. Proceedings of the 17th ACM SIGIR Conference.
  17. Schutze, H. and Pedersen, J. (1995) Information Retrieval Based on Word Senses. In Symposium on Document Analysis and Information Retrieval (SDAIR), Las Vegas, NV, pp 161–175.
  18. Scott, S. and Gaizauskas, R. (2000) University of Sheffield TREC-9 Q & A System. Proceedings of the 9th Text Retrieval Conference.
  19. Soubbotin, M. and Soubbotin, S. (2001) Patterns of Potential Answer Expressions as Clues to the Right Answers, Proceedings of the 10th Text Retrieval Conference.
  20. Voutilainen, A. (2003) Part of speech Tagging. In The Oxford Handbook of Computational Linguistics Mitkov, R. (ed.) pp 219–232 Oxford University Press.
  21. Wilks, Y. (1975) A Preferential, Pattern Seeking, Semantics for Natural Language Inference. Artificial Intelligence 6.
  22. Wilks, Y. and Stevenson, M. (1998) The Grammar of Sense: Using part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering 4(3).
  23. Woods, W. (1973) Progress in Natural Language Understanding – An Application to Lunar Geology. AFIPS Conference Proceedings, vol. 42, pp 441–450.

6. Glossary

Anaphora resolution: the process by references between separate expressions are identified and resolved.

Artificial intelligence: the intelligence of machines and the branch of computer science which aims to create it.

Categorization: the process by which ideas and objects are recognized, differentiated and understood.

Classification: the process by which an electronic document is assigned to one or more predetermined categories, based on its contents. This contrasts with document categorization, in which the categories must typically be discovered as part of the process.

Clustering: the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

Derivational morphology: the process by which word meaning is changed through the application of derivations, e.g. the suffix “ee” changes the verb employ into the noun employee.

F-measure: a measure of a test’s accuracy. It considers both precision and recall.

Homonym: one of a group of words that share the same spelling and the same pronunciation but have different meanings.

Hypernym: is a word or phrase whose semantic range subsumes that of a set of other words.

Inflection: the process by which language expresses grammatical relations and relational categories such as tense, mood, voice, aspect, person, number, gender and case.

Information Extraction: the process by which structured information is extracted from unstructured textual data.

Lexical analysis: the process of converting a sequence of characters into a sequence of tokens.

Morphology: the study of the structure and content of word forms.

Named entity recognition: a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, etc.

Natural language processing: a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages.

Parsing: see syntactic analysis

Part of speech: a linguistic category of words which is generally defined by the syntactic or morphological behaviour of the word in question

Part of Speech Tagging: the process of marking up the words in a text as corresponding to a particular part of speech

Phrase detection: the recognition of groups of words functioning as a single unit in the syntax of a sentence.

Polysemy: the capacity for a word or phrase to have multiple meanings.

Precision: the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search.

Question answering: the task of automatically answering a question posed in natural language.

Recall: the number of relevant documents retrieved by a search divided by the total number of existing relevant documents.

Relation extraction: the process by which relations between named entities within a single sentence are identified.

Relevance: the extent to which a topic of a search result matches the topic of the query or information need.

Sentence boundary detection: the problem in natural language processing of determining where sentences begin and end.

Sentiment analysis: the process by which the attitude of a speaker or a writer with respect to a given topic is identified. It is usually classified as positive, negative or neutral.

Stemming: the process by which inflected (or derived) words are reduced to their root form.

Summarisation: the ability to create an abstract or topical summary from a free text document.

Syntactic analysis: the process of analyzing a text to determine its structure with respect to a grammar

Text analytics: a set of linguistic, analytical and predictive techniques by which structure and meaning can be extracted from textual data. Similar to text mining, but with a broader focus on interactive applications and a greater emphasis on user experience.

Text mining: the process by which meaning, structure and patterns can be derived from textual data.

Tokenisation: see lexical analysis

Wordnet: a lexical database for the English language. It groups English words into sets of synonyms, provides short, general definitions, and records the various semantic relations between these synonym sets.

Word sense disambiguation: the process of identifying which sense of a word is used in any given sentence, when the word has a number of distinct senses.

Related Posts:

  1. The role of Natural Language Processing in Information Retrieval
  2. IR book is out!
  3. Applying text analytics to product innovation and legal cases
  4. Text Analytics: Yesterday, Today and Tomorrow
  5. GATE Summer School: highlights & reflections

Read Full Post »

As I posted a few months ago, I’m organizing the Industry Event at CIKM 2011 with Daniel Tunkelang. We have a great set of keynotes lined up:

We’re also looking for submissions from industry researchers and practitioners. The submission deadline is June 21.

Here is a copy of the call for papers:

This year’s CIKM conference will include an Industry Event, which will be held during the regular conference program in parallel with the technical tracks.

The Industry Event’s objectives are twofold. The first objective is to present the state-of-the-art in information retrieval, knowledge management, databases, and data mining, delivered as keynote talks by influential technical leaders who work in industry. The second objective is to present interesting, novel and innovative industry developments in these areas.

Industry authors are invited to prepare proposals for presenting interesting, novel and innovative ideas, and submit these to industry@cikm2011.org by June 21st 2011. The proposals should contain (with respective lengths):

  • Short company portrait (125 words)
  • Short CV of the presenter (125 words)
  • Title and abstract of the presentation (250 words)
  • Reasons why the presentation should be interesting to the CIKM audience

When submitting a proposal, please bear in mind the following:

  • Ensure the presentation is relevant to the CIKM audience (the Call for Papers gives a good idea of the conference scope).
  • Try to highlight interesting R&D challenges in the work you present. Please do not present a sales pitch.
  • All slides will be made public (no confidential information on the slides; you will be expected to ensure your slides are approved by your company before being presented).
  • Presenters may opt to have their presentation videoed and made public, and if so, the presenter will be asked to sign a release form.

We look forward to receiving your submissions, and welcoming you to the CIKM 2011 Conference and Industry Event.

Important dates:
21 June 2011: Industry Event paper proposals due
19 July 2011: Notifications sent
27 October 2011: Industry Event
24-28 October 2011: CIKM conference

Related Posts:

  1. Events & Presentations
  2. 1st European Workshop on Human-Computer Interaction and Information Retrieval
  3. ECIR Industry Day: Lineup Announced
  4. Tutorial on Designing the Search Experience
  5. Design Patterns for Spatial Information Visualisation and Analytics Applications

Read Full Post »

Searching for Meaning in Text

Here are the slides from the talk I gave at City University last week, as a guest lecture to their Information Science MSc students. It’s based on the chapter of the same name which I co-authored with Mark Stevenson of Sheffield University and appears in the book called “Information Retrieval: Searching in the 21st Century“. The session was scheduled for 3 hours, and to my amazement, required all of that (thanks largely to an enthusiastic group who asked lots of questions). And no, I didn’t present 3 hours of Powerpoint – the material was punctuated with practical exercises and demos to illustrate the learning points and allow people to explore the key concepts for themselves. These exercises aren’t included in the Slideshare version, but I am happy to make them available to folks who want to enjoy the full experience.

Related Posts:

  1. IR book is out!
  2. Applying text analytics to product innovation and legal cases
  3. Enterprise Search 8, Text Analytics 2 (after extra time)
  4. Text Analytics: Yesterday, Today and Tomorrow
  5. GATE Summer School: highlights & reflections

Read Full Post »

Adam Wyner presents his work on Text Analytics for Legal Cases

Better late than never … but here’s a quick report on the 3rd London Text Analytics meetup, which took place in early March at Fizzback’s offices on the Strand. I was particularly pleased to make this meeting a success as kick-starting the group was one of my resolutions for 2011… I’m convinced that there is significant potential for such a group in bringing together the nascent text analytics community in London.

First up was Adam Wyner of Liverpool University, who described his work in applying text analytics to legal cases. Lawyers, it seems, are very text-oriented in their work, and much of their activity is based on the interpretation of legal precedent established through analysis of previous cases. This analysis takes the form of comparison based on a variety of legal factors, which are used in the argumentation and reasoning underpinning a given case. Clearly, if tools could be provided to reliably extract or identify such factors in natural language text, then the potential for facilitating the legal process (and providing tools for trainee lawyers) is considerable. Adam’s approach is based on a bottom-up process that identifies patterns of co-occurrence of specific legal concepts and terms, which are then mapped onto a taxonomy of 27 base-level factors. He’ll be talking more about this at the upcoming ICAIL workshop on Applying Human Language Technologies to the Law.

Next up was Adi Andrei who presented an intriguing account of how to use text analytics as a catalyst for new product generation. His core idea was essentially that product features or “themes” that sell well within a certain category (of product) can be productively applied to a new category to generate novel product ideas. The core process seems to something be this:

  • Identify themes T that sell well in Category C
  • Introduce these themes in a new Category
  • Repeat for all T & C

He presented some quite plausible data showing how the themes of “sleep” and “sustainability” (as represented by their natural language descriptions in the product specifications and marketing material) had been extended to diverse products such as Tea and Laundry products. I know it sounds too good to be true – and no doubt my account has skipped over some of the key details – but I was encouraged to see that Adi had previously applied similar techniques at NASA and was able to present further data and a plausible rationale to back up his account. This is one application of text analytics I shall definitely watch with interest.

So all in all, a productive evening, and good to see a full house. In fact, we’ll probably re-use the same formula going at future events: a research-oriented talk, a practitioner oriented talk, then drinks in a local pub. I’m already on the lookout for speakers for our next event, so if you’re amenable to having your arm twisted in such a way, drop me a line.

Related Posts:

  1. Enterprise Search 8, Text Analytics 2 (after extra time)
  2. Text Analytics: Yesterday, Today and Tomorrow
  3. London Text Analytics: Inaugural Meeting
  4. GATE Summer School: highlights & reflections
  5. Search at the Guardian Newspaper

Read Full Post »

Here’s a New Year’s resolution: to kick-start the London Text Analytics group and get the programme of MeetUps going again. The Enterprise Search London group certainly sets the example: 250+ members and 8 meetings over the course of 2010. By contrast, Text Analytics is flatlining at around 40 members, with just a couple of Meetups thus far. Of course, this disparity may simply be nature’s way of telling me as organiser to be more proactive… but it does make me wonder:  what is a realistic size for a group of this nature anyway?

I’ve worked in NLP and IR for almost two decades, in both research and commercial settings, and have seen many changes in that period. I’ve seen many ideas and approaches transfer from one discipline to the other, and consider the two communities to be closely related. But beneath the superficial associations there are some significant differences that fundamentally influence the dynamics of the two communities.

Search, or Information Retrieval in its broadest sense, represents an activity has clear touchpoints on the lives of many diverse professions across a variety of organisational contexts. If search engines didn’t exist, then human resources would still be deployed to perform the functions they provide (albeit somewhat less efficiently). In this respect, search is embedded in the fabric of many organisations, creating a common vocabulary and set of professional practices for the individuals within them.

Text analytics, by contrast, is quite different. The primary challenges within the NLP community have traditionally been tasks such as word sense disambiguation, named entity recognition, anaphora resolution, and so on. How many of these have tangible counterparts outside of the research lab? How often are human resources routinely deployed to undertake such tasks as part of normal business operations? With the possible exceptions of speech recognition (e.g. captioning services for broadcast media) and machine translation (e.g. of EU parliamentary proceedings), the answer would seem to be very few. NLP just does not touch on the lives of diverse professionals in the same way as IR. Even a cursory glance at the frequencies with which these terms appear in Google Books Ngram Viewer would seem to indicate a differential of around 5-6 over the last decade (in favour of IR):

"Natural Language Processing" vs. Information Retrieval"

So maybe the difference in the sizes between the two groups is a reflection of the natural order of things. That said, I am convinced that there are many more individuals in London & the South East working in NLP / text analytics who would benefit from a closer association with their peers. With that in mind, I’ll be twisting a few arms in the near future to get the best we can out of the MeetUp group. Watch this space for updates 🙂

Related Posts:

  1. Text Analytics: Yesterday, Today and Tomorrow
  2. London Text Analytics: Inaugural Meeting
  3. GATE Summer School: highlights & reflections
  4. Search at the Guardian Newspaper
  5. Enterprise Search Meetup: Searching for Leisure, Travelmatch and Real-time search

Read Full Post »

Here’s something I’ve been meaning to share for a while: the slides for a talk entitled “Text Analytics: Yesterday, Today and Tomorrow”, co-authored with colleagues Vladimir Zelevinsky and Michael Ferretti. In this we outline some of the key challenges in text analytics, describe some of Endeca’s current research in this area, examine the current state of the text analytics market and explore some of the prospects for the future.

Read Full Post »

After 3 months or so in incubation, last Weds finally saw the inaugural meeting of the London Text Analytics group. A dozen or so of us convened at Fizzback‘s offices in Leicester Square, for a talk entitled “Making Natural Language Processing Work in Practical Applications” by Udo Kruschwitz of Essex University. Udo was most generous with his time (he packed about 3 hours of material into less than a third of that), and the meeting continued in a local pub aftewards.

London Text Analytics inaugural meetupIt was encouraging to see so many familiar faces there (they know who they are!) plus a few whom I hadn’t met before. More please! As I posted last week, the potential audience for a group such as this could be significant, and it would be great to have a forum for text analytics / NLP folks to get together. Virtual groups have their place of course, but they’re no substitute for physical gatherings. This is especially so for a community like NLP, where the meeting calendar is so dominated by events aimed at researchers (as opposed to practitioners). Hopefully this is changing (albeit gradually) – over the last few years there’s been a significant shift in the composition of the NLP community, with increasingly greater representation from the practitioner community. But we still have a long way to go. And in that respect, the London TA group is just one tiny step.

If you’re interested in joining us for the next meet, just sign up at MeetUp.com. Better still, submit an idea for an event via the site. See you at the next MeetUp!

Read Full Post »

OK, so after a period in stealth mode we’re now finally launching the London Text Analytics group. Well, I say “launching”, but it’s really just half a dozen of us meeting at Fizzback‘s offices for a talk and maybe a beer afterwards. It’s been a while in the making (mainly due to summer vacations & so on) but I’m pleased to say that my BCS colleague Udo Kruschwitz from Essex University has agreed to give a presentation on “Making Natural Language Processing Work in Practical Applications“. And as I said, for this first meeting I’m not expecting huge numbers. But you never know – I can think of several text analytics organisations based in the South East England with whom we share a common interest, such as:

…and many more organisations who are based overseas but have offices in the UK. And then of course there are the startups in the field of brand reputation management / customer experience, who make use of lightweight NLP techniques to measure online sentiment. So in principle that’s quite a large population we could be addressing.

We’d be delighted to hear from folks working for any of the above organisations. If you have a passion for text analytics & NLP, there’s a lot we can learn from each other. Further details at: http://www.meetup.com/textanalytics/.

Read Full Post »

IR book coverI’ve just received my complimentary copy of “Information Retrieval: Searching in the 21st Century“. This book is co-written with my colleagues on the BCS IRSG committee, and edited by Ayse Goker and John Davies. It has been several years in the making, so my thanks and congratulations go to Ayse and John in doing an excellent job in seeing this through from initial idea to finished volume.

My own chapter, co-written with my ex-Reuters colleague Mark Stevenson, is “The Role of Natural Language Processing in Information Retrieval: Search for Meaning and Structure”. The chapter is perhaps a little more academic in style than the material I would normally write, but hopefully in keeping with the scholarly traditions of the BCS and also consistent with other contributions in this volume.

The book is available from Amazon, and I hope to be able to offer a copy for review in Informer. In the meantime, I’ve appended Stephen Robertson‘s foreword below. If you’d like to volunteer to review the book for Informer, let me know.

In the forty years since I started working in the field, and indeed for some years before that (almost  since Calvin Mooers coined the term information storage and retrieval in the 1950s), there have been a significant number of books on information retrieval. Even if we ignore the more specialist research  monographs and the ‘readers’ of previously published papers, I can find on my shelves or in my mental  library many books that attempt (probably with the IR student in mind) to construct a coherent and systematic way of defining and presenting information retrieval as a field of study and of application.

Often such a book is the work of a single author, or perhaps a pair working together. Such works can clearly have an advantage in respect of coherence; the field is necessarily presented from a single viewpoint. On the other hand, they can also suffer for the same reason. The IR field is rich (more so now than it has ever been), and it is difficult within a single viewpoint to do justice to this richness. Readers, on the other hand, have to be constructed out of the materials to hand: the published papers, each of which has taken its own view, probably with a much narrower field of vision, and different from that of the other chosen papers.

The present book attempts the tricky task of combining the breadth of vision of multiple authors with the coherence of a single integrated work. The richness of the field is apparent in the range of chapters: from formal mathematical modelling to user context, from parallel computation to semantic search.

The topics covered also vary greatly in their historical association with the field. Categorisation, for example, has been around as an IR technique for quite a long time – though Stuart Watt brings a new perspective. Mobile search (David Mountain, Hans Myrhaug and Ays¸e G¨oker), however, is a relatively recent development. The use of formal models (information retrieval models, Djoerd Hiemstra) goes back almost to the beginning, as does experimental evaluation (user-centred evaluation of information retrieval systems, Pia Borlund), though in both cases there have been huge changes in the past decade.

This same decade has witnessed the huge growth of the World Wide Web, and the developing dominance of web search engines (web information retrieval , Nick Craswell and David Hawking) as the glue which holds the web together. For many people today, IR is web search. It is true that there has been a huge amount of influence in both directions: search engines are largely based on techniques from both the IR research community and from previous operational systems, while IR research and
practice in other environments has learnt a great deal from the forcing-house that is the web search space. This dominance of the web as the domain of interest is well reflected in many of the chapters in the present volume.

It is important, however, to remember that IR is not all about web search, and that the web space presents both problems and opportunities which differ from those in other domains. The desktop, the enterprise, specialist collections such as scientific papers are all examples of different domains for which search functionality is a fundamental requirement. There are references to several of these throughout the book, but specific domains with their own chapters are multimedia resource discovery

(Stefan R¨uger) and image users’ needs and searching behaviour (Stina Westman). The user theme is taken further in the context and information retrieval (Ays¸e G¨oker, Hans Myrhaug and Ralf Bierig). More generic problem areas are addressed in cross-language information retrieval (Daqing He and Jianqiang Wang), in semantic search (John Davies, Alistair Duke and Atanas Kiryakov) and in the chapter on natural language processing (Tony Russell-Rose and Mark Stevenson). Finally, a chapter
on performance issues and parallelism (Andrew MacFarlane) addresses more technical computing concerns.

Information retrieval, from being the rather arcane subject in which I did my masters degree forty years ago, has become one of the defining technologies of the twenty first century. I believe the present book does justice to this status.

Stephen Robertson, 2008

Read Full Post »

Last week I attended the First GATE Summer School (“FIG’09”), at Sheffield University, from July 27 – 31. For those unfamilar with GATE, it is generally considered to be the premier open source platform for text analytics, and described by some as the “VW Beetle of NLP” (i.e. “it will overtake everyone – eventually”). If you subscribe to this view, then you’d probably also conclude that those left behind include OpenNLP, NLTK, UIMA and various other text analytics / NLP tookits. All worthy initiatives, but with ~35,000 downloads per year and thousands of users, few can claim the same degree of success or popularity as GATE.

The GATE Summer School also comes at an interesting time for the Text Analytics community more generally. When the Reuters NLP Group (which I led) closed in 2003, there was very little activity happening outside of academia. Half a decade later, it’s a different picture – people can and do make a resonable living as text analytics consultants, with the rise of blogs, user-generated content and social media creating ever more opportunities for interesting text analytics applications and services. And, at the risk of tempting fate, there are growing signs of liquidity in the jobs market. It is still, however, a long way from being a mainstream profession, with recognised qualifications, roles, career progression, accreditation schemes, and so on (although for developments on the latter, read on).

The view from the back

Small room, big crowd

Anyway, back to GATE. When I reflect on initiatives to create generic platforms or infrastructure within a particular community, it brings to mind the old adage that “in a gold rush the best way to get rich is to make shovels”. And GATE is a pretty useful shovel – the fact that the summer school filled a whole week (and could have gone on longer) is a reflection of the breadth and flexibility of the platform. In fact, the Summer School did have a second week, but this was aimed specifically at GATE developers and committers rather than end users. And in GATE’s case, much of the richness is shared by the community as a whole, thanks to its commitment to open source philosophy and practices.

A further strength of GATE is that it is built around a extensible, component-based model, consisting of linguistic resources (the documents or corpora to be processed) and processing resources (the tools and algorithms that do the processing). The latter are typically aggregated into applications, i.e. pipelines that represent a sequence of operations over a corpus. All of these concepts can be managed through a graphical development environment, and bespoke applications can be built directly around the Java API. GATE also integrates with various 3rd party tools, such as LingPipe, OpenNLP, OpenCalais, etc.

Networking coffee break

Networking coffee break

The week itself consisted of a series of NLP topics and applications, divided into 1-2 hour blocks, most with an element of theory (delivered via powerpoint) and an associated practical session (hands on at the computer). Overall, I have to say that pretty much every session was of interest and value, which is so rarely the case with a week-long course. There were, however, significant opportunities to improve the delivery (more on that below). And GATE itself would undoubtedly benefit from greater attention to the user experience – as far as I am aware, no-one has yet “owned” this aspect of its development – and it shows. In fact, there are a number of specific UX issues that deserve a broader analysis, so watch this space for a separate post on the subject.

Over the course of the week, I’d say the highlights were:

  1. Understanding JAPE: This is the built-in language for linguistic pattern matching, allowing a user to create regular expressions to identify structures in text (which can range from simple patterns such as postcodes to much more complex phenomena such as stock price movements). However, the support that GATE provides for writing JAPE seems relatively primitive. In fact, it seems that many users write JAPE code in an external text editor, then import it via the GATE IDE. This seems bizarre, and probably goes some way to explaining why it is apparently very easy to write inefficient JAPE (which for a production environment must be a major limitation). Surely we can do better than this, given that it is likely to underpin pretty much any knowledge-based text analytics application. At the very least, someone should provide a TextPad syntax definition template. Better still, a PhD project to investigate and build some decent JAPE support tools that promote learning, re-use, efficiency, etc.
  2. Machine Learning: this is the counterpart to the JAPE approach, allowing a developer to induce linguistic patterns directly from training data. GATE provides extensive support for building machine learning NLP applications, but the usual caveats apply, i.e. you need large quantities of gold-standard, annotated training data. Consequently, support for manual annotation is a major issue, and a significant part of week was devoted to the associated workflow and quality control tools, such as GATE Teamware.
  3. Opinion mining: As a UX practitioner I am interested in the subjective user experience, expressed through both explicit behaviour (i.e. task completion etc.) and implicit attitudes and preferences. In particular, I am intrigued by the possibilities in triangulating between the qualitiative, behavioural, micro-level feedback obtained from typical usability studies and the quantitiative, attitudinal, macro-level data acquired through opinion mining, to create a ‘360 degree view’ of the user experience. This is an area I’d love to explore more, and will be one of my first areas of experimentation with GATE.

In summary, I’d say the first GATE summer school was a well-organised event with an excellent programme.  But there were some significant areas where delivery could be improved:

  • If you’re running a course with ~55 attendees, you need to make sure everyone can see and hear from the back. Use 2 projectors, and have presenters wear lapel microphones.
  • Go at the pace of the slowest machine – which will invariably be those of the attendees, not the presenters.
  • Tell the attendees clearly what slide deck you are using before starting your talk, and give them a chance to open it in advance.
  • Work out the choreography of when each lecture finishes and when the associated practical begins, and make sure everyone knows the difference. Tell people how long they have for each exercise, and give them clear timechecks when they are about to timeout.
  • Implement a “quick reset” function for GATE that clears down the old exercise in readiness for the new (or at the very least, tell the class how to do this in the first practical)
  • Standardise the format and approach of the practicals. Greater consistency will allow participants to adapt better to the pedagogical approach and focus more on the material and less on the presentation.
  • Make the practicals goal-oriented. Give people a clear problem statement at the outset, so that low-level tasks can be understood within the context of a wider goal.
  • Explain key concepts using diagrams, not the GATE UI. If you do the latter, participants will inevitably try to keep up, and will typically not succeed (for the reasons above). Avoid the problem by using suitable visuals and an associated narrative.

It’s interesting to note that the summer school closed with the award of an “intermediate training course certificate” to each of the attendees, no doubt a first step on the road to formal accreditation of competence within the GATE community. FIG ’09 was oversubscribed, so there is already talk of another presentation later in the year (it’ll probably have to be renamed a “Winter School” though). On the basis of FIG ’09, would I recommend it? Absolutely. In the meantime, watch this space for an update on the GATE user experience.

Read Full Post »

« Newer Posts