Prostitutes Appeal to Pope: Text Analytics applied to Search

July 28, 2011 by Tony Russell-Rose

Here’s something I’ve been meaning to share for a while. The content was written originally as notes to complement to my tutorial on NLP + IR, but I thought it might be informative to invite a wider audience. At some point I’ll update it to include a section on Sentiment Analysis (who doesn’t these days?!) In the meantime I hope you find it useful – feel free to suggest additions or improvements either here or offline.

1. Introduction

The field of text analytics is laden with jargon and acronyms, so before we go any further, let’s clarify a few terms.

Text analytics describes a set of linguistic, analytical and predictive techniques by which structure and meaning can be automatically extracted from unstructured documents. Text analytics (TA) provides the ability to cluster, categorize and classify unstructured data, supporting both interactive exploration and automated document processing. TA processes can thus be used to improve the accuracy of search engines and to transform the search user experience by enhancing results presentation and facilitating interactive navigation and exploration.

The term text mining has a history of at least a decade or more and is well understood by practitioner and research communities. The term text analytics has perhaps a slightly shorter history than text mining, but is growing rapidly in popularity as a label for the application of NLP techniques in practical, commercially-oriented contexts.

Some authors treat TM and TA as synonyms, but this misses a subtle but important distinction: TM has its origins in automated, offline processing, characterised by ‘data in’ and ‘structure out’. It is primarily concerned with identifying and inferring aggregate patterns and insights across collections of documents. By contrast, TA is a more generic term, often associated with interactive, analytic applications and a focus on the analysis of individual documents.

1.1 The Challenge of Text Analytics

On the surface, the problem that TA attempts to solve appears trivial. After all, as humans we effortlessly create and interpret both spoken and written language, and rarely have to apply conscious thought to the process. But that, in many ways, is part of the problem: if the history of artificial intelligence teaches us anything it is that so many of the faculties that we possess as humans are not open to introspection and are therefore notoriously difficult to analyse and encode algorithmically. But that isn’t to suggest that the task is any less challenging: far from it. Natural language is littered with ambiguity, and it is only because we have evolved highly parallelised linguistic processes that work across multiple levels that we can make any sense of it at all.

Consider the following newspaper headlines, all of which appeared unambiguous to the original writer:

DRUNK GETS NINE YEARS IN VIOLIN CASE
PROSTITUTES APPEAL TO POPE
STOLEN PAINTING FOUND BY TREE
RED TAPE HOLDS UP NEW BRIDGE
DEER KILL 300,000
RESIDENTS CAN DROP OFF TREES
INCLUDE CHILDREN WHEN BAKING COOKIES
MINERS REFUSE TO WORK AFTER DEATH

Although humorous, they illustrate much of the ambiguity in natural language, and just how much pragmatic and linguistic knowledge must be employed by NLP tools to function accurately.

1.2 Text Analytics Opportunities

It has been estimated that as much as 80% of all potentially usable business information originates in unstructured form,e.g. as:

Email, news and blog articles; forum postings; and other social media;
Contact-centre notes and transcripts;
Surveys, feedback forms and warranty claims;
Various types of corporate document, etc.

Moreover, the rate at which the world’s digital data repositories are growing creates a further challenge: according to International Data Corporation, the world generated 161 billion gigabytes of digital information in 2006 and is expected to grow to approximately 988 exabytes by 2010 (Ingersoll & Morton, 2009). And even if much of this date is in the form of audio, video, images, and so on, our current approaches to making this data searchable still revolve around the provision of manual summaries, captions and keyword tags. So in other words, we are producing yet more text that has to be analysed and managed.

In addition to the growth of social media, we are seeing progressively more eCommerce sites employing user-generated content in the form of product reviews and opinions to support the purchasing process and drive sales growth. To maximise the value of this content, these organisations also need to employ analytic tools to make it searchable and navigable.

2. Fundamentals

In the section above we alluded to the highly parallelised nature of human language processing, and the notion that language understanding requires the application of linguistic and pragmatic knowledge across a number of levels. Text analytics is no different – all but the most trivial of systems will apply linguistic analysis across a number of levels, often in cascaded fashion, where the output of the simple, lower level processes feeds into more complex, higher level ones. In this section we will review the main techniques involved.

2.1 Basic Processing: Lexical & Syntactic Analysis

At the most basic level is treatment of character encodings, such as ASCII, UNICODE, UTF-8 and so on. Although not traditionally considered an active area for NLP research, these processes nonetheless have a direct influence on the representation of case (upper & lower), punctuation, accents, and so on, which all in turn carry meaning at the lexical level (see below).

At the level above characters we deal with terms and their variants, i.e. words, abbreviations, acronyms, compounds and so on. At this level, known as lexical analysis, we deal with tokenisation, i.e. the recognition of word boundaries. This process is relatively straightforward for languages that use whitespace as a delimiter, such as English, but is harder for languages such as Chinese and Japanese. At the lexical level we also deal with morphology, i.e. the process by which words are formed. Morphological analysis allows us to identify inflectional variants, such as “run” and “ran”, which are the present and past tenses of a common stem. In addition, we can also apply derivational morphology to recognise that “runner” is a further variant of “run” and “ran”, albeit in a different syntactic category (i.e. a noun in this case).

Note that it is possible to write a reasonably accurate automatic stemmer for English, which will conflate sets of related terms to a common root. A popular example of one such tool is the Porter stemmer.

Finally we use the outputs of these processes to perform syntactic analysis, i.e. the interpretation of the grammatical structure of a piece of text. A popular example of the role of grammar in English is to consider the phrases ‘Venetian blind’ and ‘blind Venetian’ – they contain common words but possess a very different meaning. The most elementary form of syntactic analysis is the application of a Part-of-Speech (PoS) tagger, which attaches a tag to each word to indicate whether it is being used as a noun, verb, adjective, determiner, and so on. The set of tags used by a particular PoS tagger may be used to emphasise certain grammatical distinctions. For example, the Penn Treebank tagset (Marcus et al. 1993), which is commonly used by PoS taggers, has four separate tags for noun:

NN (singular common noun or mass noun)
NNP (singular proper noun)
NNS (plural common noun)
NNPS (plural proper noun).

Part of speech tagging has been researched since the 1950s (Voutilainen 2003) and a number of tagging systems are now easily available. These systems use a number of different strategies, including handwritten rules (Karlsson 1990), hidden Markov models (Cutting et al. 1992) and n-grams learned from text annotated with part of speech information (Church 1998).

Operating around the syntactic level are various other analytic processes, such as:

Sentence boundary detection, which sounds trivial, but is deceptively complex. A publicly available sentence splitter is available from CPAN
Phrase detection, i.e. recognition of noun phrases and other syntactic compounds.
Summarisation, i.e. the ability to create an abstract or topical summary from a free text document. This sounds complex, but becomes much more tractable with semi structured text such as HTML content. A publicly available summariser for HTML text is available from CPAN
Parsing, i.e. the analysis of a sequence of tokens to determine their syntactic structure with respect to a particular grammar. In practice this often means analysing PoS tagged text to identify Subject—Verb-Object and other relationships to produce higher level linguistic analyses of in the text.

2.2 Named Entity Recognition

Named entity recognition (NER) is the process by which key concepts such as the names of people,

places and organizations are identified within a document. Some commercially available systems also include the ability to identify further entity types, such as postcodes, telephone numbers and so on. In addition, some systems have been built to identify more esoteric items such as the names of proteins. NER is best illustrated using an example, such as the following extract from a news article:

“Mexico has been trying to stage a recovery since the beginning of this year and it’s always been getting ahead of itself in terms of fundamentals,’ said Matthew Hickman of Lehman Brothers in New York.”

The named entities we might expect to extract from this sentence would be:

Persons:          Matthew Hickman

Organisations:    Lehman Brothers

Cities:           New York

Countries:        Mexico

The accurate identification of such entities confers a number of benefits. Firstly, by identifying such concepts in a document, it is possible to index documents at a more fine-grained level, which in turn allows more precise searching. The word ‘Bush’, for example, can refer to either a plant or an ex-President of the USA. By differentiating between words such as these (known as homonyms) we can maximize retrieval precision, i.e. the likelihood that information seekers will receive only those documents that are relevant to their query.

Secondly, by identifying names within a document we can provide links to other related resources. Several content providers already use this technique to highlight the names of the key people mentioned in a news story, and provide links to their biographies. Similarly, it is possible to create more precise filters based on those named entities, allowing personalised delivery of content to individual users. For example, a news service might supply details of upcoming events in ‘New York’ as opposed to ‘new’ events in the city of ‘York’.

Thirdly, the output of NER adds valuable structure to a document, and thus facilitates subsequent

post-processing transformations, such as machine translation (e.g. translating documents from one natural language to another) or text-to-speech output (e.g. providing spoken versions of stock quotes for access by phone).

2.3 Information Extraction

Information Extraction is the process by which structured information is automatically extracted from unstructured documents. Examples of the types of information which might be extracted include the movements of company executives, victims of terrorist attacks, information about mergers and acquisitions and interactions between genes and proteins in scientific articles. When the relevant information has been identified it is then stored in a highly structured format known as a template.

The Message Understanding Conferences (MUC) (Grishman and Sundheim 1996) were an international evaluation exercise in which information extraction systems were tested against one another using common documents and scoring systems. In the sixth MUC conference, participant systems were tasked with the challenge of identifying facts about the movements of executives between companies. The following paragraph shows a typical example, which describes an event in which ‘John J. Dooner Jr.’ becomes chairman of the company ‘McCann-Erickson’:

“Now, Mr James is preparing to sail into the sunset, and Mr Dooner is poised to rev up the engines to guide Interpublic Group’s McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr Dooner, 45.”

This fact is encoded in the following template structure:

<SUCCESSION EVENT-2> :=

SUCCESSION ORG:

<ORGANIZATION-1>

POST: "chairman"

IN AND OUT: <IN AND OUT-4>

VACANCY REASON: DEPART WORKFORCE

<IN AND OUT-4> :=

IO PERSON: <PERSON-1>

NEW STATUS: IN

ON THE JOB: NO

OTHER ORG: <ORGANIZATION-1>

REL OTHER ORG: SAME ORG

<ORGANIZATION-1> :=

ORG NAME: "McCann-Erickson"

ORG ALIAS: "McCann"

ORG TYPE: COMPANY

<PERSON-1> :=

PER NAME: "John J. Dooner Jr."

PER ALIAS: "John Dooner" "Dooner"

These template structures contain a large amount of information that enables complex queries to be carried out against the extracted data. For example, a user could ask for a list of all the events where a finance officer left a company to take up the position of CEO in another company. Once the templates have been extracted for a set of documents they can be used to populate a knowledge base derived from the corpus, or used as document metadata to improve retrieval precision.

However, information extraction from all but the most simple of documents is an imperfect process. One of the main reasons is that the description of an event is often spread across several sentences or even across separate paragraphs. For example, consider the following pair of sentences, which contains information about management succession events:

“Pace American Group Inc. said it notified two top executives it intends to dismiss because an internal investigation found evidence of ‘self-dealing’ and ‘undisclosed financial relationships’. The executives are Don H. Pace, cofounder, president and chief executive officer; and Greg S. Kaplan, senior vice president and chief financial officer.”

The name of the organisation and fact that two executives are leaving are contained in the first sentence. The names of the two particular executives and their posts are listed in the second sentence but it does not mention the fact that the executives are leaving these posts. Therefore, the succession event can only be fully understood through a combination of the information contained in both sentences. Combining such information across sentences is non-trivial since it is necessary to identify the phrases which refer to the common entities, i.e. ‘two top executives’ and ‘the executives’ in the above example. The process by which references between instances of one expression and another is known as anaphora resolution.

A further challenge is recognising instances when a given entity may be referred to by different surface forms. For example, ‘International Business Machines Ltd’ may be referred to by an abbreviation (IBM), a nickname (‘Big Blue’) or a contextual anaphoric expression such as ‘it’ or ‘the company’. Challenges such as this are the main reasons it has proved difficult to create reliable information extraction systems. Scores for the template filling task in the MUC evaluations were generally in the 50–60% F-measure range, which at the time was insufficient for most commercial applications.

An intermediate process which is somewhere between named entity extraction and full template filling is relation extraction. This task is simpler than full template filling since: (a) it aims to identify only relations between named entities (which are considered to be binary); and (b) only relations which occur within a single sentence are considered. These simplifications have meant that the task is more achievable than full template filling.

2.4 Word Sense Disambiguation

In this section, we review some of the challenges presented by the polysemy: the phenomenon whereby a given term may have different meanings or senses. For example, the word ‘bat’ can mean ‘sports equipment’ (as in ‘cricket bat’) or ‘nocturnal mammal’ (as in ‘fruit bat’), etc. Consequently, if we are provided with no additional information, it is impossible to tell whether a query which contains the word ‘bat’ refers to the domain of sports or animals. This phenomenon is common to many natural languages and represents a significant challenge to retrieval precision. The process by which the meanings of a given word is identified is known as word sense disambiguation (WSD).

One of the main determinants of word sense is syntactic role, i.e. words can have different meanings according to their use as a noun, verb, adjective etc. For example, two of the possible meanings of ‘light’ are ‘not heavy’ (as in ‘Magnesium is a light metal’) and ‘illumination’ (as in ‘The light in the kitchen is quite dim’). The first sense only applied when ‘light’ is used as an adjective, and the second only when used as a noun. Consequently, accurate identification of a word’s part of speech can be a valuable determinant of its meaning.

Wilks and Stevenson (1998) showed that word sense ambiguity could be carried out effectively using a part of speech tagger when only broad differences between different word meanings were being considered. For example, this technique could be used to differentiate between the verb sense of “bank” (as in ‘the plane banked sharply’) and the noun sense (as in ‘financial institution’), but couldn’t distinguish between more closely related meanings such as the various noun senses used in ‘a car crashed into the bank’ or ‘the bank gave me a good rate on my mortgage’.

There has been a long history of research into WSD in NLP during which a wide variety of techniques were investigated. Early approaches such as Wilks (1975) and Hirst (1987) applied techniques from artificial intelligence that relied on hand-crafted knowledge sources containing detailed semantic information. However, these approaches were limited by the fact that the information they relied on had to be manually created. Large-scale machine- readable lexical resources, such as WordNet, started to become available in the 1980s and were quickly used for WSD. They have the advantage of providing both a list of possible senses for each word and information which can be used to identify the correct sense, such as the hypernym hierarchy in WordNet.

There has been some disagreement about the usefulness of WSD for IR. Some have argued that the benefits which might be gained from disambiguation are limited. Krovetz and Croft (1992; see also Krovetz 1997) manually disambiguated a standard test corpus and found that a perfect WSD engine would improve retrieval performance by only 2%. Sanderson (1994) performed similar experiments in which ambiguity was artificially introduced to a test collection by automatically creating ‘pseudowords’. He found that a 20–30% error rate in disambiguation led to text retrieval which was at the same level, or possibly worse, than if the ambiguity was left unresolved. Sanderson also found that queries containing fewer than five terms were less sensitive to mistakes in the disambiguation. He concluded that WSD was only useful if it was very accurate or the queries were short.

Others have demonstrated that WSD can be used to improve IR performance. Schutze and Pedersen (1995) showed that disambiguation can substantially improve text retrieval performance; demonstrating an improvement between 7 and 14% on average. Jing and Tzoukermann (1999) have also reported improvements of 8.6% in retrieval performance. Their disambiguation algorithm computes the word sense similarity in the local context of the query, the similarity of the lexical-occurrence information in the corpus, and the morphological relations between words.

2.5 Question Answering

In the traditional standard information retrieval paradigm, the user is provided with a ranked list of documents through which they must search in order to locate the information they are seeking. An alternative approach to meeting the user’s information need in a more focused way is to provide specific answers to specific questions.

Question answering research has a long history (Green et al. 1961; Woods 1973) and was introduced as a task in the Text Retrieval Conference (TREC) IR evaluations in 1999 (Voorhees 1999). Within the TREC framework question answering can be thought of as a specialisation of the standard document retrieval problem in which the queries are expressed as natural language questions and the system is expected to identify the portion of the document in which the answer can be found. Like document retrieval, question answering is often carried out against a well-defined collection of documents which may, or may not, meet the user’s information need.

Questions can be posed in a number of different ways which automatic question answering systems should aim to process. Examples of possible question formats include:

Yes/no questions ‘Is George W. Bush the current president of the USA?’ ‘Is the Sea of Tranquillity deep?’
‘Who’ questions ‘Who was the British Prime Minister before Margaret Thatcher?’ ‘When was the Battle of Hastings?’
List questions ‘Which football teams have won the Champions League this decade?’ ‘Which roads lead to Rome?’
Instruction-based questions ‘How do I cook lasagne?’ ‘What is the best way to build a bridge?’
Explanation questions ‘Why did World War I start?’ ‘How does a computer process floating point numbers?’
Commands ‘Tell me the height of the Eiffel Tower.’ ‘Name all the Kings of England.’

The standard method for tackling question answering is to approach the problem in three separate stages:

question analysis
document retrieval
answer extraction

The aim of the first stage is to predict the type of answer expected (for example, the expected answer for ‘When was Mozart born?’ is a date) and create a query which can then be passed to an IR system. Since the questions are expressed in natural language it is not surprising that text analysis techniques have been found helpful for this stage. For example, Greenwood et al. (2002) parse the question to produce a logical representation from which the expected answer type can be extracted. The problem of generating a query from the question has not exploited TA techniques to the same extent although some approaches, for example synonym expansion (Harabagiu et al. 2000; Hovy et al. 2000), have been explored.

The second stage of processing passes the query to an IR engine with the aim of retrieving documents which contain the answer to the question. The final stage of processing, answer extraction, aims to extract the answer(s) from these documents, using the expected answer type information determined in the first stage. Various approaches have been applied to this problem and make use of a variety of TA techniques. Straightforward methods include simple regular expressions which match the text (Ravichandran and Hovy 2002; Soubbotin and Soubbotin 2001). More complex methods make use of deep linguistic processing. For example, Harabagiu et al. (2001) and Scott and Gaizauskas (2000) carry out detailed syntactic and semantic analysis of the retrieved documents which is then queried to identify specific answers to the question. Named entity recognition is often used to identify the items in the text which are of the same type as the expected answer and are therefore possible answers to the question.

The nature of the question answering task, which includes the interpretation of natural language questions and identification of specific concepts and structures within documents, makes the standard document retrieval approach inadequate: TA techniques are required to provide the extra linguistic analysis.

3. Evaluation

The most common approach to evaluation within text analytics is to compare machine output with human judgement. For example, part of speech taggers are normally evaluated by manually annotating a test corpus and then applying the tagger to the same data. Performance is then calculated by comparing the two sets of annotations. However, this process is often complicated by two factors: first, difficulties in obtaining consensus on the ‘correct’ manual annotation, and second, issues in determining exactly how annotations should be compared and scored.

The first challenge is caused by the fact that people often disagree on the correct annotation for a particular fragment of text – for example, they may interpret the same sentence differently. Various steps are taken to maximise consensus between the annotators, but agreement is rarely unanimous and as a result the maximal level of human agreement represents something of a ceiling on the performance of any automated process measured using this technique.

The second challenge is due to the complexity of analysis that is generated, since it can be difficult to determine how partial matches should be scored. For example, consider the sentence ‘Bill Gates is CEO of Microsoft’. If a NER system suggests ‘Gates’ is the name of a person (rather than ‘Bill Gates’) what credit, if any, should this, partially correct answer be assigned? Proposing ‘Gates’ as the name of a person, while not ideal, is certainly better than suggesting ‘Microsoft’ as a person.

Once the systems have been scored, the performance is typically measured in terms of precision and recall, where precision is defined as the number of number of entities correctly tagged divided by the total number of entities tagged, and recall is defined as the number of entities correctly tagged divided by the total number of entities which should have been tagged.

Attempts to maximize one measure will typically compromise the other (all other factors being equal), so in practice a suitable trade-off must be found. For this reason, these measures are rarely considered in isolation, and are instead usually combined into a single value, e.g. the F-measure (which is defined as the weighted harmonic mean of precision and recall).

There are significant differences between the typical performances of TA techniques. Part of speech tagging and named entity recognition can typically be performed with >90% accuracy on business news, while accuracy for the information extraction from biomedical documents is typically below 70%. TA applications often combine together several component technologies, for example part of speech tagging and named entity recognition. There are often dependencies between these components; for example the named entity recogniser may rely on output from the part of speech tagger. Inaccurate output from one component can propagate through to subsequent processes that rely on them.

However, it should be noted that measures such as precision and recall are fairly blunt instruments, and take no account of the user experience as a determinant of performance or success. For example, Elworthy et al. (2001) demonstrate the application of TA technology to the retrieval of captioned images, and in so doing also provide a novel method of results presentation and unique user experience that could not have been achieved without the use of text analytics. The value of such a capability cannot be expressed by measuring retrieval accuracy alone.

4. Conclusions

In this post we have outlined some of the fundamental techniques in text analytics and explored the ways in which they can be applied to search and information retrieval. In an upcoming post, we’ll complement this with a look at text analytics applied to social media, in particular sentiment analysis and opinion mining.

5. References

E. Brill, “Transformation based error driven parsing”, In Proceedings of the Third International Workshop on Parsing Technologies, Tilburg, The Netherlands, 1993.Church, K. (1998) A Stochastic Parts Program and Noun Phrase Tagger for Unrestricted Text. Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin TX, pp 136–143.
Cutting, D., Kupiec, J., Pedersen, J. and Sibum, P. (1992) A Practical Part-of-Speech Tagger. Proceedings of the 3rd Conference on Applied Natural Language Processing.
Elworthy, D., Rose, T. G., Clare, A. and Kotcheff, A. (2001) A Natural Language System for the Retrieval of Captioned Images. Journal of Natural Language Engineering, Cambridge University Press, 2001.
Green, B., Wolf, A., Chomsky, C. and Laughery, K. (1961) BASEBALL: An Automatic Question Answerer. Proceedings of the Western Joint Computer Conference, 19 219–224.
Greenwood, M., Roberts, I. and Gaizauskas, R. (2002) The University of Sheffield TREC 2002 Q&A System Proceedings of the 11th Text Retrieval Conference.
Grishman, R. and Sundheim, B. (1996) Message Understanding Conference – 6: A Brief History. Proceedings of the 16th International Conference on Computational Linguistics (COLING-96) Copenhagen, Denmark, pp 466–470.
Harabagiu, S., Moldovan, D., Pasca, M. Surdeanu, M., Bunescu, R., Girju, R., Rus, V. and Morarescu, P. (2000) FALCON: Boosting Knowledge for Answer Engines. Proceedings of the 9th Text Retrieval Conference.
Hirst, G. (1987) Semantic Interpretation and the Resolution of Ambiguity Cambridge University Press, Cambridge, England.
Hovy, E. Gerber, L., Hermjakob, U., Junk, M. and Lin, C. (2000) Question Answering in Webclopedia. Proceedings of the 9th Text Retrieval Conference.
Ingersoll, G.S. and Morton, T.S. (2009) Taming Text. Manning Publications
Jing, H. and Tzoukermann, E. (1999) Information retrieval based on context distance and morphology. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99).
Karlsson, F. (1990) Constraint Grammar as a Framework for Parsing English Running Text. Proceedings of the 13th International Conference on Computational Linguistics.Krovetz, R. (1997) Homonymy and Polysemy in Information Retrieval. Proceedings of the 35th Meeting of the Association for Computational Linguistics and the 8th Meeting of the European Chapter of the Association for Computational Linguistics (ACL/EACL-97).
Krovetz, R. and Croft, B. (1992) Lexical ambiguity and information retrieval. ACM Transactions on Information Systems 10(2) 115–141.
Marcus, M., Santorini, B. and Marcinkiewicz, M. (1993) Building a Large Annotated Corpus of English: The Penn Tree Bank. Computational Linguistics 19(2) 313–330.
Ravichandran, D. and Hovy, E. (2002) Learning Surface Text Patterns for a Question Answering System. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 41–47.
Sanderson, M. (1994) Word sense disambiguation and information retrieval. Proceedings of the 17^th ACM SIGIR Conference.
Schutze, H. and Pedersen, J. (1995) Information Retrieval Based on Word Senses. In Symposium on Document Analysis and Information Retrieval (SDAIR), Las Vegas, NV, pp 161–175.
Scott, S. and Gaizauskas, R. (2000) University of Sheffield TREC-9 Q & A System. Proceedings of the 9th Text Retrieval Conference.
Soubbotin, M. and Soubbotin, S. (2001) Patterns of Potential Answer Expressions as Clues to the Right Answers, Proceedings of the 10th Text Retrieval Conference.
Voutilainen, A. (2003) Part of speech Tagging. In The Oxford Handbook of Computational Linguistics Mitkov, R. (ed.) pp 219–232 Oxford University Press.
Wilks, Y. (1975) A Preferential, Pattern Seeking, Semantics for Natural Language Inference. Artificial Intelligence 6.
Wilks, Y. and Stevenson, M. (1998) The Grammar of Sense: Using part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering 4(3).
Woods, W. (1973) Progress in Natural Language Understanding – An Application to Lunar Geology. AFIPS Conference Proceedings, vol. 42, pp 441–450.

6. Glossary

Anaphora resolution: the process by references between separate expressions are identified and resolved.

Artificial intelligence: the intelligence of machines and the branch of computer science which aims to create it.

Categorization: the process by which ideas and objects are recognized, differentiated and understood.

Classification: the process by which an electronic document is assigned to one or more predetermined categories, based on its contents. This contrasts with document categorization, in which the categories must typically be discovered as part of the process.

Clustering: the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

Derivational morphology: the process by which word meaning is changed through the application of derivations, e.g. the suffix “ee” changes the verb employ into the noun employee.

F-measure: a measure of a test’s accuracy. It considers both precision and recall.

Homonym: one of a group of words that share the same spelling and the same pronunciation but have different meanings.

Hypernym: is a word or phrase whose semantic range subsumes that of a set of other words.

Inflection: the process by which language expresses grammatical relations and relational categories such as tense, mood, voice, aspect, person, number, gender and case.

Information Extraction: the process by which structured information is extracted from unstructured textual data.

Lexical analysis: the process of converting a sequence of characters into a sequence of tokens.

Morphology: the study of the structure and content of word forms.

Named entity recognition: a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, etc.

Natural language processing: a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages.

Parsing: see syntactic analysis

Part of speech: a linguistic category of words which is generally defined by the syntactic or morphological behaviour of the word in question

Part of Speech Tagging: the process of marking up the words in a text as corresponding to a particular part of speech

Phrase detection: the recognition of groups of words functioning as a single unit in the syntax of a sentence.

Polysemy: the capacity for a word or phrase to have multiple meanings.

Precision: the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search.

Question answering: the task of automatically answering a question posed in natural language.

Recall: the number of relevant documents retrieved by a search divided by the total number of existing relevant documents.

Relation extraction: the process by which relations between named entities within a single sentence are identified.

Relevance: the extent to which a topic of a search result matches the topic of the query or information need.

Sentence boundary detection: the problem in natural language processing of determining where sentences begin and end.

Sentiment analysis: the process by which the attitude of a speaker or a writer with respect to a given topic is identified. It is usually classified as positive, negative or neutral.

Stemming: the process by which inflected (or derived) words are reduced to their root form.

Summarisation: the ability to create an abstract or topical summary from a free text document.

Syntactic analysis: the process of analyzing a text to determine its structure with respect to a grammar

Text analytics: a set of linguistic, analytical and predictive techniques by which structure and meaning can be extracted from textual data. Similar to text mining, but with a broader focus on interactive applications and a greater emphasis on user experience.

Text mining: the process by which meaning, structure and patterns can be derived from textual data.

Tokenisation: see lexical analysis

Wordnet: a lexical database for the English language. It groups English words into sets of synonyms, provides short, general definitions, and records the various semantic relations between these synonym sets.

Word sense disambiguation: the process of identifying which sense of a word is used in any given sentence, when the word has a number of distinct senses.

Posted in Search, Text analytics | Tagged information extraction, natural language processing, NLP, Text analytics, text mining | 3 Comments

3 Responses

on July 28, 2011 at 6:59 pm | Reply Jaimie Sirovich

Nice summary over here. I just learned a few things 🙂

In terms of faceted navigation, have you ever seen a web site do ‘facet extraction’ and reliably, automatically map 2-4 word queries to the object filtered by the adjectives? For example:

stainless steel nose ring ==> Category:Nose Rings; Metal:Stainless Steel.

The advantages of doing so are substantial in that it allows for more exploration via expanding to sibling values of Metal. This is simple NLP, and I have a few algorithms sketched out, but I see very few web sites doing it.

If they do, it’s only for brand. I’ve seen firms like EasyAsk claim they can do this, but not only do they not take this mapping-to-facets approach—I don’t really see it ever really work to improve relevance, either.

Frequently BM25 + some sort of proximity ranker works pretty well. Unfortunately, most rankers don’t even do that correctly—and there is an abundance of even-recent research that seeks the best approach to blending statistical and proximity ranking in a way that works and is efficiently computable. I’m still amazed at how well BM25 works in the absence of proximity regardless.
on August 2, 2011 at 11:33 am | Reply Tony Russell-Rose

Hi Jamie

Yep, the use case you cite is indeed very real – as I recall from my time at Endeca there were a number of niche vendors who specialised in exactly this transformation. I say ‘niche’ as they weren’t what I would recognise as mainstream players in the text analytics/NLP business; instead they rather seemed to focus on data cleansing as their main business with some elemenary (but highly optimised) NLP as part of their offering.

BTW my Endeca colleague Vladimir Zelevinsky published a neat paper on how to dynamically generate facets using NLP resources such as WordNet, which was published HCIR in 2010:

http://research.microsoft.com/en-us/um/people/ryenw/hcir2010/presentations.html
on January 9, 2013 at 10:40 am | Reply Walid Saba (@sabawalid)

If you are interested in this article take a look at the 2-page article “Henry Kissinger vs. Sentiment Analysis” (here http://www.scribd.com/doc/103683453/Henry-Kissinger-vs-Sentiment-Analysis) and this 2-page follow-up article “Henry Kissinger’s sentiments are not an exception!” (here http://www.scribd.com/doc/105789929/Henry-Kissinger%E2%80%99s-sentiments-are-not-an-exception)

Comments RSS

	Designing Search (pa… on Designing Search (part 1): Ent…
	BCS/IRSG Search Indu… on Announcing the winners of the…
	R. Craig Reinsch on Research Associate in informat…
	R. Craig Reinsch on Research Associate in informat…
	Tony Russell-Rose on Research Associate in informat…
	Tom JG on Research Associate in informat…
	Tony Russell-Rose on Research Associate in informat…
	R. Craig Reinsch on Research Associate in informat…
	Tony Russell-Rose on Announcing the BCS Search Indu…
	mathias duda on Announcing the BCS Search Indu…

Information Interaction

Thoughts on the intersection of user experience, search, language processing & more