Searching for Meaning in Text
Here are the slides from the talk I gave at City University last week, as a guest lecture to their Information Science MSc students. It’s based on the chapter of the same name which I co-authored with Mark Stevenson of Sheffield University and appears in the book called “Information Retrieval: Searching in the 21st Century“. The session was scheduled for 3 hours, and to my amazement, required all of that (thanks largely to an enthusiastic group who asked lots of questions). And no, I didn’t present 3 hours of Powerpoint – the material was punctuated with practical exercises and demos to illustrate the learning points and allow people to explore the key concepts for themselves. These exercises aren’t included in the Slideshare version, but I am happy to make them available to folks who want to enjoy the full experience.
Related Posts:
Tony, some great stuff here. Coincidentally, I’ve just started mucking about with the NLTK in Python (trying to learn both this and Python at the same time, so slow going!)
Maybe you’d like come up to Bristol and do a talk for my MSc class next year!
Really enjoyed reading the slides, they provide a great summary that surveys the whole NLP arena. I learnt a lot, cheers for this.
Thanks Guys!
Paul: more than happy to come & give a talk, & maybe explore opportunities for collaborative research. We can compare notes on NLTK as well 🙂
Tony, I’d really like to see your exercises and demos, I’m always looking for more interactive approaches.
Avi
Your definitions of stemming and lemmatization are rather different from those used by the rest of the world. What you call stemming is really lemmatization; stemming is just a crude approximation to that. What you call lemmatization is usually called morphological parsing.
And about POS tagging: I’ve found that it can make a difference for IR in terms of traditional evaluation metrics, if used not to index the POS, but to filter based on POS (i.e. instead of stopword filtering). A Lucene extension which does exactly that as well as lemmatization is available at https://github.com/larsmans/lucene-stanford-lemmatizer.
Lars: Actually those particular definitions were taken from an NLP course currently running in the UK, so I’m not sure your assertion about “the rest of the world” is entirely accurate. That said, I agree that the definition of lemmatisation probably implies more than it should (and the example reinforces this). Have you seen the wikipedia definition of stemming? I think this distinction sums it up most concisely: “Lemmatization produces grammatically correct word forms, but stemming just removes a series of suffixes/prefixes and/or endings.”
The Wikipedia definition of stemming is very outdated. Since at least 1980, when the Porter stemmer was published, stemmers have done more than just suffix stripping. Type the word “liberty” into Martin Porter’s own http://snowball.tartarus.org/demo.php and observe that it produces “liberti”; it doesn’t just strip the suffix, it replaces it.
This example illustrates the crucial difference between stemming and lemmatizing: a lemmatizer produces actual words (lemmas), while a stemmer maps words to the same form, which may or may not be a word. In addition, lemmatizers are often much smarter; I’m using a machine-learned one at the moment.
I’d like to know who’s teaching that NLP course. Maybe they published something on this topic that might explain their use of terminology.
Avi: As I recall I documented the exercises in the notes field for each slide, so I believe these should show up if you download the ppt direct from slideshare. Let me know if not or if anything is unclear – happy to clarify.
Hi! Tony, I am student of NLP. I found your presentation really interesting and informative. it could hlep me more if I could have details / definitions used in slide. would you mind giving it to me?
yokivo kade: Apologies for the delay in replying (away on vacation) – would you like to drop me an email? I can then send you the full version.
Hi, Tony. I am a fresher in the field of Natural Language Processing. Could you suggest some papers to read on “NLP techniques used in web search”. Also could i get this presentation to discuss with my mentor regarding NLP. Thanks, really helpful! My email is ab.anuj04@gmail.com.
Anuj: sure – did you check out the citations at the end of the full paper? They’re a couple of years old now, but they’re still very relevant:
I did check them, but i want to find out how one could use NLP techniques to rank pages. Are there any specific techniques used on web documents and queries to rank the documents or pages? I would be interested in papers pertaining to this.
Anuj: if I understand you correctly there’s actually a huge literature on the subject: at ECIR and SIGIR (and related confs) each year you’ll find whole sessions related to NLP in IR, and many of these will deal with things like ranking. Not sure if the papers are downloadable directly from the conference websites, but I’m sure you could locate a fair number of them via other sources.