Feeds:
Posts
Comments

Posts Tagged ‘opinion mining’

After a brief hiatus, I’m pleased to say that we will shortly be relaunching the London Text Analytics meetup. As many of you know, in the recent past we have organized some relatively large and ambitious events at a variety of locations. But we have struggled to find a regular venue, and as a result have had difficulty in maintaining a scheduled programme of events.

What we really need is a venue we can use on a more regular schedule, ideally on an ex-gratia basis. It doesn’t have to be huge – in fact; a programme of smaller (but more frequent) meetups is in many ways preferable to a handful of big gatherings.

(more…)

Read Full Post »

textmining

I received a pleasant surprise in the post today: my personal copy of Text Mining and Visualization: Case Studies Using Open-Source Tools, edited by Markus Hofmann and Andrew Chisholm. Now I don’t normally blog about books, since as editor of Informer there was a time when I would be sent all manner of titles for inspection and review. But I’ll make an exception here. This is partially since Chapter 7 is my own contribution (on mining search logs), as discussed in my earlier blog posts. This is complemented by 11 other chapters, covering a variety of topics organised into four sections:

(more…)

Read Full Post »

Diana Maynard entertains the masses

Diana Maynard entertains the troops

Last week I had the privilege of organising the 13th meeting of the London Text Analytics group, which featured two excellent speakers: Despo Georgiou of Atos SE and Diana Maynard of Sheffield University. Despo’s talk described her internship at UXLabs where she compared a number of tools for analysing free-text survey responses (namely TheySay, Semantria, Google Prediction API and Weka). Diana’s talk focused on sentiment analysis applied to social media, and entertained the 70+ audience with all manner of insights based on her expertise of having worked on the topic for longer than just about anyone I know. Well done to both speakers!

(more…)

Read Full Post »

Here’s something I’ve been meaning to share for a while: the slides for a talk entitled “Text Analytics: Yesterday, Today and Tomorrow”, co-authored with colleagues Vladimir Zelevinsky and Michael Ferretti. In this we outline some of the key challenges in text analytics, describe some of Endeca’s current research in this area, examine the current state of the text analytics market and explore some of the prospects for the future.

Read Full Post »

OK, so after a period in stealth mode we’re now finally launching the London Text Analytics group. Well, I say “launching”, but it’s really just half a dozen of us meeting at Fizzback‘s offices for a talk and maybe a beer afterwards. It’s been a while in the making (mainly due to summer vacations & so on) but I’m pleased to say that my BCS colleague Udo Kruschwitz from Essex University has agreed to give a presentation on “Making Natural Language Processing Work in Practical Applications“. And as I said, for this first meeting I’m not expecting huge numbers. But you never know – I can think of several text analytics organisations based in the South East England with whom we share a common interest, such as:

…and many more organisations who are based overseas but have offices in the UK. And then of course there are the startups in the field of brand reputation management / customer experience, who make use of lightweight NLP techniques to measure online sentiment. So in principle that’s quite a large population we could be addressing.

We’d be delighted to hear from folks working for any of the above organisations. If you have a passion for text analytics & NLP, there’s a lot we can learn from each other. Further details at: http://www.meetup.com/textanalytics/.

Read Full Post »

Last week I attended the First GATE Summer School (“FIG’09”), at Sheffield University, from July 27 – 31. For those unfamilar with GATE, it is generally considered to be the premier open source platform for text analytics, and described by some as the “VW Beetle of NLP” (i.e. “it will overtake everyone – eventually”). If you subscribe to this view, then you’d probably also conclude that those left behind include OpenNLP, NLTK, UIMA and various other text analytics / NLP tookits. All worthy initiatives, but with ~35,000 downloads per year and thousands of users, few can claim the same degree of success or popularity as GATE.

The GATE Summer School also comes at an interesting time for the Text Analytics community more generally. When the Reuters NLP Group (which I led) closed in 2003, there was very little activity happening outside of academia. Half a decade later, it’s a different picture – people can and do make a resonable living as text analytics consultants, with the rise of blogs, user-generated content and social media creating ever more opportunities for interesting text analytics applications and services. And, at the risk of tempting fate, there are growing signs of liquidity in the jobs market. It is still, however, a long way from being a mainstream profession, with recognised qualifications, roles, career progression, accreditation schemes, and so on (although for developments on the latter, read on).

The view from the back

Small room, big crowd

Anyway, back to GATE. When I reflect on initiatives to create generic platforms or infrastructure within a particular community, it brings to mind the old adage that “in a gold rush the best way to get rich is to make shovels”. And GATE is a pretty useful shovel – the fact that the summer school filled a whole week (and could have gone on longer) is a reflection of the breadth and flexibility of the platform. In fact, the Summer School did have a second week, but this was aimed specifically at GATE developers and committers rather than end users. And in GATE’s case, much of the richness is shared by the community as a whole, thanks to its commitment to open source philosophy and practices.

A further strength of GATE is that it is built around a extensible, component-based model, consisting of linguistic resources (the documents or corpora to be processed) and processing resources (the tools and algorithms that do the processing). The latter are typically aggregated into applications, i.e. pipelines that represent a sequence of operations over a corpus. All of these concepts can be managed through a graphical development environment, and bespoke applications can be built directly around the Java API. GATE also integrates with various 3rd party tools, such as LingPipe, OpenNLP, OpenCalais, etc.

Networking coffee break

Networking coffee break

The week itself consisted of a series of NLP topics and applications, divided into 1-2 hour blocks, most with an element of theory (delivered via powerpoint) and an associated practical session (hands on at the computer). Overall, I have to say that pretty much every session was of interest and value, which is so rarely the case with a week-long course. There were, however, significant opportunities to improve the delivery (more on that below). And GATE itself would undoubtedly benefit from greater attention to the user experience – as far as I am aware, no-one has yet “owned” this aspect of its development – and it shows. In fact, there are a number of specific UX issues that deserve a broader analysis, so watch this space for a separate post on the subject.

Over the course of the week, I’d say the highlights were:

  1. Understanding JAPE: This is the built-in language for linguistic pattern matching, allowing a user to create regular expressions to identify structures in text (which can range from simple patterns such as postcodes to much more complex phenomena such as stock price movements). However, the support that GATE provides for writing JAPE seems relatively primitive. In fact, it seems that many users write JAPE code in an external text editor, then import it via the GATE IDE. This seems bizarre, and probably goes some way to explaining why it is apparently very easy to write inefficient JAPE (which for a production environment must be a major limitation). Surely we can do better than this, given that it is likely to underpin pretty much any knowledge-based text analytics application. At the very least, someone should provide a TextPad syntax definition template. Better still, a PhD project to investigate and build some decent JAPE support tools that promote learning, re-use, efficiency, etc.
  2. Machine Learning: this is the counterpart to the JAPE approach, allowing a developer to induce linguistic patterns directly from training data. GATE provides extensive support for building machine learning NLP applications, but the usual caveats apply, i.e. you need large quantities of gold-standard, annotated training data. Consequently, support for manual annotation is a major issue, and a significant part of week was devoted to the associated workflow and quality control tools, such as GATE Teamware.
  3. Opinion mining: As a UX practitioner I am interested in the subjective user experience, expressed through both explicit behaviour (i.e. task completion etc.) and implicit attitudes and preferences. In particular, I am intrigued by the possibilities in triangulating between the qualitiative, behavioural, micro-level feedback obtained from typical usability studies and the quantitiative, attitudinal, macro-level data acquired through opinion mining, to create a ‘360 degree view’ of the user experience. This is an area I’d love to explore more, and will be one of my first areas of experimentation with GATE.

In summary, I’d say the first GATE summer school was a well-organised event with an excellent programme.  But there were some significant areas where delivery could be improved:

  • If you’re running a course with ~55 attendees, you need to make sure everyone can see and hear from the back. Use 2 projectors, and have presenters wear lapel microphones.
  • Go at the pace of the slowest machine – which will invariably be those of the attendees, not the presenters.
  • Tell the attendees clearly what slide deck you are using before starting your talk, and give them a chance to open it in advance.
  • Work out the choreography of when each lecture finishes and when the associated practical begins, and make sure everyone knows the difference. Tell people how long they have for each exercise, and give them clear timechecks when they are about to timeout.
  • Implement a “quick reset” function for GATE that clears down the old exercise in readiness for the new (or at the very least, tell the class how to do this in the first practical)
  • Standardise the format and approach of the practicals. Greater consistency will allow participants to adapt better to the pedagogical approach and focus more on the material and less on the presentation.
  • Make the practicals goal-oriented. Give people a clear problem statement at the outset, so that low-level tasks can be understood within the context of a wider goal.
  • Explain key concepts using diagrams, not the GATE UI. If you do the latter, participants will inevitably try to keep up, and will typically not succeed (for the reasons above). Avoid the problem by using suitable visuals and an associated narrative.

It’s interesting to note that the summer school closed with the award of an “intermediate training course certificate” to each of the attendees, no doubt a first step on the road to formal accreditation of competence within the GATE community. FIG ’09 was oversubscribed, so there is already talk of another presentation later in the year (it’ll probably have to be renamed a “Winter School” though). On the basis of FIG ’09, would I recommend it? Absolutely. In the meantime, watch this space for an update on the GATE user experience.

Read Full Post »