Last week I attended the First GATE Summer School (“FIG’09″), at Sheffield University, from July 27 – 31. For those unfamilar with GATE, it is generally considered to be the premier open source platform for text analytics, and described by some as the “VW Beetle of NLP” (i.e. “it will overtake everyone – eventually”). If you subscribe to this view, then you’d probably also conclude that those left behind include OpenNLP, NLTK, UIMA and various other text analytics / NLP tookits. All worthy initiatives, but with ~35,000 downloads per year and thousands of users, few can claim the same degree of success or popularity as GATE.
The GATE Summer School also comes at an interesting time for the Text Analytics community more generally. When the Reuters NLP Group (which I led) closed in 2003, there was very little activity happening outside of academia. Half a decade later, it’s a different picture – people can and do make a resonable living as text analytics consultants, with the rise of blogs, user-generated content and social media creating ever more opportunities for interesting text analytics applications and services. And, at the risk of tempting fate, there are growing signs of liquidity in the jobs market. It is still, however, a long way from being a mainstream profession, with recognised qualifications, roles, career progression, accreditation schemes, and so on (although for developments on the latter, read on).

Small room, big crowd
Anyway, back to GATE. When I reflect on initiatives to create generic platforms or infrastructure within a particular community, it brings to mind the old adage that “in a gold rush the best way to get rich is to make shovels”. And GATE is a pretty useful shovel – the fact that the summer school filled a whole week (and could have gone on longer) is a reflection of the breadth and flexibility of the platform. In fact, the Summer School did have a second week, but this was aimed specifically at GATE developers and committers rather than end users. And in GATE’s case, much of the richness is shared by the community as a whole, thanks to its commitment to open source philosophy and practices.
A further strength of GATE is that it is built around a extensible, component-based model, consisting of linguistic resources (the documents or corpora to be processed) and processing resources (the tools and algorithms that do the processing). The latter are typically aggregated into applications, i.e. pipelines that represent a sequence of operations over a corpus. All of these concepts can be managed through a graphical development environment, and bespoke applications can be built directly around the Java API. GATE also integrates with various 3rd party tools, such as LingPipe, OpenNLP, OpenCalais, etc.

Networking coffee break
The week itself consisted of a series of NLP topics and applications, divided into 1-2 hour blocks, most with an element of theory (delivered via powerpoint) and an associated practical session (hands on at the computer). Overall, I have to say that pretty much every session was of interest and value, which is so rarely the case with a week-long course. There were, however, significant opportunities to improve the delivery (more on that below). And GATE itself would undoubtedly benefit from greater attention to the user experience – as far as I am aware, no-one has yet “owned” this aspect of its development – and it shows. In fact, there are a number of specific UX issues that deserve a broader analysis, so watch this space for a separate post on the subject.
Over the course of the week, I’d say the highlights were:
- Understanding JAPE: This is the built-in language for linguistic pattern matching, allowing a user to create regular expressions to identify structures in text (which can range from simple patterns such as postcodes to much more complex phenomena such as stock price movements). However, the support that GATE provides for writing JAPE seems relatively primitive. In fact, it seems that many users write JAPE code in an external text editor, then import it via the GATE IDE. This seems bizarre, and probably goes some way to explaining why it is apparently very easy to write inefficient JAPE (which for a production environment must be a major limitation). Surely we can do better than this, given that it is likely to underpin pretty much any knowledge-based text analytics application. At the very least, someone should provide a TextPad syntax definition template. Better still, a PhD project to investigate and build some decent JAPE support tools that promote learning, re-use, efficiency, etc.
- Machine Learning: this is the counterpart to the JAPE approach, allowing a developer to induce linguistic patterns directly from training data. GATE provides extensive support for building machine learning NLP applications, but the usual caveats apply, i.e. you need large quantities of gold-standard, annotated training data. Consequently, support for manual annotation is a major issue, and a significant part of week was devoted to the associated workflow and quality control tools, such as GATE Teamware.
- Opinion mining: As a UX practitioner I am interested in the subjective user experience, expressed through both explicit behaviour (i.e. task completion etc.) and implicit attitudes and preferences. In particular, I am intrigued by the possibilities in triangulating between the qualitiative, behavioural, micro-level feedback obtained from typical usability studies and the quantitiative, attitudinal, macro-level data acquired through opinion mining, to create a ’360 degree view’ of the user experience. This is an area I’d love to explore more, and will be one of my first areas of experimentation with GATE.
In summary, I’d say the first GATE summer school was a well-organised event with an excellent programme. But there were some significant areas where delivery could be improved:
- If you’re running a course with ~55 attendees, you need to make sure everyone can see and hear from the back. Use 2 projectors, and have presenters wear lapel microphones.
- Go at the pace of the slowest machine – which will invariably be those of the attendees, not the presenters.
- Tell the attendees clearly what slide deck you are using before starting your talk, and give them a chance to open it in advance.
- Work out the choreography of when each lecture finishes and when the associated practical begins, and make sure everyone knows the difference. Tell people how long they have for each exercise, and give them clear timechecks when they are about to timeout.
- Implement a “quick reset” function for GATE that clears down the old exercise in readiness for the new (or at the very least, tell the class how to do this in the first practical)
- Standardise the format and approach of the practicals. Greater consistency will allow participants to adapt better to the pedagogical approach and focus more on the material and less on the presentation.
- Make the practicals goal-oriented. Give people a clear problem statement at the outset, so that low-level tasks can be understood within the context of a wider goal.
- Explain key concepts using diagrams, not the GATE UI. If you do the latter, participants will inevitably try to keep up, and will typically not succeed (for the reasons above). Avoid the problem by using suitable visuals and an associated narrative.
It’s interesting to note that the summer school closed with the award of an “intermediate training course certificate” to each of the attendees, no doubt a first step on the road to formal accreditation of competence within the GATE community. FIG ’09 was oversubscribed, so there is already talk of another presentation later in the year (it’ll probably have to be renamed a “Winter School” though). On the basis of FIG ’09, would I recommend it? Absolutely. In the meantime, watch this space for an update on the GATE user experience.


If you want to annotate some corpora for any reason (i.e. to train machine learning based components) you can also check Ellogon:
http://www.ellogon.org
George
Thanks George. Just checked out the link – I love your credits page. I presume you must have a good sense of irony!
Hi Tony,
Thanks for this. I’d like to have this article on Chatbots.org as well as I also would love to create a list of the resources you mentioned.
Would it be possible that you publish the article yourself? Let’s elaborate on this!
Erwin
Hi Erwin,
Sure, I’d be more than happy to. Yes on all counts!
Love what you’re doing with chatbots.org – in fact I’d come across the site before so it’s great to renew the acquaintance. What’s the process for publishing the article?
Tony
OK, I don’t want to leave me e-mail address here (to avoid e-mail harvesting by spammers), but drop me an email on:
http://www.chatbots.org/expert/erwin_van_lun/1/
and we’ll continue from there
I don’t view GATE as a competitor to OpenNLP, NLTK, LingPipe, Mallet, Stanford’s JavaNLP, MinorThird, etc. I see it as more of an Eclipse-like integration/GUI/debugging platform. UIMA’s another different kind of beast, which is more like J2EE than like any of these other packages. So my analogy is:
Eclipse:Gate :: Java:OpenNLP :: J2EE:UIMA
That’s not quite exact, because GATE supplies its own (mostly) heuristic pattern matching (mostly through JAPE), and its own simple linguistic pipeline in ANNIE (most of which is built on top of JAPE, the exception being Hepple’s Brill-like POS tagger).
The machine learning page you link to in GATE discusses the integration of Joachims’ SVM-light, which has a restrictive research-use-only license. It also integrates Weka, which is GPL-ed. And LingPipe, our software, which has its own royalty-free license that’s even more restrictive than the GPL. That’s why I say it’s Eclipse-like.
Thanks for the clairification, Bob – I think that’s a great way of looking at it. (And actually, all the more reason to work on the UX of GATE, as I think Eclipse sets the bar pretty high in this respect.)
Agree also about the ML page – I am aware that one wasn’t ideal. If you know of a more generic one I’d love to hear it.
Cheers,
Tony
I am a little bit puzzled by all these comments. For me its difficult to compare GATE with Eclipse, as they have difficult targets. The first is an NLP platform, the latter is a development environment. If you need an IDE for NLP, you should look at VisualText.
Also, I have a comment about JAPE: it is not a pattern matching engine. It is a regular expression engine over *annotations*, which is quite different from “pattern matching” which implies expressions over series of characters.
@George Although it’s just a semantic issue, I don’t think “pattern matching” implies expressions over series of characters. It’s a general term for machine learning of classifications of various kinds. For instance, Chris Bishop’s textbook is called Machine Learning and Pattern Recognition, and most of the pattern recognition is for vision (as with Duda and Hart’s classic text).
And although GATE isn’t an IDE in the sense of letting you compile programs from scratch, it does let you develop NLP pipelines. But in any case, it was just a suggestive analogy.
@Tony I’ve never used GATE, so I don’t actually know how much of their own machine learning is involved. The project’s so active that it’s hard to keep up!
Tony, thanks for the detailed and positive write-up and suggestions. They’ll help us definitely to improve the subsequent training events (be it summer or winter schools). Also, please post to me any subsequent articles on usability and GATE so we can follow them up.
Hi Kalina,
No problem – happy to oblige. I will try to strike a balance in any UX review between the more fundamental, longer term usability issues (as I see them) and any ‘quick wins’ that hopefully can be addressed in the shorter term. Obviously GATE has a very wide range of functionality, so one of the biggest challenges will be deciding exactly where to prioritise, and I’d certainly welcome your thoughts on that.
Cheers,
Tony
Hi Tony,
this sounds very good – I’d be happy to discuss usability issues and priorities with you.
Cheers,
Kalina
[...] calendar is so dominated by events aimed at researchers (as opposed to practitioners). Hopefully this is changing (albeit gradually) – over the last few years there’s been a significant shift in the [...]