Last week I attended the First GATE Summer School (“FIG’09”), at Sheffield University, from July 27 – 31. For those unfamilar with GATE, it is generally considered to be the premier open source platform for text analytics, and described by some as the “VW Beetle of NLP” (i.e. “it will overtake everyone – eventually”). If you subscribe to this view, then you’d probably also conclude that those left behind include OpenNLP, NLTK, UIMA and various other text analytics / NLP tookits. All worthy initiatives, but with ~35,000 downloads per year and thousands of users, few can claim the same degree of success or popularity as GATE.
The GATE Summer School also comes at an interesting time for the Text Analytics community more generally. When the Reuters NLP Group (which I led) closed in 2003, there was very little activity happening outside of academia. Half a decade later, it’s a different picture – people can and do make a resonable living as text analytics consultants, with the rise of blogs, user-generated content and social media creating ever more opportunities for interesting text analytics applications and services. And, at the risk of tempting fate, there are growing signs of liquidity in the jobs market. It is still, however, a long way from being a mainstream profession, with recognised qualifications, roles, career progression, accreditation schemes, and so on (although for developments on the latter, read on).
Anyway, back to GATE. When I reflect on initiatives to create generic platforms or infrastructure within a particular community, it brings to mind the old adage that “in a gold rush the best way to get rich is to make shovels”. And GATE is a pretty useful shovel – the fact that the summer school filled a whole week (and could have gone on longer) is a reflection of the breadth and flexibility of the platform. In fact, the Summer School did have a second week, but this was aimed specifically at GATE developers and committers rather than end users. And in GATE’s case, much of the richness is shared by the community as a whole, thanks to its commitment to open source philosophy and practices.
A further strength of GATE is that it is built around a extensible, component-based model, consisting of linguistic resources (the documents or corpora to be processed) and processing resources (the tools and algorithms that do the processing). The latter are typically aggregated into applications, i.e. pipelines that represent a sequence of operations over a corpus. All of these concepts can be managed through a graphical development environment, and bespoke applications can be built directly around the Java API. GATE also integrates with various 3rd party tools, such as LingPipe, OpenNLP, OpenCalais, etc.
The week itself consisted of a series of NLP topics and applications, divided into 1-2 hour blocks, most with an element of theory (delivered via powerpoint) and an associated practical session (hands on at the computer). Overall, I have to say that pretty much every session was of interest and value, which is so rarely the case with a week-long course. There were, however, significant opportunities to improve the delivery (more on that below). And GATE itself would undoubtedly benefit from greater attention to the user experience – as far as I am aware, no-one has yet “owned” this aspect of its development – and it shows. In fact, there are a number of specific UX issues that deserve a broader analysis, so watch this space for a separate post on the subject.
Over the course of the week, I’d say the highlights were:
- Understanding JAPE: This is the built-in language for linguistic pattern matching, allowing a user to create regular expressions to identify structures in text (which can range from simple patterns such as postcodes to much more complex phenomena such as stock price movements). However, the support that GATE provides for writing JAPE seems relatively primitive. In fact, it seems that many users write JAPE code in an external text editor, then import it via the GATE IDE. This seems bizarre, and probably goes some way to explaining why it is apparently very easy to write inefficient JAPE (which for a production environment must be a major limitation). Surely we can do better than this, given that it is likely to underpin pretty much any knowledge-based text analytics application. At the very least, someone should provide a TextPad syntax definition template. Better still, a PhD project to investigate and build some decent JAPE support tools that promote learning, re-use, efficiency, etc.
- Machine Learning: this is the counterpart to the JAPE approach, allowing a developer to induce linguistic patterns directly from training data. GATE provides extensive support for building machine learning NLP applications, but the usual caveats apply, i.e. you need large quantities of gold-standard, annotated training data. Consequently, support for manual annotation is a major issue, and a significant part of week was devoted to the associated workflow and quality control tools, such as GATE Teamware.
- Opinion mining: As a UX practitioner I am interested in the subjective user experience, expressed through both explicit behaviour (i.e. task completion etc.) and implicit attitudes and preferences. In particular, I am intrigued by the possibilities in triangulating between the qualitiative, behavioural, micro-level feedback obtained from typical usability studies and the quantitiative, attitudinal, macro-level data acquired through opinion mining, to create a ‘360 degree view’ of the user experience. This is an area I’d love to explore more, and will be one of my first areas of experimentation with GATE.
In summary, I’d say the first GATE summer school was a well-organised event with an excellent programme. But there were some significant areas where delivery could be improved:
- If you’re running a course with ~55 attendees, you need to make sure everyone can see and hear from the back. Use 2 projectors, and have presenters wear lapel microphones.
- Go at the pace of the slowest machine – which will invariably be those of the attendees, not the presenters.
- Tell the attendees clearly what slide deck you are using before starting your talk, and give them a chance to open it in advance.
- Work out the choreography of when each lecture finishes and when the associated practical begins, and make sure everyone knows the difference. Tell people how long they have for each exercise, and give them clear timechecks when they are about to timeout.
- Implement a “quick reset” function for GATE that clears down the old exercise in readiness for the new (or at the very least, tell the class how to do this in the first practical)
- Standardise the format and approach of the practicals. Greater consistency will allow participants to adapt better to the pedagogical approach and focus more on the material and less on the presentation.
- Make the practicals goal-oriented. Give people a clear problem statement at the outset, so that low-level tasks can be understood within the context of a wider goal.
- Explain key concepts using diagrams, not the GATE UI. If you do the latter, participants will inevitably try to keep up, and will typically not succeed (for the reasons above). Avoid the problem by using suitable visuals and an associated narrative.
It’s interesting to note that the summer school closed with the award of an “intermediate training course certificate” to each of the attendees, no doubt a first step on the road to formal accreditation of competence within the GATE community. FIG ’09 was oversubscribed, so there is already talk of another presentation later in the year (it’ll probably have to be renamed a “Winter School” though). On the basis of FIG ’09, would I recommend it? Absolutely. In the meantime, watch this space for an update on the GATE user experience.