Posts Tagged ‘machine learning’

A short while ago I posted the slides to Despo Georgiou’s talk at the London Text Analytics meetup on Sentiment analysis: a comparison of four tools. Despo completed an internship at UXLabs in 2013-4, and I’m pleased to say that the paper we wrote documenting that work is due to be presented and published at the Science and Information Conference 2015, in London. The paper is co-authored with my IRSG colleague Andy MacFarlane and is available as a pdf, with the abstract appended below.

As always, comments and feedback welcome 🙂


Sentiment analysis is an emerging discipline with many analytical tools available. This project aimed to examine a number of tools regarding their suitability for healthcare data. A comparison between commercial and non-commercial tools was made using responses from an online survey which evaluated design changes made to a clinical information service. The commercial tools were Semantria and TheySay and the non-commercial tools were WEKA and Google Prediction API. Different approaches were followed for each tool to determine the polarity of each response (i.e. positive, negative or neutral). Overall, the non-commercial tools outperformed their commercial counterparts. However, due to the different features offered by the tools, specific recommendations are made for each. In addition, single-sentence responses were tested in isolation to determine the extent to which they more clearly express a single polarity. Further work can be done to establish the relationship between single-sentence responses and the sentiment they express.



Read Full Post »

As I posted a few months ago, I’m organizing the Industry Event at CIKM 2011 with Daniel Tunkelang. We have a great set of keynotes lined up:

We’re also looking for submissions from industry researchers and practitioners. The submission deadline is June 21.

Here is a copy of the call for papers:

This year’s CIKM conference will include an Industry Event, which will be held during the regular conference program in parallel with the technical tracks.

The Industry Event’s objectives are twofold. The first objective is to present the state-of-the-art in information retrieval, knowledge management, databases, and data mining, delivered as keynote talks by influential technical leaders who work in industry. The second objective is to present interesting, novel and innovative industry developments in these areas.

Industry authors are invited to prepare proposals for presenting interesting, novel and innovative ideas, and submit these to industry@cikm2011.org by June 21st 2011. The proposals should contain (with respective lengths):

  • Short company portrait (125 words)
  • Short CV of the presenter (125 words)
  • Title and abstract of the presentation (250 words)
  • Reasons why the presentation should be interesting to the CIKM audience

When submitting a proposal, please bear in mind the following:

  • Ensure the presentation is relevant to the CIKM audience (the Call for Papers gives a good idea of the conference scope).
  • Try to highlight interesting R&D challenges in the work you present. Please do not present a sales pitch.
  • All slides will be made public (no confidential information on the slides; you will be expected to ensure your slides are approved by your company before being presented).
  • Presenters may opt to have their presentation videoed and made public, and if so, the presenter will be asked to sign a release form.

We look forward to receiving your submissions, and welcoming you to the CIKM 2011 Conference and Industry Event.

Important dates:
21 June 2011: Industry Event paper proposals due
19 July 2011: Notifications sent
27 October 2011: Industry Event
24-28 October 2011: CIKM conference

Related Posts:

  1. Events & Presentations
  2. 1st European Workshop on Human-Computer Interaction and Information Retrieval
  3. ECIR Industry Day: Lineup Announced
  4. Tutorial on Designing the Search Experience
  5. Design Patterns for Spatial Information Visualisation and Analytics Applications

Read Full Post »

Last week I attended the First GATE Summer School (“FIG’09”), at Sheffield University, from July 27 – 31. For those unfamilar with GATE, it is generally considered to be the premier open source platform for text analytics, and described by some as the “VW Beetle of NLP” (i.e. “it will overtake everyone – eventually”). If you subscribe to this view, then you’d probably also conclude that those left behind include OpenNLP, NLTK, UIMA and various other text analytics / NLP tookits. All worthy initiatives, but with ~35,000 downloads per year and thousands of users, few can claim the same degree of success or popularity as GATE.

The GATE Summer School also comes at an interesting time for the Text Analytics community more generally. When the Reuters NLP Group (which I led) closed in 2003, there was very little activity happening outside of academia. Half a decade later, it’s a different picture – people can and do make a resonable living as text analytics consultants, with the rise of blogs, user-generated content and social media creating ever more opportunities for interesting text analytics applications and services. And, at the risk of tempting fate, there are growing signs of liquidity in the jobs market. It is still, however, a long way from being a mainstream profession, with recognised qualifications, roles, career progression, accreditation schemes, and so on (although for developments on the latter, read on).

The view from the back

Small room, big crowd

Anyway, back to GATE. When I reflect on initiatives to create generic platforms or infrastructure within a particular community, it brings to mind the old adage that “in a gold rush the best way to get rich is to make shovels”. And GATE is a pretty useful shovel – the fact that the summer school filled a whole week (and could have gone on longer) is a reflection of the breadth and flexibility of the platform. In fact, the Summer School did have a second week, but this was aimed specifically at GATE developers and committers rather than end users. And in GATE’s case, much of the richness is shared by the community as a whole, thanks to its commitment to open source philosophy and practices.

A further strength of GATE is that it is built around a extensible, component-based model, consisting of linguistic resources (the documents or corpora to be processed) and processing resources (the tools and algorithms that do the processing). The latter are typically aggregated into applications, i.e. pipelines that represent a sequence of operations over a corpus. All of these concepts can be managed through a graphical development environment, and bespoke applications can be built directly around the Java API. GATE also integrates with various 3rd party tools, such as LingPipe, OpenNLP, OpenCalais, etc.

Networking coffee break

Networking coffee break

The week itself consisted of a series of NLP topics and applications, divided into 1-2 hour blocks, most with an element of theory (delivered via powerpoint) and an associated practical session (hands on at the computer). Overall, I have to say that pretty much every session was of interest and value, which is so rarely the case with a week-long course. There were, however, significant opportunities to improve the delivery (more on that below). And GATE itself would undoubtedly benefit from greater attention to the user experience – as far as I am aware, no-one has yet “owned” this aspect of its development – and it shows. In fact, there are a number of specific UX issues that deserve a broader analysis, so watch this space for a separate post on the subject.

Over the course of the week, I’d say the highlights were:

  1. Understanding JAPE: This is the built-in language for linguistic pattern matching, allowing a user to create regular expressions to identify structures in text (which can range from simple patterns such as postcodes to much more complex phenomena such as stock price movements). However, the support that GATE provides for writing JAPE seems relatively primitive. In fact, it seems that many users write JAPE code in an external text editor, then import it via the GATE IDE. This seems bizarre, and probably goes some way to explaining why it is apparently very easy to write inefficient JAPE (which for a production environment must be a major limitation). Surely we can do better than this, given that it is likely to underpin pretty much any knowledge-based text analytics application. At the very least, someone should provide a TextPad syntax definition template. Better still, a PhD project to investigate and build some decent JAPE support tools that promote learning, re-use, efficiency, etc.
  2. Machine Learning: this is the counterpart to the JAPE approach, allowing a developer to induce linguistic patterns directly from training data. GATE provides extensive support for building machine learning NLP applications, but the usual caveats apply, i.e. you need large quantities of gold-standard, annotated training data. Consequently, support for manual annotation is a major issue, and a significant part of week was devoted to the associated workflow and quality control tools, such as GATE Teamware.
  3. Opinion mining: As a UX practitioner I am interested in the subjective user experience, expressed through both explicit behaviour (i.e. task completion etc.) and implicit attitudes and preferences. In particular, I am intrigued by the possibilities in triangulating between the qualitiative, behavioural, micro-level feedback obtained from typical usability studies and the quantitiative, attitudinal, macro-level data acquired through opinion mining, to create a ‘360 degree view’ of the user experience. This is an area I’d love to explore more, and will be one of my first areas of experimentation with GATE.

In summary, I’d say the first GATE summer school was a well-organised event with an excellent programme.  But there were some significant areas where delivery could be improved:

  • If you’re running a course with ~55 attendees, you need to make sure everyone can see and hear from the back. Use 2 projectors, and have presenters wear lapel microphones.
  • Go at the pace of the slowest machine – which will invariably be those of the attendees, not the presenters.
  • Tell the attendees clearly what slide deck you are using before starting your talk, and give them a chance to open it in advance.
  • Work out the choreography of when each lecture finishes and when the associated practical begins, and make sure everyone knows the difference. Tell people how long they have for each exercise, and give them clear timechecks when they are about to timeout.
  • Implement a “quick reset” function for GATE that clears down the old exercise in readiness for the new (or at the very least, tell the class how to do this in the first practical)
  • Standardise the format and approach of the practicals. Greater consistency will allow participants to adapt better to the pedagogical approach and focus more on the material and less on the presentation.
  • Make the practicals goal-oriented. Give people a clear problem statement at the outset, so that low-level tasks can be understood within the context of a wider goal.
  • Explain key concepts using diagrams, not the GATE UI. If you do the latter, participants will inevitably try to keep up, and will typically not succeed (for the reasons above). Avoid the problem by using suitable visuals and an associated narrative.

It’s interesting to note that the summer school closed with the award of an “intermediate training course certificate” to each of the attendees, no doubt a first step on the road to formal accreditation of competence within the GATE community. FIG ’09 was oversubscribed, so there is already talk of another presentation later in the year (it’ll probably have to be renamed a “Winter School” though). On the basis of FIG ’09, would I recommend it? Absolutely. In the meantime, watch this space for an update on the GATE user experience.

Read Full Post »