Over the last few months I have been working with Paul Clough and Elaine Toms of Sheffield University on a Google-funded project called ‘A Taxonomy of Search Sessions’. A session, in case you’re wondering, is defined as a period of continued usage between a user and a search application. So if you spend a while Googling for holiday destinations, that’s a session. Sessions are interesting because they form a convenient unit of interaction with which to study usage patterns, and these can provide insights that drive improved design and functionality.
There are, of course, all sorts of ways we could categorise sessions. We could classify them as ‘known-item’ or ‘exploratory’, for example. But the aim of this project is to go further than simple distinctions and develop a novel taxonomy to characterise different types of session at varying levels of granularity. So at the basic level, for example, we might look at low-level query/clickstream data, at the next level look at query progression within episodes, and at the highest level consider the overall search task or goal.
It’s an ambitious vision of course, so we need to be mindful we don’t try to boil the ocean in the limited time we have available. Moreover, need to stand on the shoulders of giants: many have trodden this path before, and their insights should provide the foundation for our own thinking. Andrei Broder’s (2002) taxonomy of web search, for example, is well known instance of this kind of approach applied at the query level. And there are many more recent examples that focus on patterns at the session level across a variety of applications and contexts (see the list below for a few examples).
The comparison with the work we’ve been doing on search modes is also interesting: this has been based largely on top-down, qualitative approaches, using data sources that are expensive to acquire and limited in number. This project, by contrast, employs more quantitative, ‘bottom-up’ approaches, using data sources that are higher in number and more easily acquired: search logs. It will be interesting to see the extent to which we can triangulate between these contrasting views and approaches to develop a more integrated perspective.
So, where should we start? Well, first we need to break the problem down by identifying specific research questions or subtopics within it. Then we need to understand how these subtopics relate to each other, and what the dependencies are between them (from both a conceptual and logistical point of view). And then we need to operationalise these questions by turning them into hypotheses that we can empirically prove or disprove. So occupying our minds now are issues such as:
- Feature selection:
- What features can be reliably extracted from search sessions within different search logs?
- Is it possible to identify a generic set that are maximally transferable across different search contexts and applications?
- Which features best characterise different types of session activity at varying levels of abstraction?
- To what extent can manual analysis of sessions be used to guide or inform the feature selection process?
- Identifying usage patterns:
- What are the most effective techniques for revealing usage patterns (e.g. unsupervised learning, clustering, etc.)?
- What types of visualisation or projection are most informative?
- Which session-based patterns of activity are common across different search logs?
- To draw meaningful inferences the analysis process has to be methodologically sound and the patterns ‘stable’. But what does it mean for a cluster analysis to be stable? What should it be stable to – sample size? Clustering algorithm?
- How ‘similar’ or comparable are two particular search logs anyway? Of all the contextual dimensions we could identify a priori as directly influencing usage patterns, which are the most significant?
- What is the best way to combine data-driven and model-based approaches?
And there are lots more where that came from. We won’t have time to address them all, but I hope to start answering some of them over the next few weeks & months. In the meantime, if you have any insights of your own to share, we’d love to hear them 🙂
- Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2): 3-10.
- Chen, H-M., and Cooper, M.D. (2001). Using clustering techniques to detect usage patterns in a web-based information system. Journal of the American Society for Information Science and Technology, 52(11): 888– 904.
- Jansen, B. J. (2006). Search log analysis: What is it; what’s been done; how to do it. Library and Information Science Research, 28(3): 407–432.
- Jones, R., and Klinkner, K.L. (2008). Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM conference on Information and knowledge management (CIKM ’08). ACM, New York, NY, USA, 699-708.
- Kathuria, A., Jansen, B.J., Hafernik, C., and Spink, A. (2010). Classifying the user intent of web queries using k-means clustering. Internet Research 20(5): 563-581.
- Rose, D. E., and Levinson, D. (2004). Understanding user goals in Web search, In Proceedings of World Wide Web conference (WWW 2004), 13–19.
- Silvestri, F. (2010). Mining Query Logs: Turning Search Usage Data into Knowledge. Foundations and Trends in Information Retrieval, 4(1-2): 1-174.
- Stenmark, D. (2008). Identifying clusters of user behavior in intranet search engine log files. Journal of the American Society for Information Science and Technology, 59(14): 2232-2243.
- Wolfram, D., Wang, P., and Zhang, J. (2008). Modeling Web session behavior using cluster analysis: A comparison of three search settings, In Proceedings of the American Society for Information Science and Technology, 44(1): 1550-8390.
- Weber, I., and Jaimes, A. (2011). Who uses web search for what: and how? In Proceedings of the fourth ACM international conference on Web search and data mining (WSDM ’11). ACM, New York, NY, USA, 15-24.