Regular readers of this blog will know that over the past few months I’ve been researching professional search strategies in the workplace as part of an InnovateUK-funded research project. A fortuitous side effect of that is the following paper, which is the output of a collaboration with Andy MacFarlane of City University London. Andy has been teaching IR and search strategies for many years, and this paper represents a synthesis of his framework and my research insights. The paper itself is due to be presented at 4th Spanish Conference in Information Retrieval in Granada next week. I hope Andy won’t mind too much if I describe the work as slightly more academic than my usual blog posts, but it does offer a structured way to think about search strategy development and offers some valuable advice on how best to teach such skills. For now, here is just the first half. You’ll have to wait till next week for the second instalment!
Healthcare information professionals perform systematic literature reviews to gather the evidence needed to answer specific research questions and formulate policy. However, performing a systematic review is a resource-intensive and time consuming undertaking, often taking years to complete. Moreover, the output relies heavily on the quality of the initial search strategy in ensuring that the scope is sufficiently exhaustive and not biased by easily accessible studies. In this paper we introduce a structured methodology and a framework for learning which together aim to embody best practices from the community and provide support for many of the common issues in search strategy development.
- Information systems➝Information Retrieval➝Information Retrieval Query Processing.
Information retrieval; systematic reviews; education; training.
Medical knowledge is growing so rapidly that it is difficult for healthcare professionals to keep up. As the volume of published studies increases year by year, the gap between research knowledge and professional practice grows ever wider. Systematic literature reviews can play a key role in closing this gap, by synthesizing the complex, incomplete and at times conflicting findings of biomedical research into a form that can readily inform health decision making . A key principle of systematic reviews is that the protocol by which the literature was collected and analyzed should be made transparent and repeatable.
However, undertaking a systematic review is a resource-intensive and time consuming process, sometimes taking years to complete . Even rapid evidence assessments, designed to provide quick summaries of what is known about a topic or intervention, can take as long as two to six months . Moreover, new research findings may be published in the interim , leading to a lack of currency and potential for inaccuracy. It is therefore vital that the search strategies used to identify relevant studies should be published so that the process is seen to be auditable and repeatable.
In this paper we introduce a structured methodology for search strategy development and provide a framework for learning based on this methodology. In section 2 we provide the background to the problem in the context of the domain, and then discuss related pedagogical work in section 3. Section 4 details the structured searching methodology which we then apply in Section 5 to develop a framework for learning. We provide ideas for the future direction of the work in section 6.
At its heart, the process of systematic review relies on painstaking and meticulous searching of multiple literature sources. These include published literature sources such as MEDLINE and other specialist databases and ‘grey literature’ (i.e. technical reports and other non-peer reviewed sources). The principal way in which such sources are interrogated is through the use of Boolean queries, which utilize a variety of keywords, operators and ontology terms (also referred to as ‘subject headings’) – see Fig. 1.
(“etiology”[Subheading] OR “etiology”[All Fields] OR “causes”[All Fields] OR “causality”[MeSH Terms] OR “causality”[All Fields]) AND (“somnambulism”[MeSH Terms] OR “somnambulism”[All Fields] OR (“sleep”[All Fields] AND “walking”[All Fields]) OR “sleep walking”[All Fields])
Fig. 1 – Example of subject headings use
Reviewers incrementally build complex queries line by line, sometimes involving hundreds of terms, which are combined to form an overall search strategy – see Fig. 2.
The choice of search strategy is critical in ensuring that the process is sufficiently exhaustive and that the review is not biased by easily accessible studies . In addition, the strategy needs to be transparent and repeatable, so that others may replicate the methodology. However, there are often mistakes in search strategies reported in the literature that prevent them from being executed in their published form. In one sample of 63 MEDLINE strategies, at least one error was detected in 90% of these, including spelling errors, truncation errors, logical operator error, incorrect query line references, redundancy without rationale, and more .
Evidently, despite the dedication and painstaking attention to detail of many individuals (many of whom are trained librarians), creating effective search strategies may be prone to error, often relying on manual processes with limited editorial support. Moreover, once published, strategies are typically stored as free text, and are thus rarely directly executable in their native form. This compromises their ability to be used by others and may unintentionally detract from the rigour of the review.
- randomized controlled trial.pt.
- controlled clinical trial.pt.
- clinical trials as topic.sh.
- 1 or 2 or 3 or 4 or 5 or 6 or 7
- (animals not (humans and animals)).sh.
- 8 not 9
- exp Child/
- exp infant/
- child hospitalized/
- adolescent hospitalized/
- (child$ or infant$ or toddler$ or adolescen$ or teenage$).tw.
- Child Nutrition Sciences/
- exp Dietary Proteins/
- Dietary Supplements/
- exp Infant, Newborn/
- exp Overweight/
- exp Eating Disorders/
- exp Sports/
- exp Pregnancy/
- exp Viruses/
- (newborn$ or obes$ or “eating disorder$” or pregnan$ or childbirth or virus$ or influenza).tw.
- 10 and 17 and 22
- 32 not 31
Fig. 2 – Complex Boolean query example extract
3. RELATED WORK
Students of library and information science may have taken a module on search strategy development in library school, but further education and training is required to deal effectively with the highly complex queries typical of systematic reviews. We provide a broad overview of teaching methods, curricula development, online materials and assessment and feedback below. A much larger review of the area can be found in Fernandez-Luna et al.  – here we focus on Level 2 of the taxonomy given in that paper, together with a focus on [B] Educational Goals; [b1] Library and information science. The technical level [A] focuses on the operational aspect of undertaking a query from a given information need.
The literature on teaching methods shows that generic ideas in pedagogy can be used to build frameworks to tackle problems in teaching and learning in information retrieval . There are a couple of tensions which need to be considered. The first of these is that the method can either be process or outcome based. Kuhlthau  provides a five stage strategy to assist the process of learning based on prior work . These kinds of ideas are very useful for systemic reviews, where the search process is very complex. McGregor  however shows that students tend to be focused on the outcome rather than the process, so any scheme for education must deal with that tension and ensure the student understands the importance of process. The second tension to deal with is whether to focus on theory or practice in teaching : the argument for the former is that understanding the theoretical concepts provides the student with transferable skills (i.e. they can use any system to search). However systems do vary in practice and it is important to give the student practical knowledge of specific systems. In this work, we recommend a balance of the two.
The Cochrane Organization provides guidelines for search strategy development and has developed curricula specifically for search in systematic review . Professional bodies such as CILIP in the U.K and the Medical Library Association in the U.S.A.  also provide guidelines for curricula design; of a more general nature in the former case and more specific in the latter case. Other organizations such as the UK Quality Assurance Agency provide subject benchmarks in Library and Information Science which can be used to inform curricula.
After a curriculum has been defined, the instruction method can be derived and there is a clear case for producing materials online through E-learning systems such as Moodle. Much work has been done in the area of developing online resources for IR instruction ; for example, the Cochrane Organization materials are online and available to all . E-Learning environments may include interactive elements to help the student e.g. Java Applets , or a self-training package which can include pre and post self-assessments for students to measure their own progress . Materials and assessments are designed in conjunction with each other and are integrated to ensure a good learning experience for the student. Face to face methods of delivery can also be used .
Assessment and feedback can use a wide variety of methods including those mentioned above in E-learning such as multiple choice questions (MCQs) , assignments, projects and tests  or even an IR game  where the student can assess their performance on a task using an IR system with a standard test collection, allowing them to assess their progress using a graphical tool. Feedback can be automatic via MCQs  or via summative or formative textual feedback  for more abstract ideas where there is no right and wrong answer (see section 5 below).
4. A STRUCTURED SEARCH METHODOLOGY
When undertaking a search there are a number of stages in the process starting from the realization of an information need (cognition) to the creation of a search which is submitted to an information retrieval system (syntactic). At City University London a search methodology has been used for many years on various modules which provides a structured approach to the process from beginning to end. The model resembles the framework derived by Taylor in 1968 , but is much more elaborate. We outline each of the levels of the structured searching framework in this section.
4.1 Cognitive stage
This stage initiates a search. The user realizes that they do not have sufficient knowledge to undertake a particular task , or in the case of systematic review is faced with a complex information need in the form of a specific research question. The search process is often performed by an intermediary (e.g. a librarian) who may have limited knowledge of the subject area. Therefore an ability to understand the needs of the original requestor is key at this stage. Needs can be visceral (an unexpressed need) or conscious (a within brain description) . For the most part, systematic review needs are conscious, as domain knowledge may be required. Clarity of self-reflection is essential here as initial thoughts about the research question will govern the direction of the search process, and further reflection during iterations of searching will assist in understanding the topic and further guiding the direction of the search.
4.2 Linguistic stage
Once the searcher has identified an area or sub-domain in which to undertake a search, some kind of linguistic description is required to identify the underlying concepts. This could take the form of a document which describes the overall need, or it could be a description of facets using schemes such as PICO – Patient, Intervention, Comparison and Outcome . Often it may involve both with the facets defined in the document describing the information need. Other more generic facet analysis schemes include ad-hoc and PMEST – Personality, Matter, Energy, Space and Time . Whichever scheme is used, for each facet a list of terms and synonyms is identified.
4.3 Strategic stage
Assuming Boolean logic is used, the general scheme for taking data from the facet analysis is to apply an OR operator to the terms and synonyms within a facet, and then to apply the AND operator between the facets. There are three well-known search strategies for combining terms and facets: building blocks, successive fractions and citation pearl growing . We outline these schemes here, and assume access to intermediate search sets as in the examples given in Figs. 1 and 2.
4.3.1 Building Blocks
In this method the sets for the facets are formed separately, and once this is done the final set is formed using the AND operator on the facet sets (see Fig. 3).
Set1 = etiology OR caus*
Set2 = somnambulism OR sleep
Set3 = walking OR “sleep walking”
Set4 = Set1 AND Set2 AND Set3
Fig. 3 – Building Blocks Search Strategy example
The advantage of this approach is that each set (sets 1 to 3 in Fig. 3) can be reused, and the search can continue to develop themes within each facet with no impact on other facets. However the drawback of this approach is that the search may lose focus in terms of its overall direction which may undermine its effectiveness.
4.3.2 Successive Fractions
An alternative to building blocks is successive fractions, where one facet is formed first and subsequent facets are incrementally added to the set to form the final answer (see Fig. 4).
Set1 = etiology OR caus*
Set2 = Set1 AND (somnambulism OR sleep)
Set3 = Set2 AND (walking OR “sleep walking”)
Fig. 4 – Successive Fractions Search Strategy example
The searcher can start with the most general facet and refine the query from there. There are fewer steps in this method and it is more holistic, with the searcher having a clearer idea in each iteration of where the search is going. The drawback of the method is that any mistakes in earlier steps (e.g. Set3 in Fig 4), may require the search process to be restarted (e.g. from Set1 in Fig. 4).
4.3.3 Citation Pearl Growing
In this method a known useful item is pre-identified and index terms and or subject headings are extracted from it. The user goes through several iterations of extracting terms from records and testing them out on queries until they are happy with both the terms and their combinations . A final phase is to reuse the building blocks strategy to create the final set of results.
4.4 Tactical stage
Within the strategies outlined above a number of tactics are available to the user. These fall into two broad groups: choice of terms and choice of operators. Choice of terms relates to the searcher’s domain knowledge, whilst choice of operators relates to their knowledge of search techniques. The choice of terms can be augmented by the use of field operators, which depends on the meta-data available (e.g. restricting a search to the AUTHOR field). Operators can be either Boolean (AND, OR and AND-NOT are the only operators available on most systems) or word based. The latter can be either proximity operators (e.g. for phrases: “sleep walking” in Figs. 3/4) or truncation/wildcard operators (e.g. caus* in Figs. 3/4). While word based operators are not strictly Boolean operators, they behave in a similar manner, e.g. by narrowing a query (proximity operators are a special case of AND) or broadening a query (truncation/wildcard operators are a special case of OR).
4.5 Logical stage
A search strategy and its tactics are formed within some kind of logical framework – in the examples used so far we have concentrated on Boolean logic and word based extensions to that logic. This type of search is exact match logic using set retrieval, which is the dominant paradigm in systematic reviews. Another form of logic is best match in which ranked retrieval is addressed, but this is rarely relied upon exclusively in systematic reviews. We address exact match logic only in this paper therefore. Searching requires the users to utilize their knowledge of strategy, tactics and logic all together – which form the formalized need . However logic can be considered separately in terms of learning, as we will outline later in section 4.
4.6 Syntactic stage
This is the stage where the user takes their search strategy and executes it on an operational information retrieval system – the compromised need (question as presented to the information system) . Each IR system will have its own syntax, and the types of operators available and their range will vary from system to system. In most cases AND and OR are used as is, but most systems use NOT to mean AND-NOT (see set 10 in Fig. 2) which can be confusing since the Boolean operator NOT is actually a unary rather than binary operator.
Many systems used ‘*’ and/or ‘?’ for wildcard characters either for single characters, a given set of characters or any number of characters to an upper limit. Proximity operators often use quotes “” (as per Google), but many offer the ability to choose the number of characters between the terms e.g. WITHIN, NEAR. Proximity operators on blocks of text such as sentences or paragraphs are rare. A further syntactic method used often is to restrict the search to a given field (e.g. “etiology”[Subheading]) which vary syntactically between systems (e.g. AU(name) in ProQuest Dialog ). The meta-data scheme available on the IR system or the source will determine the fields available.
Finally the type of interface will often determine the type of strategy used. The search strategies described in section 4.3 require access to search sets – a typical example of this is ProQuest Dialog . Alternatively, in some cases a form-based user interface is available with columns representing terms, synonyms and phrases and rows representing the facets. In this paper we concentrate on the former, known as command line interfaces.