Searching
Definition
The search process consists of the strategy and the procedures that will be used to search for candidate primary studies (i.e., the empirical studies that will be included in the review), including search methods, search terms and resources to be searched. The resources include digital libraries, specific journals, and conference proceedings.
Methods
The two main search methods used in SE systematic reviews are:
- Keyword searches. This mean searching digital libraries and indexing systems using keyword-based search strings that define the topic, and usually restricting the search to the IT, SE, and CS literature. It may also be possible to define a start date for searches based on the publication date of the first paper addressing the topic, or the dates covered by a previous systematic review. If you have a set of known primary studies, the keyword can be derived from the title and keywords used in those studies.
- Snowballing. This means searching online digital libraries and indexing systems using citation-based searching. Identifying the studies cited by a particular study as potential candidate primary studies is called backwards snowballing. Identifying the studies that have cited a particular study is called forward snowballing. Snowballing as a general search process is based on identifying a known set of primary studies and using multiple iterations of both backwards and forward snowballing to identify new candidate primary studies, until no new candidate primary studies are found. For each iteration, the new candidate primary studies are assessed using the selection process.
Both methods require access to digital libraries and indexing systems. For software, the most commonly used digital libraries are the ACM and IEEE digital libraries and the digital libraries maintained by publishers such as Kluwer, Elsevier and Springer. Indexing systems include Scopus and Web of Science.
Other methods of searching are:
- Manual search of specific sources such as workshop and conference proceedings.
- Asking experts external to the review team whether they have some relevant as-yet-unpublished technical reports or draft studies.
- Using Google searches to look for grey literature (i.e. industry white papers and academic technical reports and theses).
- Using DBLP to search for publications of a specific author.
The research team needs to decide upon their basic research strategy (either keyword search or snowballing), and whether the requirements for completeness make it necessary that additional search methods should also be used.
Iteration between Search and Selection Processes
If the main search process is based on snowballing, the entire process is iterative, with each iteration of the search process, followed by an iteration of the selection process. If the main search process is based on keyword searches, to achieve stringent completeness requirement, the search strategy may also require one round of forward and backwards snowballing of all the primary studies identified by the keyword search process in order to reduce the risk of missing relevant studies.
Specific Processes
The search process involves two related processes:
- Defining the Search Process
- Conducting the Search Process
Defining the Search Process
Goal
To specify the search methods and sources that will be used to identify candidate primary studies.
Input
The research questions.
The eligibility criteria.
The set of any known primary studies.
Any known secondary studies addressing similar issues.
Process specification
The review team must specify the main search processes plus any additional methods. The combination of methods should be sufficient support the completeness goals inherent in the type of SR being planned. Quantitative and qualitative SRs have the most stringent requirements for completeness. Mapping studies and Rapid Reviews have less stringent requirements.
The review team must also specify the sources that must to be searched. Scopus is a good choice for all types of SR. It indexes all the major SE journals and conferences, and usually finds frequently-cited grey literature. For keyword searches, it correctly handles strings constructed using AND, OR and NOT. For snowballing, it provides mechanisms to support forward and backwards citation searches.
Most indexing systems and digital libraries exhibit delays with including papers published in conference and workshop proceedings. So for both quantitative and qualitative SRs, it is worth performing manual searches of recent conference and workshop proceedings that address the topic of interest.
If the main search method is keyword search then the keywords need to be constructed based on:
- the research questions,
- conditions related to the research question that can be built into the search process (e.g., language restrictions, date restrictions or source restrictions),
- the titles and keywords used in known primary studies,
- the titles and keywords used in known secondary studies addressing similar issues.
If the main search method is snowballing, then the set of seed studies is critical. Any set of known primary studies can act as a starting point, but it is important to ensure that the known studies:
- are not all from the same research group,
- include studies published in difference sources,
- include studies that have authors from different geographical regions.
If there appears to be any systematic bias, other studies can be sought using Google Scholar to identify highly cited papers addressing the topic of interest.
Whatever the main search method that is chosen:
- Reviewers need to specify how to identify and handle different articles that report the same study. This can happen if an initial conference paper is followed by a more detailed journal article. We do not recommend only keeping the most recent version of the paper, because the more recent paper may not have reported some details already reported in the earlier paper. It is safer to keep citation information about both papers but link them to a single study identifier.
- The tool(s) that will be used to manage the citation lists must be defined and trialled.
Verification
The main search process and support tools needed to be trialled:
- For keyword searches, the different combinations of keywords can be tested on different indexing systems or libraries, and the outcomes of searches compared with the set of known studies. For each trial, the reason for any missing studies should be investigated and the search strings refined. Searches can be trialled on different digital libraries and indexing systems, although Scopus is a good starting point. Keyword search trials aim to identify both the most appropriate search strings and the most appropriate sources. In addition, the procedures for identifying and removing duplicated citations of the same article should be trialled.
- For snowballing, it is the search process that is trialled rather than the effectiveness of the search process. In particular, the process of organizing the outcomes of each backwards and forwards snowballing round, should be trialled, as well as the methods to identify and eliminate duplicate citations (recognizing that not all articles identify their references using exactly the same format).
Output
The description of the search process including the methods and the sources reported in the protocol.
An integrated citation list of candidate primary studies obtained from trials of the search process that can be used to trial the selection process.
Conduct of the Search Process
Goal
The goal is to ensure that all relevant primary studies are included in the set of candidate primary studies found by the search process.
Input
The protocol, in particular the research questions, the agreed search strategy (i.e., combination of search methods and sources to be used) and search process.
The list of known primary studies.
Process
The searches should be run according to the search strategy defined in the protocol.
If duplicate reports of the same study were found from different searches, only one entry for the study must be included in the set of candidate primary studies.
Multiple (but different) reports of the same study should be linked to each other. The study only counts as one primary study, but if the study is identified as a relevant primary study, it is possible that the required data may appear in anyone of the related reports.
The citation information about any related secondary studies found by the searches should be kept.
Verification
The accuracy of the search process can be evaluated by assessing the percentage of known studies identified by the search process. If you have already used the set of known studies to refine the keywords, or act as the starting point of a snowballing, you need a separate and independent set of known studies. Such a set might be obtained from the search results of other related systematic reviews or mapping studies.
Risks
The main risk of bias is that the search misses relevant studies. This not such a serious problem if a few random studies are missed, but can lead to a major problem if the missed studies are not all random, such as being primarily those that report negative results. Failure to find negative or unfavourable results is a form of publication bias that arises because journals are often less interested in negative results, so negative results may not be formally published, or may be published in harder to find sources such as national journals and conferences, and may not be reported in English.
Risk Mitigation
This risk of publication bias is addressed by four methods:
- Ensuring that a broad range of relevant digital libraries and indexing systems are searched.
- Using multiple search methods. It is common to use one method as the main search method combined with other different methods. For example, a keyword search might be combined with a single iteration of forward and backwards snowballing and direct approaches to topic experts.
- Using a set of known studies to assess the search process and revising the process until all the known studies have been found, or the reason for any missing studies is understood and does not require any revision to the search process. For example, a study might be missed because it was misclassified by the digital indexing system, so a change to the search process is not necessary (although it is good manners to report the misclassification to the digital source).
- Avoiding reliance on exclusion criteria that might unfairly restrict the number of negative empirical studies such as language restrictions, including only studies published in international journals, and ignoring grey literature.
Outputs
The final outputs from the process after all iterations are:
- Citation information for each identified candidate primary study including the title, abstract and keywords and information specifying the source(s) that detected the study.
- Citation information of all related secondary studies.
- The percentage of known studies that were found by the search process.
- An explanation of why any known studies were not found by the search process. The explanation should confirm whether failing to find the study was due to a systematic flaw in the search process.
- An explanation of any deviations from the study protocol and a discussion of any implication of the deviations for subsequent processes.