Conduct - EBSE

Conducting the Review

Goal

The goal of the conduct stage of the systematic review process is to perform the required systematic review processes as defined in the review protocol.

Scope

A common theme throughout descriptions of systematic review processes is that the individual tasks comprising each process should be conducted by several different members of the review team. Results obtained by different team members undertaking the same task can then be compared and any disagreements resolved. This has been adopted as the gold standard process for conducting systematic reviews in order to:

Minimize human errors resulting from fatigue or stress such as misunderstanding or misinterpreting a primary study report, or wrongly transcribing information or data reported in a primary study.
Reduce the risk of personal bias affecting review conclusions (people often tend to overlook or ignore information that does not agree with their own views on a topic).

If you are not intending to use the standard approach, your own approach needs to have been explained and justified in your review protocol. For example, it may be non-standard because:

The review is a Rapid Review which, in order to minimize elapsed time, is being done by a single researcher.
The review team has specialized tool support to provide independent checking for some tasks.
The review is a postgraduate study where a supervisor checks only a subset of the tasks to ensure their student performs individual tasks correctly and leaves the student to complete the tasks as a single researcher.
The review is being done by a single researcher using a test-retest approach to check various tasks.

Another challenge for conducting a review is that, except for extremely simple systematic reviews (e.g., only a few relevant papers and a small review team), there is a substantial management overhead required to organize a secondary study and to keep track of the status of all the process tasks.

For example, the search process may generate large numbers of candidate primary studies including multiple citations of the same paper, and papers that report multiple studies of the same research question (see for example, papers reporting the analysis of families of experiments). The final decision whether to include or exclude each paper or individual empirical study found by the search process requires at least two members of the review team to assess each paper and for any disagreements among them to be resolved. It is important to ensure that no candidate primary studies are lost or forgotten. It is also important to ensure that different researchers are using the eligibility criteria consistently. It is useful to calculate agreement statistics during the selection process rather than wait until all the papers have been assessed as this may identify where the eligibility criteria are not being used consistently by different team members. In addition, review reporting standards recommend specifying how many papers entered in each stage of the search and selection processes and how many were excluded in each stage. For papers that were read in full, guidelines recommend identifying the reason for any papers rejected at that stage.

Similar issues affect data extraction and the assessment of primary study quality/risk of bias because we usually require at least two team members to extract information from each primary study, and the results need to be compared and any disagreements resolved. Although the number of agreed primary studies will be considerably fewer than the number of candidate primary studies found by the search process, we still need to ensure that all primary studies are properly processed. Again, agreement statistics are best calculated while the process is ongoing in order to check that different reviewers are using the data extraction forms consistently.

These issues mean that, unless your systematic review is small in terms both of candidate primary studies and team members, one team member must act as the team leader. Furthermore, they will need some tools to support the management process ranging from spreadsheets and citation management systems to comprehensive database systems tailored to SRs. The larger the scope of the SR the more important such tools become.

The processes included in the conduct stage of the review are:

Search
Selection
Data Extraction
Critical Appraisal Process (also termed Quality Assessment or Risk of Bias Assessment)
Data Analysis and Synthesis
Assessing the Strength of Evidence (also referred to as Certainty Assessment)

Outcomes

The outcomes of the conduct stage are the outcomes of each process including information about the validation processes used within each process and any deviations from the original review protocol.

This page was updated in July 2023.

Searching

Definition

The search process consists of the strategy and the procedures that will be used to search for candidate primary studies (i.e., the empirical studies that will be included in the review), including search methods, search terms and resources to be searched. The resources include digital libraries, specific journals, and conference proceedings.

Methods

The two main search methods used in SE systematic reviews are:

Keyword searches. This mean searching digital libraries and indexing systems using keyword-based search strings that define the topic, and usually restricting the search to the IT, SE, and CS literature. It may also be possible to define a start date for searches based on the publication date of the first paper addressing the topic, or the dates covered by a previous systematic review. If you have a set of known primary studies, the keyword can be derived from the title and keywords used in those studies.
Snowballing. This means searching online digital libraries and indexing systems using citation-based searching. Identifying the studies cited by a particular study as potential candidate primary studies is called backwards snowballing. Identifying the studies that have cited a particular study is called forward snowballing. Snowballing as a general search process is based on identifying a known set of primary studies and using multiple iterations of both backwards and forward snowballing to identify new candidate primary studies, until no new candidate primary studies are found. For each iteration, the new candidate primary studies are assessed using the selection process.

Both methods require access to digital libraries and indexing systems. For software, the most commonly used digital libraries are the ACM and IEEE digital libraries and the digital libraries maintained by publishers such as Kluwer, Elsevier and Springer. Indexing systems include Scopus and Web of Science.

Other methods of searching are:

Manual search of specific sources such as workshop and conference proceedings.
Asking experts external to the review team whether they have some relevant as-yet-unpublished technical reports or draft studies.
Using Google searches to look for grey literature (i.e. industry white papers and academic technical reports and theses).
Using DBLP to search for publications of a specific author.

The research team needs to decide upon their basic research strategy (either keyword search or snowballing), and whether the requirements for completeness make it necessary that additional search methods should also be used.

Iteration between Search and Selection Processes

If the main search process is based on snowballing, the entire process is iterative, with each iteration of the search process, followed by an iteration of the selection process. If the main search process is based on keyword searches, to achieve stringent completeness requirement, the search strategy may also require one round of forward and backwards snowballing of all the primary studies identified by the keyword search process in order to reduce the risk of missing relevant studies.

Specific Processes

The search process involves two related processes:

Defining the Search Process
Conducting the Search Process

Defining the Search Process

Goal

To specify the search methods and sources that will be used to identify candidate primary studies.

Input

The research questions.

The eligibility criteria.

The set of any known primary studies.

Any known secondary studies addressing similar issues.

Process specification

The review team must specify the main search processes plus any additional methods. The combination of methods should be sufficient support the completeness goals inherent in the type of SR being planned. Quantitative and qualitative SRs have the most stringent requirements for completeness. Mapping studies and Rapid Reviews have less stringent requirements.

The review team must also specify the sources that must to be searched. Scopus is a good choice for all types of SR. It indexes all the major SE journals and conferences, and usually finds frequently-cited grey literature. For keyword searches, it correctly handles strings constructed using AND, OR and NOT. For snowballing, it provides mechanisms to support forward and backwards citation searches.

Most indexing systems and digital libraries exhibit delays with including papers published in conference and workshop proceedings. So for both quantitative and qualitative SRs, it is worth performing manual searches of recent conference and workshop proceedings that address the topic of interest.

If the main search method is keyword search then the keywords need to be constructed based on:

the research questions,
conditions related to the research question that can be built into the search process (e.g., language restrictions, date restrictions or source restrictions),
the titles and keywords used in known primary studies,
the titles and keywords used in known secondary studies addressing similar issues.

If the main search method is snowballing, then the set of seed studies is critical. Any set of known primary studies can act as a starting point, but it is important to ensure that the known studies:

are not all from the same research group,
include studies published in difference sources,
include studies that have authors from different geographical regions.

If there appears to be any systematic bias, other studies can be sought using Google Scholar to identify highly cited papers addressing the topic of interest.

Whatever the main search method that is chosen:

Reviewers need to specify how to identify and handle different articles that report the same study. This can happen if an initial conference paper is followed by a more detailed journal article. We do not recommend only keeping the most recent version of the paper, because the more recent paper may not have reported some details already reported in the earlier paper. It is safer to keep citation information about both papers but link them to a single study identifier.
The tool(s) that will be used to manage the citation lists must be defined and trialled.

Verification

The main search process and support tools needed to be trialled:

For keyword searches, the different combinations of keywords can be tested on different indexing systems or libraries, and the outcomes of searches compared with the set of known studies. For each trial, the reason for any missing studies should be investigated and the search strings refined. Searches can be trialled on different digital libraries and indexing systems, although Scopus is a good starting point. Keyword search trials aim to identify both the most appropriate search strings and the most appropriate sources. In addition, the procedures for identifying and removing duplicated citations of the same article should be trialled.
For snowballing, it is the search process that is trialled rather than the effectiveness of the search process. In particular, the process of organizing the outcomes of each backwards and forwards snowballing round, should be trialled, as well as the methods to identify and eliminate duplicate citations (recognizing that not all articles identify their references using exactly the same format).

Output

The description of the search process including the methods and the sources reported in the protocol.

An integrated citation list of candidate primary studies obtained from trials of the search process that can be used to trial the selection process.

Conduct of the Search Process

Goal

The goal is to ensure that all relevant primary studies are included in the set of candidate primary studies found by the search process.

Input

The protocol, in particular the research questions, the agreed search strategy (i.e., combination of search methods and sources to be used) and search process.

The list of known primary studies.

Process

The searches should be run according to the search strategy defined in the protocol.

If duplicate reports of the same study were found from different searches, only one entry for the study must be included in the set of candidate primary studies.

Multiple (but different) reports of the same study should be linked to each other. The study only counts as one primary study, but if the study is identified as a relevant primary study, it is possible that the required data may appear in anyone of the related reports.

The citation information about any related secondary studies found by the searches should be kept.

Verification

The accuracy of the search process can be evaluated by assessing the percentage of known studies identified by the search process. If you have already used the set of known studies to refine the keywords, or act as the starting point of a snowballing, you need a separate and independent set of known studies. Such a set might be obtained from the search results of other related systematic reviews or mapping studies.

Risks

The main risk of bias is that the search misses relevant studies. This not such a serious problem if a few random studies are missed, but can lead to a major problem if the missed studies are not all random, such as being primarily those that report negative results. Failure to find negative or unfavourable results is a form of publication bias that arises because journals are often less interested in negative results, so negative results may not be formally published, or may be published in harder to find sources such as national journals and conferences, and may not be reported in English.

Risk Mitigation

This risk of publication bias is addressed by four methods:

Ensuring that a broad range of relevant digital libraries and indexing systems are searched.
Using multiple search methods. It is common to use one method as the main search method combined with other different methods. For example, a keyword search might be combined with a single iteration of forward and backwards snowballing and direct approaches to topic experts.
Using a set of known studies to assess the search process and revising the process until all the known studies have been found, or the reason for any missing studies is understood and does not require any revision to the search process. For example, a study might be missed because it was misclassified by the digital indexing system, so a change to the search process is not necessary (although it is good manners to report the misclassification to the digital source).
Avoiding reliance on exclusion criteria that might unfairly restrict the number of negative empirical studies such as language restrictions, including only studies published in international journals, and ignoring grey literature.

Outputs

The final outputs from the process after all iterations are:

Citation information for each identified candidate primary study including the title, abstract and keywords and information specifying the source(s) that detected the study.
Citation information of all related secondary studies.
The percentage of known studies that were found by the search process.
An explanation of why any known studies were not found by the search process. The explanation should confirm whether failing to find the study was due to a systematic flaw in the search process.
An explanation of any deviations from the study protocol and a discussion of any implication of the deviations for subsequent processes.

Selection Process

Definition

The selection process involves assessing each candidate primary study found by the search process for inclusion in, or exclusion from, the set of primary studies that are to be used in a systematic review, mapping study, tertiary study, or rapid review.

Methods

Several different methods of assessing candidate studies have been used in systematic reviews. The process recommended by most SR guidelines is for two or more members of the review team to work independently in order to assess each candidate primary study in each stage of the selection process.

Other less rigorous methods can also be used:

A method that is popular for graduate students with only a supervisor as another reviewer, is for both reviewers to assess a random sample of the candidate primary studies, apply the eligibility criteria and assess their agreement statistics. Any disagreements are discussed and clarified, and then the student assesses all the remaining candidate studies. In this case additional processes can be used to reduce some of the inherent risks. For example, the supervisor can be asked to confirm the exclusion decision for any studies that were excluded after reviewing the full text.
If only a single reviewer is available, a test-retest method can be used. The reviewer assesses each study. Then, they wait for several days and repeat their assessments (preferably changing the order in which they assess specific studies). Agreement statistics can then be assessed and problems with the clarity of eligibility criteria identified. Any candidate studies where consecutive assessments disagree need to be assessed in more detail until a final decision is made. Note this approach can also be used when a graduate student assesses the majority of the candidate primary studies.

Healthcare SR guidelines also suggest the use of text analysis tools. Such tools need to be trained on a selection of candidate primary studies found by the search process which have had their eligibility status agree by two members of the review team. The test set must include both included and rejected candidate primary studies. The construction of the test set must be done carefully if the proportion of ineligible and eligible studies are very unbalanced. Tools are useful to verify assessment decisions if there are a large number of candidate primary studies, and to reduce the risks that arise when single researchers assess some or all of the candidate primary studies.

Selection Process Stages

The selection process is usually performed in several stages, for example:

Each candidate primary study is assessed based on its title, abstract and key words. It is classified as excluded or not excluded (the second category is both for studies that the assessor believes should be included and also those studies where the assessor is unsure). The reason for exclusion should be reported.
Each remaining candidate primary study is re-assessed based on the full text of the article. In this step, assessors need to specify whether a candidate primary study should be excluded or included.

Occasionally, a preliminary stage is used to eliminate studies based only on the title and key words.

Usually two assessors are allocated to the task of assessing each candidate primary study in each stage. If the assessors disagree, the disagreement must be resolved. In the first stage, if there is no consensus for eliminating the study it should progress to the second stage. In the second stage, there should be consensus either for including or excluding the study.

Iteration between Search and Selection Processes

If the main search process was based on snowballing, the selection process is iterative, with each iteration of the selection process that identifies new primary studies being followed by an iteration of the search process (until no new primary studies are found). For example, the studies used in the second iteration are the additional primary studies found after the selection process is applied to the candidate primary studies found by the first round of snowballing.

If keyword searches were used and it is critical to identify all relevant studies, a snowballing process is often used as a secondary method. In this case, snowballing is based on all of the known primary studies plus all the primary studies found by applying the selection process to the candidate primary studies found by the keyword searchers. After the selection process has assessed all of the candidate primary studies found by keyword searches, the search process is restarted to perform one round of forward and backwards snowballing on all the identified primary studies. The candidate studies found by the round of snowballing are then assessed by re-starting the selection process.

Tools

In most cases, tools are needed to organize the process effectively. They are used to maintain information about the status of each candidate primary study, to monitor the process progress, and to provide the information that should be included in the SR report (i.e., agreement statistics, and information about the number of candidate primary studies entering each stage of the process and the number eliminated in each stage). The larger the number of candidate primary studies and review team members the more important it is:

To use database tools rather than spreadsheets and bibliographic tools to help manage the process.
To have a team leader responsibility for overseeing the progress of the selection process and organizing the process used to resolve disagreements.

Verification

Verification methods include:

Additional assessment of all candidate studies that were excluded after the full text was assessed by team members.
Using text analysis tools to provide an independent method of assessing study eligibility to compare with assessment made by team members.
Using citation and visualization tools to investigate the relationships among studies and the final classification of the studies. This can be used to identify studies that are assessed differently to other similar studies and that should therefore be re-assessed.

Selection Process Management Issues

The tools used to store candidate primary study citation information and information collected during the selection process must be agreed.

The method for allocating review team members to specific candidate primary studies must be defined. Random allocation is a reasonable method, but:

Avoid asking team members to assess candidate primary studies that they, themselves, authored.
Ensure the workload is fairly shared among members of the review team.

The agreement statistics must be defined, together with when they will be collected, and how the information will be used.

Methods of resolving disagreements between reviewers must be agreed. They include:

Discussions between the reviewers until an agreement is reached.
Allocating the disputed candidate primary study to another independent reviewer and accepting the consensus decision.
Trialling the data collection process on the paper – if it does not contain the required information it can be excluded from the review.

Usually option 1 is adopted and if the original reviewers cannot come to an agreement, the team leader needs to activate options 2 or 3.

All verification activities should be specified and agreed.

Specific Processes

The Selection Process involves two related processes:

Defining the Selection Process
Conducting the Selection Process

Defining the Selection Process

Goal

To specify the process that will be used to screen the candidate primary studies for inclusion in the SR.

Input

The agreed search process strategy.

The results of any trials of the search process.

Method

The review team members need to agree:

Which team member will act as co-ordinator/leader of the selection process.
How many steps will be used in the selection process.
How many review team members will assess each candidate primary study in each step.
How review team members will be allocated to the process.
Which agreement statistics will be used, when they will be collected and how they will be used. For example, identifying the agreement achieved partway through the selection process is one method of checking whether the selection process is properly understood by the team members.
How to process candidate primary studies that are themselves secondary studies. Good practice is to put such studies aside for purposes of validation (i.e., seeing whether they have identified any relevant primary studies missed by the search process, or have any recommendations for search terms) and reporting (i.e., as related work and to compare and contrast results).
How to process candidate primary studies that report more than one empirical study. Each study needs to be given a separate identifier and independently assessed for inclusion/exclusion.
Which verification methods will be used.

Risks

Risk 1: The selection process may be used inconsistently by different team members.

Risk 2: The agreement process can be time consuming.

Risk Mitigation

To address risk 1. The selection process should be trialled on studies found by trials of the search process.

To address risk 2:

Use tools to monitor the status of each candidate primary study.
In the second stage, team members should identify the reason for rejecting a primary study during their initial assessment.

Output

The selection process definition reported in the SR protocol.

The results of trials of the selection process.

Conducting the Selection Process

Goal

To ensure that each and every candidate primary study is correctly classified as suitable or not for inclusion in the primary study.

Input

A list of candidate primary studies including their citation information and abstract.

The protocol, in particular, the inclusion and exclusion criteria, methods for handling documents that report two or more different empirical studies, methods for handling documents authored by review team members, methods used for assessing candidate studies and the process steps to be adopted.

Process

The conduct and organization of the selection process should be conducted as specified in the SR protocol.

Any deviations from the specified process should be recorded and documented in the SR report.

Risks

R1: The main risk of bias during selection is misclassifying studies that should be included as excluded. Including a study that should be excluded is also a risk but not quite as serious, because it is likely to be detected during data collection. Misclassification is mainly due to:

Human error which can be caused by fatigue, misunderstandings or mis-transcription, or misleading reporting by the authors of primary studies (e.g., misleading titles, unclear abstracts, invalid keywords, or unjustified claims).
Personal biases on the part of team members
Ambiguities or errors in the inclusion/exclusion criteria.

R2: The complexity involved in managing a multi-step, multi-person process can be time-consuming and error-prone.

Risk Mitigation

Addressing R1:

Risk associated with human errors can be addressed by requiring two or more team members to independently assess each candidate primary study and employing a well-defined procedure for handling disagreements.
Risk associated with personal biases are addressed by ensuring that team members do not assess studies that they themselves authored.

Problems can be identified by monitoring agreement statistics at various times during the process. If agreement statistics are poor, the reasons should be investigated and then addressed either refining the eligibility criteria or giving team members given additional help with interpreting them. Team members should be encouraged to report any problems with the eligibility criteria, since it is possible that some aspects of eligibility were not anticipated when the protocol was developed.

Addressing R2: Use appropriate tools. For small-scale reviews a spreadsheet or a bibliographic tool may be sufficient, but for large-scale reviews with many reviewers, a special purpose systematic review tool may be needed.

Outputs

The final inclusion/exclusion decision about each candidate primary study. The reason for excluding a candidate primary study and the stage in the process at which it was excluded should be reported.

Agreement statistics should be reported for the first time that assessments from different review members are compared for each different stage in the process.

The selection process should be represented graphically as a flow chart showing the number of candidate studies entering each stage of the process and the numbers excluded by that stage.

An explanation of any deviations from the study protocol and a discussion of any implication of the deviations for subsequent processes.

Data Definition and Data Extraction

Definition

The data definition and extraction process identifies and defines which data items are required to address the SR research questions, and specifies how the required data will be extracted.

Scope

SR data falls into two broad categories:

Data required for all forms of SR and all research questions.
Data required to answer the specific research questions.

Data required for all types of SR include:

Citation information (e.g., author names, title, publication source, publication date)
Goal of study
Study methodology (as reported by the study authors, and as assessed by the reviewers)
Study context (e.g., industry/academia/mixed)
The object(s) of study (e.g., participant types, products/component types, software organization)
Main conclusions

The data required to answer the research questions depend on the type of SR.

For mapping studies, the research questions are concerned with characterising the study, so are usually based on a classification scheme. Two useful classification schemes are:

A set of criteria including Problem investigation, Solution Design, Solution validation Solution selection, Solution selection, Solution implementation, and Implementation evaluation, as proposed by Wieringa et al., (Wieringa, R. and Maiden, N. and Mead, N. and Rolland, C. 2006 Requirements engineering paper classification and evaluation criteria: A proposal and a discussion, Requirements Engineering, 11(1), pp. 102-107). These criteria were developed in the context of requirements engineering but have been adopted by many SR mapping studies, after being recommended by Petersen et al. (Kai Petersen, Robert Feldt and Shahid Mujtaba et al. Systematic Mapping Studies in Software Engineering. 2008. DOI: 10.14236/ewic/EASE2008.8.)
A set of criteria including Laboratory experiments, Experimental simulations, Field experiments, Field Studies, Computer simulations, Formal theory, Sample studies, and Judgement studies proposed by Stol and Fitzgerald (Klass-Jan Stol and Brian Fitzgerald, 2018, The ABC of Software Engineering Research, ACM Transactions on Software Engineering and Methodology, 27(3), https://doi.org/10.1145/3241743).

A critical issue is being able to correctly identify empirical field studies because practitioners usually regard such studies as providing more credible evidence than empirical studies conducted in an academic setting.

For quantitative SRs, research questions are usually about the comparing the effects of alternative SE methods, procedures, or tools on the software development, management or maintenance in terms of effort, time, quality, or productivity. SRs need to define and extract the data items used to represent those properties.

For qualitative SRs, research questions are usually about the benefits and risk associated with a specific SE method, procedure, or tool. Qualitative SRs often extract lists of the benefits and risks reported in the primary studies together with their definitions. If the primary study is an opinion survey, it should also report the number of respondents that reported specific benefits and risks and this information also needs to be extracted. A problem with extracting information from qualitative studies (whether extraction is manual or supported by text analysis tools), is that authors of primary studies are not always clear whether they are reporting their own results, or results found by other related studies. However, for aggregation purposes (i.e., to avoid double-counting), it is critical that the data associated with a primary study should relate to the outcomes of that that specific study. You should ensure that your protocol includes mechanisms to address this issue.

Some SRs may include primary studies of different types, in such cases different data extraction forms for each type of study may be required.

The data definition and collection process involves three processes:

Specifying the Required Data
Defining the Data Extraction Process
Extracting the Data

Specifying the Data

Goal

To identify and define the data to be extracted from each primary study in order to answer the SR research question(s).

Input

The SR research questions.

The set of known primary studies.

The type of SR being planned.

Method

The research team member responsible for producing the protocol should review the set of known primary studies.

To help specify the basic primary study data that will be collected.
To assess whether several different data collection forms are likely to be required.

Then:

Use the research questions to identify the data that must be extracted from each primary study.
Construct one or more data collection forms which should include all necessary data definitions and data extraction guidelines.

Note, citation information is usually obtained from extracted search engines, and not extracted manually.

The format of the data collection form needs to be defined. Options include:

Paper forms
Spreadsheets
Online forms
Databases

The format and storage of extracted data should be designed to support the requirements of the planned data synthesis process, in particular, manual transcription from the data form to another medium for analysis should be avoided. Storage of the data forms should allow for multiple versions of the data form for each primary study (i.e., independent extractions from two or more members of the review team member and the final agreed version).

Verification

All team members need to trial the data extraction form(s). At least two members of the review team should manually extract data specified on the form(s) for a specific known study, and all different types of primary study should be trialled. Problems reported by the reviewers, and disagreements among the data extracted by reviewers should be discussed (preferably in a meeting attended by all team members). Then, the data forms and data definitions should be amended if necessary.

Risks

The major risk is that the known set of primary studies are not representative of the variety of primary studies that will be found by the search process. Newly found primary studies may include results reported in a format that was not anticipated when the data extraction process was defined.

Risk Mitigation

Ensure the known primary studies comes from different research groups and were published in different venues, including recent topic-specific conferences. If the set of primary studies appears insufficient, look to find some other relevant studies (by manual inspection of recent conferences, forward snowballing of highly cited known primary studies, or by asking a topic expert to recommend some studies).

Outcome

Data extraction forms with data definitions reported in the study protocol.

Defining the Data Extraction Process

Goal

To define how the data collection process will be organized.

Input

The data forms and associated guidelines.

Process

The research team member responsible for producing the protocol should specify the data extraction process. Normal best practice for data extraction involves the following.

At least two review team members are assigned to extract the data from each primary study.
The data forms for a specific primary study must be reviewed and any disagreements resolved.
The process for resolving differences should be defined. Usually this is based on discussion among the data extractors with the option to involve another team member in the event that the differences cannot be resolved.

If review team intend to use a text analysis tool to assist data extraction:

The process for training the tool must be defined.
The team members must understand how to use the tool.

If other less rigorous methods of organizing the data collection process are planned, they must be defined and justified.

The protocol should define how many review team members will be assigned to each primary study and specify the method used to organize the assignment. Simple random assignment is seldom the best option. The assignment process should ensure that:

The workload is fairly distributed among the review team members.
The reviewers assigned to a specific primary study are not all novices with respect to performing an SR.
While considering the first two issues, reviewers should be assigned to data extraction for primary studies they assessed during the selection process.
Unless there is no other option, reviewers should not be asked to extract data from their own primary studies.

The agreement statistic that will be collected should be specified, together with guidelines for when they will be calculated, and an explanation of how the values will be used.

Risks & Risk Mitigation

See Specifying the Data.

Verification

The data extraction forms should be used to try out the data extraction process.

Outcome

The specification of the data extraction process reported in the protocol.

Extracting the Data

Goal

To extract and record the data required to answer the SR research questions from each primary study.

Input

A list of all primary studies.

The data collections form(s).

The protocol defining the data items and describing the data extraction process.

Organizational Process

The leader of the review team needs to allocate review team members to each primary study as specified in the protocol, and should then monitor the progress of the process by:

Checking that data extraction assignments are completed in a timely manner.
Checking that disagreements about the data extracted from a specific primary study are properly resolved, assigning another reviewer to the specific primary study if necessary.
Calculating and monitoring the agreement rates, and taking action if agreement rates are low (e.g. coaching individual team members, or revising data extraction guidelines).

Data Extraction Process

The data extraction process should implement the process defined in the protocol:

Team members should extract data from the primary studies to which they are assigned.
Once they have completed extracting data from a specific study, they should notify any other reviewers assigned to that study.
Once all initial extractions are complete for a specific study, reviewers should compare their extraction forms and identify any disagreements.
The original data extraction forms should be copied to the team leader to calculate agreement statistics. Any disagreements should be resolved using the procedure defined in the protocol.
The final agreed data collection form should be made available in the format required by the data synthesis process.

Risks

Risk 1: Investigating excessive numbers of disagreements may extend the time need to completed data extraction.

Risk 2: Reviewers may misinterpret the data definitions or misunderstand the data collection process when dealing with primary studies that are not among the original set of known primary studies.

Risk Mitigation

To address risk 1: Reviewers should specify the location of the extracted data in the primary study. This should make it easier to identify the source of any disagreements.

To address risk 2:

Agreement statistics can be reviewed partway through the conduct of the data collection process to check the level of agreement is acceptable.
All team members should be aware of the need to alert the team leader if they have problems with data extraction for specific data items or primary studies.

Outcome

The data required from each primary study in the format defined in the protocol.

Data Analysis and Synthesis

Definition

The data analysis and synthesis process identifies and defines how the extracted data items will be analysed in order to address the SR research questions.

Scope

Data analysis involves aggregating the extracted data to answer the research questions. It also includes using contextual information to identify any limitations or constraints associated with the answers.

Contextual information relating to the primary studies is used in three ways:

To identify subsets of primary studies that should be analysed together. This usually means groups of primary studies that address the same issues and use similar empirical methods.
In the context of quantitative synthesis, to undertake sensitivity analysis (which assesses whether results are reliable – i.e. not due to a single atypical result) and heterogeneity analysis (which investigates whether there are likely to be missing studies, and whether difference among results can be linked to different primary study contexts).
To assist the strength of evidence of assessment of the analysis/synthesis findings.

Mapping study data usually classifies the primary study in different ways. Such data is usually presented in tables or graphs that look for associations among different categories used to classify aspects of the primary studies (e.g., graphs that plot characteristics of the studies over time such as the number or types of primary study published each year). Mapping study data are seldom formally analysed.

Quantitative analysis is used when primary studies report the results of quantitative experiments or data mining studies. They usually involve some effectiveness measures such as document or code readability or fault rates, or process effort, duration, or productivity. Primary studies may report investigations of the relationships:

Among different effectiveness measures.
Between effectiveness measures and software engineering practices (e.g., methods, processes, procedures, tools).
Between effectiveness measures and product characteristics (such as size, complexity, maintenance history).

Aggregating such primary study outcomes usually involves formal meta-analysis. It is wise to have a statistical expert as part of the review team if such analysis is needed.

Qualitative analysis is usually used to answer questions relating to personal opinions towards some software engineering practice in terms of its benefits/value and limitations/risks and any constraints associated with adoption. It is also sometimes used to assess whether sociological models (such as models of motivation, management styles or personality types) are relevant to the software industry. In such cases, data synthesis involves:

Identifying conceptual equivalence between factors mentioned in the primary studies by:

Identifying each factor mentioned in a primary study and grouping factors that address different aspects of a more abstract characteristic (often referred to as a second level factor).
Doing the same analysis for another primary study
Comparing identified factors to arrive at definitions of first level and second level factors and synonyms for factor names.

This process of comparing and refining terminology and definitions continues until data from all relevant primary studies are synthesized.

2. Assessing the frequency with which different first and second level factors were mentioned across the set of primary studies and identifying the most frequently reported issues. We assume that issues reported in many different primary studies are likely to be the most important issues faced by other software organizations. It is sometimes necessary to synthesize results from opinion surveys with those from case studies and ethnographical studies. In order to do this fairly, synthesis should ignore the frequencies mentioned in the surveys, and report frequencies based on the number of times a particular factor was mentioned across the set of all relevant primary studies.

A major issue for data analysis and synthesis for quantitative and qualitative studies is deciding how to incorporate information obtained from critical appraisal of individual primary studies:

Recommendations from systematic reviews should be based on reliable evidence. So, reviewers may decide to omit primary studies with critical weakness from any aggregation process. In the case of studies using meta-analysis, the meta-analysis synthesis should be done for all studies, and then, if necessary, a second time omitting all studies with critical weaknesses. This allows reviewers to assess whether methodological weakness introduces systematic bias into results.
Recommendations should include an assessment of strength of evidence that identifies the extent to which the answer to each research question is based on reliable evidence in terms of methodological rigour and other criteria (see Assessing Strength of Evidence).

The data analysis and synthesis process involves two processes:

Specifying the Data Analysis and Synthesis Process
Conducting the Data Analysis and Synthesis Process

Specifying the Data Analysis and Synthesis Process

Goal

To define the methods that will be used to analyse the data extracted from each primary study.

Input

The SR research questions.

The set of known primary studies.

The data definition and extraction process.

Scope

The method of analysis and synthesis is defined by the research questions, the primary study types and the data extracted to address the research questions.

In most cases, a single person must take responsibility for specifying the data analysis and synthesis process, taking into account the specific types of SR.

Prior to analysing data it is sometimes necessary to group subsets of primary studies together to address different research questions. During protocol development, the known primary studies and the existing data collection forms should give the review team an indication of whether this will be required and how it should be done. For example, if sub-setting of primary studies is necessary, once all the primary studies are agreed, the review team leader should assess whether the primary studies need to be split into subsets and provide an initial breakdown if subsets are required. Other member of the review team should assess the subsets given their experience of the primary study selection process and the review questions.

For mapping studies, the link from extracted data to answering research questions should be well defined and allocation of review team members to the analysis task can be done on a basis of availability. It is advisable, however, to agree some presentation/reporting guidelines, for example:

Avoid using tabular analysis that requires mutually exclusive categories, unless that property can be guaranteed.
Keep to the simplest adequate representations of the results – over-complex representations can be misleading.
Avoid too much use of colour – some readers may be colour blind, so some combinations of colours can decrease readability.
Avoid extremely small text fonts – some readers may have restricted eyesight,

For qualitative SRs, it is sensible to ensure that that the construction of first level and second level factors is not assigned to a single reviewer. It often involves substantial effort and is based on subjective assessments, which means whenever possible it should be done by several review team members working together. The protocol should define how this should be organized. For example, for each subset of primary studies addressing a specific research question (or several related research questions), two or more team members work together to develop codes for data from two primary studies, and then revise/extend the codes by assessing the data from the other related projects one project at a time. Other team members should review the codes once all the data for a specific subset of primary studies have been assessed.

For quantitative reviews, the analysis process usually requires some statistical expertise, which may mean relying on a single member of the research team to develop the statistical analysis protocol. The statistical analysis protocol is usually not included in the systematic review protocol, but is prepared as a separate self-standing document, and then summarized in, and cited from, the SR protocol. The statistical analysis protocol should specify the statistical methods to be used and how results will be reported. It should also discuss whether or not blind analysis is required and, if required, how the process will be managed. Other issues that need to be specified are procedures to support reproducibility and any requirements the analysis process places on the data extraction and data storage processes. All members of the team should review the statistical analysis protocol and the summary of the analysis process in the SR protocol for consistency and completeness, and to ensure they understand any requirements the data analysis protocol places on data extraction and data storage

Verification

The data analysis and synthesis process should be trialled both on data from the data extraction trials, and also with artificially generated data.

Validation

For quantitative analysis, both the analysis process should be represented as an analysis script (to support reproducibility). The R programming language is particularly useful for constructing analysis scripts.

Risks

R1: The search will uncover primary studies of a type that were not anticipated during the development of the protocol.

R2: The format and/or the storage of the extracted data is inappropriate for the required analyses.

Risk Mitigation

To address risk 1: Use the risk process adopted during data definition.

To address risk 2: The trials of the data analysis and synthesis process should check that the format of the extracted data is suitable for the required analysis.

Outcome:

A description of the planned method for data analysis/synthesis reported in the study protocol.

Conducting Data Analysis and Synthesis Process

Goal

To identify one or more findings that answer all the SR research questions.

Input

A list of all primary studies relevant to a specific research question (or a set of closely related research questions).

The completed data collections form(s) for the specific subset of primary studies.

The protocol defining the data analysis and synthesis process.

Organizational Process

The leader of the review team needs to allocate review team members to data synthesis as defined in the study protocol.

Data Analysis and Synthesis Process

The data analysis and synthesis process should be conducted as specified in the protocol.

Risks & Risk Mitigation

Addressed by the risk process adopted when the data analysis and synthesis process was specified.

Outcome

One or more findings answering each SR research question, linked to the specific set of primary studies that contributed to the finding.

Critical Appraisal of Primary Studies

Definition

Critical appraisal of primary studies is the method used to assess the risk of bias arising from methodological weakness of a SR primary study. It is also referred to as assessing the quality of primary studies. Critical appraisal is important for quantitative and qualitative systematic reviews, but is not usually relevant to mapping studies.

Scope

SRs aim to aggregate answers to research questions from the outcomes of synthesis (referred to as findings) of a set of primary studies. Assessing which primary studies have adopted a reliable research methodology, and have applied it correctly, helps an SR reader decide whether they can rely its findings. It is also an important issue for review authors who need to assess the strength of evidence of primary studies that contributed to a specific finding.

To support the needs of SR readers and review authors, it is only methodological issues that are important in critical appraisal, not issues relating to the clarity of primary study reporting, nor the importance or novelty of the primary study results.

This emphasis on methodology rather than other quality aspects is why current medical standards refer to assessing risk of bias of methodological weakness rather than assessing quality.

Currently the critical appraisal method used in SE systematic reviews is often based on a checklist that includes a number of items (usually formulated as questions) and guidelines for answering those questions. However, SR primary studies use many different methodologies, which makes construction of a single general-purpose checklist difficult. Any checklist general enough to apply to a set of primary studies using many different empirical methods may not be detailed enough to ask questions about the methodological weaknesses of all the different study types.

Another problem with SRs in software engineering is that we often assign numerical values to checklist answers and sum those numbers to provide an overall assessment of the “quality” of a primary study. This is generally depreciated in other disciplines because it gives the impression that good values for one criterion can overcome poor values for another criterion. In fact, the assessment of methodological weakness of a study should be no better than the level of its most serious weakness.

These issues are discussed in more detail in two related SR process descriptions:

Identifying an appropriate appraisal method.
Appraising each primary study using the agreed method.

Identifying an Appraisal Method

Goal

The goal of this process is to identify the appraisal method that will be used to assess the primary study risk of bias due to methodological weakness.

Input

The systematic review context and constraints, the research questions, and any known primary studies.

Method

The aim of critical appraisal is to provide an assessment of risk of bias due to methodological weakness for each primary study, and allowing for the fact that different primary studies that address the same research question(s) may use different methodologies. This is done by defining the outcome of appraising an individual study to be measured on a comparable subjective ordinal scale of the form:

Critical (C): the primary study has one or more major flaws, which means that its results are highly unlikely to be trustworthy.
Serious (S): the primary study has no major flaws but several serious weaknesses, which means that its results are unlikely to be trustworthy.
Moderate (M): The primary study has several moderate weaknesses and its results should be treated with caution.
Low (L): The primary study has some minor weakness, but overall, its results are likely to be trustworthy.
Very Low (VL): The primary study has either no weaknesses, or negligible weaknesses, so its results are highly likely to be trustworthy.

Then, if we have criteria for assessing the methodological weakness for different types of study, the reviewers can select the most appropriate criteria to apply to the any specific primary study.

Methodological problems that should be considered major flaws for experiments include:

Participants were not randomly assigned to experimental conditions.
The researchers have not used any form of blinding or other procedures to minimize the impact of experimenter, analyst, or subject bias.
The design has not catered for plausible confounding effects.

Methodological problems that should be considered major flaws for qualitative studies include:

Relying on a single researcher to perform all the coding and interpretation of textual data.
Lack of any form of triangulation to validate data and conclusions.
A mismatch between the study participants and the study research question (e.g., asking novices about an issue that requires extensive personal experience).
For case studies, pilot projects or action research, a major problem is failure to report and to use methods to reduce experimenter bias.

Methodological problems that should be considered major flaws for surveys include:

Lack of a defined sampling frame
Lack of random sampling
A mismatch between research goals and survey questions

Methodological problems that should be considered major flaws for data mining/machine intelligence studies include:

Lack of justification for the data sets used in the study
Poor choice of goodness of fit criteria
Lack of statistical testing.

In all cases, the researchers should be alert for any other weaknesses in the study method.

When identifying areas of possible methodological weakness, it is important not to duplicate issues that are covered by the strength of evidence assessment criteria, in particular, the Adequacy of data and Relevance of data criteria, which address problems due to:

Small scale experiments (small samples, simple SE tasks)
Unrepresentative participants
Proof of concept examples rather than serious evaluations
Old or invalid datasets in data mining and machine learning studies
Lack of an industry setting, or practitioner participants.

Risks

Determining the appropriate criteria for different types of study is difficult and requires expertise both in methodology and the SR topic. Thus, there is a risk that the assessment of risk of bias for some types of primary study may overlook some important issues.
Some types of study found by the search may not have been anticipated during the SR protocol preparation.

Risk Mitigation

To address risk 1: The methods likely to be found in secondary studies should be identified by reading the known primary studies. Guidelines for using the identified methods appropriately should be obtained from the literature, or by consulting methodology experts and experts on the SR topic.

To address risk 2: Some contingency time should be built into the SR schedule to allow time to identify assessment criteria for unexpected types of primary study.

Outcome

The method for assessing risk of bias due to methodological weakness should be reported in the SR protocol. The protocol should also include any guidelines needed to apply the assessment method to the type of primary studies expected to be found in SR.

Appraising the Primary Studies

Goal

The goal of this process is to assess and report the risk of bias of methodological weakness for each identified primary study.

Input

The appraisal method as specified in the study protocol including any data collection forms.

The list of primary studies.

Organizational Process

The leader of the review team needs to allocate at least two review team members to each primary study and monitor the progress of the process in order to:

Check that assessment assignments are completed.
Check that disagreements about the assessment of specific primary study are properly resolved, which may involve assigning another reviewer to the specific study if necessary.
Calculate and monitor the agreement rates.

Appraisal Process

At least two reviewers assess the risk of bias for each secondary study. Prior to making the assessment the reviewers need to agree the type of the primary study and then select the appropriate criteria to use in their assessment.

The reviewers should independently assess each relevant criterion using a subjective ordinal scale of the form Critical, Serious, Moderate, Low, Very Low. In each case, the reviewer should provide an explanation for their assessment.

Disagreements need to be discussed and resolved, if necessary involving another reviewer. The agreed subjective assessment for each criterion and the reason for that assessment should be recorded in a risk of bias form.

Once all the criteria have been assessed for a specific primary study, the researchers involved in the assessment need to make an overall assessment for the study and provide an explanation of the overall assessment.

The overall risk of bias for the primary study should be based on the following guidelines:

The overall assessment should not be better than the worse individual assessment.
If the worst assessment is Serious, but three or more criteria all have that value, the overall criteria should be set to Critical.

Risks

Assessing the risk of bias due to methodological weakness requires a series of subjective assessments and such assessments are naturally error-prone.
There is a risk that factors used to assess the methodological weakness of individual projects could overlap with criteria used to assess the strength of evidence

Risk Mitigation

To address Risk 1. To reduce assessment errors and increase assessment consistency researchers should trial the assessment process during protocol development using some of the known primary studies (including all the different types of study found among the known projects). Disagreements between pairs of researchers should be discussed, and the criteria for assessment refined if necessary.

To address Risk 2: Reviewers should check the risk of bias forms for the project and make sure that they are not identifying problems that are addressed by other strength of evidence criteria.

Outcome

The results of the assessment risk of bias due to methodological weakness for each primary study, including an assessment of the risk of bias for each individual criteria and an overall assessment.

Assessing the Strength of Evidence

Definition

Assessing the strength of evidence is also referred to as assessing the credibility or certainty of evidence. It means assessing the trustworthiness of each systematic review finding, in terms of the number and the reliability of the primary empirical studies that support that finding. This process is important for quantitative and qualitative systematic reviews but is not appropriate for mapping studies.

Scope

SRs aim to aggregate answers to research questions or outcomes of synthesis (referred to as findings) from a set of primary studies. Assessing the strength of evidence for each finding and recommendation provides information to help a SR reader decide whether they can rely on the recommendation or finding.

In order to assess whether or not a specific finding is trustworthy, it is not only necessary to assess whether the individual primary studies contributing to a finding are of good methodological rigour, but also whether the set of primary studies related to a particular finding have any systematic weaknesses or biases. For example, the primary studies may all have only minor risk of methodological bias, but may have been produced by the same research group, which could introduce experimenter bias. Alternatively, the studies may report results only from relatively small studies which could suggest that there are insufficient large scale evaluations. Other issues that affect strength of evidence are whether the set of related primary studies suggest evidence of publication bias, or exhibit strong disagreements among empirical studies about which technique is preferable, or lack clear relevance to users of the SR recommendations.

Overall assessments of strength of evidence are based on a subjective ordinal scale of the type: High Strength, Moderate Strength, Low Strength, Very Low Strength. Because they are subjective, the assessment for each finding or recommendation must be justified (i.e., any major problem with the evidence should be explained), and should not be based on the opinions of a single researcher.

To get more information about assessing the strength of evidence, see G. Guyat et al., “GRADE guidelines: 1. Introduction—GRADE evidence profiles and summary of findings tables”, Journal of Clinical Epidemiology 64 (2011) and S. Lewin et al., (2018) “Applying GRADE-CERQual to qualitative evidence synthesis findings: introduction to the series”, Implementation science, 13(Suppl 1):2.

Assessing strength of evidence involves two related SR processes:

Identifying a Strength of Evidence Framework
Assessing the Strength of Evidence for each Finding

Warning: This is the hardest part of any systematic review and is usually omitted from SE systematic reviews. However, to keep our SR standards in line with other disciplines, and to help readers understand the extent they can rely on SR findings, we strongly recommend that reviewers adopt this form of assessment.

Identifying a Strength of Evidence Framework

Goal

The goal of this process is identify the criteria that will be used to assess the strength of evidence of the systematic review findings.

Input

The systematic review context and constraints, the research questions, and any known primary studies, and any results of trials of the primary study quality assessment/risk of bias assessment. This information should be available from the study protocol.

Method

There are three options for determining the assessment framework to be used:

Select an existing framework
Refine an existing framework
Create a new framework

In the case of a strength of evidence assessment framework suitable for software engineering SRs, there are insufficient examples of assessing strength of evidence in the literature to be sure of the best method. Given the prevalence of qualitative review in SE and the complete lack of randomized controlled field trials (which are the basis of the GRADE framework), a version including elements primarily suggested by GRADE-CERQual but allowing for quantitative findings would seem to be the best starting point. These are:

Methodological limitations of the primary studies which contributed to the specific outcome or finding. Assessments should emphasize methodological strengths and weaknesses not the quality of reporting nor the potential importance or novelty of results. Only primary studies without any critical or serious methodological weaknesses are included in the full assessment process.
Coherence of the primary studies, which assesses how clearly and consistently the data from the primary studies support the specific review finding. In the case of quantitative SRs, consider whether the effect sizes of primary studies are consistent.
Adequacy of data, which in the case of qualitative studies, relates to whether the number of participants and the richness of the data obtained from the participants are sufficient to understand and justify the finding. In the case of quantitative studies, data adequacy relates to sample size and the number of experiments or data sets.
Relevance of data is related to whether the set of primary studies are applicable to the context specified in the research question. For example, issues to consider are a lack of studies undertaken either in an industry setting, or with practitioner participants.
Likelihood of missing data or studies. This is mainly an issue for quantitative studies, which can be affected by publications bias. If formal meta-analysis is used some statistical techniques can be used to assess the probability of missing studies. Less formally a large number of small sample positive studies and a very low number of negative studies are indications of missing data. In qualitative studies, a complete lack of dissenting opinions or a predominance of studies from a single research group may be a sign of potential bias.

The Methodological Weakness criterion is used to screen primary studies before strength of evidence where each primary study is assessed on a five-point scale:

Critical (C): the primary study has one or more major flaw, which means that its results are highly unlikely to be trustworthy.
Serious (S): the primary study has no major flaws but several weaknesses, which means that its results are unlikely to be trustworthy.
Moderate (M): The primary study has several weaknesses and its results should be treated with caution.
Low (L): The primary study has some minor weakness, but overall its results are likely to be trustworthy.
Very Low (VL): The primary study has no or negligible weaknesses so its results are highly likely to be trustworthy.

Only primary studies with no critical or serious methodological weaknesses are included in the full assessment process. Other criteria and the final overall assessment are graded on a four-point subjective ordinal scale:

High (H): high strength of evidence, no problems or negligible problems associated with the criterion
Moderate (M): moderate strength of evidence, only relatively minor problems associated with the criterion
Low (L): low strength of evidence, no critical problems and no more than two serious problems with the criterion.
Very Low (VL): very low strength of evidence, three or more serious problems or a critical problem with the criterion.

The overall strength of evidence needs to be agreed (by at least two members of the review team) and recorded in the strength of evidence table. The overall strength of evidence should be based on the following guidelines:

The overall assessment cannot be better than the assessment of Methodological Weakness, where Moderate methodical weakness equates to Low strength of evidence, but Low methodological weakness equates to Moderate strength of evidence and Very Low methodical weakness equates to High strength of evidence.
The overall assessment should not be better than the worse individual assessment. High strength of evidence in one criterion does not cancel out a Low or Very Low strength of evidence in another criterion.
If the worst assessment is Low, but three or more criteria all have that value, the overall criteria should be set to Very Low.

Risks

A major risk is that the individual criteria are not completely independent. Lack of independence can mean that a specific concern could lower the assessment of more than one criterion, which would mean that the strength of evidence might be unfairly lowered.

Risk Mitigation

During protocol development, review team members should discuss how the criteria are to be interpreted after reading some of the known primary studies. These discussions should be used to develop assessment guidelines.
When constructing the entry for a specific finding, the reviewers need to check the related primary study risk of bias assessments and ensure that their assessments of other strength of evidence criteria have not been based on factors already considered by in the Methodological Weakness assessment.

Outcome

A strength of evidence assessment table identifying the criteria that will be used for assessing strength of evidence with any guidelines for completing the table developed during protocol development. Note, strength of evidence assessment may require some additional trials during the conduct of the study to finalise assessment guidelines.

Assessing the Strength of Evidence for each Finding

Goal

The overall aim of the assessment is to be able to report for each finding, how many of the primary studies that used good or moderately good empirical methods supported the finding, whether those primary studies were appropriate given the context of the research question, and whether they had any other major limitations.

Input

The strength of evidence assessment table.

The results of the quality/risk of bias assessment of each primary study, where each assessment is an assessment of Risk of Bias on a subjective ordinal scale of the form Very Low, Low, Moderate, Serious, Critical.

The list of SR findings identifying all primary studies that contributed to the finding.

Organizational Process

The leader of the review team needs to allocate at least two review team members to each SR finding and also to monitor the progress of the process:

Check that assessment assignments are completed.
Check that disagreements about the assessment of a specific finding are properly resolved, assigning another reviewer to the specific finding if necessary.
Calculate and monitor the agreement rates.

Process for Assessing Strength of Evidence

At least two reviewers assess the strength of evidence for each outcome or finding. The reviewers identify the set of primary studies that contribute to at finding and then assess the strength of evidence for each criterion using a subjective ordinal scale of the form High, Moderate, Low, Very Low. In each case the reviewers should provide an explanation for their assessment.

Once all the criteria have been assessed for a specific finding, the researchers involved in the assessment should make an overall assessment for the finding together with an explanation of the overall assessment. Researchers should make initial overall assessment independently and then discuss and resolve any disagreements (including another researcher in the process if necessary).

Risks

Assessing the strength of evidence requires a series of subjective assessments and such assessments are naturally error-prone.

Risk Mitigation

To reduce assessment errors and increase assessment consistency:

If there are a large number of findings, all reviewers should assess a specific finding, and then have a meeting to discuss their assessments in terms of agreements and disagreements among the assessments. The aim of such a meeting would be to ensure all review team members have a shared understanding of the assessment criteria and the assessment process.
If there are only a few findings, all reviewers should assess all findings and discuss their results. The aim of this would be to obtain a consensus of the strength of evidence based on the experience of all the review team.

Outcome

A completed strength of evidence table, including an entry for each outcome or finding, identifying the agreed strength of evidence for each criterion and the basis for the assessment, and an overall assessment.

Conducting the Review

Goal

Scope

Outcomes

Searching

Selection

Data Definition & Extraction

Data Analysis & Synthesis

Critical Appraisal of Primary Studies

Assessing the Strength of Evidence