The terminology used for empirical software engineering has been drawn from a number of other disciplines, and so may not always be used consistently between different authors and documents. This glossary is intended to provide a useful and accessible set of definitions for the terms that we use in empirical software engineering, and hence evidence-based software engineering. Wherever possible it seeks to cite an authority for those definitions where this might be appropriate. The short list of references below provide fuller discussions of many of these terms.
Dybå, T., Kampenes, V. B. & Sjøberg, D. (2006), ‘A systematic review of statistical power in software engineering experiments’, Information & Software Technology 48, 745-755.
Dybå, T., Kitchenham, B. A. & Jørgensen, M. (2005), ‘Evidence-Based Software Engineering for Practitioners’, IEEE Software 22, 58-65.
Fenton, N. E. & Bieman J, (2014), Software Metrics: A Rigorous & Practical Approach, 3rd Edition, Chapman & Hall/CRC Press.
Kitchenham B., Budgen D. & Brereton P. (2016), Evidence-Based Software Engineering and Systematic Reviews, Chapman & Hall/CRC Press.
Oates, B. J. (2006), Researching Information Systems and Computing, SAGE.
Petticrew, M. & Roberts, H. (2006), Systematic Reviews in the Social Sciences: A Practical Guide, Blackwell Publishing.
Shadish, W., Cook, T. & Campbell, D. (2002), Experimental and Quasi-Experimental Designs for Generalized Causal Inference, Houghton Mifflin Company.
Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B. & Wesslen, A. (2012), Experimentation in Software Engineering, 2nd Edition Springer.
Yin, R. K. (2018), Case Study Research: Design and Methods, 6th Edition SAGE.
This is the most restrictive of the measurement scales and simply uses counts of the elements in a set of entities. The only operation that can be performed is a test for equality. (See measurement scales.) [N.B. This term is sometimes used rather differently to refer to a ratio scale which incorporates a natural and unequivocal definition of the measurement unit!]
The accuracy of a measurement is an assessment of the degree of conformity of a measured or calculated value to its actual or specified value.
The accuracy range tells us how close a sample is to the true population of interest, and is usually expressed as a plus/minus margin. See also confidence level.
The process of gathering together knowledge of a particular type and form (for example, in a table).
An attribute is a measurable (or at least, identifiable) characteristic of an entity, and as such provides a mapping between the abstract idea of a property of the entity and something that we can actually measure in some way.
(Aka parallel experiment) Refers to one of the possible designs of a laboratory experiment. In this form, participants are assigned to different treatment (intervention) groups on the basis of one or more criteria and each participant only receives one treatment. (See within-subject.)
Unfairly favouring one treatment over another in an experiment. See also publication bias.
A process of concealing some aspect of an experiment from researchers and participants. In single-blind experiments, participants do not know which treatment they have been assigned to. In double-blind experiments, neither participants nor experimenters know which treatment the participants have been assigned to. In triple-blind experiments, as well as the participants and researchers, the statisticians analysing the results from an experiment are also blinded by being given treatment identifiers and not being told which identifier refers to which treatment. In software engineering we sometimes use blind-marking, where the marker does not know which treatment the participants adopted to arrive at their answers or responses.
A box plot is a means of showing the way that data values are distributed, in particular, of demonstrating any skew in the distribution in a visual manner. The core ‘box’ has a line for the median value, and the upper and lower bounds of the box are the upper quartile and lower quartile values respectively, where these are the data values with index position half way between the median and the upper and lower values of the dataset. The ‘whiskers’ on the ‘tails’ are determined by multiplying the box length by 1.5 and adding/subtracting the values of the upper/lower fourths respectively (and then truncating to the nearest actual value in the dataset). Any values lying outside of these are plotted as separate ‘outliers’.
A form of primary study, which is an investigation of some phenomenon in a real-life setting. Case studies are typically used for explanatory, exploratory and descriptive purposes. The main two forms are single-case studies which may be appropriate when studying a representative case or a special case, but will be less trustworthy than multiple-case forms, where replication is employed to see how far different cases predict the same outcomes. (Note that the term case study is sometimes used in other disciplines to mean a narrative describing an example of interest.) Case study research is covered in detail in the books by (Yin 2018) and (Regnell et al., 2012), and more concisely in (Oates 2006).
The link between a stimulus and a response, in that one causes the other to occur. The notion of some form of causality usually underpins hypotheses.
(As used in a questionnaire.) Such a question restricts the respondent by giving them a list of all permissible answers. Such a list may optionally include ‘other’ or ‘don’t know’ options. See also open question.
To generalise from an experimental sample to the wider population of interest we need to know the size of that population and also determine how sure we are that the values for our sample represent the population (the confidence level). See also accuracy range and sample size.
This is an undesirable element in an empirical study that produces an effect that is indistinguishable from that of one of the treatments. (A common example for software engineering is any prior relevant experience that participants may have.)
(As used in a questionnaire.) Concerned with whether the questions are a wellbalanced sample for the domain we are addressing.
For laboratory experiments we can divide the participants into two groups, with the treatment group receiving the treatment and the control group involving no manipulation of the independent variable(s). It is then possible to attribute any differences between the outcomes for the two groups as arising from the treatment.
(See also, laboratory experiment and quasi-experiment).
A form of non-probabilistic sampling in which participants are selected from those who are convenient, perhaps because it is easy to get access to them or they are willing to help. (See sampling technique.)
This is a form of survey used with an expert group, with the aim of identifying those issues where the group can reach a consensus, and those where individuals can express disagreement with the group. The survey is administered iteratively to the group, and for each iteration after the first, each participant receives a summary of their own responses and the mean response of the group on the previous iteration, allowing them to change their response to move closer to the norm or to choose to differ.
(Also termed response variable or outcome variable.) This changes as a result of changes to the independent variable(s) and is associated with effect. The outcomes of a study are based upon measurement of the dependent variable.
descriptive (case study)
(See case study.)
Assignment of values to an attribute of an entity by some form of counting.
A divergence occurs when a study is not performed as specified in the experimental protocol, and all divergences should be both recorded during the study and reported at the end.
This involves applying the experimental treatment to (usually) a single recipient, in order to test the experimental procedures (which may include training, study tasks, data collection and analysis).
The effect size provides a measure of the strength of a phenomenon. There are many measures of effect size used to cater to different types of treatment outcome measures, including the standardised mean differences, the log odds ratio, and the Pearson correlation coefficient.
Relying on observation and experiment rather than theory. (Collins English Dictionary)
The study of standards of conduct and moral judgement. (Collins English Dictionary) Codes of ethics for software engineering are published by the British Computer Society and the ACM/IEEE. Any empirical study that involves human participants should be vetted by the department’s ethics committee to ensure that it does not disadvantage any participants in any way.
A form of observational study that is purely observational, and hence without any form of interventional or participation by the observer.
To judge the quality of; appraise. (Collins English Dictionary)
An approach to empirical studies by which the researcher seeks to identify and integrate the best available research evidence with domain expertise in order to inform practice and policy-making. Originating in clinical medicine (EBM) it has been adopted for a number of other domains, including software engineering (Kitchenham et al. 2004), (Dybå, Kitchenham & Jørgensen 2005). The normal mechanism for identifying and aggregating research evidence is the systematic review.
An approach to decision-making that is based upon evidence, but which may also take into account other factors.
(In systematic reviews.) After performing a search for papers (primary studies) when performing a review, the exclusion criteria are used to help determine which ones will not be used in the study. (See inclusion criteria.)
A study in which an intervention (i.e. a treatment) is deliberately controlled to observe its effects (Shadish, Cook & Campbell 2002).
explanatory (case study)
(See case study.)
exploratory (case study)
(See case study.)
In the context of software products, an external attribute is a property that can only be observed when the software product executes.
A generic term for an empirical study undertaken in real-life conditions. In software engineering this is likely to be in the form of a case study.
(Aka normal distribution.) A bell-shaped distribution which is the basis for many statistical tests.
In the context of a systematic review, grey literature refers to unpublished primary studies that report the outcomes of completed research projects, where these are usually in the form of PhD theses, academic technical reports, industry and government white papers and versions of papers that are ‘in press’ or published as some form of pre-print. Searching for grey literature is considered to be important in order to avoid publication bias, which may arise when articles describe primary studies that did not find novel results and so are likely to be of less interest to journals. For a discussion of grey literature in systematic reviews, see (Kitchenham, Madeyski & Budgen, 2023) “How should software engineering secondary studies include grey material?”
Forms a testable prediction of a cause-effect link. Associated with a hypothesis is a null hypothesis which states that there are no underlying trends or dependencies and that any differences observed are coincidental. A statistical test can be used to determine the probability that the null hypothesis can or cannot be rejected.
(As used in systematic literature reviews.) After performing a search for papers (primary studies) when performing a review, the inclusion criteria are used to help determine which ones contain relevant data and hence will be used in the study. (See exclusion criteria.)
In a controlled experiment, an independent variable (also known as a stimulus variable or an input variable) is manipulated by the investigator in order to assess its relationship with the dependent variable. In a correlation study, which is based on available observations, an independent variable is a variable that is expected to affect a dependent variable.
Assigning values to an attribute of an entity by measuring other attributes and using these with some form of ‘measurement model’ to obtain a value for the attribute of interest.
(See independent variable.)
The ‘vehicle’ or mechanism used in an empirical study as the means of data collection (for the example of a survey, the instrument might be a questionnaire).
A term used in software metrics to refer to a measurable attribute that can be extracted directly from a software document or program without reference to other software process or project attributes.
In IS and computing in general, interpretive research is ‘concerned with understanding the social context of an information system: the social processes by which it is developed and construed by people and through which it influences, and is influenced by, its social setting’ (Oates 2006). See also positivism.
An interval scale is one whereby we have a well-defined ratio of intervals, but have no absolute zero point on the scale, so that we cannot speak of something being ‘twice as large’. Operations on interval values include testing for equivalence, greater and less than, and for a known ratio. (See measurement scales.)
A mechanism used for collecting data from participants for surveys and other forms of empirical study. The forms usually encountered are structured, semi-structured and unstructured. The data collected are primarily subjective in form.
Intrusive data collection is a method of data collection that relies upon the active participation of the individual responsible for generating the data. This involves activities such as filling out forms, attending interviews, answering questions, performing ‘think-aloud’ etc. This can be compared with non-intrusive data collection, where data generation is fully automated.
Sometimes referred to as a controlled laboratory experiment involves the identification of precise relationships between experimental variables by means of a study that takes place in a controlled environment (the ‘laboratory’) involving human participants and supported by quantitative techniques for data collection and analysis.
For systematic reviews it is more common to use the term ‘limitations’ than ‘threats to validity’ when considering the evidence included in the review and also the review process used.
Refers to a form of study that involves repeated observations of the same items over long periods of time.
25% of the values in a data set fall below this value.
(Sometimes termed a scoping review.) A form of secondary study intended to identify and classify the set of publications on a topic. May be used to identify ‘evidence gaps’ where more primary studies are needed as well as ‘evidence clusters’ where it may be practical to perform a systematic review.
This is the process whereby the participants change behaviour between tests, perhaps because of growth (as with children) or because the study has given practice in the skills involved.
Often referred to as the average and one of the three measures of the central tendency. Computed by adding the data values and dividing by the number of elements in the dataset. It is only meaningful for data forms that have genuinely numerical values (as opposed to codes).
The process by which numbers or symbols are assigned to attributes of real-world entities using a well-defined set of rules. Measurement may be direct (for example length) or indirect, whereby we measure one or more other attributes in order to obtain the value (an example of this is measuring the length of a column of mercury on a thermometer in order to measure temperature).
(A.k.a. 50 percentile.) One of the three measures of the central tendency. This is the value that separates the upper half of a set of values from the lower half, and we compute it by listing the values and taking the middle one (or the average of two middle ones if there is an even number of elements). Then half (50%) of the elements have values above the median and half have values below.
The process of statistical pooling of similar quantitative studies (Petticrew & Roberts 2006). Commonly used within secondary studies such as systematic reviews to provide a summary of results from a series of primary studies testing the same hypothesis and to explore similarities and differences between the outcomes. For software engineering experiments, meta-analysis is needed to address the small sample sizes found in many individual experiments.
One of the three measures of the central tendency. This is the value that occurs most frequently in a data-set.
A nominal scale is essentially one that consists of a number of categories, with no sense of ordering. So the only operation that is meaningful is a test for equality (or inequality). An example of a nominal scale might be programming languages. (See measurement scales.) Nominal values are often used in regression studies by converting each element in the scale to a dummy binary value. If the nominal scale has n elements, it will be represented by n – 1 dummy variables; the nth condition occurs when each of the n – 1 dummy variables takes the value zero.
(data collection) Non-intrusive forms are those that do not require any actions on the part of the participant(s). For example, the recording of keystrokes in an experimental interface. (N.B. use of these forms requires that participants are aware that they are being employed, in order to avoid ethical problems.)
Objective measures are those that are independent of the observer’s own views or opinions, and hence are repeatable by others. Hence they tend to be quantitative in form.
Non-intrusive forms of data collection (watching, listening…). This can be ‘overt’, where those being observed are aware of the presence of the observer, or ‘covert’ where they are unaware. There is a good discussion of observation in (Oates 2006).
An observational scale seeks simply to record the actions and outcomes of the study, and there is no attempt to use this to confirm or refute any form of hypothesis. Observational scales are often used to explore an issue and to determine whether more rigourous forms might then be employed.
(As used in a questionnaire.) An open question is one that leaves the respondent free to provide whatever answer they wish, without any constraint on the number of possible answers. See also closed question.
An ordinal scale is one that ranks the elements, but without there being any sense of a well-defined interval between different elements. An example of such a scale might be cohesion where we have the idea that particular forms are better than others, but no measure of how much. Operations are equality (inequality) and greater than/less than. (See measurement scales.)
(See dependent variable.)
A data point that lies beyond the range of values that might be expected for a variable. The expectation may arise from a theoretical consideration (a predicted curve) or from a clustering of empirical values. Note that an outlier may still be a valid data point. Outliers should not be removed from the data set without a good reason. (See box plot.)
Someone who takes part (participates) in a study, sometimes termed a subject. Participant is the better term in a software engineering context because involvement nearly always has an active element, whereas subject implies a passive recipient.
A value that splits a data set into a stated percentage of values below it. I.e. the 25 percentile or lower quartile is the value such that 25% of data values are less than that value. (See median, lower quartile and upper quartile.)
A group of individuals or items that share one or more characteristics from which data can be extracted and analysed. (See sampling frame.)’
An empiricist philosophical theory that holds that all genuine knowledge is either true by definition or by facts derived by reason and logic from sensory experience. For a fuller discussion, see (Oates 2006).
A statistical concept, that determines the probability that a statistical test will correctly reject the null hypothesis, and hence the likelihood that a study can find significant effects. For a discussion of this in the software engineering context, see (Dybå, Kampenes & Sjøberg 2006).
(See also recall.) In the context of information retrieval, the precision of the outcomes of a search is a measure of the proportion of studies found that are relevant. (This makes no assumptions about whether or not all possible relevant documents were found.) So a value of 1.0 for precision would indicate that all of the documents found were relevant to the search, but says nothing about whether every relevant document was found.
This is an empirical study in which we directly make measurements about the objects of interest, whether by surveys, experiments, case studies etc.
(In the context of a case study.) This is a more detailed element derived from a research question and broadly similar to a hypothesis (and like a hypothesis can be derived from theory). The proposition forms the basis of the case study.
This word is used in two (similar but different) ways.
- For empirical studies in general, the experimental protocol is a document that describes the way that a study is to be performed. It should be written before the study begins and evaluated and tested through a ‘dry run’. During the actual study, any divergences from the protocol should be recorded.
- In the context of protocol analysis as a qualitative data analysis technique based upon the use of think-aloud, the protocol is a categorisation of possible utterances that is used to analyse the particular sequence produced by a participant while performing a task as well as to strip out irrelevant material.
Widely used in experimental psychology for analysing data related to expert knowledge and for probing into such issues as patterns of information use and strategies employed while performing a task. Requires participants to use think-aloud, verbalising their thoughts and ideas while performing a task. For details, see (Ericsson & Simon, 1993).
Publication bias is any systematic bias that would lead to certain types of study having a greater probability of being found or being missed in the systematic review search process than other types of study. There are two main types:
- Under-representation of negative results because journals and reviewers lack interest in negative results, and so some studies may never have their findings published.
- Over-representation of small-scale studies with positive results in new topic areas, because journals and reviewers favour novelty over rigour.
A measurement form that (typically) involves some form of human judgement or assessment in assigning values to an attribute, and hence which may use an ordinal scale or a nominal scale. Qualitative data is often subjective data, but the terms are not synonymous because subjective data such as personal opinions cannot be quantized in terms of an ordinal agreement scale.
A measurement form that involves assigning values to an attribute using an interval scale or (more typically) a ratio scale. Quantitative data is also referred to as objective data, however this is incorrect since is it possible to have quantitative subjective data.
An experiment in which units are not assigned at random to the interventions. Field studies in software engineering are often quasi-experiments because interventions cannot usually be assigned at random and contextual factors cannot usually be controlled. Quasi-experiments can take a wide variety of forms, for more details see (Shadish, Cook & Campbell, 2002).
A data collection mechanism commonly used for surveys (but also in other forms of empirical study). Involves participants in answering a series of questions (which may be ‘open’ or ‘closed’).
A form of population sampling in which the experimenter tries for a balance between different groups within the overall population of interest, for example, male and female participants, Java and C++ programmers, etc.
randomised controlled trial
(RCT) A form of large-scale controlled field experiment using double blinding and a random sample from the population of interest. In clinical medicine this is regarded as the ‘gold standard’ in terms of experimental forms, but there is little scope to perform RCTs in disciplines such as software engineering where individual participant skill-levels are involved in the treatment.
An experiment in which units are assigned to receive the treatement or alternative condition by a random process such as a coin toss or a table of random numbers (Shadish et al. 2002).
A form of knowledge synthesis that accelerates the process of conducting a traditional systematic review through streamlining or omitting various procedures to produce evidence for stakeholders in a resource-efficient manner.
This is a scale where we have well-defined intervals and also an absolute zero to the scale. Operations are equality, greater than / less than, and ratio (such as ‘twice the size’). (See measurement scales.)
This refers to change in the participant behaviour arising from being tested as part of the study, or from trying the ‘help’ the experimenter (hypothesis guessing). May also arise because of the influence of the experimenter (such as any bias).
(See also precision.) In the context of information retrieval, the recall of the outcomes of a search (also termed sensitivity) is a measure of the proportion of all relevant studies found in the search. However, while a value of 1.0 for recall indicates that all relevant documents were found, it does not indicate how many irrelevant ones were also found.
The process of fitting data points to a model such as a curve.
The research question provides the rationale behind any empirical study, and states in broad terms the issue that the study is intended to investigate (for example, ‘do structured abstracts make it easier to obtain information about a study simply by reading the abstract?’. For experiments this will be the basis of the hypothesis used, but the idea is equally valid when applied to a more observational form of study.
For a survey, the response rate is the proportion of surveys completed and returned, compared to those issued.
An alternative term for the dependent variable.
risk of bias
For secondary studies, the concept of Quality Assessment has been largely replaced by the term Risk of bias (RoB). RoB can be associated with both individual studies and also with syntheses. The focus is on identifying methodological flaws that can bias the outcomes of studies. RoB for primary studies is used for three purposes: sensitivity analysis, concerned with the influence individual studies may have on specific findings; investigation of causes of heterogeneity (between positive and negative findings); and as part of the certainty assessment process. For synthesis it is concerned with the risk of publication bias arising from non-reporting of negative results.
This is the set (usually) of people who act as participants in a study (for example, a survey or a controlled laboratory experiment). Equally, it can be a sample set of documents or other entities as appropriate. An important aspect of a sample is the extent to which this is representative of the larger population of interest.
This is the number of independent units in an empirical study. Small sample sizes form a particular risk of bias for software engineering studies with human participants.
This is the set of entities that could be included in a survey, for example, people who have been on a particular training course, or who live in a particular place.
This is the means by which we select a sample from a sample frame and takes two main forms: probabilistic sampling and non-probabilistic sampling. Probabilistic sampling is an approach whereby we aim to obtain a sample that is a representative cross-section of the sampling frame. Major forms include random, systematic, stratified and cluster sampling. Where it is impractical or unnecessary to have a representative sample, non-probabilistic forms that may be employed include purposive, snowball, self-selection and convenience sampling. These are discussed in more detail in (Oates 2006).
(A.k.a. Scatter Plot.) These are often used to demonstrate the relationship between two variables, and the more closely the points cluster around a regression line with a non-zero gradient, then the closer the relationship between the variables. Conversely, if the best fitting regression line has a very small gradient then it is likely that there is no relationship.
(See mapping study.)
A secondary study does not generate any data from direct measurements, instead, it analyses a set of primary studies and usually seeks to aggregate the results from these in order to provide stronger forms of evidence about a particular phenomenon.
The square root of the variance of a set of values. (See variance.)
The ability of a statistical test to reveal a true pattern in the data (Wohlin, Runeson, Host, Ohlsson, Regnell & Wesslen 2012). If the power is low, then there is a high risk of drawing an erroneous conclusion. For a detailed discussion of statistical power in software engineering studies, see (Dybå et al. 2006).
(See independent variable.)
Subjective measures are those that depend upon a value judgement made by the observer, such as a ranking (‘this is more significant than that’). May be expressed as a qualitative value (‘better’) or in a quantitative form by using an ordinal scale.
A comprehensive research method for collecting information to describe, compare or explain knowledge, attitudes and behaviour. The purpose of a survey is to collect information from a large group of people in a standard and systematic manner and then to seek patterns in the resulting data that can be generalised to the wider population. Surveys can be: experimental, when used to assess the impact of some intervention; or descriptive where the survey is used to enable assertions to be made about some phenomenon of interest and the distribution of particular attributes, with the concern being what form the distribution has rather then why it exists.
The process of systematically combining different sources of data (evidence) to answer a research question.
This is a particular form of secondary study and aims to provide an objective and unbiased approach to finding relevant primary studies, and for extracting and aggregating the data from these. The use of systematic reviews in software engineering are discussed in (Kitchenham, Budgen & Brereton 2016) and a more general approach (based upon the social sciences) is provided in (Petticrew & Roberts 2006).
systematic literature review
In software engineering, this term was originally used in preference to the more conventional systematic review to avoid confusion with code reviews. See the entry for systematic review.
This form of study effectively performs a secondary study that uses the outputs of secondary studies as its inputs, perhaps by examining the secondary studies performed in a complete discipline or a part of it.
Conventionally, this provides a measure of the reliability and stability of a survey instrument. Respondents are ‘tested’ at two well-separated points in time and their responses are compared for consistency by means of a correlation test.
Used with protocol analysis. Participants in a study are trained to ‘speak out’ their thoughts while performing a task. These verbalisations (often termed utterances) are recorded and then analysed to identify particular patterns of behaviour.
This is the ‘intervention’ element of an experiment (the term is really more appropriate to randomised controlled trials where the participants are recipients). In software engineering it may describe a task (or tasks) that participants are asked to perform such as writing code, testing code, reading documents.
Triangulation is a method used to increase the credibility and validity of research findings. It may involve using different methods to investigate, or applying different methods of collecting data about, the same phenomena.
See definition of bias.
75% of data values lie below this value (and hence there are 25% of values above it).
This is concerned with the degree to which we can ‘trust’ the outcomes of an empirical study, and is usually assessed in terms of four commonly-encountered forms of threat to validity. The following definitions are based upon those used in (Shadish et al. 2002): internal, relating to inferences that the observed relationship between treatment and outcome reflects a cause-effect relationship; external, relating to whether a cause-effect relationship holds over other conditions, including persons, settings, treatment variables, and measurement variables; construct, relating to the way in which concepts are operationalised as experimental measures; statistical conclusion, relating inferences about the relationship between treatment and outcome variables. For systematic reviews the concept of Risk of Bias (RoB) is now considered more appropriate.
A measure of the spread of a set of values. Formally, the mean squared deviation from the mean. (See standard deviation.)
(A.k.a. sequential experiment.) Refers to one of the possible designs of a laboratory experiment. In this form, participants receive a number of different treatments, with the order in which these are received being randomised. The basic design (two treatments) is an ‘A/B-B/A crossover’ form whereby some participants receive treatment A and then treatment B, while others receive them in reverse order.
This page last updated in May 2023.