Judgement-based statistical analysis
Stephen Gorard
Department of Educational Studies, University of York, email:
sg25@york.ac.uk
Paper presented at the British Educational Research Association Annual Conference, University of Manchester, 16-18 September 2004
Abstract
There is a misconception among social scientists that statistical analysis is somehow a technical, essentially objective, process of decision-making, whereas other forms of data analysis are judgement-based, subjective and far from technical. This paper focuses on the former part of the misconception showing, rather, that statistical analysis relies on judgement of the most personal and non-technical kind. Therefore, the key to reporting such analyses, and persuading others of ones’ findings, is the clarification and discussion of those judgements and their (attempted) justifications. In this way, statistical analysis is no different from the analysis of other forms of data, especially those forms often referred to as ‘qualitative’. By creating an artificial schism based on the kinds of data we use, the misconception leads to neglect of the similar logic underlying all approaches to research, encourages mono-method research identities, and so inhibits the use of mixed methods. The paper starts from the premise that all statistical analyses involve the format: data = model + error, but where the error is not merely random variation. The error also stems from more systematic sources such as non-response, estimation, transcription, and propagation. This total error component is an unknown and there is, therefore, no technical way of deciding whether the error dwarfs the other components. Our current techniques are largely concerned with the sampling variation alone. However complex the analysis, at heart it involves a judgement about the likely size of the error in comparison to the size of the alleged findings (whether pattern, trend, difference, or effect).
Introduction
‘Statistics are no substitute for judgement’
Henry Clay (1777-1852)
The paper reminds readers of the role of judgement in statistical decision-making via an imaginary example. It introduces the three common kinds of explanations for observed results: error or bias, chance or luck, and any plausible substantive explanations. The paper then re-considers standard practice when dealing with each of these types in turn. Our standard approach to these three types needs adjusting in two crucial ways. We need to formally consider, and explain why we reject, a greater range of plausible substantive explanations for the same results. More pressingly, we need to take more notice of the estimated size of any error or bias relative to the size of the ‘effects’ that we uncover. The paper concludes with a summary of the advantages of using judgement more fully and more explicitly in statistical analysis.
An example of judgement
Imagine trying to test the claim that someone is able mentally to influence the toss of a perfectly fair coin, so that it will land showing heads more than tails (or vice versa) by a very small amount. We might set up the test using our own set of standard coins selected from a larger set at random by observers, and ask the claimant to specify in advance whether it is heads (or tails) that will be most frequent. We would then need to conduct a very large number of coin tosses, because a small number would be subject to considerable ‘random’ variation. If, for example, there were 51 heads after 100 tosses the claimant might try to claim success even though the a priori probability of such a result is quite high anyway. If, on the other hand, there were 51 tails after 100 tosses the claimant might claim that this is due to the standard variation, and that their influence towards heads could only be seen over a larger number of trials. We could not say that 100 tosses would provide a definitive test of the claim. Imagine instead, then, one million trials yielding 51% heads. We have at least three competing explanations for this imbalance in heads and tails. First, this could still be an example of normal ‘random’ variation, although considerably less probable than in the first example. Second, this might be evidence of a slight bias in the experimental setup such as a bias in one or more coins, the tossing procedure, the readout or the recording of results. Third, this might be evidence that the claimant is correct; they can influence the result.
In outline, this situation is one faced by all researchers using whatever methods, once their data collection and analysis is complete. The finding could have no substantive significance at all (being due to chance). It could be due to ‘faults’ in the research (due to some selection effect in picking the coins perhaps). It could be a major discovery affecting our theoretical understanding of the world (a person can influence events at a distance). Or it could be a combination of any of these. I consider each solution in turn.
The explanation of pure chance becomes less likely as the number of trials increases. In some research situations, such as coin tossing, we can calculate this decrease in likelihood precisely. In most research situations, however, the likelihood can only be an estimate. In all situations we can be certain of two things – that the chance explanation can never be discounted entirely (Gorard 2002a), and that its likelihood is mostly a function of the scale of the research. Where research is large in scale, repeatable, and conducted in different locations, and so on, then it can be said to have minimised the chance element. In the example of one million coin tosses this chance element is small (less than the 1/20 threshold used in traditional statistical analysis), but it could still account for some of the observed difference (either by attenuating or disguising any ‘true’ effect).
If we have constructed our experiment well then the issue of bias is also minimised. There is a considerable literature on strategies to overcome bias and confounds as far as possible (e.g. Adair 1978, Cook and Campbell 1979). In our coin tossing example we could automate the tossing process, mint our own coins, not tell the researchers which of heads or tails was predicted to be higher, and so on. However, like the chance element, errors in conducting research can never be completely eliminated. There will be coins lost, coins bent, machines that malfunction, and so on. There can even be bias in recording (misreading heads for tails, or reading correctly but ticking the wrong column) and in calculating the results. Again, as with the chance element, it is usually not possible to calculate the impact of these errors precisely (even on the rare occasion that the identity of any error is known). We can only estimate the scale of these errors, and their potential direction of influence on the research. We are always left with the error component as a plausible explanation for any result or part of the result.
Therefore, to be convinced that the finding is a ‘true’ effect, and that a person can mentally influence a coin toss, we would need to decide that the difference (pattern or trend) that we have found is big enough for us to reasonably conclude that the chance and error components represent an insufficient explanation. Note that the chance and error components not only have to be insufficient in themselves, they also have to be insufficient in combination. In the coin tossing experiment, is 51% heads in one million trials enough? The answer will be a matter of judgement. It should be an informed judgement, based on the best estimates of both chance and error, but it remains a judgement. The chance element has traditionally been considered in terms of null-hypothesis significance-testing and its derivatives, but this approach is seen as increasingly problematic, and anyway involves judgement (see below). But perhaps because it appears to provide a technical solution, researchers have tended to concentrate on the chance element in practice and to ignore the far more important components of error, and the judgement these entail.
If the difference is judged a ‘true’ effect, so that a person can mentally influence a coin toss, we should also consider the importance of this finding. This importance has at least two elements. The practical outcome is probably negligible. Apart from in ‘artificial’ gambling games, this level of influence on coin tossing would not make much difference. For example, it is unlikely to affect the choice of who bats first in a five match cricket test series. If someone could guarantee a heads on each toss (or even odds of 3:1 in favour of heads) then that would be different, and the difference over one million trials would be so great that there could be little doubt it was a true effect. On the other hand, even if the immediate practical importance is minor a true effect would involve many changes in our understanding of important areas of physics and biology. This would be important knowledge for its own sake, and might also lead to more usable examples of mental influence at a distance. In fact, this revolution in thinking would be so great that many observers would conclude that 51% was not sufficient, even over one million trials. The finding makes so little immediate practical difference, but requires so much of an overhaul of existing ‘knowledge’ that it makes perfect sense to conclude that 51% is consistent with merely chance and error. However, this kind of judgement is ignored in many social science research situations, where our over-willing acceptance of what Park (2000) calls ‘pathological science’ leads to the creation of weak theories based on practically useless findings (Cole 1994, Davis 1994, Platt 1996a, Hacking 1999). There is an alternative, described in the rest of this paper.
The role of chance
To what extent can traditional statistical analysis help us in making the kind of decision illustrated above? The classical form of statistical testing in common use today was derived from experimental studies in agriculture (Porter 1986). The tests were developed for one-off use, in situations where the measurement error was negligible, in order to allow researchers to estimate the probability that two random samples drawn from the same population would have divergent measurements. In a roundabout way, this probability is then used to help decide whether the two samples actually come from two different populations. Vegetative reproduction can be used to create two colonies of what is effectively the same plant. One colony could be given an agricultural treatment, and the results (in terms of survival rates perhaps) compared between the two colonies. Statistical analysis helps us to estimate the probability that a sample of the results from each colony would diverge by the amount we actually observe, under the artificial assumption that the agricultural treatment had been ineffective and, therefore, that all variation comes from the sampling. If this probability is very small, we might conclude that the treatment appeared to have an effect. That is what significance tests are, and what they can do for us.
In light of current practice, it is also important to emphasise what significance tests are not, and cannot do for us. Most simply, they cannot make a decision for us. The probabilities they generate are only estimates, and they are, after all, only probabilities. Standard limits for retaining or rejecting our null hypothesis of no difference between the two colonies, such as 5%, have no mathematical or empirical relevance. They are arbitrary thresholds for decision-making. A host of factors might affect our confidence in the probability estimate, or the dangers of deciding wrongly in one way or another, including whether the study is likely to be replicated (Wainer and Robinson 2003). Therefore there can, and should, be no universal standard. Each case must be judged on its merits. However, it is also often the case that we do not need a significance test to help us decide this. In the agricultural example, if all of the treated plants died and all of the others survived (or vice versa) then we do not need a significance test to tell us that there is a very low probability that the treatment had no effect. If there were 1,000 plants in the sample for each colony, and only one survived in the treated group, and one died in the other group, then again a significance test would be superfluous (and so on). All that the test is doing is formalising the estimates of relative probability that we make perfectly adequately anyway in everyday situations. Formal tests are really only needed when the decision is not clear-cut (for example where 600/1000 survived in the treated group but only 550/1000 survived in the control), and since they do not make the decision for us, they are of limited practical use even then. Above all, significance tests only estimate a specific kind of sampling variation (also confusingly termed ‘error’ by statisticians), but give no idea about the real practical importance of the difference we observe. A large enough sample can be used to reject almost any null hypothesis on the basis of a very small difference, or even a totally spurious one (Matthews 1998).
It is also important to re-emphasise that the probabilities generated by significance tests are based on probability samples (Skinner et al. 1989), or the random allocation of cases to experimental groups. They tell us the probability of a difference as large as we found, assuming that the only source of the difference between the two groups was the random nature of the sample. Fisher (who pioneered many of today’s tests) was adamant that a random sample was required for such tests (Wainer and Robinson 2003). ‘In non-probability sampling, since elements are chosen arbitrarily, there is no way to estimate the probability of any one element being included in the sample… making it impossible either to estimate sampling variability or to identify possible bias’ (Statistics Canada 2003, p.1). If the researcher does not use a random sample then traditional statistics are of no use since the probabilities then become meaningless. Even the calculation of a reliability figure is predicated on a random sample. Researchers using significance tests with convenience, quota or snowball samples, for example, are making a key category mistake. Similarly, researchers using significance tests on populations (from official statistics perhaps) are generating meaningless probabilities (Camilli 1996, p.11). All of these researchers are relying on the false rhetoric of apparently precise probabilities, while abdicating their responsibility for making judgements about the value of their results. As Gene Glass put it ‘In spite of the fact that I have written stats texts and made money off of this stuff for some 25 years, I can’t see any salvation for 90% of what we do in inferential stats. If there is no ACTUAL probabilistic sampling (or randomization) of units from a defined population, then I can’t see that standard errors (or t-test or F-tests or any of the rest) make any sense’ (in Camilli 1996). He subsequently stated that, in his view, data analysis is about exploration, rather than statistical modelling or the traditional techniques for inference (Robinson 2004).
Added to this is the problem that social scientists are not generally dealing with variables, such as plant survival rates, with minimal measurement error. In fact, many studies are based on latent variables of whose existence we cannot even be certain, let alone how to measure them (e.g. the underlying attitudes of respondents). In agronomy there is often little difference between the substantive theory of interest and the statistical hypothesis (Meehl 1998), but in wider science, including social science, a statistical result is many steps away from a substantive result. Added to this are the problems of non-response and participant dropout in social investigations, that also do not occur in the same way in agricultural applications. All of this means that the variation in observed measurements due to the chance factor of sampling (which is all that significance tests estimate) is generally far less than the potential variation due to other factors, such as measurement error. The probability from a test contains the unwritten proviso - assuming that the sample is random with full response, no dropout, and no measurement error. The number of social science studies meeting this proviso is very small indeed. To this must be added the caution that probabilities interact, and that most analyses in the ICT-age are no longer one-off. Analysts have been observed to conduct hundreds of tests, or try hundreds of models, with the same dataset. Most analysts also start each probability calculation as though nothing prior is known, whereas it may be more realistic and cumulative (and more efficient use of research funding) to build the results of previous work into new calculations. Statistics is not, and should not be, reduced to a set of mechanical dichotomous decisions around a 'sacred' value such as 5%.
As shown at the start of this section, the computational basis of significance testing is that we are interested in estimating the probability of observing what we actually observed, assuming that the artificial null hypothesis is correct^{1}. However, when explaining our findings there is a very strong temptation to imply that the resultant probability is actually an estimate of the likelihood of the null hypothesis being true given the data we observed (Wright 1999). Of course, the two values are very different, although it is possible to convert the former into the latter using Bayes’ Theorem (Wainer and Robinson 2003). Unfortunately this conversion, of the ‘probability of the data given the null hypothesis’ into the more useful ‘probability of the null hypothesis given the data’, requires us to use an estimate of the probability of the null hypothesis being true irrespective of (or prior to the data). In other words, Bayes’ Theorem provides a way of adjusting our prior belief in the null hypothesis on the basis of new evidence (Gorard 2003). But doing so entails a recognition that our posterior belief in the null hypothesis, however well-informed, will now contain a substantial subjective component.
In summary, therefore, significance tests are based on unrealistic assumptions, giving them limited applicability in practice. They relate only to the assessment of the role of chance (explanation one in the introduction), tell us nothing about the impact of errors, and do not help decide whether any plausible substantive explanation is true. Even so, they require considerable judgement to use, and involve decisions that need to be explained and justified to any audience.
Effect or error?
Here is a real example of the role of judgement, selected because I was reading it recently and not because it is extreme or particularly problematic (in fact it comes from a very well-respected and influential source). From their analysis of the 1958 National Childhood Study and 1970 British Cohort Study, Machin and Gregg (2003) found that ‘the extent of intergenerational mobility in economic status has reduced substantially over time’ (p.194), and claimed that ‘these findings are sizeable and important’ (p.196). This is their conclusion. Their actual finding is that the regression coefficient for individual earnings related to family income was 0.17 for both men and women in the 1958 study, but was 0.26 for men and 0.23 in women in the 1970 study. Given that we will assume both a chance element and an error component in both studies, do the figures justify the conclusion? In line with current practice, the authors do not argue the point or set out clearly the logic of their move from the figures to the conclusion (their ‘warrant’, see below), so it is up to the reader to make the judgement alone. This judgement over social mobility is important because the paper was presented via an influential think-tank to help form the policy-setting agenda for the current government.
Both of these cohort studies originally involved over 16,000 cases selected to be representative of the population of Britain. But this new analysis used only around 2,000 cases per cohort from the two studies. Originally, not all individuals were included in the sampling frame anyway (perhaps the families of very severely disabled or terminally-ill infants were not approached to participate) so there was some selection bias. Not all of those included in the sampling frame agreed to take part, so there was some volunteer bias among the 16,000. Not all who agreed initially have taken part in successive sweeps, so there is some bias from dropout (only around 70% of the original participants could be traced and were still available to the researchers in 1999, see Bynner et al. 2000). There will also be typical levels of error in reporting, recording, coding, transcription and calculation. Finally, there are systematic differences in the sampling procedures, question formats, and question contents between the 1958 and 1970 surveys. We can have no real idea of the overall level and impact of these combined biases, but that is not any reason to ignore them, or their propagation through the complex computations of multiple regression.
Machin and Gregg (2003) report conducting null-hypothesis significance-tests, and state that the differences are ‘mostly significant’ using the standard 5% threshold. But, as we have seen, this approach only addresses the chance element, and really only caters for a chance element introduced as a result of random sampling. It does not decide in any way whether the differences between the two surveys are worthy of further attention. In light of the errors in each survey and the other differences between them, it is perfectly reasonable to conclude that statistical significance of the kind reported by the authors is totally consistent with a position that claims there is no real difference. A difference between 0.17 and 0.23, in these circumstances, may not be substantial enough for us to base social policy on.
Even where the difference is judged substantial enough, the argument so far has only considered what may be termed the ‘methodological’ alternatives such as error and chance. There will also be a large number of substantive alternatives to the conclusions given, and the warrant should also show how the most plausible of these other explanations were considered, and the reasons why they were rejected in favour of the published conclusion. For example, is it possible that individual earnings would become easier to predict from family income if earnings become less variable over time, so the different coefficients may represent not so much less social mobility as less overall social variation in 1970? If this is not possible, because the second survey shows at least as much income variation as the first for example, then the authors could explain this, thereby eliminating a rival conclusion derived from the same finding.
To recapitulate: all research faces the problem outlined in the introduction, of having at least three types of explanation for the same observed data. The first is the explanation of chance. The second is an explanation based on error such as bias, confounds, and ‘contamination’. The third is a substantive explanation, from a range of plausible explanations. Null hypothesis significance tests cater for only the first of these, and only under very unrealistic conditions. What are the alternatives?
What are the alternatives?
The American Psychological Society and the American Psychological Association, among other concerned bodies, have suggested the use of effect sizes, confidence intervals, standard errors, meta-analyses, parameter estimation, and a greater use of graphical approaches for examining data. These could be complements to significance testing, but there has also been the suggestion that reporting significance tests should be banned from journals to encourage the growth of useful alternatives (Thompson 2002).
All of these ideas are welcome, but none is a panacea for the problems outlined so far – chiefly the problem of estimating the relative size of the error component. Most actually address the somewhat simpler but less realistic issue of estimating the variation due to random sampling. Confidence intervals and standard errors are based on the same artificial foundation as significance tests in assuming a probability-based sample with full response and no measurement error, and an ideal distribution of the data (de Vaus 2002). They are still inappropriate for use both with populations and non-random samples. Even for random samples, minor deviations from the ideal distribution of the data affect the confidence intervals derived from them in ways that have nothing to do with random error (Wright 2003). In addition, the cut-off points for confidence intervals are just as arbitrary as a 5% threshold used in significance tests (Wainer and Robinson 2003). In no way do these overcome the need for judgement or replication.
Whereas a significance test is used to reject a null hypothesis, an ‘effect size’ is an estimate of the scale of divergence from the null hypothesis of no difference. The larger the effect size, the more important the result (Fitz-Gibbon 1985). For example, a standard effect size from a simple experiment might be calculated as the difference between the mean scores of the treatment and control groups, proportional to the standard deviation for that score among the population. This sounds fine in principle, but in practice we will not know the population standard deviation. If we had the population figures then we would probably not be doing this kind of calculation anyway. We could estimate the population standard deviation by using the standard deviation for one or both of the two groups, but this introduces a new source of potential error.
Table 1 – ‘Effect’ of aspirin on heart attacks in two groups
Condition |
No heart attack |
Heart attack |
Total |
Aspirin |
10933 |
104 |
11037 |
Placebo |
10845 |
189 |
11034 |
Total |
21778 |
293 |
22071 |
Above all, the use of effect sizes requires considerable caution. Several commentators have suggested that in standardising them they become comparable across different studies, and so we see papers setting out scales describing the range of effect sizes that are substantial and those that are not. They therefore return us to the same position of dealing with arbitrary cut-off points as do confidence intervals and significance tests. Wainer and Robinson (2003) present an example of a problem for such scales. Table 1 summarises the results of a large trial of the impact of regular doses of aspirin on the incidence of heart attacks. A significance test such as chi-squared would suggest a significant difference between these two groups. But the effect (in this case R-squared) is of the order of magnitude 0.001, which is far too small to be of a practical value, according to scales describing the meaning of effect sizes. On the other hand there were 85 fewer deaths in the treatment group, which is impressive because of what they represent. The traditional odds ratio of the diagonals is over 1.8, reinforcing the idea that the effect size could be misleading in this case.
In fact, of course ‘there is no wisdom whatsoever in attempting to associate regions of the effect-size metric with descriptive adjectives such as "small", "moderate", "large", and the like’ (Glass et al. 1981, p.104). Whether an effect is large enough to be worth bothering with depends on a variety of interlocking factors, such as context, cost-benefit, scale, and variability. It also depends on the relative size of the error component because, like all of the attempted technical solutions above, effect sizes do nothing to overcome errors. An effect size of 0.1 might be very large if the variability, the costs, and the errors in producing it, are low, while the benefits are high. Again, we are left only with our judgement and our ability to convey the reasons for our judgments as best we can.
Therefore, while these moves to extend the statistical repertoire are welcome, the lack of agreement about the alternatives, the absence of textbooks dealing with them (Curtis and Araki 2002), and their need for even greater skill and judgement means that they may not represent very solid progress (Howard et al. 2000). In fact, the alternatives to null hypothesis significance tests are doing little to assist scientific progress (Harlow et al. 1997). They do nothing to help us overcome the major error component in our findings which, as we saw above, is not due to pure chance. Unfortunately the vagaries of pure chance are the only things that classical statistical analyses allow us to estimate.
Discussion
If the above points are accepted it can be seen that merely producing a result, such as 51% heads, is not sufficient to convince a sceptical reader that the results are of any importance. In addition to explaining the methods of sampling, data collection and analysis (as relevant) authors need also to lay out a clear, logical warrant (Gorard 2002b). A key issue here is clarity of expression in the overt argument that leads from the results to the conclusions. At present, too much social science research seems to make a virtue of being obscure but impressive-sounding – whether it is the undemocratic way in which complex statistical models are presented in journals, or the use of neologisms that are more complex than the concepts they have been, ostensibly, created to describe (in Steuer 2002). Jargon-laden reports go into considerable mathematical detail without providing basic scientific information (Wright and Williams 2003). Clarity, on the other hand, exposes our judgements to criticism, and our warrant stems from that exposure of the judgement. Transparency does not, in itself, make a conclusion true or even believable, but it forces the analyst to admit the subjectivity of their analysis and allows others to follow their logic as far as it leads them.
Phillips (1999) reminds us that, despite their superficial similarity, falsification is very different to the null hypothesis testing of traditional statisticians. The first approach involves putting our cherished ideas ‘on-the-line’, deliberately exposing them to the likelihood of failure. It involves considerable creativity in the production of predictions and ways of testing them. The second involves a formulaic set of rules (mis)used to try and eliminate the null hypothesis, and so embrace the cherished alternative hypothesis. As such, it is only a very weak test of the alternative hypothesis. Perhaps, the apparent and soothing ‘rigour’ of traditional statistics has satisfied both researchers and research users, and so inhibited the search for more truly rigorous ways of testing our ideas. One obvious example of this is the preference among UK social science funders for increasingly complex methods of statistical analysis (the post hoc dredging of sullen datasets), over a greater use of quasi-experimental designs. For ‘despite the trapping of modeling, the analysts are not modeling or estimating anything, they are merely making glorified significance tests’ (Davis 1994, p.190).
Complex statistical methods cannot be used post hoc to overcome design problems or deficiencies in datasets. If all of the treated plants in the agricultural example, at the start of the paper, were placed on the lighter side of the greenhouse, with the control group on the other side, then the most sophisticated statistical analysis in the world could not do anything to overcome that bias. It is worth stating this precisely because of the ‘capture’ of funders by those pushing for more complex methods of probability-based traditional analysis, whereas of course, ‘in general, the best designs require the simplest statistics’ (Wright 2003, p.130). Or as Ernest Rutherford bluntly pointed out ‘If your experiment needs statistics, you ought to have done a better experiment’ (in Bailey 1967, see also the comments of Kerlinger in Daniel 1996). Therefore, a more fruitful avenue for long-term progress would be the generation of better data, open to inspection through simpler and more transparent methods of accounting. Without adequate empirical information 'to attempt to calculate chances is to convert mere ignorance into dangerous error by clothing it in the garb of knowledge' (Mills 1843, in Porter 1986, p.82-83).
Because simpler techniques so often produce the same results as complex analyses, Wright (2003) advises that ‘the simpler techniques should be reported and if appropriate the authors may report that the more advanced techniques led to similar conclusions… If you use advanced statistics, beyond what the typical psychology undergraduate would know, make sure that these are clearly described’ (p.130). Above all, it is essential that reports make full acknowledgement of the underlying pattern of analysis which is that: data = model + error, where the ‘error’ is not just sampling variation but also genuine errors of the kind described so far. ‘It should be remembered that consideration of confounding and bias are at least as important’ (Sterne and Smith 2001, p.230). The pattern can be made to look much more complicated than this by the use of complex techniques, but this should never be allowed to mislead readers into thinking that any technique eliminates, or even addresses, the error component. Perhaps a better pattern for data modelling would be data = model + random variation + error.
Perhaps one reason why research is not typically taught as an exercise in judgement is that judgement seems ‘subjective’ whereas computation is ostensibly ‘objective’. This distinction is often used by commentators to try and reinforce the distinction between a ‘qualitative’ and a ‘quantitative’ mode of reasoning and researching. But, in fact, we all combine subjective estimates and objective calculations routinely and unproblematically. Imagine preparing for a catered party, such as a wedding reception. We may know how many invitations we send, and this is an objective number. We may know how much the catering will cost per plate, and this is another objective number. To calculate the cost of the party, we have to use the number invited to help us estimate the number who will attend, and this is a subjective judgement even when it is based on past experience of similar situations. We then multiply our estimate by the cost per plate to form an overall cost. The fact that one of the numbers is based on a judgement with which other analysts might disagree does not make the arithmetic any different, and the fact that we arrive at a precise answer does not make the final estimate any firmer. This last point is well known, yet when they conduct research many people behave as though it were not true. ‘Quantitative’ researchers commonly eschew the kind of judgment at the heart of their decisions, seeking instead pseudo-technical ways of having the decisions taken out of their hands.
‘Qualitative’ researchers, on the other hand, are exhorted to welcome subjectivity, and this leads many of them to ignore totally the computations that are at the heart of their patterns, trends, and narratives. Whenever one talks of things being rare or typical or great or related this is a statistical claim, and can only be so substantiated whether expressed verbally or in figures (Meehl 1998). In fact, a consideration of how social science research is actually done, rather than how methodologists often claim it should be done, suggests that nearly all studies proceed in the same way – contextualised, value-attuned but logical (Eisenhart and Towne 2003). Practical research is defined by a set of guiding principles which are the same whatever method is used. The need to test our ideas by seeking their falsification obviously applies not only to the statistical analysis of passive data but also to ‘qualitative’ methods, such as observation, and their over-reliance on grand theories (Phillips 1999). It is easy to find apparent ‘confirmations’ for any theory if one looks for them. What is harder to find is a theory that can stand the test of seeking refutations instead. ‘We can, for example, apparently verify the theory that the world is flat by citing some confirmations (it looks flat, balls placed on the flat ground do not roll away, and so forth), but nevertheless the theory is wrong’ (pp.175-176). This example also illustrates that consensual validation is not really validation at all. There are many examples of widely held beliefs, such as that the world is flat, that have turned out to be wrong and so inter-researcher agreement on a theory tells us nothing about its value. Immediate peer approval is not a test. Longer-term acceptance of findings tends to be based on their practical success. Similarly, the process of falsification shows us that mere coherence, plausibility, checking the theory with the participants in the study, and even triangulation are all weak or ineffectual as criteria for establishing the quality of qualitative enquiry. Qualitative work, no less than quantitative work, requires judgement that is laid open to inspection by critics.
At present, much of science is bedevilled by ‘vanishing breakthroughs’, in which apparently significant results cannot be engineered into a usable policy, practice or artefact. Epidemiology, in particular, and perhaps dietary advice, cancer treatment, genetics and drug development have become infamous for these vanishing breakthroughs. The traditional guidelines for significance tests, and the apparently standardised scales for effect sizes are producing too many results that literally disappear when scaled up. When Fisher indirectly suggested the 5% threshold he may have done so because he felt it was more important not to miss possible results than it was to save time and effort in fruitless work on spurious results (Sterne and Smith 2001). He also assumed that this was relatively safe because he envisaged a far higher level of direct replication in agricultural studies than we see today in social science. However, rather than simply lowering the 5% threshold this paper argues for a recognition that any such threshold is only concerned with the chance explanation. Of much more concern is the relative size of the propagated error component.
We could always set out to estimate a band of error, and judge its potential impact. This would give us a far more realistic idea of the ‘confidence’ we can place in our results than any confidence interval. For example, imagine a survey that sets out to include 2,000 people. It receives 1,000 responses, for a 50% response rate, of which 50% are from men and 50% from women. On one key question, 55% of men respond in a particular way, but only 45% of women do. This is clearly a statistically ‘significant’ result, and it leads to a medium-sized ‘effect size’. Therefore, traditional approaches lead us to the conclusion that men and women differ in their response to this question. But neither of these measures, nor any of the other alternatives discussed, takes into account the response rate. Since 275 of the men and 225 of the women respond in this particular way, we would need only 50 more of the 500 women non-respondents than the 500 men non-respondents to have responded in a particular way (if they could have been made to respond) for there to be no difference. Put another way, if most non-responders had responded in the same proportions as the actual responders, then we need only assume that 5% of all non-responders would have responded differently for the difference between men and women to disappear. In this case, the difference we observe seems very small in comparison to the non-response. As we add in the potential errors caused at each stage of our survey (measurement and transcription error, for example), we may conclude that the difference is not worth investing further effort in, because studies of the implementation of research findings show that the signal to noise ratio gets even weaker as the results are rolled out into policy and practice.
As a rule-of-thumb we could say that we need to be sure that the effect sizes we continue to work with are substantial enough to be worth it. As Cox (2001) describes it, from a medical standpoint, the key issue is whether the direction of the effect is firmly established and of such magnitude as to make of clinical importance. Clearly, this judgement depends on the variability of the phenomenon, its scale, and its relative costs and benefits (although, unfortunately, program evaluation currently tends to focus on efficacy alone, rather than in tandem with efficiency, Schmidt 1999). It also depends on the acknowledged ratio of effect to potential error. Therefore, an effect size that is worth working with will usually be clear and obvious from a fairly simple inspection of the data. If we have to dredge deeply for any effect, then it is probably ‘pathological’ to believe that anything useful will come out of it. We cannot specify a minimum size needed for an effect, but we can say with some conviction that, in our present state of knowledge in social science, the harder it is to find the effect the harder it will be to find a use for the knowledge so generated. It is probably unethical to continue to use public money pursuing some of the more ‘pathological’ findings of social science.
Probably, the best ‘alternative’ to many of the problems outlined so far in contemporary statistical work is a renewed emphasis on judgements of the worth of results (Altman et al. 2000). The use of open, plain but ultimately subjective judgement is probably also the best solution to many of the problems in other forms of current research, such as how to judge the quality of in-depth data analysis or how to link theory and empirical work more closely (Spencer et al. 2003). If this course were adopted it would also have the effect of making it easier for new researchers to adopt mixed methods approaches as routine (Gorard 2004), without having to worry about which forms of data require a judgement-based analysis rather than a technical one. They all do.
References
Adair, J. (1973) The Human Subject, Boston: Little, Brown and Co.
Altman, D., Machin, D., Bryant, T. and Gardiner, M. (2000) Statistics with confidence, London: BMJ Books
Bailey, N. (1967) The mathematical approach to biology and medicine, New York: Wiley
Bailey, N. (1967) The mathematical approach to biology and medicine, New York: Wiley
Bynner, J., Butler, N., Ferri, E., Shepherd, P. and Smith, K. (2000) The design and conduct of the 1999-2000 surveys of the National Child Development Study and the 1970 British Cohort Study, Working Paper 1, London: Centre for Longitudinal Studies
Camilli, G. (1996) Standard errors in educational assessment: a policy analysis perspective, Education Policy Analysis Archives, 4, 4
Camilli, G. (1996) Standard errors in educational assessment: a policy analysis perspective, Education Policy Analysis Archives, 4, 4
Cole, S. (1994) Why doesn’t sociology make progress like the natural sciences?, Sociological Forum, 9, 2, 133-154
Cook, T. and Campbell, D. (1979) Quasi-experimentation: design and analysis issues for field settings, Chicago: Rand McNally
Cox, D. (2001) Another comment on the role of statistical methods, British Medical Journal, 322, 231
Curtis, D. and Araki, C. (2002) Effect size statistics: an analysis of statistics textbooks, presentation at AERA, New Orleans April 2002
Daniel, L. (1996) Kerlinger’s research myths, ERIC Digests ED410232
Davis, J. (1994) What’s wrong with sociology?, Sociological Forum, 9, 2, 179-197
de Vaus, D. (2002) Analyzing social science data: 50 key problems in data analysis, London: Sage
Eisenhart, M. and Towne. L. (2003) Contestation and change in national policy on "scientifically based" education research, Educational Researcher, 32, 7, 31-38
Fitz-Gibbon, C. (1985) The implications of meta-analysis for educational research, British Educational Research Journal, 11, 1, 45-49
Glass, G., McGaw, B. and Smith, M. (1981) Meta-analysis in social research, Beverley Hills, CA: Sage
Gorard, S. (2002a) The role of causal models in education as a social science, Evaluation and Research in Education, 16, 1, 51-65
Gorard, S. (2002b) Fostering scepticism: the importance of warranting claims, Evaluation and Research in Education, 16, 3
Gorard, S. (2003) Understanding probabilities and re-considering traditional research methods training, Sociological Research Online, 8,1, 12 pages
Gorard, S., with Taylor, C. (2004) Combining methods in educational and social research, London: Open University Press
Hacking, I. (1999) The social construction of what?, London: Harvard University Press
Harlow, L., Mulaik, S. and Steiger, J. (1997) What if there were no significance tests?, Marwah, NJ: Lawrence Erlbaum
Howard, G., Maxwell, S. and Fleming, K. (2000) The proof of the pudding: an illustration of the relative strengths of null hypothesis, meta-analysis, and Bayesian analysis, Psychological Methods, 5, 3, 315-332
Machin, S. and Gregg, P. (2003) A lesson for education, New Economy, 10, 4, 194-198
Matthews, R. (1998) Statistical snake-oil: the use and abuse of significance tests in science, Cambridge: European Science and Environment Forum, Working Paper 2/98
Meehl, P. (1998) The power of quantitative thinking, speech delivered upon receipt of the James McKeen Cattell Fellow award at American Psychological Society, Washington DC, May 23^{rd}
Park, R. (2000) Voodoo science: the road from foolishness to fraud, Oxford: Oxford University Press
Phillips, D. (1999) How to play the game: a Popperian approach to the conduct of research, in Zecha, G. (Ed.) Critical rationalism and educational discourse, Amsterdam: Rodopi
Platt, J, (1996) A history of US Sociological Research Methods 1920-1960, Cambridge: Cambridge University Press
Porter, T. (1986) The rise of statistical thinking, Princeton: Princeton University Press
Robinson, D. (2004) An interview with Gene Glass, Educational Researcher, 33, 3, 26-30
Schmidt, C. (1999) Knowing what works: the case for rigorous program evaluation, IZA DP 77, Bonn: Institute for the Study of Labor
Skinner, C., Holt, D. and Smith, T. (1989) Analysis of complex surveys, Chichester: John Wiley and Sons
Spencer, L., Ritchie, J., Lewis, J. and Dillon, L. (2003) Quality in qualitative evaluation: a framework for assessing research evidence, London: Cabinet Office Strategy Unit
Statistics Canada (2003) Non-probability sampling, www.statcan.ca/english/power/ch13/ (accessed 5/1/04)
Sterne, J and Smith, G. (2001) Sifting the evidence – what’s wrong with significance tests, British Medical Journal, 322, 226-231
Steuer, M. (2002) The scientific study of society, Boston: Kluwer
Thompson, B. (2002) What future quantitative social science could look like: confidence intervals for effect sizes, Educational Researcher, 31, 3, 25-32
Wainer, H. and Robinson, D. (2003) Shaping up the practice of null hypothesis significance testing, Educational Researcher, 32, 7, 22-30
Wright, D. (1999) Science, statistics and the three ‘psychologies’, in Dorling, D. and Simpson, L. (Eds) Statistics in society, London: Arnold
Wright, D. (2003) Making friends with your data: improving how statistics are conducted and reported, British Journal of Educational Psychology, 73, 123-136
Wright, D. and Williams, S. (2003) How to produce a bad results section, The Psychologist, 16, 12, 646-648
Note:
It should also be noted that the Neyman/Pearson formulation of significance tests also assumes that the analyst has specified a precise alternative hypothesis in advance. In practice, of course, many analysts are just concerned about rejecting the null hypothesis at 5%.
This document was added to the Education-line database on 12 November 2004