Education-line Home Page
The European Conference On Educational Research,

25-28 September, 1996.

Seville, Spain

INSPECTING HER MAJESTY'S INSPECTORS

SHOULD SOCIAL SCIENCE AND SOCIAL POLICY COHERE?

C. T. Fitz-Gibbon & N. J. Stephenson

Curriculum, Evaluation and Management Centre,
University of Durham,
DH1 1TA,
England

Abstract

Schools in England are now scheduled to be inspected by a national team from OFSTED (the Office for Standards in Education) at least once every four years in addition to being monitored by the local education authority (LEA). Inspection teams, hired on a contract basis for each inspection, make pre-announced, one week site visits. This extremely expensive system of inspection has replaced the long established procedures of Her Majesty's Inspectors. Is it working? What issues are raised for educational researchers?

A major feature of the inspection process is the observation of lessons which are then rated on a 5 point scale (soon to become a 7 point scale). Thus judgements are the main method of inspection. However, the Office for Standards in Education has not published any studies which apply the most elementary standards generally observed by social scientists using judgements. There are no studies of

This paper draws on two methods of evaluation: blueprint analysis (are the purported methods reasonable and sound according to our current understanding of methodology?) and a survey designed to collect evidence on how the system is working in practice. Additional evidence was drawn from Ofsted publications and inspection reports.

The paper illustrates the application of social science methodology to the evaluation of a policy which requires major public expenditure. In addition to contributing to deliberations on school accountability the paper could be useful for teaching research methods.

Keywords: Inspection, Accountability, Research methods, Value for money

Introduction

Her Majesty's Inspectors (HMI), created in 1839 (sic),consisted of putatively independent inspectors who visited schools occasionally. HMI included among their numbers the poet Matthew Arnold and although they were widely hated in the early parts of the century (e.g. Hogg, 1990) they later came to be seen as generally benign, supportive and highly skilled in their subject specialism. This change in attitude might have been related to their adoption of the stance that their responsibility was primarily to report to the Secretary of State for Education on the state of the nation's schools. They were not to supervise individual schools. Each HMI served a year's apprenticeship before becoming a fully fledged inspector.

A new policy was introduced by the 1992 Education (Schools) Act with the creation of the Office for Standards in Education (OFSTED). HMI were severely reduced in numbers and are now largely working for OFSTED. OFSTED contracts inspections out to teams of inspectors on the basis of competitive tendering. Training for work on such a team, as this new kind of OFSTED inspector, was accomplished in one week for the first year or two although recently distance learning materials have extended the study period.

It has been estimated that about 70 percent of an inspector's time is spent on classroom observation so it is particularly important to look at the value of this aspect of inspection. The classsroom visits are also a source of stress for teachers, providing another reason to consider their value carefully. Furthermore, as researchers into school effectiveness we must be particularly interested to see if inspectors' judgements concur with other methods of evaluating schools.

Conclusions are only as sound as the methods on which they are based and it was clear from the initiation of the OFSTED that the normal canons of research were being ignored. Indeed reservations were expressed to the Chief Inspector before a single inspection had occurred. The letter suggested a way of checking one kind of validity: schools should keep their careful measures of student progress ("Value Added" indicators) away from inspectors until after the latter had made their judgements about the quality of teaching and learning in the various school departments. This would provide a chance for cross-validating inspectors' judgements about appropriate progress against a measure of appropriate progress. Such a check would clearly be worthless if the inspectors saw the data before making their judgements.

This perfectly proper, methodologically sound, proposal was, amazingly, met with a letter drawing attention to the threat of a "level 2" fine against any school which failed to provide inspectors with data. This threat, which is written into the legislation by which Ofsted was created, naturally aroused suspicions that all was not well.

Although inspection has an important role to play since nothing can substitute for the direct observation of the way that a school is functioning, the strengths and weaknesses of inspection as currently introduced in England must be frankly confronted. (Scottish and Welsh systems operate rather differently, as doe s the inspectorate in England for non-compulsory education, run by the Further Education Funding Council.) The current proposals to have inspectors rate teachers on a seven point scale also makes a consideration of the validity of inspection exceedingly urgent: the confidence of the public and the careers of teachers are at stake.

The Survey

Most systems, such as inspection, need systematic and independent monitoring and feedback. The survey reported here represents a move towards creating such monitoring. The survey is indicative rather than conclusive and serves mainly to illustrate one of the ways in which the inspection process should be monitored. (The contents are presented in Appendix A)

Questionnaires were sent to 322 Head teachers whose schools were chosen on 7 criteria described shortly. The opinions of Head teachers were important as an indication of the effect inspection was having on key persons, but it must be noted that we do not suggest that the opinions of the OFSTED process can be validated or challenged on the basis of other opinions. As researchers, there are methods which have been developed to ensure basic standards of evidence and to increase the likelihood of observation and judgement yielding valid findings. It is of interest to ask if these proper methods are being used i.e. to conduct an evaluation of the OFSTED design, a "blueprint" evaluation. It would however be quite possible for Heads and others to believe every finding that emerged from an inspection, and for researchers still to need to criticise the methods.

The sample for the survey.

As a pilot survey, the intention was to compare and contrast using with deliberate sampling and two random samples.

One random sample was of one in four schools randomly selected from those participating in an unofficial self-monitoring system in collaboration with the University: the Year 11 Information System (YELLIS). (Briefly described in Note 1) A second random sample was drawn from a list of all schools. Together these two sources accounted for 68 percent of the mailing and at least && percent of the obtained responses. It is not possible to say exactly how many of the obtained responses were from each source since anonymous responses were permitted. Table 1 summarises the sample.

In addition to these random samples 17 schools had at the time of the study been deemed to be "failing" secondary schools and these were each sent a questionnaire but only five schools responded. We were well aware that failing schools were coming under intense media scrutiny and have in general tried to cut down on their exposure and simply concentrate on coping with the demands consequent upon a rating of "failing". We wished to compare the failing schools with similar schools in the same area and to that end phoned to ask for the names of schools to which they considered that they were similar and this formed a matched sample for the failing schools.

Although we did not have the resources to obtain all inspection reports in order to match the inspectors' judgements of each department against data which we have already provided to schools through the YELLIS project we rank ordered YELLIS schools on the basis of percent of pupils with positive Value Added measures and selected the top 20 schools and the bottom 20 schools out of over 300 schools. Thus we had schools who were, in 1994 generally seeing their pupils make better than usual or worse than usual progress. (However, based on the experience of monitoring Value Added in schools since 1983, we would never recommend identifying a school by a single number from one year's results. Schools are highly complex and contain considerable variation within the school, from department to department. Single numbers cannot represent this variation and any such numbers are likely to vary across the years.)

By random sampling combined with deliberate sampling we had six categories in the sample. A seventh was created from the 8 percent of schools that chose to remain anonymous. The achieved sample is described in Table 1. The largest groups were the two random samples of 106 from the DfEE list and 113 from the 1-in-4 YELLIS sample.

It turned out that the YELLIS group were not different from the DFEE random sample on Free School Meals, suggesting that which schools do or do not become involved in detailed quantitative monitoring is not a reflection of socio-economic status of the intake to the school.

One hundred and fifty nine Head teachers responded of whom 88 reported being inspected by Ofsted. This was a response rate of 51 percent which is considered high for a response to a one-off survey. All but 10 of these reported the "rating" of the school arrived at by Ofsted. The rating is a widely reported datum which can strongly affect the image of the school. The ratings reported were 'fail': 5 percent, 'cause for concern': 6 percent, 'satisfactory': 65 percent, 'good' : 10 percent and 'outstanding': 13 percent.

Inspection of the inspectors.

There is nothing inherently unscientific in using human judgements as measurements. In fact such procedures may often be essential. However, there are certain fundamental standards to be observed in the collection and analysis of such judgements before there can be confidence in the evidence presented.

Minimally, there should be information on:

These issues are considered below from first principles (the 'blueprint' evaluation) and as illustrated by the survey.

Sampling

The two major issues methodological issues relating to the sample used in a school inspection are its representativeness and size.

Representativeness of the sample.

Does what the inspectors see constitute a representative sample of the lessons, pupils, and parents? It seems unlikely that the announced visits of Ofsted inspectors provides them with a view of the school as it is normally functioning. Indeed none of the questionnaire respondents agreed strongly with the statement "An Ofsted inspector sees a school as it is normally". Only 7 percent agreed and 81 percent of respondents disagreed The week of the Ofsted visit is a highly unusual week and the presence of inspectors in classrooms will have a major impact on the lessons presented. It may be thought that these lessons will be excellent, better than usual and therefore misleadingly good. However, such a hypothesis needs testing. If it were true then there would be reason for making inspections unannounced. There are numerous accounts of subterfuges to impress inspectors (Fitz-Gibbon, 1995,Hogg, 1990) There is also a contrary widespread view that the lessons taught in front of inspectors should be very safe. In sum, the representativeness of the lessons observed is in question and no studies have been provided to show that this is not a matter for serious concern. Researchers in classrooms go to considerable lengths to be present so often that they eventually exert little influence on events and can ideally be 'a fly on the wall'. Observations should ideally be random and unannounced so that they are more likely to represent reality.

The size of the sample.

Even if the lessons observed were representative, we would still need to ask whether the number of lessons constituted a sufficiently large sample on which to base important judgements. There have long been methods available for estimating how many lessons need to be observed by how many persons in order to obtain measures with acceptable levels of reliability (See, for example , Winer, 1971; Fitz-Gibbon and Clark, 1982; Medley & Mitzel, 1963). Ofsted has presented no justification for the number of lessons observed, no studies as to whether different subjects require different lengths of observation, no studies of the proportion of a lesson that needs to be observed for the kinds of judgements that are being made. Nor have they shown any cognisance of the methodological fact that to judge individuals rather than, say, departments or schools, will require unprecedentedly high levels of reliability, probably requiring long periods observation. Nor have any justifications been presented for the numbers of pupils interviewed in the course of an inspection.

The entire design seems to be based on received wisdom rather than checked by proper methods. The problem here is that the received "wisdom" may not be adequate. HMI were highly respected but they were not experts in checking the adequacy of their methods, methods which had developed before there were good foundations for the statistical procedures needed for careful analyses of judgements. Using methods of the last century, uniformed by this century's advances, might well be considered as indefensible in social science as it would be in medicine.

Reliability

Even if we overlook the inadequacies of the sampling methods, we still have to deal with issues of reliability and validity. To establish that Ofsted judgements are 'reliable' requires that Ofsted demonstrate that it does not matter which inspection team inspects a school or when they visit.

Equally, does it matter which inspector is observing a classroom? Would all observers agree to a sufficient extent i.e. are the judgements of the classroom reliable?

There does not seem to have been a single study of "inter-rater reliability" from Ofsted, yet the issue as to agreement among inspectors is critical. If there is variation in the judgements two inspectors would make of the same lesson, department or school, which inspector is to be believed? The whole system is called into question if reliability is not demonstrated.

How did Head teachers view this issue? Two questions related to reliability:

Neither proposition attracted majority support. "Disagree" or "strongly disagree" was chosen by 48 percent for question 13 and 61 percent for question 17. These two items were not as mutually consistent as might have been expected. There was only a weak tendency - - - only r = 0.41 - - - for those Heads who believed that independent teams would agree to be also those who agreed that differences within an inspection team would be easily resolved. On further investigation, this low correlation turned out to be partly due to a difference between Heads whose schools were rated 'failing' or 'cause for concern' (henceforth referred to together as the so-called-problem-schools) and other Heads. Among the Heads of so-called-problem-schools, the correlation between the two reliability questions was actually negative. ( -0.30). These Heads tended to have seen the inspection teams as monolithic, easily resolving conflicts, but strongly believed other teams might have arrived at different judgements. (See Figure 1) It must be remembered that the school is not allowed to hear the inspection team's deliberations. These are conducted behind closed doors, unlike practice in the Further Education Funding Council (FEFC) where a member of the staff of the college is included in inspectors' meetings. Among the Heads whose schools were rated satisfactory or better (n=69) the correlation was 0.5. Perhaps the difference in perceptions represented a difference in the behaviour of the inspection teams or perhaps it was a reaction to the rating.

Further evidence that inspectors actually disagree a good deal comes from reports of those who have trained to be inspectors, particularly from an exercise in which they all watch a video of a lesson and generally fail to agree on a rating for the lesson. Furthermore, in print, it was stated that

'The majority of RgIs (Registered Inspectors i.e. leaders of OFSTED teams) were able to make appropriate decisions about conflicting evidence' (p10, Coopers&Lybrand, 1994)

a statement which clearly implies that the majority of Registered Inspectors were presented with conflicting evidence, as did a subsequent statement on the same page:

' a few hasty compromises or unresolved conflicts were evident'

But we were assured that inspectors could

' ...move towards a corporate view on the range of issues prescribed by the Framework.... Four fifths of such meetings were successful in involving all or almost all inspectors, and sometimes the discussion was outstanding in producing corporate judgements.'

In summary, there does not seem to be any evidence to assure us that Ofsted inspectors' judgements are reliable, and Head teachers predominantly do not believe them to be reliable.

Validity

If the sample observed is inadequate and the judgements do not agree, there's an end to the issue: inspections are not secure judgements. Validity cannot be obtained without there being based on an adequate sample and reasonable consistency in the measurements. However, there can be arguments about how "adequate" and what is "reasonable" consistency, so, in the absence of data on reliability, we will leave the issue as unresolved and move to the question of validity: the question as to whether Ofsted's judgements are correct.

Has the public been provided with evidence that the judgements inspectors make are correct? Can governors trust these judgements? Can parents rely on them? Ofsted's approach to validity has been to have a second team, fully aware of the report of the first team, re-examine the school. This is poor methodology since the second team is not a fresh, independent inspection. It is already biased by knowledge of the first report. Even agreement only establishes consistency, which could be a consistency of bias. Agreement does not constitute validity.

The issue of validity is one to which there is rarely a single or simple answer. To establish validity generally requires the accumulation of a variety of kinds of evidence, a collection which adds up to what Cronbach and Meehl (1955) called the "nomological net". Various kinds of validity are described in any basic test on measurement: construct, concurrent, predictive, face, and discriminant validity. These are discussed below.

Face validity.

Does the demand seem reasonable? Do the various judgements rendered appear reasonably feasible ones to make? Face validity is very much a matter of opinion. The notion that one can judge a school in a matter of hours is fairly widely accepted and hence inspection tends to have face valildity. This does not mean it is correct. Face validity - - - an agreement that the procedure seems reasonable - - - may simply represent the shared mis-conceptions of persons who have never had the sobering experience of subjecting their judgements to the test of proper validity studies. The history of science is the history of replacing guesses and "the obvious" with careful measures which sometimes disconfirm what we thought we knew. The careful checks are represented by the other kinds of validity.

Construct validity.

Is the very construct of a single rating for a complex school reasonable? Is a 'failing school', for example, adequately defined? Is it methodologically valid to apply a single label to a whole school given that there is almost certainly considerable variation within every school? Has Ofsted published any consideration of the validity of the constructs given to it by the legislation, or must it simply accept these? (Who wrote the legislation and from what empirical base? Is it too much to ask that legislation be informed by knowledge - - - as we approach the 21st century?)

Concurrent and predictive validity.

Are there some concurrent measures which would show differences in line with inspectors" judgements, and thus tend to confirm those judgements? Do the judgements predict detectable differences in the future performance of schools? Since Ofsted has defined effective schools as ones in which pupils make average or better progress, Value Added measures are clearly the ideal concurrent measure which should agree with inspectors' judgements. Discrepancies between Value Added measures and the Ofsted rating might certainly occur for good reasons, but the ratings should show considerable match with Value Added measures.

The response of Ofsted's first chief to suggestions to check this relationship was mentioned earlier. Despite the widespread availability of Value Added measures Ofsted has failed to publish any studies of this fundamentally important question relating to the validity of their judgements.

Views of Heads regarding the validity of the Ofsted process.

It could well be that Head teachers believe that visiting a school for a week provides an adequate basis for precise ratings. This would not prove that the view was correct. However, if Heads did not have confidence in the validity of the judgements this would have serious implications for the capacity of the system to have any positive effects on schools.

Nine questions in the form of statements to which respondents answered on a Likert scale (In a Likert scale a statement is made and the respondent selects responses from 'strongly disagree' 'disagree' 'neutral' 'agree' or 'strongly agree') were used to assess validity as perceived by Head teachers. Statements which checked on validity were those in items 3 , 4, 5, 6, 9 10, 11, 12 and 16, namely:

3. An OFSTED inspector can judge the achievement of a school against national expectations.

4. An OFSTED inspector can judge the achievement of a school taking into account pupils' capabilities.

5. The distinctions between moral and spiritual, cultural and social are clear.

6. An OFSTED inspector can assess moral and spiritual, cultural and social aspects of the school

9. OFSTED inspectors rely on what they see, not on what they are told.

10. An OFSTED inspector knows what constitutes 'good practice'

11. OFSTED teams have ways of assessing pupils' capabilities which provide better judgements than those made by staff in the school

12. OFSTED can accurately, reliably recognise 'failing schools'

16. If the judgement of the OFSTED team differed from data available to me on Value Added I would tend to believe the OFSTED judgement.

Correlations were sufficiently high for the responses to the items to be added and averaged thus producing a scale for the perception of validity by the Head teachers. Scales are preferable to single items because the true variance becomes a larger proportion of the whole variance as more items are added. An individual item may attract some odd responses due to a particular interpretation but if all items are answered in a particular direction this provides a strong indication of the respondent's view. The extent of agreement among items is called the internal consistency and is often measured by Cronbach's alpha (Cronbach & Glesser, 1963; Fitz-Gibbon & Morris, 1987; McKennell, 1970) For the validity scale in this study Cronbach's alpha was 0.79.

On only one of the items assessing validity did more than half the Head teachers register a positive belief in the validity of the Ofsted judgement. This was Question 3: "An Ofsted inspector can judge the achievement of the school against national expectations ', with 51 percent of Head teachers opting for "Agree" or "Strongly Agree". However, on two other items (9 and 10) the positive responses were as high as 41 per cent and 48 per cent. A study to check empirically the accuracy of perceptions regarding Question 3, for example, would need to test inspectors about their evaluation of the statistical data relating to national "expectations". So far as the authors are aware there is no test of statistical reasoning required before inspectors are sent to interpret statistical data.

As for the knowledge of "good practice" this can be a shared prejudice rather than accurate knowledge. Indeed teachers are all too aware that views on what constitutes "good practice" have changed over the years, like a fashion. The evaluation of processes, in an endeavour as complex as teaching, is hazardous since the link between processes and outcomes may be tenuous indeed.

Item 4 was the next most positive response. Even so only about a third of Heads agreed that Ofsted inspectors can correctly judge the achievement of a school taking into account pupils" capabilities. In their recently revised procedures Ofsted itself has dropped this claim to an ability to intuit the capabilities of pupils and no longer demands that its inspectors do so. Schools failed partly on such judgements will not, however, receive apologies.

The next least negative item was question 12. Fewer than one in three Heads (29 per cent) agreed that Ofsted inspectors can "correctly identify failing schools".

Twelve per cent of Head teachers agreed that inspectors "can assess pupils' abilities better than can the staff of the school". This was such an extraordinary view in our opinion that it seemed interesting to ask what characateristics were shared by these Heads. They appeared to be from every sample source and from schools with a range of Free School Meals though predominantly low on this inidcator of poverty. Of thoe inspected, the ratings were signifiacantly higher than the whole group"s, with none in the "cause for concern" or "failing" categories. with 69 per cent disagreeing and 19 per cent taking a neutral position.

For questions 5 and 6, the belief in inspectors' capacities to assess the spiritual, moral, social and cultural aspects of schools, accurately or separately, sank to 7 per cent and 6 per cent agreeing. (See Vignette)

Vignette:

A primary school was declared to be failing in the year that the Head teacher, who had worked there for the last nineteen years, was retiring. One of the few negative comments to explain why the school failed was the following:

"The school promotes satisfactorily the social and moral development of the pupils, but not their spiritual and cultural development."
This seems extraordinarily unconvincing. There was no evidence presented to illustrate which observations convinced the inspectors of such finely differentiated effects.

Evidence of Discriminant Validity

If the claim is that inspection measures the effectiveness of a school, we need to ask if it actually measures this or if it is accidentally confused by other factors. This is the issue of discriminant validity. For example, a decade ago Gray and Hannon (1986) showed that HMI almost never praised schools in inner cities. HMI seemed unable to recognise good teaching in a context which was not predominantly middle class. The problem identified by Gray and Hannon may be still with us. Being a school with a high percentage of free school meals was associated with a significantly higher likelihood of a poor rating. The simple correlation between a rating and free school meals was found to be 0.44 (p<0.001, N = 75) Is it possible that the strong relationship between a poor inspection rating and being an inner city school reflects the incapacity of the inspectors to make adjustments for the difficulties of working in these schools and for the handicaps that pupils continue to experience in the urban environment, as suggested years ago by Gray and Hannon? However, the association could be an indication of genuine, remediable problems in some inner-city schools in which case there needs to be adequate research into what remediation is effective.

Whatever the reason for a strong association between receiving a poor rating and having a high poverty indicator, we can take the relationship into account statistically, and then ask whether there are yet other factors which relate to the ratings received from inspectors. Amount spent on obtaining pre-inspection help turned out to be such a factor.

Even after taking account of free school meals there remained a relationship between the amount spent buying in pre-inspection help and the rating received by a school. The correlation between residuals and the amount spent on pre-inspection help was 0.23(p=0.07 for the 58 Heads who answered the question on expenditures on pre-inspection help. For schools which had spent over £1000 the residuals were positive for 11 schools and negative for 3. The 5 schools which had failed had all spent nothing on pre-inspection advice, usually provided by inspectors.. If inspectors judge what they see and not what they are told and if they are evaluating a school as it actually is, not on the basis of a self-presentation exercise, why should the purchase of pre-inspection help apparently have such a noticeable impact? One interpretation is a frankly quid pro quo arrangement. Indeed, an explicit complaint has been received of LEA inspectors pointing out to schools that they would be undertaking inspection and a pre-inspection visit could be helpful to the school. Do Heads feel at all obliged to buy in advice from inspectors? If Heads are so confident as not to buy in pre-inspection help they apparently risk a lowered rating. Of course these findings are just correlations. The actual mechanisms cannot be established from correlations but they do raise hypotheses and this link between money-expended and rating-received is a situation which needs monitoring and further investigation.

Validity of different kinds of information.

Despite the limitations of numerical indicators, they appear to carry more weight than inspectors' judgements in measurable areas. Question 16 invited agreement or disagreement with the following statement, 'If the judgement of the Ofsted team differed from data available to me on Value Added I would tend to believe the Ofsted judgement '. Only 8 percent agreed with this statement (only 1 percent "strongly") and 72 percent rejected the statement. If these Heads were representative of Heads in general, the inspection system of the future could be vastly reduced in cost, far more economical and have higher perceived validity if Value Added indicators replaced the guesses of inspectors.

Impact

The actual impact of inspections will have to be monitored in a variety of ways but here we will consider the perceived impact in terms of the value put on the inspection process by Head teachers. We will then look at the difference in perceptions of those who have and have not been inspected, and we will then consider the impact on the school of the pending inspection with regard to cost in time, money and staff illness.

Perceived outcomes as reported by Head teachers

For those who had had an inspection, there was an the question:

Q. & How much information of use to you in improving schooling did you gain from the inspection?

Since this last question in a sense went to the heart of the slogan for Ofsted 'Improvement through inspection ', it is worth looking at the distribution of results on that item alone: Only 4 Heads reported having learnt nothing; 14 reported not much; 34 reported "some" (the middle of the scale) 28 reported "quite a lot" and 5 reported "a large amount". The modal response was thus 40 per cent of the sample in the middle of the scale suggesting that they had learnt something in between "not much" and "quite a lot"; a result not overwhelmingly positive nor overwhelmingly negative.

Three items that did not require the experience of inspection were used to creating the "usefulness" scale were (The measure of internal consistency for this scale, Cronbach"s alpha, was 0.73.)

The most positive item on the outcome scale related to the net impact of the "whole Ofsted process" with 45 per cent agreeing that it had been positive and only 22 per cent disagreeing. This resulted in an average on the five point scale of 3.25 just above the mid-point. Since none of the other indicators of impact was positive this needed some explanation. It could be that the "whole Ofsted process" was seen to include the very existence of inspections and the use of the popular "Framework" (OFSTED, 1995). Seventy per cent agreed with the statement 'the framework had a useful impact on school management and organisation'. In the view of Head teachers the original Framework was a welcome document. Whether it in fact constituted "advice", and put across a philosophy of management that was not necessarily helpful, are questions which may now be academic since the Framework has been revised from a £40 tome down to a slim, 26 page glossy (OFSTED, 1996)

Responses to Q15 which asked if the school would prefer to have the money that an inspection cost, rather than an inspection, leaned heavily towards the money. Only 16 per cent would prefer the inspection. The question about value for money yielded 13 per cent agreeing that "Ofsted provides good value for money", but more than twice as many, 54 per cent, disagreed.

Summary of findings regarding reliability, validity and impact (outcomes).

We have now considered the reliability, validity and outcomes as perceived by various groups of Head teachers. We found that, with the exception of one group, all attitudes were below neutral, i.e. negative. (Figure 3) Reliability was seen as a problem. Validity was generally seen as not quite as much of a problem and for one group the Outcome scale was above neutral. The unusually positive group consisted of schools nominated by the Failing-According -to-Ofsted (FAO) schools as being nearby and very similar. Could it be that one outcome of a nearby school being declared "Failing" was seen by these nearby schools as improving their own position in the market? Mere speculation but interesting.

Among all this evidence of the rather negative view of inspection by Head teachers it must be said that there was some strong support for inspection. This seemed to come particularly from inexperienced and newly appointed Heads, a finding reported also by Ouston, Fidler, & Earley (1996) "There was a trend for acting Heads to be more positive than permanent Heads. " (p117).

COSTS

Value for Money (VFM) studies are at the heart of management decisions and strategic planning for the nation as a whole. Now that the Ofsted trial has had time to bed down, it is a suitable time for VFM studies to be conducted by the appropriate bodies. Meanwhile, an important feature on the cost side of the balance sheet is not just the £70 to £100 million spent by Ofsted but the consequences of inspection for school budgets, money taken directly from school's services to children. Several sorts of costs were investigated: costs due to stress-related absence as well as photocopying and other costs of preparation for the pre-announced inspection.

The costs of stress An inspection is undoubtedly a source of stress for staff. Were the stress minor it might well be dismissed as a necessary evil in protecting the well-being of children. As the stress becomes more serious the necessity for proving that inspection does in fact look after the well-being of children becomes the more urgent. At the very least, in any Value for Money study the cost of stress in the profession must be counted in, both in the short term and in the long-term possible effects on morale and recruitment.

One way to do this, and to put a realistic figure on the impact of the stress, is to look at absences which Heads attributed to the inspection process. On the questionnaire Heads were asked

'In your view, were there any stress-related staff absences before, during and after the inspection? '

This part of the questionnaire was often written over with statements such as 'Too many" or "Too difficult to estimate '. There was a wide range of response in terms of staff absence. Absences before inspection were substantial and reported by half the Heads. The overall average was 15.1 staff days. Absences during inspection , dropped to a level of 2.3 staff days on average, but then shot up again after inspection to 28 staff days on average. For the 50 percent of Heads reporting some absences the average was 45 staff days.

One assumption sometimes implied rather than stated is that the worst teachers feel the most stress and will be encouraged to leave the profession. This is strongly denied by many Heads and many teachers who feel that the people most stressed are often the most conscientious and excellent teachers. If staff absence were a sign of nervousness because of incompetence it would be expected that the amount of absence would be related to the rating the inspectors attached to the school. No such relationship was found. Despite a considerable search through the variables little could be said about stress other than that it was apparently related to little else, certainly not to the Ofsted rating of the school.

Costs of preparing for inspection. Mention has already been made of the average expenditure of £1192 on pre-inspection advice. The more spent on pre-inspection help the higher the rating subsequently received. Because there were some extremely high values on other costs, the medians are a better representation than the means. Median values were 40 staff days preparing documents, 10 days of the Head's time on documents, £250 on Reprographics and photocopying, 5 staff days on extra meetings of staff , but zero on extra meetings with parents although some schools reported high values such as &&. Two staff days were reported as the median for extra meetings with governors and zero with the press, thought again some schools reported very large amounts such as 90 staff days.

The costs of (bad?) advice. Ofsted's official position is that inspectors do not give advice. This seems an entirely reasonable position in that if inspectors give help and advice then, when they return, they will be returning to evaluate the effectiveness of their own work. The two functions can, perhaps profitably, be kept entirely separate. However, the rhetoric of keeping inspection and advice separate does not match the practice.

Despite the claimed principle that inspection should be separated from advice there appears to be a sub-text of advice implied by Ofsted's framework. Thus it pushes schools towards target-setting and detailed planning, both short term and long-term, without there being adequate evidence that a this "gradgrind" ethos or five-year plans are effective. There are alternative views as to good management. W. Edwards Deming, for example, widely credited with transforming Japanese industry from the shoddy goods of the 1950s to the world class quality of the 80s and 90s, was explicitly opposed to targets, and to appraisal. He advised, rather, the constant collection of hard statistical data, well interpreted, and an on-going striving for improvement combined with an insistence on joy in work (Hinkley & Seddon, 1996; Neave, 1992)). Furthermore he urged the search for 'profound knowledge' i.e.. soundly based research results. 'Don"t work harder work smarter'. It has been argued elsewhere that Deming's philosophy fits with the conclusions reached by many other eminent scholars such as Popper and Simon (Fitz-Gibbon, 1996). These issues may well be over-simplified and not easily resolved. What is clear is that Ofsted's unthinking adoption of one approach to management and improvement is going beyond its brief and may conceivably be damaging.

The perception that Ofsted advises is clear from parents" comments as reported by (Taberer, 1995). Quite reasonably they suppose that a plan identifying the 'key issues for action' (Ofsted, 1995. p11.) constitutes advice. It directs behaviour and not, necessarily, effectively. For example, the report which failed Breeze Hill Comprehensive School in Oldham complained that the school had done nothing about the under achievement of boys at GCSE. Since this achievement discrepancy is a country-wide phenomenon, mainly in English rather than science or mathematics, it might have been a reasonable management decision that efforts could be better expended in areas more likely to yield results. Is Ofsted usurping school's rights under Local Management of Schools? Who is then subsequently responsible?

The psycho-dynamics and legal position of inspection. Ofsted has been given un-challengeable power. When Breeze Hill was re-rated as "failing" less than a year after being rated as "satisfactory" it wished to take Ofsted to court. The LEA was prepared to back this challenge. Unfortunately a QC stated that it could not be shown that the judgements were "perverse" or "unreasonable" and a case would not be successful. Yet it would seem both perverse and unreasonable that Ofsted should demonstrate clearly in one school the un-reliability of its procedures and not be subject to review. Is it not fundamentally perverse and unreasonable that Ofsted is not held accountable for being able to demonstrate its levels of reliability and validity?

Lord Acton warned that power corrupts and absolute power corrupts absolutely. Again this is a very difficult area to deal with, not one in which the authors feel in the least bit competent. There are numerous instances coming to our attention of inspection acting as a poison in the system and concern has been expressed at the appearance of subterfuge among grown-ups being less than educationally sound for children who observe the inspection process and its impact on their teachers. Inspection has certainly created much bad blood between teachers and their local education authority and it has undermined the confidence of parents in their schools, without good evidence. Because there is considerable fear in the system we can only urge that an enquiry is needed in which those with some responsibility for Ofsted can hear statements given in confidence which many inspectors and others feel unable to provide publicly.

Discussion

The aspect of inspection which is the most expensive in inspectors" time, the most costly to schools in staff stress, and the least validated, is the practice of having inspectors sit in classrooms using classroom observation methods which have not been demonstrated to meet any level of quality standards and drawing un-challengeable conclusions which have yet to be subjected to proper scrutiny for their reliability, validity or sufficiency for the purpose of publicly rating an entire school. It is this aspect of inspection which should be immediately suspended pending the application of proper standards.

It is doubtful that business or industry would permit an inspection regime, centrally imposed, that was based on opinion about how the business or industry should be run, not on sound research. This what is being imposed upon schools in the public sector, despite the intentions of the welcome Local Management of Schools legislation.

Inspection should do what inspection can do best and should not pretend to second guess what is better measured such as rates of progress or "Value Added". There should be compliance indicators (Richards, 1988) as, for example, with the delivery of the National Curriculum, maintaining a safe environment, maintaining proper financial records and showing a duty of care to children and staff in the school. Such compliance should be assessed by unannounced visits as is practised in industry. Furthermore the differing responsibilities of Ofsted inspectors and those of such bodies as the Audit Commission, the Health and Safety Executive and the Teacher Training Agency need to be clarified.

We have received complaints from science teachers' organisations that Ofsted inspectors are not well trained in Health and Safety and some make poor judgements and recommendations. This raises the entire issue of what competences inspectors need. If they are interpreting a body of statistical data, then they should be examined in their understanding of such data. If they are interpreting the adequacy of account keeping then they should be examined in their knowledge of accounts. If they are serving as Health and Safety Officers they should be qualified to the highest standards since nothing is of greater concern to parents than the health and safety of their children. There can be no substitute for inspection. but as it is presently operating it is an embarrassment to anyone who understands social science, its complexities and methods, and it is apparently a source of grave distress to a teaching profession on which we rely for the care of our children and grandchildren.

Summary

A simple blueprint evaluation starting from the proposition that 'Inspections are as good as their methodological foundations' (Gray and Wilcox, 1995, p.127) and considering Ofsted's procedures would seem to raise grave concerns since Ofsted's methods have

The survey of 159 Head teachers reported above indicated that Ofsted has

Because of all these failings Ofsted may have substantially damaged the quality of education provided by schools by causing them to spend time, money and energy unproductively. The pretence to unlikely levels of wisdom, so inherent in an inspection system which has avoided any routine and proper checks on the adequacy of its methods, is the greatest enemy of empirical investigation, effective problem-solving and real improvement. There should immediately be an expert panel to consider the role and methodology of the inspection process, with representatives from business, industry, medicine and statistics as well as education.

NOTE 1. YELLIS, the Year Eleven Information System, collects data from 100 percent samples of pupils and relates pupils reports of aspirations, attitudes, and school experiences to a test of developed abilities given in Year 10 and external examination results, in over 50 subjects, from Year 11 (15 year old pupils). Run by the CEM Centre, University of Durham, YELLIS tracks Value Added and many other indicators.

NOTE 2. OFSTIN, the Office for Standards in Inspection, has been formed by concerned educators and welcomes submissions regarding the proper role of inspectors in the 21st century. Please address these to The OFSTIN Secretary 9 Quatre Bras, Hexham, Northumberland NE46 3JY. Tel: 01434 604747

REFERENCES for Evaluating Ofsted's methodology

Bryant, C. (1995). Inspection. In IMAC Research (Ed.) Education and Training Statistics, Statistics Users' Council.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin., 52, May.

Fitz-Gibbon, C. T. (1995). Ofsted, schmofsted. In T. Brighouse & B. Moon (Ed.) School Inspection London: Pitman Publishing.

Fitz-Gibbon, C. T. (1996). Monitoring Education: Indicators, Quality and Effectiveness. London, New York: Cassell.

Fitz-Gibbon, C. T., & Clark, K. S. (1982). Time Variables in classroom research: a study of eight urban secondary school Mathematics classes. British Journal of Educational Psychology., 52( 3,), 301-316.

Fitz-Gibbon, C. T., & Vincent, L. S. (1994). Candidates" performance in Science and mathematics at A-level. School Curriculum and Assessment Authority.

Gray, J., & Hannon, V. (1986). HMI interpretation of schools" examination results. . Journal of Educational Policy , 1 ( 1 ,), .23-33

Gray, J., & Wilcox, B. (1995). The methodologies of inspection: issues and dilemmas. In T. Brighouse & B. Moon (Ed.) School Inspection London: Pitman Publisher.

Hinkley, T., & Seddon, J. (1996). The Deming approach to school improvement. In P. Earley, B. Fidler, & J. Ouston (Ed.) Improvement through Inspection? Complementary approaches to school development (pp. 71-93). London: David Fulton Publishers Ltd.

Hogg, G. W. (1990). Great Performance Indicators of the past. In C. T. Fitz-Gibbon (Ed.) Performance Indicators Clevedon, Philadelphia: Multilingual Matters Ltd.

Kelly, A. (1976). A study of the comparability of external examinations in different subjects. Research in Education, 16,, 50-63.

Medley, D. M., & Mitzel, H. E. (1963). Measuring classroom behavior by systematic observation. In N.L.Gage (Ed.) Handbook of Research on Teaching. Chicago: Rand McNally.

Neave, H. (1992). The Deming Dimension. Knoxville, Tennessee: SPC Press.

Ofsted (1995). Framework for the inspection of schools. London: HMSO.

Ouston, J., Fidler, B., & Earley, P. (1996). Secondary schools' responses to Ofsted: improvement through inspection? In J. Ouston, P. Earley, & B. Fidler (Ed.) OFSTED inspections: the Early Experience (pp. 110-125). London: David Fulton Publishers Ltd.

Richards, C. E. (1988). A Typology of Educational Monitoring Systems. . Educational Evaluation and Policy Analysis 10 ( 2 ), 106-116.

Rutter, M., Maughan, B., Mortimer, P., & Ousten, J. (1979). Fifteen Thousand Hours. Secondary schools and their effects on children. London: Open Books.

Taberer, R. (1995). Parents' perceptions of Ofsted. National Foundation for Educational Research.

Winer, B. J. (1971). Statistical principles in experimental design (2nd ed). New York: London: McGraw-Hill.

FIGURES for Evaluating Ofsted's methodology

FIGURE 1 RELIABILITY

Question 17. I believe the OFSTED teams have no difficulty in reconciling the judgements of each team member to provide a corporate view.

Question 13 Two OFSTED teams working without contact would come to the same conclusions about a school.

The graph below shows splines which indicate the trends in the data

gif

Responses to reliability questions among Heads of so-called-problem schools.

Variable    Mean         Std Dev     Correlation  Signif.     Number       
                                                  Prob                     

Q13             2.43       0.976         -0.30      0.5133         7       

Q17             2.57       0.976                                           


Responses to reliability questions among Heads of "satisfactory" schools.

Variable    Mean         Std Dev     Correlation  Signif.     Number       
                                                  Prob                     

Q13             2.50     0.974           0.40       0.0043        50       

Q17             2.22     0.975                                             


Figure 2: Relationship between amount spent on buying in pre-inspection help and the residual rating (i.e. the rating having already taken FSMeals into account.)

gif
Linear Fit                                                                        

Rsquare            0.144597                                                       

Root Mean Square   0.595632                                                       
Error                                                                             

Mean of Response   -0.04285                                                       

N                        42                                                       

                                                                                  

Analysis of                                                                       
Variance                                                                          

Source             DF           Sum of      Mean Square  F Ratio     Prob>F       
                                Squares                                           

Model                  1          2.398852   2.39885       6.7616      0.0130     

Error                 40         14.191112   0.35478                              

C Total               41         16.589964                                        

                                                                                  

Parameter                                                                         
Estimates                                                                         
                                                                                  

Term                            Estimate    Std Error    t Ratio     Prob>|t|     

Intercept                       -0.207911    0.1117       -1.86      0.0701       

COSTPRE                         0.0003387   0.00013        2.60      0.0130       


FIGURE 3 Head teachers' views of Ofsted Inspections

gif

Notes: The six target groups consisted of two random samples, one from schools participating in the Year 11 Information System (YELLIS) and one from the DFEE list of schools; schools which, in YELLIS, had low Value Added scores ('Low VA') or high Value Added scores; schools which were "failing" according to Ofsted (FAO) and schools which were nominated by the FAO schools as being similar and nearby ("matched to FAO)

Figure 4 The Head's perception of validity related to whether or not the school had been inspected

gif
Source      DF        Sum of        Mean Square    F Ratio        
                      Squares                                     

Model           1          0.220       0.220          0.583       

Error         106         39.988       0.377       Prob>F         

C Total       107         40.207                       0.45       


Mean Estimates

Level        number       Mean         Std Error    

0               51            2.53       0.086      

1               57            2.62       0.081      


Calculation: To reach a mean of "agree", 4.00, requires an improvement of (4 - 2.62). The gain was apparently (2.62-2.53) from one inspection. An inspection happens once every four years, say. Thus the years needed will be

4 * (4 - 2.62)

(2.62 - 2.53) or 61 years before Heads, on average, perceive inspections as valid - - -given the current rate of progress.

Figure 5 ABSENCES PRE, DURING AND POST-INSPECTION


                         PRE-inspectio During-inspe Post-inspecti 
                               n          ction          on       

 Number of respondents        40            47           42       
      (out of 61)                                                 

     all respondents     14.7staff-day     2.6           32       
                               s        staff-days   staff-days   

 % reporting some staff       50%          30%           57%      
        absences                                                  

  Those reporting some        30           8.5           57       
        absence.          staff-days    staff-days   staff-days   



gif

INTERCORRELATIONS BETWEEN STAFF ABSENCES BEFORE, DURING AND AFTER AN INSPECTION AND THE RATING RECEIVED

            PRE          DURING      POST         RATING      

PRE             1.00         0.80        0.64        -0.15    

DURING          0.80         1.00        0.67        -0.13    

POST            0.64         0.67        1.00        -0.10    

RATING         -0.15        -0.13       -0.10         1.00    


Note: Absences were more common in some schools than others but the extent of staff absence was unrelated to the rating of the school by Ofsted. The rating was coded with a positive rating as 5. None of the correlations reached the .05 level of statistical significance.