44 0 3MB
This document contains for the greater part the “SON-R 21/2-7 Manual and Research Report”. Not included are chapter 12 (Directions per subtest), chapter 13 (The record form, norm tables and computer program) and the appendices. The reference for this text is: Tellegen, P.J., Winkel, M., Wijnberg-Williams, B.J. & Laros, J.A. (1998). Snijders-Oomen Nonverbal Intelligence Test. SON-R 21/2-7 Manual and Research Report. Lisse: Swets & Zeitlinger B.V. This English manual is a translation of the Dutch manual, published in 1998 (SON-R 21/2-7 Handleiding en Verantwoording). The German translation was also published in 1998 (SON-R 21/2-7 Manual). In 2007 a German manual was published with German norms (SON-R 21/2-7 Non-verbaler Intelligenztest. Testmanual mit deutscher Normierung und Validierung).
Translation by Johanna Noordam ISBN 90 265 1534 0 Since 2003, the SON-tests are published by Hogrefe Verlag, Göttingen, Germany. © 1998, 2009 Publisher: Hogrefe, Authors : Peter J. Tellegen & Jacob A. Laros http://www.hogrefe.de E-mail: [email protected]
Rohnsweg 25, 37085 Göttingen, Germany
3
CONTENTS
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
, -7 PART I: THE CONSTRUCTION OF THE SON-R 2, 1.
2.
3.
4.
5.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.1
Characteristics of the SON-R 2,-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2
History of the SON-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.3
Rationale for the revision of the Preschool SON . . . . . . . . . . . . . . . . . . . . . . .
16
1.4
Phases of the research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.5
Organization of the manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
Preparatory study and construction research . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.1
The preparatory study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2
The construction research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
, -7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Description of the SON-R 2,
25
3.1
The subtests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2
Reasoning tests, spatial tests and performance tests . . . . . . . . . . . . . . . . . . . .
31
3.3
Characteristics of the administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
Standardization of the test scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.1
Design and realization of the research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.2
Composition of the normgroup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.3
The standardization model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.4
The scaled scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
Psychometric characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.1
Distribution characteristics of the scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.2
Reliability and generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
5.3
Relationships between the subtest scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.4
Principal components analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.5
Stability of the test scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4
SON-R 2,-7
PART II: VALIDITY RESEARCH 6.
Relationships with other variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Duration of test administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Time of test administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Examiner influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Regional and local differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Differences between boys and girls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 SES level of the parents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Parents’ country of birth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Evaluation by the examiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Evaluation by the teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57 57 58 58 59 60 61 62 63 64
7.
Research on special groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Composition of the groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The test scores of the groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Relationship with background variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Diagnostic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Evaluation by the examiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Evaluation by institute or school staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Examiner effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Psychometric characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67 67 70 74 74 75 77 78 79
8.
Immigrant children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 The test results of immigrant children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Relationship with the SES level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Differentiation according to country of birth . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Comparison with other tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 The test performances of children participating in OPSTAP(JE) . . . . . . . . . .
81 81 82 82 83 84
9.
Relationship with cognitive tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Correlation with cognitive tests in the standardization research . . . . . . . . . . . 9.2 Correlation with nonverbal tests in primary education . . . . . . . . . . . . . . . . . . 9.3 Correlation with cognitive tests at OVB-schools . . . . . . . . . . . . . . . . . . . . . . . 9.4 Correlation with cognitive tests in special groups . . . . . . . . . . . . . . . . . . . . . . 9.5 Correlation with the WPPSI-R in Australia . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Correlation with cognitive tests in West Virgina, USA . . . . . . . . . . . . . . . . . . 9.7 Correlation with the BAS in Great Britain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Overview of the correlations with the criterion tests . . . . . . . . . . . . . . . . . . . . 9.9 Difference in correlations between the Performance Scale and the Reasoning Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.10 Difference in mean scores on the tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.11 Comparisons in relation to external criteria . . . . . . . . . . . . . . . . . . . . . . . . . . .
87 89 93 94 96 101 102 104 106 109 110 112
5
CONTENTS
PART III: THE USE OF THE TEST 10. Implications of the research for clinical situations . . . . . . . . . . . . . . . . . . . . . . . .
117
10.1 The objectives of the revision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
10.2 The validity of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119
10.3 The target groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
10.4 The interpretation of the scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
128
10.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
134
11. General directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
137
11.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
137
11.2 Directions and feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
138
11.3 Scoring the items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
140
11.4 The adaptive procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
141
11.5 The subtest score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
144
11.6 Adapting the directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
12. Directions per subtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
12.1 Mosaics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
148
12.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
154
12.3 Puzzles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
12.4 Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
12.5 Situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
169
12.6 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
173
13. The record form, norm tables and computer program . . . . . . . . . . . . . . . . . . . .
187
13.1 The use of the record form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
187
13.2 The use of the norm tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191
13.3 The use of the computer program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
194
13.4 Statistical comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
198
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
205
Appendix A
Norm tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
211
Appendix B
The record form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
250
Appendix C
The file SONR2.DAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
255
Appendix D
Contents of the test kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
256
6
SON-R 2,-7
TABLES AND FIGURES IN THE TEXT Introduction Table 1.1 Overview of the versions of the SON-tests . . . . . . . . . . . . . . . . . . . . . . . . .
15
Pilot study and construction research Table 2.1 Relationship between the subtests of the Preschool SON and the SON-R 2,-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 2.2 Origin of the items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21 23
, -7 Description of the SON-R 2, Table 3.1 Tasks in the subtests of the SON-R 2,-7 . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.1 Items from the subtest Mosaics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.2 Items from the subtest Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.3 Items from the subtest Puzzles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.4 Items from the subtest Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.5 Items from the subtest Situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.6 Items from the subtest Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 3.2 Classification of the subtests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26 27 27 28 29 30 31 32
Standardization of the test scores Table 4.1 Composition of the norm group according to age, sex and phase of research Table 4.2 Demographic characteristics of the norm group in comparison with the Dutch population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.3 Education and country of birth of the mother in the weighted and unweighted norm group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Psychometric characteristics Table 5.1 P-value of the items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.1 Plot of the discrimination and difficulty parameter of the items . . . . . . . . . Table 5.2 Mean and standard deviation of the raw scores . . . . . . . . . . . . . . . . . . . . . . Table 5.3 Distribution characteristics of the standardized scores in the weighted norm group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.4 Floor and ceiling effects at different ages . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.5 Reliability, standard error of measurement and generalizability of the test scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.6 Reliability and generalizability of the IQ score of the Preschool SON, the SON-R 2,-7 and the SON-R 5,-17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.7 Correlations between the subtests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.8 Correlations of the subtests with the rest total score and the square of the multiple correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.9 Results of the Principal Components Analysis in the various age and research groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.10 Test-retest results with the SON-R 2,-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.11 Examples of test scores from repeated test administrations . . . . . . . . . . . .
37 38 38
43 45 46 46 47 48 50 51 52 53 55 56
CONTENTS
Relationships with other variables Table 6.1 Duration of the test administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 6.2 Relationship of the IQ scores with the time of administration . . . . . . . . . . Table 6.3 Examiner effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 6.4 Regional and local differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 6.5 Relationship of the test scores with sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 6.6 Relationship of the IQ score with the occupational and educational level of the parents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 6.7 Relationship of the IQ score with the SES level . . . . . . . . . . . . . . . . . . . . . Table 6.8 Relationship between IQ and country of birth of the parents . . . . . . . . . . . Table 6.9 Relationship between evaluation by the examiner and the IQ . . . . . . . . . . Table 6.10 Correlations of the total scores with the evaluation by the teacher . . . . . . . Table 6.11 Correlations of the subtest scores with the evaluation by the teacher . . . . Research on special groups Table 7.1 Subdivision of the research groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 7.2 Composition of the research groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 7.3 Test scores per group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 7.1 Distribution of the 80% frequency interval of the IQ scores of the various groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 7.4 Relationship of the IQ scores with background variables . . . . . . . . . . . . . . Table 7.5 Reasons for referral of children at schools for Special Education and Medical Daycare Centers for preschoolers, with mean IQ scores . . . . . . . Table 7.6 Relationship between IQ and evaluation by the examiner . . . . . . . . . . . . . Table 7.7 Correlations between test scores and evaluation by institute or school staff member . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 7.8 Correlations between the subtests and subtest-rest correlations . . . . . . . . . Immigrant children Table 8.1 Test scores of native Dutch children, immigrant children and children of mixed parentage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 8.2 Relationship between group, SES level and IQ . . . . . . . . . . . . . . . . . . . . . . Table 8.3 Differentiation of mean IQ scores according to country of birth . . . . . . . . Table 8.4 Mean IQ scores of Surinam, Turkish and Moroccan children who had participated in the OPSTAP(JE) project . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relationship with cognitive tests Table 9.1 Overview of the criterion tests used and the number of children to whom each test was administered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 9.2 Characteristics of the children to whom a criterion test was administered in the standardization research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 9.3 Correlations with other tests in the standardization research . . . . . . . . . . . Table 9.4 Correlations with nonverbal cognitive tests in the second year of kindergarten, 5 to 6 years of age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
57 58 59 60 60 61 62 63 64 65 66
68 69 71 73 74 75 76 77 79
81 82 83 84
88 89 90 94
8
SON-R 2,-7
Table Table Table Table Table Table Table Table Table Table
9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12 9.13 9.14
Table 9.15 Table 9.16 Table 9.17 Table 9.18
Correlations with cognitive tests completed by children at low SES schools given educational priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Characteristics of the children in the special groups to whom a criterion test was administered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlations with criterion tests in the special groups . . . . . . . . . . . . . . . . Correlations with the WPPSI-R in Australia . . . . . . . . . . . . . . . . . . . . . . . . Age and sex distribution of the children in the American validation research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlations with criterion tests in the American research . . . . . . . . . . . . . Correlations with the BAS in Great Britain . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the correlations with the criterion tests . . . . . . . . . . . . . . . . . . Difference in scores between SON-IQ and PIQ of the WPPSI-R . . . . . . . . Correlations of the Performance Scale and the Reasoning Scale with criterion tests, for cases in which the difference between correlations was greater than .10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison between the mean test scores of the SON-R 2,-7 and the criterion tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons between tests of the evaluation of the subject’s testability . . Comparisons between tests in relation to socioeconomic and ethnic background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons between tests in relation to evaluation of intelligence and language skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Implications of the research for clinical situations Table 10.1 Mean change in IQ score over a period of one month . . . . . . . . . . . . . . . . . Figure 10.1 The components of the variance of the SON-R 2,-7 IQ score . . . . . . . . . Table 10.2 Classification of IQ scores and intelligence levels . . . . . . . . . . . . . . . . . . . Table 10.3 Composition of the variance when several tests are administered . . . . . . . Table 10.4 Correction of mean IQ score based on administration of two or three tests Table 10.5 Obsolescence of the norms of the SON-IQ . . . . . . . . . . . . . . . . . . . . . . . . . Record form, norm tables and computer program Table 13.1 Examples of the calculation of the subject’s age . . . . . . . . . . . . . . . . . . . . . Figure 13.1 Diagram of the working of the computer program . . . . . . . . . . . . . . . . . . . Table 13.2 Comparison between the possibilities using the computer program and using the norm tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 13.3 Examples of probability and reliability intervals for various scores . . . . .
95 97 98 102 103 104 105 107 108
109 111 113 114 116
118 123 130 132 133 133
190 195 197 202
9
FOREWORD
Nan Snijders-Oomen (1916-1992)
Jan Snijders (1910-1997)
The publication of the SON-R 2,-7 completes the third revision of the Snijders-Oomen Nonverbal Intelligence Tests. Over a period of fifty years Nan Snijders-Oomen and Jan Snijders were responsible for the publication of the SON tests. We feel honored to be continuing their work. They were interested in this revision and supported us with advice until their death. The present authors played different roles in the production of this test and the manual. Peter Tellegen, as project manager, was responsible for the revision of the test and supervised the research. Marjolijn Winkel made a large contribution to all phases of the project in the context of her PhD research. Her thesis on the revision of the test will be published at the end of 1998. Jaap Laros, at present working at the University of Brasilia, participated in the construction of the subtests, in particular Mosaics and Analogies. Barbara Wijnberg-Williams, made a large contribution, based on her experience as a practicing psychologist at the University Hospital of Groningen, to the manner in which the test can be administered nonverbally to children with communicative handicaps. The research was carried out at the department for Personality and Educational Psychology of the University of Groningen. Wim Hofstee, head of the department, supervised the project. Jannie van den Akker and Christine Boersma made an important contribution to the organization of the research. The research was made financially possible by a subsidy from SVO, the Institute for Educational Research (project 0408), by a subsidy from the Foundation for Behavioral Sciences, a section
10
SON-R 2,-7
of the Netherlands Organization for Scientific Research (NWO-project 575-67-033), and by contributions from the SON research fund. Wolters-Noordhoff, who previously published the SON-tests, made an important contribution to the development of the testing materials. The drawings for the subtests Categories, Puzzles and Situations were made by Anjo Mutsaars. The figures for the subtest Patterns were executed by Govert Sips of the graphical design agency Sips. Wouter Veeman from Studio van Stralen executed the subtests Mosaics and Analogies. The construction of a test requires a large number of subjects, for the construction research as well as for the standardization and the validation. In the last few years, more than three thousand children were tested with the SON-R 2,-7 in the framework of the research. We are greatly indebted to them, as well as to their parents and the staff members of the schools and institutes where the research was carried out. In the Netherlands, as well as in Australia, Great Britain and the United States of America, many students, researchers, and practicing psychologists and orthopedagogic specialists contributed to the research. Thanks to their enthusiasm and involvement, the research could be carried out on such a large and international scale. Without claiming to be comprehensive, we would like to mention the following people by name: Margreet Altena, Rachida El Baroudi, Cornalieke van Beek, Wynie van den Berg, M. van den Besselaar, Marleen Betten, Marjan Bleckman, Nico Bollen, Rene Bos, Ellen Bouwer, Monique Braat, C. Braspenning, Marcel Broesterhuizen, Karen Brok, Ankie Bronsveld, Aletha Brouwer, Anne Brouwer, Sonja Brouwer, Lucia Burnett, Mary Chaney, Janet Cooper, Pernette le Coultre-Martin, Richard Cress, J. van Daal, Shirley Dennehy, M. van Deventer, Dorrit Dickhout-Kuiper, Julie Dockrell, Nynke Driesens, Petra van Driesum, Marcia van Eldik, Marielle Elsjan, Yvonne Eshuis, Arnoud van Gaal, Judith Gould, Marian van Grinsven, Nicola Grove, Renate Grovenstein, Marije Harsta, R.G. den Hartog, Leida van der Heide, Roel van der Helm, Marlou Heppenstrijdt, Valerie Hero, Sini Holm, Marjan Hoohenkerk, E.P.A. Hopster, Jacqueline ten Horn, Jeannet Houwing, Hans Höster, Jo Jenkinson, Jacky de Jong, Myra de Jong, Anne Marie de Jonge, José Kamminga, Jennifer Kampsnider, Claudine Kempa, Debby Kleymeer, Jeanet Koekkoek, Marianne van de Kooi, Annette Koopman, Monique Koster, A.M. Kraal, Marijke Kuiper, Koosje Kuperus, Marijke Künzli-van der Kolk, Judith Landman, Nan Le Large, Del Lawhon, J. van Lith-Petry, Jan Litjens, Amy Louden, Henk Lutje Spelberg, Mannie McClelland, Sanne Meeder, Anke van der Meijde, Jacqueline Meijer, Sjoeke van der Meulen, Bieuwe van der Meulen, Jitty Miedema, Margriet Modderman, Cristal Moore, Marsha Morgan, Renate Mulder, Marian Nienhuis-Katz, F. Nietzen, Theo van Noort, Stephen O’Keefe, Jamila Ouladali, Mary Garcia de Paredes, Inge Paro, Immelie Peeters, Jo Pelzer, Simone Peper, Trudy Peters-ten Have, Dorothy Peterson, Mirea Raaijmakers, Lieke Rasker, Inge Rekveld, Lucienne Remmers, E.J. van Rijn van Alkemade, Susan Roberts, Christa de Rover, Peter van de Sande, A.J. van Santen, Liesbeth Schlichting, Marijn Schoemaker, Ietske Siemann, Margreet Sjouw, Emma Smid, L. Smits, Tom Snijders, Marieke Snippe, P. Steeksma, Han Starren, Lilian van Straten, Penny Swan, Dorine Swartberg, Marjolein Thilleman, Lous Thobokholt-van Esch, Jane Turner, Dick Ufkes, Baukje Veenstra, Nettie van der Veen, Marja Veerman, Carla Vegter, Pytsje Veltman, Harriet Vermeer, Mieke van Vleuten, Jeroen Wensink, Betty Wesdorp-Uytenbogaart, Jantien Wiersma, Aranka Wijnands, G.J.M. van Woerden, Emine Yildiz and Anneke Zijp. With the publication of this “Manual and Research Report” of the SON-R 2,-7, an important phase of the revision of the test comes to an end. This does not mean that the test is ‘finished’. The value of a test is determined, for a large part, by diagnostic experiences and by ongoing
11
FOREWORD
research. We are, therefore, interested in the experiences of users, and we would appreciate being informed of their research results when these become available as internal or external publications. We intend to inform users and other interested parties about the developments and further research with the SON tests via Internet. The address of the homepage will be: www.ppsw.rug.nl/hi/tests/sonr. In the last years the need to carry out diagnostic research on children at a young age has greatly increased. Furthermore, the realization has grown that the more traditional intelligence tests are less suitable for important groups of children because they do not take sufficient account of the limitations of these children, or of their cultural background. In these situations the SON tests are frequently used. We hope that this new version of the test will also contribute to reliable and valid diagnostic research with young children.
Groningen, January 1998
Dr. Peter Tellegen
Heymans Institute University of Groningen Grote Kruisstraat 2/1 9712 TS Groningen The Netherlands tel. +31 50 363 6353 fax +31 50 363 6304 e-mail: [email protected] http://www.testresearch.nl
, -7 Reviewing of the SON-R 2, The test has been reviewed by de COTAN, the test commission of the Netherlands Institute for Psychologists. The categories used are insufficient, sufficient and good. The rating is as follows: Basics of the construction of the test: Execution of the materials: Execution of the manual: Norms: Reliability: Construct validity: Criterion validity:
good good good good good good good
13
1
INTRODUCTION
The new version of the Snijders-Oomen Nonverbal Intelligence Test for children from two-anda-half to seven years, the SON-R 2,-7, is an instrument that can be individually administered to young children for diagnostic purposes. The test makes a broad assessment of mental functioning possible without being dependent upon language skills.
, -7 1.1 CHARACTERISTICS OF THE SON-R 2, The SON-R 2,-7, like the previous version of the test, the SON 2,-7 (Snijders & SnijdersOomen, 1976), provides a standardized assessment of intelligence. The child’s scores on six different subtests are combined to form an intelligence score that represents the child’s ability relative to his or her age group. Separate norm tables allow total scores to be calculated for the performance tasks and for the tasks mainly requiring reasoning ability. A distinctive feature of the SON-R 2,-7 is that feedback is given during administration of the test. After the child has given an answer, the examiner tells the child whether it is correct or incorrect. If the answer is incorrect, the examiner demonstrates the correct answer. When possible, the correction is made together with the child. The detailed directions provided in the manual also make the test suitable for the assessment of very young children. In general, the examiner demonstrates the first items of each subtest in part or in full. Examples are included in the test directions and items. The items on the subtests of the SON-R 2,-7 are arranged in order of increasing difficulty. This way a procedure for determining a starting point appropriate to the age and ability of each individual child can be used. By using the starting point and following the rules for discontinuing the test, the administration time is limited to fifty to sixty minutes. The test can be administered nonverbally or with verbal directions. The spoken text does not give extra information. The manner of administration can thus be adapted to the communication ability of each individual child, allowing the test to proceed as naturally as possible. Because the test can be administered without the use of written or spoken language, it is especially suitable for use with children who are handicapped in the areas of communication and language. For the same reason it is also suitable for immigrant children who have little or no command of the language of the examiner. The testing materials do not need to be translated, making the test suitable for international and cross-cultural research. The SON-tests are used in various countries. The names of the various subtests are shown on the test booklets in the following languages: English, German, Dutch, French, and Spanish. The manual has been published in English and German as well as in Dutch. A similarity between the SON-R 2,-7 and other intelligence tests for (young) children, such as the BAS (Elliott, Murray & Pearson, 1979-82), the K-ABC (Kaufman & Kaufman, 1983), the RAKIT (Bleichrodt, Drenth, Zaal & Resing, 1984) and the WPPSI-R (Wechsler, 1989), is that intelligence is assessed on the basis of performance on a number of quite diverse tasks. However, verbal test items are not included in the SON-R 2,-7. Such items are often dependent to a great extent on knowledge and experience. The SON-R 2,-7 can therefore be expected to be focused more on the measurement of ‘fluid intelligence’ and less on the measurement of ‘crystallized intelligence’ (Cattell, 1971) than are the other tests.
14
SON-R 2,-7
The subtests of the SON-R 2,-7 differ from the nonverbal subtests in other intelligence tests in two important ways. First, the nonverbal part of other tests is generally limited to typical performance tests. The SON-R 2,-7, however, includes reasoning tasks that take a verbal form in the other tests. Second, while the testing material of the performance part of the other tests is admittedly nonverbal, the directions are given verbally (Tellegen, 1993). An important difference with regard to other nonverbal intelligence tests such as the CPM (Raven, 1962) and the TONI-2 (Brown, Sherbenou & Johnsen, 1990) is that the latter tests consist of only one item-set and are therefore greatly dependent on the specific ability that is measured by that test. Nonverbal intelligence tests such as the CTONI (Hammill, Pearson & Wiederholt, 1996) and the UNIT (Bracken & McCallum, 1998) consist of various subtests, like the SON-R 2,-7. A fundamental difference, however, is that the directions for these tests are given exclusively with gestures, whereas the directions with the SON-R 2,-7 are intended to create as natural a test situation as possible. An important way in which the SON-R 2,-7 differs from all the above-mentioned tests is that the child receives assistance and feedback if he or she cannot do the task. In this respect the SON-R 2,-7 resembles tests for learning potential that determine to what extent the child profits from the assistance offered (Tellegen & Laros, 1993a). The LEM (Hessels, 1993) is an example of this kind of test. In sum, the SON-R 2,-7 differs from other tests for young children in its combination of a friendly approach to children (in the manner of administration and the attractiveness of the materials), a large variation in abilities measured, and the possibility of testing intelligence regardless of the level of language skill.
1.2 HISTORY OF THE SON-TESTS The publication of the SON-R 2,-7 completes the third revision of the test battery that Nan Snijders-Oomen started more than fifty years ago. In table 1.1 the earlier versions are shown schematically. The first version of the SON-test was intended for the assessment of cognitive functioning in deaf children from four to fourteen years of age (Snijders-Oomen, 1943). Drawing on existing and newly developed tasks, Snijders-Oomen developed a test battery which included an assortment of nonverbal tasks related to spatial ability and abstract and concrete reasoning. The test was intended to provide a clear indication of the child’s learning ability and chances of succeeding at school. One requirement for the test battery was that upbringing and education should influence the test results as little as possible. Further, a variety of intellectual functions had to be examined with the subtests, and the tasks had to interest the child to prevent him or her becoming bored or disinclined to continue. No specific concept of intelligence was assumed as a basis for the test battery. However, ‘form’, ‘concrete coherence’, ‘abstraction’ and ‘short-term memory’ were seen as acceptable representations of intellectual functioning typical of subjects suffering from early deafness (Snijders-Oomen, 1943). The aim of the test battery was to break through the one-sidedness of the nonverbal performance tests in use at the time, and to make functions like abstraction, symbolism, understanding of behavioral situations, and memory more accessible for nonverbal testing. The first revision of the test was published in 1958, the SON-’58 (Snijders & SnijdersOomen, 1958). In this revision the test battery was expanded and standardized for hearing as well as deaf children from four to sixteen years of age. Two separate test batteries were developed during the second revision. The most important reason for this was, in all the subtests of the original SON, a different type of test item had seemed more appropriate for children above six years of age. The bipartite test that already existed in fact was implemented systematically in this second revision: the SSON (Starren, 1975) was designed for children from seven to seventeen years of age; for children from three to seven years of age the SON 2,-7, commonly known as Preschool SON, or P-SON, was developed (Snijders & Snijders-Oomen, 1976).
15
INTRODUCTION
Table 1.1 Overview of the Versions of the SON-Tests SON (1943) Snijders-Oomen Deaf Children 4-14 years SON-’58 (1958) Snijders & Snijders-Oomen Deaf and Hearing Children 4-16 years , -7 (Preschool SON) SON 2, (1975) Snijders & Snijders-Oomen Hearing and Deaf Children 3-7 years
SSON (1975) Starren Hearing and Deaf Children 7-17 years
, -7 SON-R 2, (1998) Tellegen, Winkel, Wijnberg-Williams & Laros General Norms 2;6-8;0 years
,-17 SON-R 5, (1988) Snijders, Tellegen & Laros General Norms 5;6-17;0 years
– under each heading has been listed: the year of publication of the Dutch manual, the authors of the manual, the group and the age range for which the test was standardized
The form and contents of the SSON strongly resembled the SON-’58, except that the SSON consisted entirely of multiple choice tests. After the publication of the SSON in 1975, the SON’58 remained in production because it was still in demand. In comparison to the SSON, the SON-’58 contained more stimulating tasks and provided more opportunity for observation of behavior, because it consisted of tests in which children were asked to manipulate a large variety of test materials. The subtests in the Preschool SON maintained this kind of performance test to provide opportunities for the observation of behavior. The third revision of the test for older children, the SON-R 5,-17, was published in 1988 (Snijders, Tellegen & Laros, 1989; Laros & Tellegen, 1991; Tellegen & Laros, 1993b). This test replaces both the SON-’58 and the SSON, and is meant for use with hearing and deaf children from five-and-a-half to seventeen years of age. In constructing the SON-R 5,-17 an effort was made to combine the advantages of the SSON and the SON-’58. On the one hand, a range of diverse testing materials was included. On the other hand, a high degree of standardization in the administration and scoring procedures as well as a high degree of reliability of the test was achieved. The SON-R 5,-17 is composed of abstract and concrete reasoning tests, spatial ability tests and a perceptual test. A few of these tests are newly developed. A memory test was excluded because memory can be examined better by a specific and comprehensive test battery than by a single subtest. In the SON-R 5,-17, the standardization for the deaf is restricted to conversion of the IQ score to a percentile score for the deaf population. The test uses an adaptive procedure in which the items are arranged in parallel series. This way, fewer items that are either too easy or too difficult are administered. Feedback is given in all subtests; this consists of indicating
16
SON-R 2,-7
whether a solution is correct or incorrect. The standardized scores are calculated and printed by a computer program. The SON-R 5,-17 has been reviewed by COTAN, the commission of the Netherlands Institute for Psychologists responsible for the evaluation of tests. All aspects of the test (Basics of the construction of the test, Execution of the manual and test materials, Norms, Reliability and Validity) were judged to be ‘good’ (Evers, Van Vliet-Mulder & Ter Laak, 1992). This means the SON-R 5,-17 is considered to be among the most highly accredited tests in the Netherlands (Sijtsma, 1993). After completing the SON-R 5,-17, a revision of the Preschool SON was started, resulting in the publication of the SON-R 2,-7. The test was published in 1996, together with a manual consisting of the directions and the norm tables (Tellegen, Winkel & Wijnberg-Williams, 1997). In the present ‘Manual and Research report’, the results of research done with the test are also presented: the method of revision, the standardization and the psychometric characteristics, as well as the research concerning the validity of the test. Norm tables allowing the calculation of separate standardized total scores for the performance tests and the reasoning tests have been added. Also, the reference age for the total score can be determined. Norms for experimental usage have been added for the ages of 2;0 to 2;6 years. All standardized scores can easily be calculated and printed using the computer program.
1.3 RATIONALE FOR THE REVISION OF THE PRESCHOOL SON The most important reasons for revising the Preschool SON were the need to update the norms, to modernize the test materials, to improve the reliability and generalizability of the test, and to provide a good match with the early items of the SON-R 5,-17.
Updating the norms The Preschool SON was published in 1975. After a period of more than 20 years, revision of an intelligence test is advisable. Test norms tend to grow obsolete in the course of time. Research shows (Lynn & Hampson, 1986; Flynn, 1987) that performance on intelligence tests increases by two or three IQ points over a period of 10 years. Experience in the Netherlands with the revision of the SON-R 5,-17 and the WISC-R is consistent with this (Harinck & Schoorl, 1987). Comparisons in the United States of scores on the WPPSI and WPPSI-R, and scores on the WISC-R and WISC-III showed an average increase in the total IQ scores of more than three points every ten years. The increase in the performance IQ was more than four points every ten years (Wechsler, 1989, 1991). Changes in the socio-economic environment may explain the increase in the level of performance on intelligence tests (Lynn & Hampson, 1986). Examples of these changes are watching television, increase in leisure time, smaller families, higher general level of education, changes in upbringing and education. The composition of the general population has also changed; in the Netherlands the population is ageing and the number of immigrants is increasing. The norms of the Preschool SON from 1975 can be expected to provide scores that are too high, and that no longer represent the child’s performance in comparison to his or her present age group.
The testing materials The rather old-fashioned testing materials were the second reason for revising the test: some of the drawings used were very dated, and the increasing number of immigrant children in the Netherlands over the last twenty years makes it desirable to reflect the multi-cultural background of potential subjects in the materials (see Hofstee, 1990). The structure of the materials and the storing methods of the test were also in need of improvement.
Improving the reliability and generalizability A third motive for revision was to improve the reliability and generalizability of the Preschool SON, especially for the lower and upper age ranges. Analysis of the data presented in the manual
INTRODUCTION
17
of the Preschool SON showed that the subtests differentiated too little at these ages. The range of possible raw scores had a mean of 12 points. In the youngest age group, 20% of the children received the lowest score on the subtests and in the oldest age group, 43% received the highest score (Hofstee & Tellegen, 1991). In other words, the Preschool SON was appropriate for children of four or five years old, but it was often too difficult for younger children and too easy for older children. Further, there was no standardization at the subtest level, only at the level of the total score; this meant that it was not possible to calculate the IQ properly if a subtest had not been administered. Finally, the norms were presented per age group of half a year. This could lead to a deviation of six IQ points if the age did not correspond to the middle of the interval.
, -17 Correspondence with the SON-R 5, To be able to compare the results of the SON-R 2,-7 with those of the SON-R 5,-17, the new test for young children should be highly similar to the test for older children. An overlap in the age ranges of the tests was also considered desirable. This way, the choice of a test can be based on the level of the child, or on other specific characteristics that make one test more suitable than the other. Various new characteristics of the SON-R 5,-17, such as the adaptive test procedure, the standardization model and the use of a computer program, were implemented as far as possible in the construction of the SON-R 2,-7.
1.4 PHASES OF THE RESEARCH On the basis of the above-mentioned arguments it was decided to revise the Preschool SON. The revision was not restricted to the construction of new norms; the items, subtests and directions were also subjected to a thorough revision. The revision proceeded in several phases. This section presents a short review of the research phases.
Preparatory study In the preparatory study the Preschool SON was evaluated. This started in 1990. The aim of the preparatory study was to decide how the testing materials of the Preschool SON could best be adapted and expanded. To this end, users of the Preschool SON were interviewed, the literature was reviewed, other intelligence tests were analyzed and a secondary analysis of the data of the standardization research of the Preschool SON was performed.
Construction research phase The construction research for the SON-R 2,-7 took place in 1991/’92. During this period, three experimental versions of the test were administered to more than 1850 children between two and seven years of age. The final version of the SON-R 2,-7 was compiled on the basis of the data from this research, the experiences and observations of examiners, and the comments and suggestions of psychologists and educators active in the field.
Standardization research phase The standardization research, in which more than 1100 children in the age range two to seven years participated, took place during the school year 1993/’94. The results of this research formed the basis for the standardization of the SON-R 2,-7, and the evaluation of its psychometric characteristics. During the standardization research, background data relevant for the interpretation of the test scores were collected. For the validation of the test, other language and intelligence tests were administered to a large number of the children who participated in the standardization research. Administration of these tests was also made possible by collaboration with the project group that was responsible for the standardization of the Reynell Test for Language Skills (Van Eldik, Schlichting, Lutje Spelberg, Sj. van der Meulen & B.D. van der Meulen, 1995) and the Schlichting Test for Language Production (Schlichting, Van Eldik, Lutje Spelberg, Sj. van der Meulen & B.F. van der Meulen, 1995).
18
SON-R 2,-7
Validation research phase Separate validation research was done for the following groups: children in special educational programs, children at medical preschool daycare centers, children with a language, speech and/ or hearing disorder, deaf children, autistic children and immigrant children. Validation research was also carried out in Australia, the United States of America and the United Kingdom. The results of these children on the SON-R 2,-7 have been compared with their performance on many other cognitive tests.
1.5 ORGANIZATION OF THE MANUAL This manual is made up of three parts. In the first part the construction phase of the test is discussed. Chapter 2 deals with the preparatory study and the construction research during which new testing materials and administration procedures were developed. In chapter 3 a description is given of the subtests and the main characteristics of the administration of the test. The standardization research and the standardization model used are described in chapter 4. Information about psychometric characteristics such as reliability, factor structure and stability can be found in chapter 5. In the second part research concerning the validity of the test is described. Chapter 6 is based on the results in the norm group and discusses the relations between test performance and other variables, such as socio-economic level, sex and evaluations by the examiner and teachers. In chapter 7 the test results in a number of special groups of children, with whom the SON-tests are often used, are discussed. The special groups include children with a developmental delay, autistic children, language, speech and/or hearing disabled children, and deaf children. Chapter 8 deals with the performance of immigrant children. In chapter 9 the correlations between the SON-R 2,-7 and several other tests for intelligence, language skills, memory and perception are discussed. The research on validity involved both children in regular education and handicapped children, and was partly carried out in other countries. The third part of this book concerns the practical application of the test. Chapter 10 deals with the implications of the research results in practice, and with problems that can arise with the interpretation of the results. The general directions for the administration and scoring of the test are described in chapter 11; the directions for the separate subtests can be found in chapter 12. Chapter 13 gives guidelines for using the record form, the norm tables and the computer program. In the appendices the norm tables for determining the reference age, and the standardized subtest and total scores can be found, as well as an example of the record form and a description of the contents of the test kit. In general, ages in the text and tables are presented in years and months. This means that 4;6 years equals four years and six months. In a few tables the mean ages are presented with a decimal; this means that 4.5 years is the same as 4;6 years. In the norm tables the age of 4;6 years indicates an interval from ‘four years, six months, zero days’ to ‘four years, six months, thirty days’ inclusive. To improve legibility, statistical results have been rounded off. This can lead to seemingly incorrect results. For instance a distribution of 38.5% and 61.5% becomes, when rounded off, 39% and 62%, and this does not add up to 100%. Similar small differences may occur in the presentation of differences in means or between correlations. Pearson product-moment correlations were used in the analyses. Unless stated otherwise, the correlations were tested one-sidedly.
19
2
PREPARATORY STUDY AND CONSTRUCTION RESEARCH
In this chapter, the test construction phase is described. In this phase, the research necessary to construct a provisional version of the test was carried out. Successive improvements resulted in the final test battery.
2.1 THE PREPARATORY STUDY The preparatory study was carried out to discover how best to adapt and possibly to expand the materials of the Preschool SON. To this end ten users of the Preschool SON were interviewed about their experience with the test (via questionnaires). Secondary analyses were also carried out on the original material from the standardization research of the Preschool SON. A review of the literature and an analysis of other intelligence tests were undertaken as a preparation for the revision (Tellegen, Wijnberg, Laros & Winkel, 1992).
Composition of the Preschool SON The Preschool SON was composed of fifty items distributed over five subtests: Sorting, Mosaics, Combination, Memory and Copying. In the subtest Sorting, geometrical forms and pictures were sorted according to the category to which they belong. The subtest Mosaics was an action test in which various mosaic patterns had to be copied using red and yellow squares. Combination consisted of matching halves of pictures and doing puzzles. In the subtest Memory, also called the Cat House, the aim was to find either one or two cats that were hidden several times in the house. Copying consisted of copying figures that were drawn by the examiner or shown in a test booklet.
Evaluation by users An inventory of the comments received from ten users of the Preschool SON was made. These were psychologists employed by school advisory services, audiological centers, institutes for the deaf, medical preschool daycare centers, and in the care for the mentally deficient. On the whole, the Preschool SON was given a positive assessment as a test to which children respond well and that affords plenty of opportunity to observe the child’s behavior. The users did, however, have the impression that the IQ score of the Preschool SON overrated the level of the children. Clear information about administering and scoring the various subtests was lacking in the manual. The users followed the directions accurately but not literally. Furthermore, they thought the subtests contained too few examples. They were inclined to provide extra help, especially to young and to mentally deficient children. The discontinuation criterion, used in the Preschool SON, was three consecutive mistakes per subtest. This discontinuation rule was considered too strict, particularly for the youngest children, and, in practice, this rule was not always applied. The subtest Memory was administered in different ways. Some users administered it as a game, playing a kind of hide and seek, whereas others tried to avoid doing this. The users had the impression that this subtest was given too much weight in the total score of the Preschool SON. Also, some doubt existed about the relationship between this subtest and the other ones.
20
SON-R 2,-7
Comparative research on the Preschool SON, the Stanford-Binet and parts of the WPPSI was conducted by Harris in the United States of America. In general, her assessment of the test was positive. Her criticism focused on some of the materials and the global norm tables (Harris, 1982).
Secondary analyses of the standardization data The original data from a sample of hearing children (N=503) involved in the standardization research of the Preschool SON was used for the secondary analyses. A study was made of the distribution of the test scores according to age, the correlation between the test scores and the reliability. The results were as follows: – The standard deviation of the raw subtest scores was usually highest in children from four to five years of age. For Mosaics and Copying, the range of scores for young children from 2;6 to 4 years was very restricted. For most subtests the range decreased greatly in the oldest groups from 5;6 to 7 years. – In the conversion of the scores into IQ scores, the distributions were not sufficiently normalized, so that they were negatively skewed for children from five years onwards. This could result in extremely low IQ scores. – The reliability for combinations of age groups was recalculated. After this, a correction for age was carried out. The mean reliability of the subtests was .57 for children from 2;6 to four years of age, .66 for children from four to five years, and .61 for children from 5;6 to seven years. The reliability of the total score was .78 for children from 2;6 to four years, .86 for children from four to five years, and .82 for children from 5;6 to seven years. Generally, the reliability was low, especially for the youngest and oldest age groups where strong floor and ceiling effects were present. The reliability of the subtests and the total scores was much lower than the values mentioned in the manual of the Preschool SON. The cause of this discrepancy was that, in the manual, the reliability was calculated for combined age groups with no correction for age. – The generalizability of the total score is important for the interpretation of the IQ scores. In this case, the subtests are seen as random samples from the domain of possible, relevant subtests. The generalizability coefficient of the Preschool SON was .61 for the age group from 2;6 to four years, .75 for the age group from four to five years and .65 for the age group from 5;6 to seven years. – The reliability of the subtest Memory was low and the score on this subtest showed a low correlation with age and with the scores on the remaining subtests.
Review of the literature In the revision of the Preschool SON we attempted to produce a version that was compatible with the early items of the SON-R 5,-17. As the subtest Analogies in the SON-R 5,-17 is one of its strongest components, the possibility of developing a similar analogy test for young children was examined. Based on recent research results (Alexander et al., 1989; Goswami, 1991) it seemed possible to construct an analogy test for children from about 4 years of age onwards. Since an analogy test would most likely be too difficult for the youngest children, starting this test with sorting seemed advisable; the level of abstraction required for sorting is lower than the level of abstraction required for understanding analogies, and, in a certain sense, precedes it.
Implications for the revision The results of the preparatory study confirmed the need for a new standardization and a thorough revision of the Preschool SON. An important goal in the revision of the Preschool SON was the improvement of the psychometric characteristics of the test. The reliability and the generalizability of the test scores were lower than was desirable, especially in the youngest and oldest of the age groups for which the test was designed. However, an increase in reliability could not be gained simply by expanding the number of items and subtests because an increase in the duration of the test could lead to fatigue, loss of motivation and decrease in concentration. Any expansion of the test had therefore to be combined with an effective adaptive procedure.
21
PREPARATORY STUDY AND CONSTRUCTION RESEARCH
For the SON-R 5,-17 with an administration time of about one-and-a-half hours, the mean reliability of the total score is .93 and the generalizability is .85. If the administration of the SON-R 2,-7 was to be limited to one hour, a reliability of .90 and a generalizability of .80 seemed to be realistic goals. The improvement of these characteristics could be achieved by adding very easy and very difficult items to each subtest, and by increasing the number of subtests. An important object during the revision of the Preschool SON was to obtain a good match with the early items of the SON-R 5,-17. As the age ranges of the two tests overlapped, the idea was to take the easy items of the SON-R 5,-17 as a starting point for the new, most difficult items of the SON-R 2,-7. These considerations led to a plan for the revision of the Preschool SON in which the subtest Memory was dropped. The subtest Memory (the Cat House) had a low level of reliability and, what is more, a low correlation with age and the remaining subtests. The interviews with users of the Preschool SON showed that children enjoyed doing the Cat House subtest, but that the directions for administration were often not followed correctly. Another consideration was that assessment of memory can be carried out more effectively with a specific and comprehensive test battery. The results from a single subtest are insufficient to draw valid conclusions about memory. On the basis of similar considerations, no memory subtest had been included in the SON-R 5,-17. The four remaining subtests of the Preschool SON were expanded to six subtests by dividing two existing subtests: – The subtest Sorting was divided into two subtests: the section Sorting Disks was expanded with simple analogy items consisting of geometrical forms similar to the SON-R 5,-17; the section Sorting Pictures was expanded with easy items from the subtest Categories of the SON-R 5,-17. – The section of the subtest Combining, in which two halves of a picture had to be combined, was expanded with items from the subtest Situations from the SON-R 5,-17; the section Puzzles was expanded and implemented as a separate subtest. – The subtest Mosaics was expanded with simple items and with items from the SON-R 5,-17. – The subtest Copying was adapted to increase its similarity to the subtest Patterns of the SON-R 5,-17. The relationship between the subtests of the Preschool SON and the SON-R 2,-7 is presented schematically in table 2.1. Table 2.1 Relationship Between the Subtests of the Preschool SON and the SON-R 2,-7 Preschool SON
, -7 SON-R 2,
Subtest
Task
Subtest
Task
Sorting
Sorting disks
Analogies
Sorting disks Analogies SON-R 5,-17
Sorting figures
Categories
Sorting figures Categories SON-R 5,-17
Mosaics
Mosaics with/without a frame
Mosaics
Mosaics in a frame Mosaics SON-R 5,-17
Combination
Two halves of a picture
Situations
Two halves of a picture Situations SON-R 5,-17
Puzzles
Puzzles
Puzzles in a frame ‘separate puzzles’
Patterns
Copying patterns
Memory
Finding cats
Copying
Copying drawn figures
22
SON-R 2,-7
2.2 THE CONSTRUCTION RESEARCH In 1991/’92, extensive research was done with three experimental versions of the test. These were administered to more than 1850 children between two and eight years of age. The research was carried out in preschool play groups, day care centers and primary schools across the Netherlands. The versions were also administered on a small scale to deaf children and children with learning problems. The examiners participating in the construction research were mainly trained psychologists with experience in testing. Psychologists and educators who normally make diagnostic assessments of young children were contacted in an early phase to obtain information about the usability of the construction versions for children with specific problems. More than twenty people in the field, employed by school advisory services, audiological centers and outpatient departments, administered sections of the three versions to a number of children. They commented on and gave suggestions for the construction of the material, the directions and the administration procedure.
Points of departure for the construction The most important objectives in the construction and administration of the experimental versions were: – expanding the number of items and subtests to improve the reliability of the test and to make the test more suitable for the youngest and the oldest age groups, – limiting the mean administration time to a maximum of one hour by using an effective adaptive procedure, – making the testing materials both attractive for children and durable, – developing clear directions for the administration of the test and the manner of giving feedback.
Testing materials From the first experimental version on, the test consisted of the following subtests: Mosaics, Categories, Puzzles, Analogies, Situations and Patterns. This sequence was maintained throughout the three versions. Tests that are spatially oriented are alternated with tests that require reasoning abilities, and abstract testing materials are alternated with materials using concrete (reasoning) pictures. Mosaics is a suitable test to begin with as it requires little direction, the child works actively at a solution, and the task corresponds to activities that are familiar to the child. The items of the experimental versions consisted of (adapted) items from the Preschool SON and the SON-R 5,-17 and of newly constructed items. Most of the new items were very simple items that would make the test better suited to young children. Table 2.2 shows the origin of the items in the final version of the test. Of a total of 96 items, five of which are example items, 45% are new, 25% are adaptations of Preschool SON items, and 30% are adaptions from the SON-R 5,-17. In the first experimental version the original items of the Preschool SON and the SON-R 5,-17 were used. In the following versions all items of the subtests were redrawn and reformed to improve the uniformity of the material and to simplify the directions for the tasks. In the pictures of people the emphasis was on pictures of children and care was taken to have an even distribution of boys and girls. More children with a non-western appearance were included. An effort was made to make the material colorful and attractive, durable and easy to store. A mat was used to prevent the material from sliding around, to facilitate picking up the pieces and to increase the standardization of the test situation.
Adaptive procedure and duration of administration To make the test suitable for the age range from two to seven years, a broad range of task difficulty is required. An adaptive test procedure is desirable to limit the duration of the test, and to prevent children having to do tasks far above or far below their level. Having to do items that
23
PREPARATORY STUDY AND CONSTRUCTION RESEARCH
Table 2.2 Origin of the Items ,-7 Subtests of the SON-R 2, Origin
Mos
Cat
Puz
Ana
Sit
Pat
Total
Adapted from the Preschool SON
3
4
6
3
2
6
24
Adapted from the SON-R 5,-17
6
9
–
5
9
–
29
New items
7
3
9
10
4
10
43
16
16
15
18
15
16
96
Total number of items, including examples
are much too difficult is very frustrating and demotivating for children. When older children are given items that are much too easy, they very quickly consider these childish and may then be inclined not to take the next, more difficult items seriously. In the Preschool SON a discontinuation rule of three consecutive mistakes was used. Because the mistakes had to be consecutive, children sometimes had to make many mistakes before the test could be stopped. In practice this meant that, especially with young children, examiners often stopped too early. In the SON-R 5,-17 the items are arranged in two or three parallel series and in each series the test is discontinued after a total of two mistakes. In the first series the first item is taken as a starting point; in the following series the starting point depends on the performance in the previous series. This method has great advantages: everyone starts the test at the same point, but tasks that are too easy as well as tasks that are too difficult are skipped. Further, returning to an easier level in the next series is pleasant for the child after he or she has done a few tasks incorrectly. Research was carried out with the first experimental version to see if the adaptive method of the SON-R 5,-17 could also be applied with the SON-R 2,-7. The problem was, however, that the subtests consist of two different parts. This makes a procedure with parallel series confusing and complicated because switching repeatedly from one part of the test to the other may be necessary. In the subsequent construction research, only one series of items of progressive difficulty was used. However, the discontinuation criterion was varied and research was done on the effect of using an entry procedure in which the item taken as a starting point depended on the age of the child. Finally, on the basis of the results of this research, a procedure was chosen in which the first, third or fifth item is taken as a starting point and each subtest is discontinued after a total of three mistakes. The performance subtests can also be discontinued when two subsequent mistakes are made in the second section of these tests. The items in these subtests have a high level of discrimination, and the children require a fair amount of time to complete the tasks. They become frustrated if they have to continue when the next item is clearly too difficult for them. As a result of the adaptive procedure, the number of items to be administered is strictly limited, and the mean duration of the test is less than an hour, but very little information is lost by skipping a few items. Further, the children’s motivation remains high during this procedure because only a very few items above their level are administered.
Difficulty of items and ability to discriminate After each phase of research the results were analyzed per subtest with the 2-parameter logistic model from the item response theory (IRT; see Lord, 1980; Hambleton & Swaminathan, 1985). The program BILOG (Mislevy & Bock, 1990) was used for this analysis. With this program the parameters for difficulty and discrimination of items can be estimated for incomplete tests. The
24
SON-R 2,-7
IRT-model was used because the adaptive administration procedure makes it difficult to evaluate these characteristics on the basis of p-values and item-total correlations. The parameter for difficulty indicates a level of ability at which 50% of the children solve the item correctly; the parameter for discrimination indicates how, at this level, the probability that the item will be answered correctly increases as ability increases. Because of the use of an adaptive procedure, it was important that the items were administered in the correct order of progressive difficulty; the examiner had to be reasonably certain that items skipped at the beginning would have been solved correctly, and that items skipped at the end would have been solved incorrectly. Also important was a balanced distribution in the difficulty of the items, and sufficient numbers of easy items for young children and difficult items for older ones. On the basis of the results of the IRT-analysis, new items were constructed, some old items were adapted and others were removed from the test. In some cases the order of administration was changed. A problem arising from this was that items may become more difficult when administered early in the test. The help and feedback given after an incorrect solution may benefit the child so that the next, more difficult item becomes relatively more easy.
Directions and feedback An important feature of the SON-tests is that directions can be given verbally as well as nonverbally. This makes the test situation more natural because the directions can correspond to the communication skills of the child. When verbal directions are given, care must be taken not to provide extra information that is not contained in the nonverbal directions. However, nonverbal directions have their limitations, so that explaining to the children exactly what is expected of them is difficult, certainly with young children. Examples were therefore built into the first items to give the child the opportunity to repeat what the examiner had done or to solve a similar task. As the test proceeds, tasks are solved more and more independently. To make the items of the SON-R 5,-17 suitable for this approach, they were also adapted, for example, by first working with cards that have to be arranged correctly instead of pointing out the correct alternative. Not only does the difficulty of the items increase in the subtests, the manner in which they are administered changes as well. In the construction research this procedure was continuously adapted, and the directions were improved in accordance with the experiences and comments of the examiners and of practicing psychologists. The greatest problems in developing clear directions arose in the second section of the subtest Analogies. Here the child has to apply a similar transformation to a figure as is shown in an example. This is difficult to demonstrate nonverbally because of the high level of abstraction, but it can be explained in a few words. The test therefore provides first for extensive, repeated practice on one example, and then provides an example with every following item. The feedback and help given after an incorrect solution is important in giving the child a clear understanding of the aim of the tasks. The manner in which feedback and help should be given was worked out in greater detail during the research and is described in the directions.
Scoring Patterns In the subtest Patterns lines and figures must be copied, with or without the help of preprinted dots. Whether the child can draw neatly or accurately is not important when copying, but whether he or she can see and reproduce the structure of the example is. This makes high demands on the assessment and a certain measure of subjectivity cannot be excluded. During the construction research, a great deal of attention was paid to elucidating the scoring rules, and inter-assessor discrepancies were used to determine which drawings were difficult to evaluate. On this basis, drawings that help to clarify the scoring rules were selected. These drawings are included in the directions for the administration of Patterns.
25
3
, -7 DESCRIPTION OF THE SON-R 2,
The SON-R 2,-7 is a general intelligence test for young children. The test assesses a broad spectrum of cognitive abilities without involving the use of language. This makes it especially suitable for children who have problems or handicaps in language, speech or communication, for instance, children with a language, speech or hearing disorder, deaf children, autistic children, children with problems in social development, and immigrant children with a different native language. A number of features make the test particularly suitable for less gifted children and children who are difficult to test. The materials are attractive, the tasks diverse. The child is given the chance to be active. Extensive examples are provided. Help is available on incorrect responses, and the discontinuation rules restrict the administration of items that are too difficult for the child. The SON-R 2,-7 differs in various aspects from the more traditional intelligence tests, in content as well as in manner of administration. Therefore, this test can well be administered as a second test in cases where important decisions have to be taken, on the basis of the outcome of a test, or if the validity of the first test is in doubt. Although the reasoning tests in the SON-R 2,-7 are an important addition to the typical performance tests, the nonverbal character of the SON tests limits the range of cognitive abilities that can be tested. Other tests will be required to gain an insight into verbal development and abilities. However, for those groups of children for whom the SON-R 2,-7 has been specifically designed, a clear distinction must be made between intelligence and verbal development. After describing the composition of the subtests, the most important characteristics of the test administration are presented in this chapter.
3.1 THE SUBTESTS The SON-R 2,-7 is composed of six subtests: 1. Mosaics, 2. Categories, 3. Puzzles, 4. Analogies, 5. Situations and 6. Patterns. The subtests are administered in this sequence. The tests can be grouped into two types: reasoning tests (Categories, Analogies and Situations) and more spatial, performance tests (Mosaics, Puzzles and Patterns). The six subtests consist, on average, of 15 items of increasing difficulty. Each subtest consists of two parts that differ in materials and/or directions. In the first part the examples are included in the items. The second part of each subtest, except in the case of the Patterns subtest, is preceded by an example, and the subsequent items are completed independently. In table 3.1 a short description is given of the tasks in both sections of the subtests. In figures 3.1 to 3.6 a few examples of the items are presented.
26
SON-R 2,-7
Table 3.1 Tasks in the Subtests of the SON-R 2,-7 Task part I
Task part II
Mosaics
Copying different simple mosaic patterns in a frame, using red squares.
Copying mosaic patterns in a frame, using red, yellow and red/yellow squares.
Categories
Sorting cards into two groups according to the category to which they belong.
Three pictures of objects have something in common. From a series of five pictures, two must be chosen that have the same thing in common.
Puzzles
Puzzle pieces must be laid in a frame to resemble a given example.
Putting three to six separate puzzle pieces together to form a whole.
Analogies
Sorting disks into two compartments on the basis of form and/or color and/or size.
Solving an analogy problem by applying the same principle of change as in the example analogy.
Situations
Half of each of four pictures is printed. The missing halves must be placed with the correct pictures.
One or two pieces are missing in drawing of a situation. The correct piece(s) must be chosen from a number of alternatives.
Patterns
Copying a simple pattern.
Copying a pattern in which five, nine or sixteen dots must be connected by a line.
Mosaics (Mos) The subtest Mosaics consists of 15 items. In Mosaics, part I, the child is required to copy several simple mosaic patterns in a frame using three to five red squares. The level of difficulty is determined by the number of squares to be used and whether or not the examiner first demonstrates the item. In Mosaics II, diverse mosaic patterns have to be copied in a frame using red, yellow and red/ yellow squares. In the easiest items of part II, only red and yellow squares are used, and the pattern is printed in the actual size. In the most difficult items, all of the squares are used and the pattern is scaled down.
Categories (Cat) Categories consists of 15 items. In Categories I, four or six cards have to be sorted into two groups according to the category to which they belong. In the first few items, the drawings on the cards belonging to the same category strongly resemble each other. For example, a shoe or a flower is shown in different positions. In the last items of part I, the child must him or herself identify the concept underlying the category: for example, vehicles with or without an engine. Categories II is a multiple choice test. In this part, the child is shown three pictures of objects that have something in common. Two more pictures that have the same thing in common have then to be chosen from another column of five pictures. The level of difficulty is determined by the level of abstraction of the shared characteristic.
Puzzles (Puz) The subtest Puzzles consists of 14 items. In part I, puzzle pieces must be laid in a frame to
27
DESCRIPTION OF THE SON-R 2,-7
Figure 3.1 Items from the Subtest Mosaics
Item 3 (Part I)
Item 9 (Part II)
Item 14 (Part II)
resemble the given example. Each puzzle has three pieces. The first few puzzles are first demonstrated by the examiner. The most difficult puzzles in part I have to be solved independently. In Puzzles II, a whole must be formed from three to six separate puzzle pieces. No directions are given as to what the puzzles should represent; no example or frame is used. The number of puzzle pieces partially determines the level of difficulty. Figure 3.2 Items from the Subtest Categories
Item 4 (Part I)
Item 11 (Part II)
28
SON-R 2,-7
Figure 3.3 Items from the Subtest Puzzles
Item 3 (Part I)
Item 11 (Part II)
Analogies (Ana) The subtest Analogies consists of 17 items. In Analogies I, the child is required to sort three, four or five blocks into two compartments on the basis of either form, color or size. The child must discover the sorting principle him or herself on the basis of an example. In the first few items, the blocks to be sorted are the same as those pictured in the test booklet. In the last items of part I, the child must discover the underlying principle independently: for example, large versus small blocks. Analogies II is a multiple choice test. Each item consists of an example-analogy in which a geometric figure changes in one or more aspect(s) to form another geometric figure. The examiner demonstrates a similar analogy, using the same principle of change. Together with the child, the examiner chooses the correct alternative from several possibilities. Then, the child has to apply the same principle of change to solve another analogy independently. The level of difficulty of the items is related to the number and complexity of the transformations.
Situations (Sit) The subtest Situations consists of 14 items. Situations I consists of items in which one half of each of four pictures is shown in the test booklet. The child has to place the missing halves beside the correct pictures. The first item is printed in color in order to make the principle clear. The level of difficulty is determined by the degree of similarity between the different halves belonging to an item. Situations II is a multiple choice test. Each item consists of a drawing of a situation with one or two pieces missing. The correct piece (or pieces) must be chosen from a number of alternatives to make the situation logically consistent. The number of missing pieces determines the level of difficulty. Patterns (Pat) The subtest Patterns consists of 16 items. In this subtest the child is required to copy an example. The first items are drawn freely, then pre-printed dots have to be connected to make the pattern resemble the example. The items of Patterns I are first demonstrated by the examiner and consist of no more than five dots.
DESCRIPTION OF THE SON-R 2,-7
29
Figure 3.4 Items from the Subtest Analogies
Item 8
(Part I)
Item 9 (Part I)
Item 16 (Part II)
30
SON-R 2,-7
Figure 3.5 Items from the Subtest Situations
Item 5 (Part I)
Item 10 (Part II)
31
DESCRIPTION OF THE SON-R 2,-7
Figure 3.6 Items from the Subtest Patterns
Item 6 (Part I)
Item 13 (Part II)
Item 16 (Part II)
The items in Patterns II consist of five, nine or sixteen dots and have to be copied by the child without help. The level of difficulty is determined by the number of dots and whether or not the dots are pictured in the example pattern.
3.2 REASONING TESTS, SPATIAL TESTS AND PERFORMANCE TESTS Reasoning tests Reasoning abilities have traditionally been seen as the basis for intelligent functioning (Carroll, 1993). Reasoning tests form the core of most intelligence tests. They can be divided into abstract and concrete reasoning tests. Abstract reasoning tests, such as Analogies and Categories, are based on relationships between concepts that are abstract, i.e., not bound by time or place. In abstract reasoning tests, a principle of order must be derived from the test materials presented, and applied to new materials. In concrete reasoning tests, like Situations, the object is to bring about a realistic time-space connection between persons or objects (see Snijders, Tellegen & Laros, 1989).
Spatial tests Spatial tests correspond to concrete reasoning tests in that, in both cases, a relationship within a spatial whole must be constructed. The difference lies in the fact that concrete reasoning tests concern a meaningful relationship between parts of a picture, and spatial tests concern a ‘form’ relationship between pieces or parts of a figure (see Snijders, Tellegen & Laros, 1989; Carroll, 1993). Spatial tests have long been integral components of intelligence tests. The spatial subtests included in the SON-R 2,-7 are Mosaics and Patterns. The subtest Puzzles is more difficult to classify, as the relationship between the parts concerns form as well as meaning. We expected the performance on Puzzles and Situations to relate to concrete reasoning ability.
32
SON-R 2,-7
However, the correlations and factor analysis show that Puzzles is more closely associated with Mosaics and Patterns (see section 5.3)
Performance tests An important characteristic that Puzzles, Mosaics and Patterns have in common is that the item is solved while manipulating the test stimuli. That is why these three subtests are called performance tests. In the three reasoning tests (Situations, Categories and Analogies), in contrast, the correct solution has to be chosen from a number of alternatives. For the rest, the six subtests are very similar in that perceptual and spatial aspects as well as reasoning ability play a role in all of them. The performance subtests of the SON-R 2,-7 can be found in a similar form in other intelligence tests. However, only verbal directions are given in these tests. Reasoning tests can also regularly be found in other intelligence tests, but then they often have a verbal form (such as verbal analogies). In table 3.2 the classification of the subtests is presented. The empirical classification, in which a distinction is made between performance tests and reasoning tests, is based on the results of principal components analysis of the test scores of several different groups of children (see section 5.4). In table 3.2. the number of each subtest indicates the sequence of administration; the sequence of the subtests in the table is based on similarities of content. This sequence is used in the following chapters when presenting the results. Table 3.2 Classification of the Subtests No
Abbr
Subtest
Content
Empirical
6 1 3 5 2 4
Pat Mos Puz Sit Cat Ana
Patterns Mosaics Puzzles Situations Categories Analogies
Spatial insight Spatial insight Concrete reasoning Concrete reasoning Abstract reasoning Abstract reasoning
Performance test Performance test Performance test Reasoning test Reasoning test Reasoning test
3.3 CHARACTERISTICS OF THE ADMINISTRATION In this section the most important characteristics of the SON-R 2,-7 are discussed.
Individual intelligence test Most intelligence tests for children are administered individually. The SON-R 2,-7 follows this tradition for the following reasons: – the directions can be given nonverbally, – feedback can be given in the correct manner, – testing can be tailored to the level of each individual child, – the examiner can encourage children who are not very motivated or cannot concentrate; personal contact between the child and the examiner is essential for effective testing, certainly for children up to the age of four to five years.
Nonverbal intelligence test The SON-R 2,-7 is nonverbal. This means that the test can be administered without the use of spoken or written language. The examiner and the child are not required to speak or write and the testing materials have no language component. One is, however, allowed to speak during the
DESCRIPTION OF THE SON-R 2,-7
33
test administration, otherwise an unnatural situation would arise. The manner of administration of the test depends on the communication abilities of the child. The directions can be given verbally, nonverbally with gestures or using a combination of both. Care must be taken when giving verbal directions that no extra information is given. No knowledge of a specific language is required to solve the items being presented. However, level of language development, for example, being able to name objects, characteristics and concepts, can influence the ability to solve the problems correctly. Therefore the SON-R 2,-7 should be considered a nonverbal test for intelligence rather than a test for nonverbal intelligence.
Directions An important part of the directions to the child is the demonstration of (part of) the solution to a problem. An example item is included in the administration of the first item on each subtest, and detailed directions are given for all first items. Once the child understands the nature of the task, the examiner can shorten the directions for the following items. If the child does not understand the directions, they can be repeated. In the second part of each subtest an example is given in advance. Once the child understands this example, he or she can do the following items independently.
Feedback The examiner gives feedback after each item. In the SON-R 5,-17, feedback is limited to telling the child whether his of her answer is correct or incorrect. In the SON-R 2,-7 the examiner indicates whether the solution is correct or incorrect, and, if the answer is incorrect, he/she also demonstrates the correct solution for the child. The examiner tries to involve the child when correcting the answer, for instance, by letting him or her perform the last action. However, the examiner does not explain why the answer was incorrect. By giving feedback, a more normal interaction between the examiner and the child occurs, and the child gains a clearer understanding of the task. The child is given the opportunity to learn and to correct him or herself. In this respect a similarity exists between the SON-tests and tests for learning potential (Tellegen & Laros, 1993a).
Entry procedure and discontinuation rule Each subtest begins with an entry procedure. Based on age and, when possible, the estimated cognitive level of the child, a start is made with the first, third or fifth item. This procedure was chosen to prevent children from becoming demotivated by being required to solve too many items that are below their level. The design of the entry procedure ensures that the first items the child skips would have been solved correctly. Should the level chosen later appear to be too difficult, the examiner can return to a lower level. However, because of the manner in which the test has been constructed, this should occur infrequently. Each subtest has rules for discontinuation. A subtest is discontinued when a total of three items has been incorrectly solved. The mistakes do not have to be consecutive. The three performance subtests are also discontinued when two consecutive mistakes are made in the second part. Frequent failure often has a drastically demotivating effect on children and can result in refusal to go on.
Time factor The speed with which the problems are solved plays a very subordinate role in the SON-R 2,-7. A time limit for completing the items is used only in the second part of the performance tests. The time limit is generous. Its goal is to allow the examiner to end the item. The construction research showed that children who go beyond the time limit are seldom able to find a correct solution when given more time.
Duration of test administration The administration of the SON-R 2,-7 takes about 50 minutes (excluding any short breaks
34
SON-R 2,-7
during administration). During the standardization research the administration took between forty and sixty minutes in 60% of the cases. For children with a specific handicap, the administration takes about five minutes longer. For children two years of age, administration time is shorter; nearly 50% of the two-year-olds complete the test in less that forty minutes.
Standardization The SON-R 2,-7 is meant primarily for children in the age range from 2;6 to 7;0 years. The norms were constructed using a mathematical model in which performance is described as a continuous function of age. An estimate is made of the development of performance in the population, on the basis of the results of the norm groups (see chapter 4). These norms run from 2;0 to 8;0 years. In the age group from 2;0 to 2;6 years, the test should only be used for experimental purposes. In many cases the test is too difficult for children younger than 2;6 years. Often, they are not motivated or concentrated enough to do the test. However, in the age group from 7;0 to 8;0 years, the test is eminently suitable for children with a cognitive delay or who are difficult to test. The easy starting level and the help and feedback given can benefit these children. For children of seven years old who are developing normally, the SON-R 5,-17 is generally more appropriate. The scaled subtest scores are presented as standard scores with a mean of 10 and a standard deviation of 3. The scores range from 1 to 19. The SON-IQ, based on the sum of the scaled subtest scores, has a mean of 100 and a standard deviation of 15. The SON-IQ ranges from 50 to 150. Separate total scores can be calculated for the three performance tests (SON-PS) and the three reasoning tests (SON-RS). These have the same distribution characteristics as the IQ score. When using the computer program, the scaled scores are based on the exact age; in the norm tables age groups of one month are presented. With the computer program, a scaled total score can be calculated for any combination of subtests. In addition to the scaled scores, based on a comparison with the population of children of the same age, a reference age can be determined for the subtest scores and the total scores. This shows the age at which 50% of the children in the norm population perform better, and 50% perform worse. The reference age ranges from 2;0 to 8;0 years. It provides a different framework for the interpretation of the test results, and can be useful when reporting to persons who are not familiar with the characteristics of deviation scores. The reference age also makes it possible to interpret the performance of older children or adults with a cognitive delay, for whom administration of a test, standardized for their age, is practically impossible and not meaningful. As with the SON-R 5,-17, no separate norms for deaf children were developed for the SON-R 2,-7. Our basic assumption is that separate norms for specific groups are only required when a test discriminates against a special group of children because of its contents or the manner in which it is administered. Research using the SON-R 2,-7 and the SON-R 5,-17 with deaf children (see chapter 7) shows that this is absolutely not the case for deaf children with the SON tests.
35
4
STANDARDIZATION OF THE TEST SCORES
Properly standardized test norms are necessary for the interpretation of the results of a test. The test norms make it possible to assess how well or how badly a child performed in comparison to the norm population. The norm population of the SON-R 2,-7 includes all children residing in the Netherlands in the relevant age group, except those with a severe physical and/or mental handicap. The standardization process transforms the raw scores into normal distributions with a fixed mean and standard deviation. This allows comparisons to be made between children, including children of different ages. Intra-individual comparisons between performances on different subtests are also possible. As test performances improve very strongly in the age range from two to seven years, the norms should ideally be related to the exact age of the child and not to an age range, as is the case for most intelligence tests for children.
4.1 DESIGN AND REALIZATION OF THE RESEARCH Age groups Eleven age groups, increasing in age by 6 months, from 2;3 years to 7;3 years formed the point of departure for the standardization research. In each group one hundred children were to be tested: fifty boys and fifty girls. When selecting the children, an effort was made to keep the age within each group as homogeneous as possible. The age in the youngest group, for instance, was supposed to deviate as little as possible from two years, three months and zero days.
Regions of research To ensure a good regional distribution, the research was carried out in ten regions, five of which are in the West, three in the North/East, and two in the South of the Netherlands. The regions were chosen to reflect specific demographic characteristics of the Netherlands. In nine of the ten regions, one examiner administered all the tests. In one region, two examiners shared the test administration. Approximately the same number of children was tested in each region in five separate two week periods. The test was administered to 22 children, one boy and one girl from each age group in each region in each period. The sample to be tested consisted of 1100 children, i.e., 10 (regions) x 5 (periods) x 11 (age groups) x 2 (one boy and one girl).
Communities The second phase of the standardization research concerned the selection of the communities in the ten research regions where the test administrations were to take place. In total, 31 communities were selected. Depending on the size of the community, the research was carried out during one, two or three periods. The selected communities were representative for the Netherlands with regard to number of inhabitants and degree of urbanization.
Schools Children four years and older were tested at primary schools. Research at schools was carried out in the same communities as the research with younger children. One, two or three schools were selected in each community, depending on the number of periods in which research was to be done in that community. To select the schools, a sample was drawn from the schools in each community. The chance of inclusion was proportional to the number of pupils at the school.
36
SON-R 2,-7
Fifty schools were approached, 25 were prepared to participate. Schools that were not prepared to participate were replaced by other schools in the same community. The socio-economic status of the parents was taken into account in the choice of replacement schools.
Selection of the children The manner of selecting the children depended on their age. For children in the age groups up to four years, samples were drawn from the local population register, which contains data on name, date of birth, sex and address of the parents. The boy or girl, whose age corresponded most closely to the required age for each age group was selected. The parents received a letter explaining the aims of the research and asking them to participate. If no reaction to this letter was received, they were approached again by letter of by telephone. In about one quarter of the cases, the test could not be administered to the child that had originally been selected. Some parents refused permission for their child to participate. Sometimes, the data from the population register were no longer correct, or practical problems made it impossible for the parents to allow their child to participate in the research program. In this case, the children were replaced, as far as possible, by children from the same community. For children four years and older, the experimenter selected, per school and per age group, one boy and one girl whose age on the planned test date corresponded as closely as possible to the required age. If the deviation from the required age was too large, either two boys or two girls were selected from one age group, or one extra child was tested at another school. Parents were sent a written request for permission, which was nearly always given.
Practical implementation The department of Orthopedagogics of the University of Groningen, responsible for the standardization in the Netherlands of the Reynell Test for Language Understanding and the Schlichting Test for Language Production (Lutje Spelberg & Sj. van der Meulen, 1990), collaborated in the design and execution of the standardization research. In three of the five research periods, children who were tested with the SON-R 2,-7 had also participated in the standardization research of the language tests six months earlier. To validate both the language tests and the SON-R 2,-7, a third test was administered to some of the children in the intervening period. Eleven examiners, eight women and three men, administered most of the tests. Most were psychology graduates, with extensive experience in testing young children, some of which had been gained in the previous research they had carried out with the language tests. Children below four years old were tested in a local primary health care center, in the presence of one of the parents. In a few cases the child was tested at home. Older children were tested at school in a separate room. An effort was made to administer the whole test in one session. However, a short break between the subtests was allowed. At the schools, breaking off the test for longer periods, or even continuing a test the next day, was sometimes necessary because of school hours and breaks. In a few cases the test could not be administered correctly. If no more than four subtests could be administered, the test was considered invalid and was not used in the analyses. This situation occurred in the case of ten children, eight of whom who were two years old.
Completing the norm group The greater part of the standardization research took place in the period from September to December 1993. As fewer children than had been planned were tested in the youngest age groups, the norm group was supplemented with 31 children in the spring of 1994. Further, immigrant children appeared to be under-represented in the youngest age groups. Eight immigrant children, who had been tested in a different research project were therefore added to the norm group. Finally, eight pupils, 4 years or older, from special schools were added. This was a sample from a group of children who had been tested at schools for special education with a preschool department.
37
STANDARDIZATION OF THE TEST SCORES
4.2 COMPOSITION OF THE NORM GROUP The norm group consisted of 1124 children. Table 4.1 shows the composition of the group according to age and sex, and the distribution according to age of the children who were added to the norm group for various reasons. The mean age per group is practically identical to the planned age, and the distribution according to age within the age groups is very narrow. In all the groups the number of boys is approximately equal to the number of girls. The extent to which the distribution of the selected demographic characteristics of the norm group conformed to that of the total Dutch population (Central Bureau for Demographics, CBS, 1993) is presented in table 4.2. Children from the large urban communities are slightly underrepresented, but these communities are also characterized by a relatively smaller number of youngsters.
Weighting the norm group As a result of sample fluctuations and the different sampling methods used for children above and below four years of age, the backgrounds of the children differed from age group to age group. For the standardization, the following factors were weighted within each age group: the percentage of children with a mother born abroad, the educational level of the mother, and the child’s sex. This allowed a better comparison between the different age groups. Finally, the observations were weighted so that the number of children per age group was the same. After weighting, every age group consisted of 51 boys and 51 girls, making the size of the total sample 1122. An example may elucidate this weighting procedure. The percentage of children with a foreign mother in the entire norm group was 11%. If the percentage in the age group 3;9 years, for example, was 8%, the children with a foreign mother in this age group received a weight of 11/8, and the children with a Dutch mother received a weight of (100-11)/(100-8) = 89/92. When using weights, critical limits of 2/3 and 3/2 were adhered to, in order to prevent some children contributing either too much or too little to the composition of the weighted norm group. After the various steps in the weighting procedure, 80% of the children had a weighting factor between .80 and 1.25. Table 4.1 Composition of the Norm Group According to Age, Sex and Phase of Research (N=1124) Age Group
Total
2;3
2;9
3;3
3;9
4;3
4;9
5;3
5;9
6;3
6;9
7;3
98
99
99
100
102
101
105
105
102
107
106
94
89
86
90
99
101
104
104
101
105
103
3 1 –
9 1 –
11 2 –
7 3 –
2 1 –
– – –
– – 1
– – 1
– – 1
– – 2
– – 3
47 51
50 49
48 51
52 48
52 50
50 51
52 53
53 52
49 53
53 54
55 51
2.24 14
2.76 16
3.25 16
3.75 14
4.25 15
4.74 22
5.24 22
5.74 24
6.25 18
6.74 23
7.24 21
Phase 1993 Addition: 1994 Immigrant Spec. Educ. Sex Boys Girls Age Mean (years) SD (days)
38
SON-R 2,-7
Table 4.2 Demographic Characteristics of the Norm Group in Comparison with the Dutch Population (N=1124) Region North/East-Netherlands South-Netherlands West-Netherlands Size of Community Less than 10.000 inhabitants 10.000 to 20.000 inhabitants 20.000 to 100.000 inhabitants More than 100.000 inhabitants Degree of Urbanization (Urbanized) Rural Communities Commuter Communities Urban Communities
Norm Group
Population
31% 19% 50%
31% 22% 47%
Norm Group
Population
12% 22% 44% 22%
11% 20% 42% 27%
Norm Group
Population
37% 16% 47%
34% 15% 51%
Table 4.3 presents the level of education and country of birth of the mother, before and after weighting, for three age groups. As can be seen, the differences between the age groups were much smaller after weighting. The level of education of the mothers corresponded well to the level of education in the population of women between 25 and 45 years of age (CBS, 1994). The percentages for low, middle and high levels of education in the population are respectively 27%, 54% and 19%. The percentage of children whose mother was born abroad also corresponded to the national percentage of 10% immigrant children in the age range from zero to ten years (Roelandt, Roijen & Veenman, 1992). Table 4.3 Education and Country of Birth of the Mother in the Weighted and Unweighted Norm Group
Unweighted Norm Group
Education Mother Low Middle High
Country of Birth Mother Netherlands Abroad
2 and 3 years 4 and 5 years 6 and 7 years
26% 32% 40%
57% 51% 45%
17% 17% 15%
91% 90% 86%
9% 10% 14%
Total
32%
51%
17%
89%
11%
Weighted Norm Group
Education mother Low Middle High
2 and 3 years 4 and 5 years 6 and 7 years
28% 32% 33%
54% 52% 50%
18% 16% 17%
89% 89% 87%
11% 11% 13%
Total
31%
52%
17%
89%
11%
Country of Birth Mother Netherlands Abroad
STANDARDIZATION OF THE TEST SCORES
39
4.3 THE STANDARDIZATION MODEL Subtest scores The first step in standardization is transforming the raw subtest scores to normally distributed scores with a fixed mean and standard deviation. Usually, these transformations are carried out separately for each age group. The disadvantage of this method, however, is that the relatively small number of subjects in each age group allows chance factors to play an important role in the transformations. In the SON-R 2,-7, a different method, developed for the standardization of the SON-R 5,-17, was applied (Snijders, Tellegen & Laros, 1989, p.43-45; Laros & Tellegen, 1991, p. 156-157). In this method, the score distributions for all age groups are fitted simultaneously as a continuous function of age. This is done for each subtest separately. The function gives an estimate, dependent on age, of the distribution of the scores in the population. With the fitting procedure an effort is made to minimize the difference between the observed distribution and the estimated population distribution, while limiting the number of parameters of the function. Within the age range of the model two pre-conditions must be met: 1. For each age, the standardized score must increase if the raw score increases. 2. For each raw score, the standardized score must decrease if the age increases. A great advantage of this method is that the use of information on all age groups simultaneously makes the standardization much more accurate. Further, the standardized scores can be calculated on the basis of the exact age. The model also allows for extrapolation outside the age range in which the standardization research was carried out. In the SON-R 2,-7, the model had to comply with the pre-conditions for the age range from 2;0 to 8;0 years.
The logistic regression model The logistic regression model is used to estimate parameters of a function in order to describe the chance of a certain occurrence as precisely as possible. The model has the following form: Chance(occurrence) = exp[Z]/(1+exp[Z]) Z can be a composite function of independent variables, in our case, age and score. The dependent variable is defined by determining for each person and for each possible score (in the range from 0 to the maximum score minus 1), whether that score or a lower score was received. If this is the case, the dependent variable is given the value 1. If this is not the case, the dependent variable is given the value 0. Because of the narrow distribution of age in each subgroup, the analysis was based on the mean age in the subgroup. However, our model has the special characteristic that standardization does not need to be based on homogeneous age groups. The regression procedure was carried out in two phases. In the first phase, Z was defined as follows: Z = b 0 + b1 X + b 2 X 2 + b 3X 3 + b4 X 4 + b 5 X 5 + b 6Y + b 7 Y 2 + b 8 Y 3 Here b0 through b8 are the estimated parameters, X through X5 are powers of the raw score, and Y through Y3 are powers of age. When fitting the model, the procedure for logistic regression in SPSS was used (SPSS Inc, 1990). Using the parameters found for the third degree function of age, age was transformed to Y’ in such a manner that the relation between Y’ and the test scores in the above mentioned model became linear. In the following phase, Y’ was used in the regression analysis and the interaction between score and age was added to the model. The definition of Z in this second phase was: Z = b0 + b1X + b2X2 + b3X3 + b4X4 + b5X5 + b6Y’ + b7Y’*X + b8Y’*X2 + b9Y’*X3 + b10Y’*X4 + b11Y’*X5
40
SON-R 2,-7
After the stepwise fitting procedure, the number of selected parameters in the subtests varied from six to ten. The cumulative proportion in the population, in the age range from two to eight, could then be estimated for every possible combination of age and score. Normally distributed z-values were then determined by calculating the mean z-value for the normal distribution interval that corresponded to the upper limit and the lower limit of each raw score. The averaging procedure caused a slight loss of dispersion, for which we corrected. This model may seem to be complicated. However, for simple linear transformations per age group, twenty-two parameters for each subtest would have to be estimated, and in the case of nonlinear transformations based on the cumulative proportions, more than one hundred parameters would have to be estimated.
Reliability For each subtest and age group the reliability was calculated with the formula for labda2 (Guttman, 1945). This is, like labda3 (Coefficient alpha; Cronbach, 1951), a measure for internal consistency. However, labda2 is preferable if the number of items is limited, and if the covariance between the items is not constant (Ten Berge & Zegers, 1978). The reliability for each subtest was fitted as a third degree function of the transformed age (Y’), using the method of stepwise multiple regression. In a few cases, when extrapolating to the ages of 2;0 and 8;0, extreme values occurred for the estimate of reliability. In these cases, the lower limit for the estimated value was set at .30 and the upper limit at .85.
Correlations and total scores In each age group correlations between the standardized subtest scores were first corrected for unreliability, and then fitted as a third degree function of age. Using the estimated values of the correlations in the population, the standard deviation of the total score could be calculated for every age and every combination of subtests, and transformed into the required standardized distribution.
4.4 THE SCALED SCORES The scaled scores are presented in two different ways, as standard scores and as reference ages. The standard score (also called deviation score) shows how well or how badly the child performs in relation to the population of children of the same age. The reference age (also called mental age or test age) shows at which age 50% of the children in the population perform worse than the subject. Unless stated otherwise, standard scores are meant when scaled scores are mentioned in this manual. In the following section, a short explanation is given of scaled scores of the SON-R 2,-7.
Standard scores Scaled subtest scores are presented on a normally distributed scale with a mean of 10 and a standard deviation of 3. These so-called Wechsler scores have a range of 1 to 19. As a result of ‘floor’ and ‘ceiling’ effects, the most extreme scores will not occur in all age groups. The raw scores of the subtests are less differentiated than the standard scores. As a result, only some of the values in the range of 1 to 19 are used in each age group. However, the values show the position in the normal distribution with more precision which would not be possible with a less differentiated scale. The sum of the six scaled subtest scores is the basis of the IQ score. This SON-IQ has a mean of 100 and a standard deviation of 15. The range extends from 50 to 150. The sum of the scaled scores of Mosaics, Puzzles and Patterns is transformed to provide the Performance Scale (SON-PS), and the sum of Categories, Situations and Analogies forms the Reasoning Scale (SON-RS). Both scales, like the IQ-distribution, have a mean of 100 and a standard deviation of 15. The range extends from 50 to 150. In the Appendix, the norm tables for the subtests are shown for each month of age, for the age
STANDARDIZATION OF THE TEST SCORES
41
range 2;0 to 8;0 years. The tables for calculating the standardized total scores are presented per four month period. When the computer program is used, all the standardized scores are based on the exact age.
Reference age The reference age is derived from the raw score(s). The actual age of the child is not important. For the age range of 2;0 to 8;0 years, the reference age is presented in years and months. The reference age for the subtests can be found in the norm tables. The reference age for the total score is the age at which a child with this raw scores would receive an IQ score of 100. This age is determined iteratively, with the help of the computer program, for the Total Score on the test, the Performance Scale and the Reasoning Scale. An approximation of the reference age for the total score is presented in the norm tables in the appendix. This approximation is based on the sum of the six raw subtest scores. For use of the norm tables and the computer program, we refer to chapter 13 (The record form, norm tables and computer program). Directions on the procedure to be used when the test has not been fully administered can also be found in this chapter.
43
5
PSYCHOMETRIC CHARACTERISTICS
Important psychometric characteristics of the SON-R 2,-7 will be discussed in this chapter. These are the distribution characteristics of the scores, the reliability and generalizability of the test, the relationship between the test scores and the stability of the scores. In general, these results are based on the weighted norm group (N=1122). In several analyses comparisons have been made between the results in three age groups, namely: – two- and three year-olds (the norm groups of 2;3, 2;9, 3;3 and 3;9 years), – four- and five-year-olds (the norm groups of 4;3, 4;9; 5;3 and 5;9 years), – six- and seven-year-olds (the norm groups of 6;3, 6;9 and 7;3 years). The results in this chapter are relevant for the internal structure of the test. Research on validity, carried out in the norm group, will be discussed in chapter 6 (Relationships with other variables) and in chapter 9 (Relationship with cognitive tests).
5.1 DISTRIBUTION CHARACTERISTICS OF THE SCORES Level of difficulty of the test items As entry and discontinuation rules are used in the SON-R 2,-7, it is important that successive items of the subtests increase in difficulty. Table 5.1 shows the p-values of the items, calculated over the entire norm group. The p-value represents the proportion of children who completed the item correctly. Items skipped at the beginning of the subtest are scored as correct; items that are not administered after discontinuation of the test are scored as incorrect. In general, the level of difficulty of the items increased as expected. Six of the 91 items were more difficult than the following item, but in four cases the difference in p-value was only .02. Table 5.1 P-value of the Items (N=1122)
item 1 item 2 item 3 item 4 item 5 item 6 item 7 item 8 item 9 item 10 item 11 item 12 item 13 item 14 item 15 item 16 item 17
Pat
Mos
Puz
Sit
Cat
Ana
.90 .88* .90 .88 .86 .79 .77 .62 .60 .43 .33 .30 .21 .20 .13 .04
.95 .81 .77 .76 .73 .70 .64 .58 .46 .33 .23 .14 .08* .10 .06
.97 .90 .89 .79 .76 .72 .64 .59 .37* .44 .25 .19 .13 .05
.95 .91 .87 .86 .80 .67 .56 .54 .46 .32 .17 .12 .07 .06
.91 .89 .89 .82 .75 .69 .64 .51 .49 .33 .30 .17 .10 .09 .05
.96 .93 .84* .86 .73 .52* .58 .57 .45 .28 .28 .23 .15 .13 .04* .06 .04
*: the p-value is lower than the p-value of the following item
44
SON-R 2,-7
For two items, item 9 of Puzzles and item 6 of Analogies, the difference was larger. The six deviating items are marked with an asterisk in table 5.1.
IRT model As in the construction research, the item characteristics for the definitive test were estimated with the 2-parameter model from item response theory. The computer program BIMAIN (Zimowski et al., 1994) was used for these calculations. This program does not require all subjects to have completed all the items. The two item parameters estimated for the items of each subtest are the a-parameter and the b-parameter. The a-parameter shows how well the item discriminates and the b-parameter shows how difficult the item is. To obtain a reliable estimate of the item parameters, the analysis was carried out on the test results of 2498 children, almost all the children who were tested during the standardization and the validation research. The estimate is based on the items administered de facto. In figure 5.1 the item characteristics are represented in a graph. The distribution of the bparameters is similar to the results obtained on the basis of the p-values. Except for a few small deviations, the items increase in difficulty. The difficulty of the items is also distributed evenly over the range from -2 to +2. The mean of the discrimination parameter is highest for Patterns (mean=4.8) and Mosaics (mean=3.8). For Puzzles, Situations, Categories and Analogies, the means are 2.8, 2.4, 2.9 and 2.2 respectively. Within the subtests, however, the discrimination values of the items can diverge strongly. Initially, we considered basing the scoring and standardization of the SON-R 2,-7 on the estimated latent abilities as represented in the IRT model. A good method for doing this with incomplete test data was described by Warm (1989). Such a method of scoring has important advantages: items which clearly discriminate have more weight in the evaluation, no assumptions need to be made about scores on items that were not administered, and the precision of statements about the ability of a person can be shown more clearly. However, the disadvantages are that this scoring method can only be done with a computer, and that important differences can occur between the standardized computer results and the results obtained with norm tables. The main factor in the decision not to apply the IRT model when standardizing the test, however, was the fact that the data did not fit the model. This is not surprising. The IRT model assumes that the item scores are obtained independently. However, the feedback and the help given with the SON-R 2,-7, creates an interdependence among the scores. This works out positively for the test and its validity, but it limits the psychometric methods that can be applied successfully. IRT models that take learning effects into account are being developed (see Verhelst & Glas, 1995), but programs with which the item parameters can be estimated in combination with an adaptive test administration, are not yet available.
Correlation of test performances with age Table 5.2 presents, for each age group, the mean and the standard deviation of the raw subtest scores and of the sum of the six subtest scores. The mean score increases with age for all subtests. The sum of the raw subtest scores increases by about nine points per half year in the youngest age groups, and by about five points per half year in the oldest age groups. The strong correlation with age is also evident from the high correlations between the subtest scores and age. The multiple correlation of age and the square of age with the subtest scores has a mean of .87 and varies from .80 (Analogies) to .91 (Patterns). For the other subtests, the correlations are .85 (Situations), .88 (Categories), .89 (Puzzles) and .90 (Mosaics). For the sum score, the multiple correlation with age is .93. Because of the large increase in test performance with age, the norm tables were constructed for each month of age.
Distribution of the standardized scores The subtest scores, standardized and normalized for age, are presented on a scale of 1 to 19 with a mean of 10 and a standard deviation of 3. The sum of the six standardized subtest scores is presented on a scale with a mean of 100 and a standard deviation of 15. This score, the SON-IQ,
45
PSYCHOMETRIC CHARACTERISTICS
Figure 5.1 Plot of the Discrimination (a) and Difficulty (b) Parameter of the Items Patterns a 9
15
4 5–
3
6
8
10 11 12 14 13
7
16
1 2 5 2– | -3
| -2
| -1
| 0
b
| 1
| 2
Mosaics a 5–
4 6 1
8 9
7
10 11 12
13 14
5
2–
2 3 | -3
| -2
15 | -1
| 0
b
| 1
| 2
Puzzles a 5–
5 4 1
8
2 3
2– | -3
6
7
10 9
| -2
| -1
b
11
| 0
| 1
12 13
14 | 2
| 1
11 12 13 14 | 2
Situations a 5– 2
4 3
1 2–
5 | -3
| -2
6
8 7 9 | b 0
| -1
10
Categories a 5– 2 3 2–
4
5
8 9
6 7
1 | -3
11 10
| -2
| -1
b
| 0
12 14 15 13 | 1
| 2
Analogies a 4– 2–
1
2 | -3
4 3 | -2
5 | -1
8 7 6 9 | b 0
10 12 14 11 13 | 1
15 16 17 | 2
46
SON-R 2,-7
Table 5.2 Mean and Standard Deviation of the Raw Scores Pat Age
Mos
Puz
Sit
Cat
Ana
Sum Subt.
Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD)
2;3 2;9 3;3 3;9 4;3 4;9 5;3 5;9 6;3 6;9 7;3
1.1 3.8 5.7 7.1 8.4 9.5 10.4 11.2 12.9 13.2 14.1
Total
(1.5) (2.3) (1.8) (1.3) (1.3) (1.5) (1.6) (1.9) (2.0) (1.8) (1.7)
8.9 (4.3)
1.3 1.8 3.3 5.3 7.3 8.3 9.0 9.8 11.1 11.4 12.2
( .9) (1.1) (2.1) (2.2) (1.9) (1.8) (1.5) (1.8) (2.0) (1.9) (2.0)
2.0 2.7 4.3 6.3 7.7 8.4 9.4 10.0 11.0 11.2 11.5
7.4 (4.1)
(1.0) (1.3) (2.0) (1.9) (1.7) (2.0) (1.7) (2.0) (1.9) (1.5) (1.5)
1.6 3.6 5.1 6.2 7.3 7.8 8.4 9.1 10.2 10.5 11.1
7.7 (3.7)
(1.7) (2.2) (1.8) (1.6) (1.7) (1.7) (1.7) (1.7) (1.8) (1.7) (1.6)
7.4 (3.4)
1.2 2.8 4.7 6.1 7.6 8.5 8.6 9.9 11.0 11.2 11.9
(1.6) (2.0) (1.9) (1.9) (1.9) (2.0) (1.7) (1.9) (1.7) (1.9) (1.6)
7.6 (3.9)
1.8 3.5 5.1 6.2 6.9 8.2 8.4 9.6 10.5 11.3 12.7
(1.4) (1.9) (2.0) (2.1) (2.1) (2.1) (2.4) (2.6) (3.3) (3.1) (3.0)
7.7 (4.0)
Mean (SD) 9.0 18.1 28.3 37.3 45.2 50.7 54.3 59.8 66.6 68.8 73.6
(5.3) (6.8) (8.4) (7.6) (7.2) (7.9) (6.7) (8.6) (9.1) (8.3) (8.3)
46.5 (21.7)
ranges from 50 to 150. A distribution with a mean of 100 and a standard deviation of 15 is also used for the Performance Scale (SON-PS), based on the sum of the scores of Mosaics, Puzzles and Patterns, and for the Reasoning Scale (SON-RS), based on the sum of the scores of Categories, Analogies and Situations. In table 5.3, the mean and the standard deviation of the standardized scores are presented for the entire weighted norm group and for three age groups. Only very small deviations from the planned distribution were found for the entire group. No significant deviations from the normal distribution were found in tests for skewness and kurtosis. Deviations in mean and dispersion sometimes differed slightly across the three separate age groups, but an analysis of variance showed that the differences between the means were not significant. A test for the homogeneity of the variances also failed to show any significant differences. The kurtosis was not significant in the different groups. The distribution was positively skewed for Puzzles and for the Reasoning Scale in the oldest group. However, the values for skewness were small, .4 and .3, respectively. A variance analysis was also carried out over the eleven original age groups. No significant differences in mean and variance between the groups were established for any of the variables. Table 5.3 Distribution Characteristics of the Standardized Scores in the Weighted Norm Group Total Mean (SD) Patterns Mosaics Puzzles Situations Categories Analogies
10.0 10.1 10.0 10.0 10.0 10.0
(2.9) (3.0) (3.0) (2.9) (2.9) (2.9)
2-3 years Mean (SD) 9.9 10.0 10.0 10.0 10.0 10.0
(2.8) (3.0) (2.9) (2.8) (2.9) (2.7)
4-5 years Mean (SD) 10.0 10.0 10.0 9.9 10.0 10.0
(2.9) (3.1) (3.0) (3.1) (3.0) (3.0)
6-7 years Mean (sd) 10.1 10.2 10.1 10.0 10.1 9.8
(3.1) (3.0) (3.0) (2.8) (2.9) (3.1)
SON-PS SON-RS
100.2 (15.1) 99.9 (15.0)
100.1 (15.2) 100.1 (14.5)
99.9 (15.0) 100.0 (15.6)
100.6 (15.2) 100.0 (14.9)
SON-IQ
100.1 (15.0)
100.1 (14.8)
99.9 (15.2)
100.5 (15.0)
47
PSYCHOMETRIC CHARACTERISTICS
Table 5.4 Floor and Ceiling Effects at Different Ages Floor Effect (lowest possible standardized score) Age
Pat
Mos
Puz
Sit
Cat
Ana
PS
RS
IQ
2;0 2;3 2;6 2;9 3;0 3;3 3;6
9 8 6 4 3 1 1
6 6 5 4 3 2 1
4 3 3 2 2 1 1
8 7 5 4 3 2 1
9 8 7 5 3 2 1
7 6 5 3 2 1 1
70 68 62 52 52 50 50
86 80 72 61 52 50 50
73 68 63 51 50 50 50
Ceiling Effect (highest possible standardized score) Age
Pat
Mos
Puz
Sit
Cat
Ana
PS
RS
IQ
5;0 5;6 6;0 6;6 7;0 7;6 8;0
19 19 18 16 15 14 13
19 19 18 17 16 15 14
19 19 18 17 16 16 15
19 19 18 17 16 16 16
19 18 18 17 16 16 15
19 19 19 18 17 16 15
150 150 149 141 137 132 126
150 150 150 149 140 138 134
150 150 150 150 143 139 133
These results indicate that the standardization model is adequate and gives a good estimate of the distribution of the scores in the population; the deviations in the samples can be seen as chance deviations from the population values resulting from sample fluctuations.
‘Floor’ and ‘ceiling’ effects Although the standardization of the subtest scores was based on a distribution with a range from 1 to 19, these scores could not be obtained in all age groups. The youngest children had raw scores of zero so often that the standardized scores were substantially higher than 1. This means that, at this age, the test differentiates less for children with a low performance level. The first part of table 5.4 presents, for a few age ranges, the standardized scores in the situation where the child receives no positive scores. In the age range 2;0 to 2;6 years, considerable ‘floor’ effects can be seen. From 2;9 years onwards these effects are much smaller. The lowest possible standard subtest scores are about two standard deviations below the mean of 10 and the lowest IQ score that can occur is 51. From 3;6 years onwards, no ‘floor’ effects occur. In the second part of table 5.4 the standardized scores are presented for the situation in which all the items are done correctly. From the age of about 6;0 onwards, small ‘ceiling’ effects can be observed. From 7;0 years onwards, these effects become more important and the maximum IQ score of 150 can no longer be reached.
5.2 RELIABILITY AND GENERALIZABILITY Reliability of the subtests The reliability of the subtests is based on the internal consistency of the item scores. The reliability was calculated using the formula for labda2 (Guttman, 1945). However, an assumption made by this and similar formulas for internal consistency is that the item scores are obtained independently. The sequence in which the items are administered should therefore
48
SON-R 2,-7
have no effect on the scores. In the case of the SON-R 2,-7, this condition is not fulfilled for two reasons. First, the entry and discontinuation rules mean that scores on some items determine whether other items are or are not administered. The latter items are, however, scored as ‘correct’ or ‘incorrect’. When item scores become interdependent in this way, reliability is inflated. In the case of the SON-R 5,-17, where this was investigated, the mean overestimation of the reliability of the subtests as a result of the adaptive procedure was .11 (Snijders, Tellegen & Laros, 1989, p. 46-51). The item scores are not independent for a second reason. After every item that a child cannot solve independently, extensive help and feedback are given. This often leads to the next, more difficult item being solved correctly. These inconsistencies, which have a valid cause, lead to an underestimation of reliability. The net effects of the underestimation of reliability (as a result of valid inconsistencies) on the one hand, and the overestimation of reliability (as a result of artificial consistencies) on the other hand, cannot be determined. Therefore, the reliability of the subtests with the SON-R 2,-7 was based on the formulas for internal consistency and no correction for under or overestimation was applied. The uncertainty about the correctness of the estimate of reliability is a reason to be reticent about the individual interpretation of results on the subtest level. It was also the reason why the standardized subtest scores were not presented, as was done with the SON-R 5,-17, in such a way that the reliability was taken into account in the score. Table 5.5 Reliability, Standard Error of Measurement and Generalizability of the Test Scores Reliability Age
Pat
Mos
Puz
Sit
Cat
Ana
Mean
PS
RS
IQ
2;6 3;6 4;6 5;6 6;6 7;6
.79 .73 .72 .74 .76 .79
.41 .76 .77 .74 .78 .84
.45 .75 .75 .70 .69 .69
.79 .66 .62 .62 .66 .69
.81 .73 .70 .68 .68 .69
.75 .73 .74 .78 .83 .85
.67 .73 .72 .71 .73 .76
.68 .86 .88 .87 .87 .88
.89 .84 .81 .81 .84 .86
.86 .90 .90 .90 .91 .92
Mean
.75
.73
.69
.67
.71
.78
.72
.85
.84
.90
Standard Error of Measurement Age
Pat
Mos
Puz
Sit
Cat
Ana
PS
RS
IQ
2;6 3;6 4;6 5;6 6;6 7;6
1.4 1.6 1.6 1.5 1.5 1.4
2.3 1.5 1.5 1.5 1.4 1.2
2.2 1.5 1.5 1.6 1.7 1.7
1.4 1.7 1.9 1.8 1.8 1.7
1.3 1.6 1.7 1.7 1.7 1.7
1.5 1.6 1.5 1.4 1.2 1.2
8.5 5.6 5.3 5.4 5.4 5.2
5.0 6.1 6.6 6.5 6.0 5.5
5.6 4.7 4.7 4.7 4.5 4.2
Generalizability
Standard Error of Estimation
Age
PS
RS
IQ
Age
PS
RS
IQ
2;6 3;6 4;6 5;6 6;6 7;6
.45 .67 .77 .78 .75 .71
.74 .66 .57 .56 .63 .71
.71 .77 .78 .78 .80 .82
2;6 3;6 4;6 5;6 6;6 7;6
11.1 8.7 7.3 7.0 7.5 8.1
7.7 8.8 9.8 9.9 9.1 8.1
8.0 7.1 7.1 7.0 6.7 6.4
Mean
.69
.64
.78
PSYCHOMETRIC CHARACTERISTICS
49
The calculated values of labda2 have been fitted in the standardization model as a function of age. The results for a number of ages are presented in table 5.5. The mean reliability of the subtests is .72; it increases, though not regularly, with age. Very low reliabilities were found for Mosaics and Puzzles at the age of 2;6 years. A learning effect may occur with these subtests at a young age when help is offered, and this may result in an underestimation of reliability. In the second part of table 5.5 the standard errors of measurement are presented. The standard error of measurement is the standard deviation of the standardized scores that would be received by an individual child, if the subtest could be administered to him or her many times. It indicates how strongly the test results of a child can fluctuate. Section 13.4 describes how to use the standard error of measurement to test the differences between scores statistically.
Reliability of the total scores The reliability of the Performance Scale, the Reasoning Scale and the SON-IQ was calculated using the formula for stratified alpha. This is a formula for the reliability of linear combinations (Cronbach, Schönemann & McKie, 1965; Nunnally, 1978, p. 246-250). The reliability of the IQ score had a mean of .90. Reliability increased with age, from .86 at 2;6 years to .92 at 7;6 years. The standard error of measurement of the IQ decreased from 5.6 at 2;6 years to 4.2 at 7;6 years (see table 5.5). The mean reliability of the Performance Scale was .85 and the mean reliability of the Reasoning Scale .84. In general, the reliability of the Performance Scale was higher. The youngest children formed an exception. In this group, the reliability of the Reasoning Scale was clearly higher than the reliability of the Performance Scale. The scores on the Performance Scale and the Reasoning Scale were strongly correlated. In the entire norm group the correlation was .56. In the age groups two and three, four and five, and six and seven years, the correlations were .52, .55 and .61 respectively. The correlation between the two scales decreased the reliability of the difference between the Performance Scale and the Reasoning Scale. The mean reliability of the difference score was .65. The minimum difference between the two scores for significance on the 1%- and 5% level is shown in the norm tables.
Generalizability of the total scores The generalizability of the IQ and the two scale scores was also determined. This shows how well one can generalize, on the basis of the selected subtests, to the total domain of comparable subtests. The generalizability was calculated using the formula for coefficient alpha with subtest scores instead of item scores as the unit of analysis. For homogeneous (sub)tests, alpha, as a measure for internal consistency, can also be used as estimate of reliability. However, coefficient alpha has a different meaning for a total score based on subtest scores, each of which has its own specific reliable variance. In this case, it can be interpreted as a measure of generalizability. The six subtests of the SON-R 2,-7 can be considered a sample from the domain of similar nonverbal subtests. Alpha represents the expected correlation of the IQ score with the total score on a different, same sized combination of subtests from the domain. The square root of alpha is the correlation of the IQ score with the hypothetical test score that would be expected if a large number of similar nonverbal subtests had been administered. The same applies for the Performance Scale and the Reasoning Scale. However, here the domain of subtests is limited to similar performance or reasoning tests. The mean generalizability coefficient (α) of the SON-IQ was .78. It increased from .71 at 2;6 years to .82 at 7;6 years. The mean generalizability for the Performance Scale was .69 (relatively high for the middle age groups) and for the Reasoning Scale .64 (relatively high for the extreme age groups). In table 5.5 the standard errors of estimation, based on the generalizability coefficient, are also presented. The standard error of estimation for the IQ represents the standard deviation of the distribution of IQ scores of all subjects with the same SON-IQ that would be found if a large number of subtests were administered. The greater the dispersion, the less accurate are the statements about ‘the’ level of intelligence based on these test results. The standard error of
50
SON-R 2,-7
estimation was used to construct the interval in which the ‘domain score’ will, with a certain probability, be found. This interval is not situated symmetrically around the given score. When the point of departure is the distribution of the scores in the norm population, the middle of the interval equals 100 + √α(IQ-100). The standard error of estimation equals 15√(1-α). In the norm tables, this interval is presented for each IQ score with a latitude of 1.28 times the standard error of estimation. This means that the probability that the ‘domain score’ is in the interval is 80%. When using the computer program, these intervals are also presented for the Performance Scale and the Reasoning Scale. For individual assessments, the interval gives a good indication of the accuracy with which a statement, based on the test results, can be made about the level of intelligence. The interval is broader than the intervals that are based, as is customary, on the reliability of the test. When interpreting the results of an intelligence test, one will, in general, not want to limit oneself to the specific abilities included in the test. The interval, based on generalizability, takes into account the facts that the number of items per subtest is necessarily limited, and that the choice of the subtests also denotes a limitation. Given the problems in correctly determining the reliability of the subtests with the SON-R 2,-7, it is fortunate that the calculation of the generalizability of the total scores depends exclusively on the number of subtests and the strength of the correlations between the subtests, and not on the reliability of the subtests.
, -17 Comparison with the Preschool SON and the SON-R 5, The reliability and generalizability of the IQ score of the SON-R 2,-7 were compared with the previous version of the test, the Preschool SON, and with the revision of the SON for older children, the SON-R 5,-17. In the manual for the Preschool SON, reliabilities based on calculations over combined age groups were presented. The combination of age groups leads to a high overestimation of reliability. Therefore, new calculations were carried out on the original normalization material, and the reliability and the generalizability for homogeneous age groups were determined. (Tellegen et al., 1992). The reliability and the generalizability of the SON-R 2,-7 were greatly improved with respect to the Preschool SON. This is especially so for the more extreme age groups. However, an improvement can also be seen for the four-year-olds, for whom the reliability and generalizability of the old Preschool SON were highest (table 5.6). Table 5.6 Reliability and Generalizability of the IQ Score of the Preschool SON, the SON-R 2,-7 and the SON-R 5,-17 Reliability Age
Generalizability P-SON
2;6 years
SON-R 2,-7
SON-R 5,-17
P-SON
SON-R 2,-7
SON-R 5,-17
.86
–
2;6 years
.54
.71
–
.90
–
3;6 years
.69
.77
–
.90
–
4;6 years
.74
.78
–
.90
.90
5;6 years
.71
.78
.79
.91
.92
6;6 years
.62
.80
.81
.92
.93
7;6 years
.52
.82
.83
Age
.78 3;6 years 4;6 years
.86
5;6 years .82 6;6 years 7;6 years
–
51
PSYCHOMETRIC CHARACTERISTICS
In comparison with the SON-R 5,-17, the results of similar age groups for reliability and generalizability are practically the same. However, for the total age range of the SON-R 5,-17, the mean reliability (.93) and the generalizability (.85) are higher than for the SON-R 2,-7.
5.3 RELATIONSHIPS BETWEEN THE SUBTEST SCORES The relationship between the test scores was examined using the correlations between the subtests and the correlations of each subtest with the sum of the remaining subtests.
Correlations between the subtests The correlations between the standardized subtest scores for the entire norm group and for three age groups are presented in table 5.7. The mean correlation in the entire group was .36. The strongest correlations were found between Patterns and Mosaics (.50) and between Puzzles and Mosaics (.45); the weakest correlations were those of Categories and Analogies with Puzzles (.30 and .28) and of Analogies with Situations (.31). The mean correlations increased with age. In the youngest group the mean was .33, in the middle group .37, and in the oldest group .40. If we compare the oldest and youngest age groups, nearly all correlations appear to increase. The exception to the rule is Categories; the correlation of Categories with Patterns increased, but the correlations with the other four subtests decreased. The increase in the correlations with age corresponds to the findings with the SON-R 5,-17. Here the mean correlation in the age range 6;6 to 14;6 years increased from .38 to .51. The mean correlation with the SON-R 5,-17 for the six and seven-year-olds was .39, almost equal to the mean correlation of .40 in the same age group with the SON-R 2,-7. With the SON-R 2,-7, as with the SON-R 5,-17, the correlation between the performances on the different subtests increased with age. This also increased the reliability and generalizability of the SON-IQ for the older age groups. Table 5.7 Correlations Between the Subtests Age: 2-7 years Pat Pat Mos Puz Sit Cat Ana
– .50 .39 .35 .35 .34
Age: 2-3 years
Mos Puz – .45 .36 .36 .37
– .34 .30 .28
Sit
– .39 .31
Cat
– .39
Ana
–
Age: 4-5 years
Pat Mos Puz Sit Cat Ana
Pat Mos Puz Sit Cat Ana
Pat
Mos Puz
Sit
Cat
Ana
– .36 .24 .33 .32 .28
– .39 .30 .39 .31
– .31 .31 .22
– .51 .29
– .45
–
Pat
Mos Puz
Sit
Cat
Ana
– .56 .44 .41 .37 .43
– .47 .48 .33 .39
– .33 .36
– .41
–
Age: 6-7 years
Pat
Mos Puz
– .60 .50 .33 .36 .32
– .49 .34 .34 .40
– .32 .32 .28
Sit
– .33 .28
Cat
– .33
Ana
–
Pat Mos Puz Sit Cat Ana
– .38 .26 .35
52
SON-R 2,-7
Table 5.8 Correlations of the Subtests with the Rest Total Score and the Square of the Multiple Correlations Correlation with Rest Total
Pat Mos Puz Sit Cat Ana
Square of the Multiple Correlation
2-7 years
2-3
4-5
6-7
.56 .59 .50 .49 .51 .47
.44 .52 .42 .51 .59 .45
.61 .63 .55 .45 .47 .45
.63 .63 .53 .54 .46 .54
Pat Mos Puz Sit Cat Ana
2-7 years
2-3
4-5
6-7
.33 .37 .27 .25 .27 .24
.20 .28 .20 .31 .40 .24
.43 .45 .33 .20 .23 .22
.41 .43 .30 .30 .24 .30
Correlation with the total score The correlation of the subtests with the total score was examined by calculating the correlation with the unweighted sum of the five remaining subtests and the square of the multiple correlation of a subtest with the five remaining subtests (table 5.8). The latter indicates the proportion of variance explained by the optimally weighted combination of the other subtests. For the entire norm group, Patterns and Mosaics correlated most strongly with the remaining total score. However, this was not the case in the youngest age group. For the two to three-yearolds, Categories had the strongest correlation with the remaining total score (.59), but for the six to seven-year-olds this correlation decreased to .46. In this age range, Categories had the weakest correlation with the remaining subtests. About 70% of the variance of each subtest could not be explained by the scores on the other subtests. This is partially explained by the unreliability of the subtests. However, it also indicates that a substantial part of the reliable variance of each subtest is specific. The importance of the subtest-specific reliable variance decreased as the children grew older.
5.4 PRINCIPAL COMPONENTS ANALYSIS In order to determine how many dimensions can be distinguished meaningfully when interpreting the test results, a Minimum Rank Factor Analysis was first carried out for the entire norm group (Ten Berge & Kiers, 1991). This method was used to determine how many factors were required to explain the common variance of the variables. One factor explained 87% of the common variance, two factors explained 97% and three factors explained 100%. The third factor added little to the the explained variance. After rotation, only one subtest had a high loading on the third factor. As a result, further analyses of a solution were based on two factors. In the first part of table 5.9 the results of the Principal Components Analysis for the entire norm group and for the three age groups are presented. In the entire norm group, 60% of the total variance is explained by the first two components. The percentage increases slightly, to 64%, in the age groups. The total variance includes the subtest-specific reliable variance and the error of measurement variance of the subtests. Therefore, the percentages of explained variance are lower than for the minimum rank factor analysis that determines which part of the common variance is explained. In the entire norm group the loadings on the rotated components showed a clear distinction between the performance subtests (Patterns, Mosaics and Puzzles) and the reasoning subtests (Situations, Categories and Analogies). This distinction was also seen in the middle age group. In the youngest groups, however, Patterns had an equally high loading on both components, whereas in the oldest group, Situations, like the performance tests, had its highest loading on the first component.
53
PSYCHOMETRIC CHARACTERISTICS
To determine how important the differences in loadings between the three age groups were, a Simultaneous Components Analysis was carried out on these data sets (Millsap & Meredith, 1988; Kiers & Ten Berge, 1989). This was done to examine whether a uniform solution of component weights explained (substantially) less of the variance than the solutions that were optimal for the separate age groups. The analysis with the SCA program (Kiers, 1990) showed that this was not the case: the uniform solution over the three age groups explained 61.1% of the variance and the separate optimal solutions explained 61.4% of the variance. Also important was the fact that the simple weights, being 1 or 0 (depending on the scale to which the subtest belongs), were almost as effective as the optimal uniform solution. Using simple weights, as is done in the construction of the Performance Scale and the Reasoning Scale, the percentage of explained variance was 60.8%. Table 5.9 Results of the Principal Components Analysis in the Various Age and Research Groups Eigenvalue and Percentage of the Explained Variance by the first two Main Components 2-7 years F1 F2
2.8 .8
47% 13%
2-3 years 2.7 .8
45% 14%
4-5 years 2.9 .8
48% 14%
6-7 years 3.0 .8
50% 14%
Loadings on the first two Varimax-Rotated Components 2-7 years F1 F2
2-3 years F1 F2
4-5 years F1 F2
6-7 years F1 F2
Pat Mos Puz
.72 .75 .80
.29 .29 .12
.44 .72 .85
.43 .30 .07
.82 .78 .78
.23 .31 .19
.69 .79 .79
.37 .23 .07
Sit Cat Ana
.35 .17 .18
.59 .80 .75
.30 .25 .05
.65 .78 .78
.22 .20 .21
.68 .74 .70
.65 .13 .35
.29 .88 .70
Boys F1 F2
Girls F1 F2
low SES F1 F2
high SES F1 F2
Pat Mos Puz
.84 .79 .59
.13 .28 .38
.71 .72 .84
.31 .33 .07
.81 .76 .76
.15 .24 .10
.62 .70 .84
.39 .31 .08
Sit Cat Ana
.24 .16 .25
.70 .79 .66
.39 .16 .17
.55 .81 .75
.36 .00 .46
.59 .88 .46
.23 .12 .33
.71 .84 .64
Immigrant F1 F2
Tested outside the Netherlands F1 F2
Gen./Perv. Dev. Disorder F1 F2
Speech/language/ Hearing Disorder F1 F2
Pat Mos Puz
.90 .71 .52
.03 .39 .34
.85 .82 .75
.28 .36 .39
.80 .80 .82
.37 .33 .24
.78 .79 .82
.28 .26 .18
Sit Cat Ana
.10 .22 .38
.87 .72 .60
.30 .41 .28
.78 .71 .80
.42 .29 .23
.66 .80 .78
.57 .28 .24
.32 .79 .83
54
SON-R 2,-7
In the second part of table 5.9, the loadings on the first two components are shown for different samples of the norm group. These are the boys (N=561), the girls (N=563), and the children whose parents had either a low (N=233) or a high SES level (N=202). The SES level and its correlation with the test performances are described in section 6.6. In the four groups the loadings of the subtests are consistent with a distinction between performance and reasoning tests, with one exception: the loading of Analogies was the same for both components for the children with a low SES level. The last part of table 5.9 presents the component loadings for a number of groups who were not, or only partially, tested in the context of the standardization research. The first group consisted of immigrant children (N=118). These were children who lived in the Netherlands and whose parents were both born abroad. About two thirds of this group was tested in the context of the standardization research. The remaining one third was tested at primary schools in the context of the validation research (see chapter 8). The second group consisted of children who were tested in other countries (N=440). The research was conducted in Australia, the United States of America and Great Britain, mainly with children without specific problems or handicaps, although some children with impaired hearing, bilingual children and children with a learning handicap were included (see section 9.5, 9.6 and 9.7). The third and fourth groups consisted of children with specific problems and handicaps, who were examined in the Netherlands in the context of the validation of the test (see chapter 7). The third group consisted of children with a general developmental delay and children with a pervasive developmental disorder (N=328). The fourth group consisted of children with a language/speech disorder, impaired hearing and/or deaf children (N=346). In these four groups, with one exception, the loadings on the first two rotated components corresponded to the distinction between performance and reasoning tests. In the group of children with language/speech and/or impaired hearing disorders, the subtest Situations had its highest loading on the first performance component. The distinction made by the SON-R 2,-7 between the Performance Scale and the Reasoning Scale, is supported to a large extent by these results in very different groups. Though the reliability of the difference between scores on the two scales is moderate, this distinction is the most relevant one for the intra-individual interpretation of the test results. The empirical validity of the distinction between the Performance Scale and the Reasoning Scale will be discussed in section 9.9.
5.5 STABILITY OF THE TEST SCORES Correlations and means The SON-R 2,-7 was administered a second time to a sample of 141 children who had participated in the standardization research. The mean interval between administrations was 3.5 months, with a standard deviation of 21 days. The age of the children varied from 2;3 to 7;4 years. The mean age at the first administration was 4;6 years with a standard deviation of 1;5 years. The number of boys and girls was almost equal. The correlations between the scores at each administration, and the mean and standard deviation of the scores, are presented in table 5.10. If the standard deviation of the scores of the first administration was different from the standard deviation in the norm population, the correlations were corrected (see Guilford & Fruchter, 1978). The test-retest correlation for the IQ score was .79. For the Performance Scale and the Reasoning Scale, it was .74 and .69 respectively, and for the subtests .57 on average. The stability was relatively high for Mosaics and Categories (both .64) and relatively low for Situations and Analogies (respectively .48 and .49). The test-retest correlations for all the test scores are clearly lower than the reliability based on internal consistency. This indicates that changes in performance occur which cannot be attributed to errors of measurement. In chapter 10 the significance of this will be discussed. Performances on all subtests were, on average, better during the second administration. The increase in standardized scores (both times based on the exact age) varied from .5 (Analogies) to
55
PSYCHOMETRIC CHARACTERISTICS
Table 5.10 Test-Retest Results with the SON-R 2,-7 (N=141)
r Patterns Mosaics Puzzles Situations Categories Analogies
.56 .64 .60 .48 .64 .49
SON-PS SON-RS
.74 .69
SON-IQ
.79
Admin. I Mean (SD) 10.2 10.6 10.2 10.4 10.5 10.6
(3.0) (2.8) (2.9) (2.5) (2.8) (2.9)
Admin. II Mean (SD) 10.8 11.6 11.4 11.6 11.2 11.1
Difference
(2.6) (3.1) (2.8) (3.1) (2.9) (3.0)
.6 1.0 1.1 1.2 .7 .5
102.5 (14.3) 103.5 (13.7)
107.9 (14.3) 108.7 (15.2)
5.5 5.2
103.4 (13.7)
109.4 (14.7)
6.0
– correlations have been corrected for variance in the first administration
1.2 (Situations). The scores on the Performance Scale and the Reasoning Scale increased by more than 5 points. The IQ score increased, on average, by 6 points. All differences in mean scores were significant at the 1% level, except for the subtests Patterns and Analogies. A distinction was made between the children who were younger than 4;6 years (mean age 3;4 years; N=67) at the first administration, and children who were older (mean age 5;7 years; N=74). In the younger group the test-retest correlation for the IQ was .78, in the older group .81. The correlation for the Reasoning Scale decreased slightly with age (from .71 to .69). For the Performance Scale it increased clearly (from .65 to .80). The increase in the mean IQ in both groups was practically equal.
Profile analysis A profile analysis was carried out to determine the meaning of the intra-individual differences between the subtest scores of a single subject. One of the characteristics of the profile is the dispersion of the scores. This was calculated as the standard deviation of the six scores (the square root of the mean square of the deviations of the six subtests from the individual mean). In the entire norm group the mean of the dispersion was 2.0. For 24% of the children, the intraindividual dispersion was 2.5 or higher, and for 9% the dispersion was 3.0 or higher. The mean individual dispersion for the 141 children who were tested twice with the SON-R 2,-7 was 2.0 on both occasions. Remarkably, the correlation between the dispersion on the first and second administration was weak (.17) and not significant. Another important characteristic of the profile is the relative position of the subtest scores. To determine whether this was stable, the six subtest scores from the first administration were correlated, for each child, with the six scores of the second administration. The mean correlation was .32. The strength of the correlation depends very much on the dispersion of the scores on the first administration. Clearly, if the differences are small, they are determined largely by errors of measurement and are therefore unstable. Where the dispersion on the first administration was less than 2.0 (N=69), the mean correlation was .22; where the dispersion was 2.0 to 3.0, the mean correlation was .38, and for the twelve children who had a dispersion of 3.0 or more, the mean correlation was .61. This indicates that the differences between the subtest scores must be substantial before we can conclude that they will remain stable over a period of some months. When using the computer program, the dispersion is calculated and printed. The difference between the scores on the Performance Scale and the Reasoning Scale in the first administration correlated .46 with the difference between the two scores in the second administration. For the children younger than 4;6 years, the correlation was .43 and for the older children it was .50.
56
SON-R 2,-7
Table 5.11 Examples of Test Scores from Repeated Test Administrations (I and II) Example A I II
Example B I II
Example C I II
Example D I II
SON-IQ
97
108
109
116
106
110
121
120
SON-PS SON-RS
100 93
105 113
108 107
110 118
100 113
94 126
122 116
123 113
Patterns Mosaics Puzzles Situations Categories Analogies
11 8 11 9 9 9
14 11 7 14 12 10
12 13 9 13 12 8
10 17 8 13 11 14
11 9 10 8 14 14
9 9 9 13 16 13
14 9 18 10 15 12
12 12 17 12 13 11
Dispersion Correlation
1.1 2.4 –.18
2.0
2.9
2.3
2.7
3.1
.32
.56
2.0 .78
As an example, the scores of a few children on the two administrations are presented in table 5.11. The dispersion and the correlation between the six scores are also shown. The examples illustrate that important changes can take place in the intra-individual order of the subtest scores.
57
6
RELATIONSHIPS WITH OTHER VARIABLES
In this chapter the relationship is discussed between test performance and a number of variables that are important in order to judge the validity of the test. The analyses are based on the results of the standardization research. Other tests were also administered to a large number of the children in order to validate the SON-R 2,-7. The results are described in chapter 9. A comparison is made in section 9.11 between the SON-R 2,-7 and other tests, with respect to their relationship with a number of variables that are discussed in this chapter, i.e. SES index, parents’ country of birth, evaluation by the examiner, and the school’s evaluation of language skills and intelligence.
6.1 DURATION OF TEST ADMINISTRATION In general the test was administered in one session with short breaks if necessary. In the case of 9% of the children a break of longer than a quarter of an hour was taken, usually due to school recess or the end of the school day. In these cases the second part of the test was administered later in the day or on another day. The mean IQ score of the children to whom the test was administered in two parts did not deviate from the mean of the children to whom the test was administered in one session. The duration of administration (including short breaks) had a mean of 52 minutes with a standard deviation of 12 minutes. For two-year-olds the duration of administration was shorter; in the age group of 2;3 years the mean duration of administration was 38 minutes and in the age group of 2;9 years this was 46 minutes. From three years onwards the mean was fairly constant at 54 minutes. In table 6.1 the frequency distribution of the duration of administration is presented both for the total norm group and for the two-year-olds and the older children as separate groups. There was a significant positive correlation between the duration of administration and the IQ score. This relationship was strong (r=.52) especially for the two-year-olds. The correlation for the older children was .34. The relation could be explained by the fact that children within each group who performed well completed more items on average. Table 6.1 Duration of the Test Administration Duration of the complete test (N=1124)
- 40 min 41 - 50 min 51 - 60 min 61 - 70 min > 70 min
Mean duration in minutes (N=1014)
2-7 years
2 yrs
3-7 yrs
16% 32% 32% 14% 6%
49% 30% 17% 3% 1%
9% 32% 36% 16% 7%
Mean
(SD)
Patterns Mosaics Puzzles Situations Categories Analogies
7.0 10.3 8.5 6.3 8.4 8.8
(3.0) (3.9) (3.1) (2.3) (3.3) (2.8)
Total
49.2 (10.7)
58
SON-R 2,-7
The duration of the administration of the separate subtests was known for 1014 children (table 6.1). Situations had the shortest duration of administration with a mean of 6.3 minutes and also the narrowest dispersion in duration. Mosaics had the longest duration of administration with a mean of 10.3 minutes, and the widest dispersion. The duration of administration was also recorded for children who participated in other validation research projects (see chapter 7). The mean duration (including short breaks) for these children, who had varying problems and handicaps in cognitive development and communication, was 57 minutes. This was 5 minutes longer than for the children in the standardization research. The duration of administration was relatively short for children with a general developmental delay (a mean of 53 minutes) and relatively long for deaf children (a mean of 66 minutes).
6.2 TIME OF TEST ADMINISTRATION The influence of the time of administration on test results was examined in the standardization research. The largest part of the norm group was tested during the first twelve weeks of the school year 1993/94. For these 1065 children, the relationship was examined, using analysis of variance, between the IQ scores and the period of research (four consecutive periods of three weeks), the day of the week on which the test was administered, and the time of day at which the administration was started. In table 6.2 the mean IQ scores for each category of these three variables are presented as a deviation from the total mean. Each variable was controlled for the effect of the other two variables. The largest differences in mean IQ scores were found for the variable starting time, but the effect was not significant (F[6,997]=1.26; p=.27). Table 6.2 Relationship of the IQ Scores with the Time of Administration (N=1065) Starting Time 8- 9 a.m. 9-10 a.m. 10-11 a.m. 11- 1 p.m. 1- 2 p.m. 2- 3 p.m. After 3 p.m.
N
dev.
Day of Week
108 231 240 139 162 115 70
1.9 .3 –.4 –1.9 –.6 2.4 –1.2
Monday Tuesday Wednesday Thursday Friday
N
dev.
198 287 162 262 156
–.8 –.1 –.5 1.2 –.3
Period I II III IV
N
dev.
305 302 178 280
.6 –.2 .6 –.8
6.3 EXAMINER INFLUENCE Eleven examiners tested most of the children in the standardization research. The scores of the different examiners were compared, while controlling for the sex of the children, the percentage of immigrant children (children whose parents had both been born abroad) and the SES index. In table 6.3 the deviations from the total mean are shown for the IQ score, after controlling for the other variables. The beta coefficient, which indicates how strong the association is after controlling for the other variables, was .18 and clearly significant (F[10,1059]=4.09; p Deaf
Hearing Impaired
Speech/language Disorder
Pervasive Developmental Disorder
General Developmental Delay
| 50
| 60
| 70
| 80
| 90
| 100
| 110
| 120
ences between the groups. The children with a general developmental delay and the children with pervasive developmental disorders had low performance levels. Deaf children were very similar to children in primary education. The children with impaired hearing and the children with a speech or language disorder took an intermediate position. Besides these differences, the figure also shows a large overlap in the distributions of the groups. The mean scores of the children with a developmental disorder or delay were low, but in both groups a good 10% of the children had a score higher than 100, which is the mean of the norm population. In contrast, 10% of the children in these groups had a score of 50 or thereabouts, which means that they performed at such a low level that the test did not differentiate further. In all the groups, children performed relatively poorly on the subtests Categories and (with exception of the deaf) Patterns. In all the groups, children performed relatively well on Puzzles, Situations and (with exception of the deaf) on Analogies. The results on Mosaics varied (see table 7.3). When evaluating the differences between the groups, the manner in which the groups were selected must be taken into account. Most of the children examined attended special schools and institutes that had strict selection procedures for admittance. Children who had, for example, a pervasive developmental disorder or with impaired hearing, but who were in regular education were strongly under-represented. In their case, a cognitive delay is less likely to occur. On the other hand, autistic children in daycare centers for the mentally disabled were not included in the research. The results are only representative for the children at the kinds of schools and institutes listed above, and then only to a limited extent due to the small number of schools and institutes involved. No statement can be made on the basis of this research about ‘the’ intelligence of autistic children, or ‘the’ intelligence of children with impaired hearing. Only in the
74
SON-R 2,-7
case of the deaf children was an effort made to obtain a representative picture of the intelligence level of (native Dutch) deaf children who are not multiple handicapped.
7.3 RELATIONSHIP WITH BACKGROUND VARIABLES A variance analysis for a number of background variables such as sex, age, SES level and immigrant status was carried out with the IQ score as dependent variable, controlling for the research group. No significant interaction effect with the research groups was found for any of the variables. In table 7.4 the mean values of the IQ scores are presented as the deviation from the total mean after controlling for the research group. Few differences were found between boys and girls (p=.64), or among the three age groups of two and three years, four and five years, and six and seven years (p=.59). A relationship with the SES level of the parents (p=.02) was found, but this was much weaker than in the norm group. The difference between the native Dutch children, the immigrant children and the children with a mixed background was not significant (p=.17). However, different background characteristics (like sex and SES level) played an indirect role in the referral to the special schools, because of the relative frequency of developmental problems among boys and among children with a low SES level. Table 7.4 Relationship of the IQ Scores with Background Variables Sex
Boys Girls
Age N
Dev
470 204
.2 –.5
2-3 j. 4-5 j. 6-7 j.
SES Level
Country of birth
N Dev
N
Dev
N
Dev
121 1.3 354 –.5 199 .2
Low 172 Below aver. 233 Above aver. 115 High 77
–2.7 –.5 3.3 2.6
Native Dutch 538 Mixed 32 Immigrant 23
–.2 5.0 –2.8
7.4 DIAGNOSTIC DATA Diagnostic data for a large number of pupils from the schools for special education with a department for children at risk in their development and from the medical daycare centers had been gathered during the admittance procedure to the school or the daycare center in question. The data refer to the home situation, the existence of emotional problems, behavioral problems and communicative handicaps, and also include an evaluation of motor, language and cognitive development. Complete data sets were available for 238 children, 93 children from a department for children at risk and 145 children from a medical daycare center. Twenty-four of these children had a pervasive developmental disorder. The mean IQ score of the entire group of 238 children was 80.9 with a standard deviation of 17.1. In table 7.5, the distribution of the diagnostic variables is presented together with the mean IQ scores for each category. Various problems and delays appear to be present in all the diagnostic variables. The most favorable evaluation was found in relation to communicative handicaps (60% ‘none’) and motor development (40% ‘normal’). Serious behavioral problems and large delays in language development were mentioned most frequently. With respect to the evaluation of cognitive development, nearly half the children had a small delay and 20% had a large delay. The correlations between the IQ scores and the evaluation of the home situation, and of emotional and behavioral problems, were weak and not significant. The relationships with the other diagnostic variables were significant on the 1% level. The correlation with communicative
75
RESEARCH ON SPECIAL GROUPS
Table 7.5 Reasons for Referral of Children at Schools for Special Education and Medical Daycare Centers for Preschoolers (N=238), with mean IQ scores
Normal Pct Mean Home situation
Emotional problems Behavioral problems Communicative handicap
Motor development Language development Cognitive development
29%
80.1
Fairly Unfavorable Pct Mean 48%
80.6
Very Unfavorable Pct Mean 23%
82.9
None Pct Mean
Light Pct Mean
Severe Pct Mean
17% 14% 60%
59% 51% 30%
24% 35% 10%
79.9 80.5 83.7
83.3 81.5 79.6
75.9 80.3 67.7
Normal Pct Mean
Small Delay Pct Mean
Large Delay Pct Mean
40% 24% 32%
48% 44% 48%
12% 32% 20%
91.8 93.6 95.6
73.4 80.2 78.1
74.1 72.2 64.3
handicaps was -.26. The correlation with both motor and language development was .46. The SON-IQ correlated most strongly, .66, with the evaluation of cognitive development. The mean IQ score of the children whose cognitive development had been evaluated as ‘normal’ was 95.6, whereas the mean IQ score of the children with a large delay was more than 30 points lower, i.e., 64.3. With a stepwise multiple regression the correlation with the IQ increased slightly, from .66 to .67, when motor development was also taken into account. The Performance Scale and the Reasoning Scale both correlated strongly with the evaluation of cognitive development (.59 and .61). The Performance Scale had a stronger correlation with the evaluation of motor development (r=.43) than with the evaluation of language development (r=.40). The Reasoning Scale had a higher correlation with the evaluation of language development (r=.44) than with the evaluation of motor development (r=.39).
7.5 EVALUATION BY THE EXAMINER As was done during the standardization research, all the children in the special groups were rated by the examiner on motivation, cooperation, and comprehension of the directions, following the test administration. In table 7.6 the ratings and the mean IQ scores are presented for each group. In a small number of cases, (approximately 2%), motivation, cooperation, or comprehension of the directions was evaluated as ‘poor’. An exception was the group with a pervasive developmental disorder, where cooperation was evaluated as ‘poor’ in 8% of the children. Concentration was evaluated as low in 5% of the children. In the case of the deaf children, however, this was 1%. Concentration and, to a lesser degree, motivation were frequently rated as ‘mediocre’ or ‘varying’. Cooperation and comprehension of the directions were most frequently rated as ‘good’. The deaf children were evaluated most positively. On average, the children with a pervasive developmental delay had the lowest evaluation with relation to motivation, concentration, cooperation and comprehension of directions.
76
SON-R 2,-7
Table 7.6 Relationship between IQ and Evaluation by the Examiner
Motivation
General developm. delay (N=238)
Pervasive developm. disorder (N=90)
Speech/ language disorder (N=179)
Hearing impaired (N=73)
Deaf (N=94)
Pct Mean
Pct Mean
Pct Mean
Pct Mean
Pct Mean
Poor 2% Mediocre/Varying 33% Good 65%
63.7 73.4 84.4
Correlation
.33*
.37*
.33*
.41*
Pct Mean
Pct Mean
Pct Mean
Pct Mean
Concentration
2% 29% 69%
Poor 6% Mediocre/Varying 44% Good 50%
62.9 78.1 84.1
Correlation
.28*
.39*
Pct Mean
Pct Mean
Cooperation
Poor 1% Mediocre/Varying 24% Good 75%
61.7 78.6 81.1
Correlation
.11
Comprehension of directions
Pct Mean
Poor 3% Mediocre/Varying 22% Good 75%
55.9 75.0 82.9
Correlation
.31*
4% 43% 52%
54.5 69.7 82.8
8% 14% 78%
61.3 72.0 85.1
60.7 72.2 81.3
2% 22% 75%
5% 39% 56%
76.3 78.1 90.6
– 21% 79%
92.6 99.3 .19
Pct Mean 1% 68 26% 92.2 73% 100.3
.44*
.49*
.31*
Pct Mean
Pct Mean
Pct Mean
66.3 76.5 89.4
6% 38% 56%
71.5 84.2 96.7
74.0 84.7 99.1
2% 12% 86%
72.3 80.7 93.4
3% 30% 67%
1% 12% 86%
86 78.7 94.3
– 13% 87%
86.1 99.6
.32*
.32*
.28*
.32*
Pct Mean
Pct Mean
Pct Mean
Pct Mean
2% 28% 70%
50.0 71.3 82.0 .34*
3% 15% 82%
79.2 76.6 89.8 .28*
3% 20% 77%
63.5 84.9 95.2 .38*
1% 9% 90%
83 90.5 98.8 .19
*: p < .01 with one-tailed testing
Sixty-five percent of the entire group of 674 children were rated ‘good’ on all four aspects, or on three aspects, with the fourth rated as ‘mediocre’/’varying’. Eleven percent had a mean rating of ‘mediocre’/’varying’, or lower. In comparison to the standardization research, the evaluations of the children from these special groups were most similar to the evaluations of children two and three years of age. However, children from special groups received much higher ratings for comprehension of the directions than did the two- and three-year-olds in the standardization research. The ratings of motivation, cooperation, and comprehension of the directions correlated significantly with the IQ score in most groups. The correlations were strongest in the group with impaired hearing and for the evaluation of concentration. The correlations were substantially stronger than in the norm group. The main cause for this is that a negative evaluation was more frequently given in the special groups.
77
RESEARCH ON SPECIAL GROUPS
7.6 EVALUATION BY INSTITUTE OR SCHOOL STAFF In the case of a large number of children who were tested at schools and institutes, a staff member, closely concerned with the child evaluated the following four aspects: intelligence, language development, fine motor skills and communicative orientation (the extent to which the child seeks and maintains contact with others in his or her surroundings). The evaluation was given on a five-point scale running from ‘low’ via ‘intermediate’ to ‘high’ in the case of intelligence and language development, and in the case of motor activity and communication from ‘low’ via ‘reasonable’ to ‘high’. In general the evaluation was carried out after the schools and institutes had received the provisional results on the test. The possibility that the results on the SON-R 2,-7 influenced the evaluation cannot be excluded. However, many other test and research data on the children were available at the schools, so that the question remains whether the test results on the SON-R 2,-7 contributed much to the evaluation. We also do not know whether the person making the evaluation was acquainted with the results. Because a certain amount of contamination may have occurred, the results presented in this section must be interpreted with care. Mean evaluations and their correlations with the test scores are presented in table 7.7 for two broad groups. The first group consisted of children with a general developmental delay (N=222) and children with a pervasive developmental disorder (N=46). The second group consisted of children with a speech or language disorder (N=105), children with impaired hearing (N=42) and deaf children (N=94). In the group of children with a general developmental delay or with pervasive developmental disorders, the subjective evaluation of intelligence and language development was generally low. In the speech/language/hearing-impaired group, the children were given a relatively low evaluation regarding their language development; on all other aspects, the mean evaluation was higher than in the first group. The mean IQ score in the first group was 80.9 with a standard deviation of 17.1; the correlation with the evaluation of intelligence was .68. In the second group, where the dispersion of Table 7.7 Correlations Between Test Scores and Evaluation by Institute or School Staff Member
Distribution Mean SD correlation
General developmental delay Perv. developm. disorder (N=268)
Speech/language disorder Hearing impaired/deaf (N=241)
Intell. Language Motiv. Commun.
Intell.
2.4 (.9)
2.3 (.8)
2.9 (.9)
3.0 (.9)
Intell. Language Motiv. Commun.
2.9 (.7) Intell.
Language Motiv. Commun. 2.1 (1.0)
3.2 (1.1)
3.5 (1.0)
Language Motiv. Commun.
Patterns Mosaics Puzzles Situations Categories Analogies
.56 .60 .37 .50 .58 .43
.42 .41 .19 .39 .45 .27
.44 .36 .36 .25 .31 .36
.24 .15 .19 .19 .28 .15
.53 .51 .45 .37 .45 .31
.35 .27 .22 .28 .12 .10
.34 .21 .24 .20 .22 .17
.21 .21 .18 .10 .20 .06
SON-PS SON-RS
.59 .64
.40 .47
.45 .38
.22 .26
.59 .50
.33 .22
.32 .26
.24 .15
SON-IQ
.68
.48
.46
.27
.61
.31
.32
.23
– correlations > .14 are significant at the 1% level
78
SON-R 2,-7
both IQ scores and the evaluation of intelligence was narrower, the correlation was also weaker, i.e., .61. In the group of children with a developmental delay, the correlation of the Reasoning Scale with the evaluation of intelligence was higher than that of the Performance Scale. The correlations with the subtests Puzzles and Analogies were relatively weak. In the group of children with language/speech/hearing disorders, the Performance Scale had the highest correlations with the evaluation of intelligence; Situations and Analogies had the lowest correlations. Patterns and Mosaics had strong correlations with the evaluation of intelligence in both groups. Reasonably strong correlations with the evaluation of language development and fine motor development were also found in both groups. Patterns had the strongest correlation with the evaluation of motor skills. The Performance Scale correlated more strongly than the Reasoning Scale with motor skills. The correlations between the test scores and the evaluation of the communicative orientation of the child were positive but weak. Using a stepwise multiple regression analysis, the extent of the influence of the other evaluations on the correlation between the evaluation of intelligence and the SON-IQ was examined. In both groups the correlation increased when the evaluation of motor skills was included; in the first group from .68 to .74, and in the second group from .61 to .65.
7.7 EXAMINER EFFECTS The evaluation of examiner effects was much more difficult in the special groups than in the standardization research, because large differences existed between the groups and because most or all of the children tested by an examiner belonged to one specific group. Furthermore, the number of children tested by each examiner was much smaller than in the standardization research. Using a variance analysis, the differences in IQ scores between the examiners was tested. The school evaluation of intelligence and fine motor activity, and the SES index were all controlled for. The comparison was limited to examiners who had tested at least 20 children. The number of examiners was 11 and the number children 446. The main examiner effect, after controlling for the other variables, was significant (F[10,426]=2.81, p