6.2. Evaluation of procedures

6.2.1. Consistency

Consistency and reliability imply to repeatability. An experimental finding is considered reliable if it can be replicated at a later time under similar conditions (Cushman & Rosenberg 1991). Technically, reliability is the degree to which a set of measurements is free from random error (Sander & McCormick 1993).

Test/retest reliability correlates the measure obtained by a subject during the first (test) period with that obtained during a subsequent (retest) period (Drury 1995). Split-half reliability correlates the two halves of the measurement made during a single period. Test/retest reliability was investigated by conducting a task-surface height evaluation (paper II) and chair criteria weighting (paper III). Split-half reliability was investigated with conjoint measures (papers III and IV). The reliability of the procedures can be considered good enough.

In the step experiment (paper I), reliability was measured by repeated trials of steps and stairs with reliable results. In paired comparison of chairs as proposed by Mitchell (papers IV and V), both inter-rater and intra-rater consistency was good. Consistency was also good in the ranking of phone types (paper VI).

Table 7 shows the different statistical methods used in the investigation of inter-rater and intra-rater reliability. One indicator of reliability and a sign of validity is the consistency of the results between conjoint analysis and Mitchell’s method in paper IV.

6.2.2. Validity

To improve the external validity of experiments and tests, it will be necessary to create a laboratory environment essentially indistinguishable from the environment in which the product is to be used (Cushman & Rosenberg 1991). The importance of this “ecological validity” (Jordan & Thomas 1994, Bogner 1998, Macleod 1996) may vary, depending on the extent to which the product being tested is affected by factors in the context of its use. The context promoted validity in this study. For products such as health care equipment, which are subject to a myriad factors in actual use, ecological validity is essential for obtaining findings that apply to the typical use of that equipment. The evaluations of a telemedicine system in this study have ecological validity, since the experiments were made on the actual settings in a health care centre (paper VII). The same aspects of the experiment: real users, real work goals and realistic settings, were involved in the home simulator evaluations (papers II, III, IV and V). In the step experiment (paper I), partly a real (tractor steps) and partly a test (constructed steps) environment were used.

For the task-surface experiment (paper II), the validation was made by the expert. The physiotherapist made his evaluation of the subjects’ performance when they performed tasks typical of daily living. He made the evaluation based on video recordings. He did not know what had been done earlier or the results of the subjects’ preference assessments. The height evaluation by the subjects’ rating was supported by the expert’s evaluations. In the telephone experiment (paper VI), the subjects’ preferences were supported by the expert evaluations of elderly people’s needs (geriatric nurses and gerontechnology researchers).

Galer and Page (1996) reported that subjective measures appear to be highly variable and of doubtful validity in human activities. To overcome at least part of this disadvantage, it is necessary to use large numbers of experimental volunteers and/or to train them carefully in the use of a subjective scale, to make it firmly anchored to the subject’s personal experience of the different levels of work done. Repeated measures of perceived effort also help to control variability. Repeated measures were done in the step experiments described in paper I. The number of volunteers involved in the trials was quite large in most substudies. In the substudies that involved fewer participants, additional expert evaluation was used (as described in paper VI). The subject scales used were simple and the experimenters elicited the answers in a way that could be used to control that the scale was understandable to the participants.

Many validity questions were discussed and solved. To sum up, the validity deserves even more attention that could be allocated to it during the present study (cf. 6.3.1).