| Experimental ergonomic evaluation with user trials: EEE product development procedures | ||
|---|---|---|
| Prev | Chapter 4. Materials and methods | Next |
The methods used in the different studies are a result of the author’s and her team"s own choices and experiments. The methods were described briefly in the literature review, and they have been presented in more detail in the original papers. This chapter focuses on the use of the methods in a study context enabling EEE development.
The measurements included direct objective ones with standardised units. For the subjective assessments, various scales were needed. The scaling methods that are commonly used in product design research with short descriptions are listed in Table 5 together with their connections with the cases described in the papers. Ranking, rating, category scaling, paired comparisons and magnitude estimates are unidimensional scaling methods that may be used to construct ordinal scales, interval scales and ratio scales. Multidimensional scaling, on the other hand, provides scale values for each stimulus on several unidimensional scales. The semantic differential is a specialised form of multidimensional scaling that consists of a series of bipolar unidimensional graphic rating scales. (Cushman & Rosenberg 1991)
The choice of the specific scaling method depends on the research objective, the number of stimuli to be scaled, the number of subjects, and the time and money available. Category scales are the most common scaling tools for the quantification of human experiences (Müller 1996). Rating, category scaling, and magnitude estimation are very simple procedures and especially suitable if either the number of stimuli or the number of subjects is very large. Ranking, paired comparisons, semantic differential and multidimensional scaling are more time-consuming procedures, but they may provide more useful information. (Cushman & Rosenberg 1991)
Table 5. Comparison of the scaling methods commonly used in product design research (Cushman & Rosenberg 1991) and the paper in which the method is used in this study.
| Method | Subject’s task | Result | Paper |
|---|---|---|---|
| Ranking | Subject places stimuli in a rank order with respect to some attribute. | Ordinal scaleInterval scale | VI |
| Rating | Subject rates each stimulus on a scale of 1 – n with respect to some attribute. | Ordinal scale Interval scale | II, VII |
| Category scaling | Subject places each stimulus in one of n categories, which are arranged with respect to some attribute. | Ordinal scale Interval scale (including category boundaries) | I |
| Paired comparisons | Stimuli are presented in pairs. Subject chooses the stimulus with the greater amount of some attribute. | Interval scale | III, IV, V |
| Magnitude estimation | Subject assigns a number to each stimulus to indicate the amount of some attribute. | Ratio scale | |
| Semantic differential | Subject rates stimuli on n continuous bipolar scales. | Stimulus profiles | II, VII |
| Multidimensional scaling | Subject indicates similarities between stimuli or preferences. | Positions of stimuli in n-dimensional psychological space; preference vector for each subject |
Fig. 9 shows the different scaling methods and the numbers used for calculations in this study. Likert scales are scales on which participants register their agreement or disagreement with a statement (Rubin 1994).
Measurement with an instrument was applied to the subjects’ anthropometric dimensions: stature and elbow heights (paper II) and sitting height with an adjustable chair (paper IV). The furniture or the product to be used were also measured, including the step height of the tractors used by the farmers (paper I). The height of the steps was chosen according to the recommendations. The same procedure was also applied to the furniture and fixture heights (paper II) and the microwave oven workstation arrangements (paper III). The match between the end-user and the technological product (Pheasant 1996), i.e. the physical fit to the interface, can be, at least in theory, objectively checked. The smaller the difference between the product’s physical features and the corresponding human measures, the better the fit (EEE1, Fig.10). How this fit was perceived by each subject could be subjectively rated (EEE2) (Roozenburg & Eekels 1995).
Rating was used to evaluate the heights of different furniture and facilities after experiments by the elderly (paper II). Rating was also used to find out the patients’ satisfaction with the video consultation in a real medical consultation (paper VII). The procedure of evaluation with the rating is presented in Fig. 11.
In the task-surface height experiment (paper II), the elderly subjects performed tasks typical of a home environment. The tasks were performed twice with different task-surface heights in the second trial. After each task, the subject rated the task-surface height. Afterwards, the expert evaluated the heights from the video recordings. In the videophone experiment (paper VII), the patients were in real video consultation with a physician in a health care center and a specialist in a central hospital. The patients or their escort rated afterwards the statements in the questionnaire according their experiences of the consultation and sent the questionnaire to the experimenter.
For rating purposes, three scales of different types were used. The rating in the videophone experiment (paper VII) was accompanied by a 3-step Likert scale and a 3-step semantic differential. In the task-surface experiment (paper II), a 7-step semantic differential was used. The ratings were made on different scales, depending on the factor to be rated as follows:
the heights of chairs and fixtures: much too high (1), too high (2), slightly
too high (3), suitable (4), slightly too low (5), too low (6) and much too
low (7).
the video consultation statements: not agree (1), cannot say (2) and
agree (3) and
the system of video consultation: bad (1) cannot say (2) and good (3).
Ranking was applied to find a suitable telephone for the elderly among four different telephones (paper VI). The ranking was done by the elderly themselves and also by two expert groups: (a) geriatric nurses and (b) gerontechnology researchers (Fig. 12). The elderly subjects were allowed to examine the telephone concepts (three telephones of each type) for as long as they wanted before the judgement. The experts made their ranking according to pictures with descriptions of the telephone types.
In the chair experiment (paper IV), Mitchell’s paired comparison method (1992) was used to weight eight chair criteria (Fig. 13, the results were used later in paper V). In the weighting process, altogether 28 pairs of criteria were compared. The comparison was made easier for the participants by utilising supporting figures for each criterion pair (see paper V Fig. 6). In paper IV, the Mitchell method was continued with an evaluation of three different chair prototypes. The subjects compared two chairs at a time for every criterion and chose the chair that was the best on each criterion.
Conjoint analysis was used to help in the evaluation of microwave oven workstations by the elderly (paper III). On the basis of the defined features and feature levels, the React Customer software (React!… 1995) created ten conjoint cards, where one profile represented one possible product concept. The cards were a simplification of all the possible combinations of the attribute features and their levels. The simplification was based on a 90% level of confidence calculated by orthogonal arrays and utilised for statistical design of the experiments (SDE). The ten combinations of conjoint cards were simulated with the help of four microwave ovens placed at three different heights. The workstations were evaluated by a representative group of elderly subjects. The subjects were allowed about 20 minutes to get acquainted with all the oven arrangements with the task assignment.
After the user trials, the software calculated the relative importance of each feature and the utility of each feature level. According to these results, the best product combination was attained. The trade-offs of different feature levels and the scores for the possible combination of chairs can also be derived with the software.
In paper IV, the conjoint method was used for chair development by reducing the alternatives from 81 possible chairs to ten, comprising 9 actual alternative choices and one reference chair. All the ten chairs were made of wood. The elderly tested the chairs, compared nine alternatives to the reference chair and divided 100 points between each of the chair pairs to be compared at a time. The conjoint analysis was done with the Market Maker software (Market Maker 1996), which is a revised version of the React! customer software used in paper III. Fig. 14 illustrates the procedure used in the microwave oven experiment (paper III).
Paper V described a new procedure of multi-criteria product evaluation and one application of it. The procedure was a combination of three methods: use-value analysis (Pahl & Beitz 1988), the Dutch method (Roozenburg & Eekels 1995) and paired comparison (Mitchell 1992). The main features of the procedure include combined evaluation by experts and users (as subjects or participating partners), paired comparison, calculation of weighting factors and a possibility to utilise user trials. The procedure helps to make optimal trade-offs in the context of user-centred design.
The experimental part of paper V (Fig. 15) consisted of: (a) making a multi-criteria objective model for a chair by experts, (b) determining the importance of each criterion (done by the elderly using the Mitchell method, paper IV), (c) assigning scores to three chair alternatives (prototypes) with regard to each property corresponding to a criterion as defined by the experts, and (d) determining the overall value of each chair prototype according to the model.
Three researchers (acting as “pilot subjects”) rated the chairs and their individual characteristics based on the criteria after having got acquainted with them for 5 minutes (small-scale user trial). The scoring could also be done by the elderly. A special form was designed for assigning the scores (in this case only subjective verbal assessments). Calculation of the mean was preferred to trying to make a consensus decision concerning the scores.
The statistical methods used in the papers are shown in Tables 6 and 7. Regarding statistics, the main emphasis was placed on four points: case results as such, significance of the results, consistency and validity. These points are exceedingly important as far as experimental ergonomic evaluation procedures are concerned and simultaneously less dependent on the particular case.
For the processing and presentation of case results, the following methods were used: tabulation, cross-tabulation, matrix, mean, standard deviation, conjoint analysis, regression, correlation, arithmetic mean for the score value of each task-surface height, standard deviation as a measure of the distribution of judgements, scattergram.
Table 6. The significance of the results was clarified with help of different statistical methods.
| Method (paper) | Statistical method used | Reference |
|---|---|---|
| Rating (II): | the difference between the expert’s and the subjects’ scores was analysed by the t-test for paired samples. | Howell 1992 |
| Rating (II): | to compare the differences between the subjects’ ratings of two heights of the same furniture or fixture, Wilcoxon’s matched-pairs signed-ranks test was used. | Howell 1992 |
| Ranking (VI): | the validity of the rankings given by the experts was estimated by Kendall’s Tc coefficient. | Siegel & Castellan 1988 |
| Ranking (VI): | the validity of the rankings given by the experts was estimated by Kendall’s rank order correlation coefficient τ . | Siegel & Castellan 1988 |
| Ranking (VI): | to determine if the agreement between the respondents was higher than it would have been by chance, a method for assessing the differences in the sums of individual rankings between the groups (Doornbos-Prins test) was used. | Lokki 1980 |
| Rating (VII): | χ2 –test of the correlation between the statements. |
Table 7. Consistency of the results.
| Method (paper) | Statistical method used | Reference |
|---|---|---|
| Inter-rater: | ||
| Rating (II): | The indicator of the reliability of the subjects’ ratings was the overall mean score. | |
| Conjoint (III, IV): | With the half-split test, the stability of the relative significances of the attributes for both subject groups were calculated and compared with the results of all subjects. | Cushman & Rosenberg 1991 |
| Mitchell (IV, V): | To estimate the agreement between the subjects in their Mitchell preferences, the method of the Kendall coefficient of agreement u for paired comparisons was used. | Siegel & Castellan 1988 |
| Multi-criteria (V) Ranking (VI): | The interpersonal variability of scores between the three researchers (paper V) and between the rankings made within each group (paper VI) was determined by using the Kendall coefficient of concordance W. | Siegel & Castellan 1988 |
| Intra-rater: | ||
| Objective (I): | The degree of consistency with which a certain foot was used to mount the first step was determined for the different mounting occasions of each subject. | |
| Rating (II): | Six of the subjects repeated the experiment after a couple of weeks, and their ratings of the furniture heights at the different times were compared with Wilcoxon’s matched-pairs signed-ranks test. | Howell 1992 |
| Mitchell (IV): | 11 subjects repeated the experiment after four months, and their pair comparisons of the chair criteria at the different times were compared with kappa Κ . | Fleiss 1973 |
| Mitchell (IV): | As a measure of the differences between rankings by the same subjects on two occasions Spearman’s rank-order correlation rs was used. | Siegel & Castellan 1988 |
| Mitchell (IV): | The proportion of consistent assessments CR shows the degree of agreement. | Fleiss 1973 |