4.3. Methods and procedures

The methods used in the different studies are a result of the author’s and her team"s own choices and experiments. The methods were described briefly in the literature review, and they have been presented in more detail in the original papers. This chapter focuses on the use of the methods in a study context enabling EEE development.

The measurements included direct objective ones with standardised units. For the subjective assessments, various scales were needed. The scaling methods that are commonly used in product design research with short descriptions are listed in Table 5 together with their connections with the cases described in the papers. Ranking, rating, category scaling, paired comparisons and magnitude estimates are unidimensional scaling methods that may be used to construct ordinal scales, interval scales and ratio scales. Multidimensional scaling, on the other hand, provides scale values for each stimulus on several unidimensional scales. The semantic differential is a specialised form of multidimensional scaling that consists of a series of bipolar unidimensional graphic rating scales. (Cushman & Rosenberg 1991)

The choice of the specific scaling method depends on the research objective, the number of stimuli to be scaled, the number of subjects, and the time and money available. Category scales are the most common scaling tools for the quantification of human experiences (Müller 1996). Rating, category scaling, and magnitude estimation are very simple procedures and especially suitable if either the number of stimuli or the number of subjects is very large. Ranking, paired comparisons, semantic differential and multidimensional scaling are more time-consuming procedures, but they may provide more useful information. (Cushman & Rosenberg 1991)

Table 5. Comparison of the scaling methods commonly used in product design research (Cushman & Rosenberg 1991) and the paper in which the method is used in this study.

MethodSubject’s taskResultPaper
RankingSubject places stimuli in a rank order with respect to some attribute.Ordinal scaleInterval scaleVI
RatingSubject rates each stimulus on a scale of 1 – n with respect to some attribute.Ordinal scale Interval scaleII, VII
Category scalingSubject places each stimulus in one of n categories, which are arranged with respect to some attribute.Ordinal scale Interval scale (including category boundaries)I
Paired comparisonsStimuli are presented in pairs. Subject chooses the stimulus with the greater amount of some attribute.Interval scaleIII, IV, V
Magnitude estimationSubject assigns a number to each stimulus to indicate the amount of some attribute.Ratio scale 
Semantic differentialSubject rates stimuli on n continuous bipolar scales.Stimulus profilesII, VII
Multidimensional scalingSubject indicates similarities between stimuli or preferences.Positions of stimuli in n-dimensional psychological space; preference vector for each subject 

Fig. 9 shows the different scaling methods and the numbers used for calculations in this study. Likert scales are scales on which participants register their agreement or disagreement with a statement (Rubin 1994).

Figure 9. An example of the different scaling methods used in the experiments.

4.3.1. Unit measurement by an instrument (I, II, III, IV)

Measurement with an instrument was applied to the subjects’ anthropometric dimensions: stature and elbow heights (paper II) and sitting height with an adjustable chair (paper IV). The furniture or the product to be used were also measured, including the step height of the tractors used by the farmers (paper I). The height of the steps was chosen according to the recommendations. The same procedure was also applied to the furniture and fixture heights (paper II) and the microwave oven workstation arrangements (paper III). The match between the end-user and the technological product (Pheasant 1996), i.e. the physical fit to the interface, can be, at least in theory, objectively checked. The smaller the difference between the product’s physical features and the corresponding human measures, the better the fit (EEE1, Fig.10). How this fit was perceived by each subject could be subjectively rated (EEE2) (Roozenburg & Eekels 1995).

Figure 10. An example of an EEE1 procedure with generalisation: the process of measuring the suitable work surface height in a task-surface experiment (paper II).

4.3.2. Rating (II, VII)

Rating was used to evaluate the heights of different furniture and facilities after experiments by the elderly (paper II). Rating was also used to find out the patients’ satisfaction with the video consultation in a real medical consultation (paper VII). The procedure of evaluation with the rating is presented in Fig. 11.

In the task-surface height experiment (paper II), the elderly subjects performed tasks typical of a home environment. The tasks were performed twice with different task-surface heights in the second trial. After each task, the subject rated the task-surface height. Afterwards, the expert evaluated the heights from the video recordings. In the videophone experiment (paper VII), the patients were in real video consultation with a physician in a health care center and a specialist in a central hospital. The patients or their escort rated afterwards the statements in the questionnaire according their experiences of the consultation and sent the questionnaire to the experimenter.

For rating purposes, three scales of different types were used. The rating in the videophone experiment (paper VII) was accompanied by a 3-step Likert scale and a 3-step semantic differential. In the task-surface experiment (paper II), a 7-step semantic differential was used. The ratings were made on different scales, depending on the factor to be rated as follows:

Paper II:

the heights of chairs and fixtures: much too high (1), too high (2), slightly

too high (3), suitable (4), slightly too low (5), too low (6) and much too

low (7).

Paper VII:

the video consultation statements: not agree (1), cannot say (2) and

agree (3) and

the system of video consultation: bad (1) cannot say (2) and good (3).

Figure 11. An example of an EEE2 procedure with generalisation: the process of rating fixture height in the home simulator and observations by an expert in the task-surface experiment (paper II).

4.3.3. Ranking (VI)

Ranking was applied to find a suitable telephone for the elderly among four different telephones (paper VI). The ranking was done by the elderly themselves and also by two expert groups: (a) geriatric nurses and (b) gerontechnology researchers (Fig. 12). The elderly subjects were allowed to examine the telephone concepts (three telephones of each type) for as long as they wanted before the judgement. The experts made their ranking according to pictures with descriptions of the telephone types.

Figure 12. An example of an EEE3 procedure with generalisation: the process of ranking telephone types for the elderly in the telephone experiment (paper VI).

4.3.4. Michael’s paired comparison (IV, V)

In the chair experiment (paper IV), Mitchell’s paired comparison method (1992) was used to weight eight chair criteria (Fig. 13, the results were used later in paper V). In the weighting process, altogether 28 pairs of criteria were compared. The comparison was made easier for the participants by utilising supporting figures for each criterion pair (see paper V Fig. 6). In paper IV, the Mitchell method was continued with an evaluation of three different chair prototypes. The subjects compared two chairs at a time for every criterion and chose the chair that was the best on each criterion.

Figure 13. An example of an EEE4 procedure with generalisation: the process of Mitchell’s paired comparison in the chair experiment (paper IV).

4.3.5. Conjoint analysis (III, IV)

Conjoint analysis was used to help in the evaluation of microwave oven workstations by the elderly (paper III). On the basis of the defined features and feature levels, the React Customer software (React!… 1995) created ten conjoint cards, where one profile represented one possible product concept. The cards were a simplification of all the possible combinations of the attribute features and their levels. The simplification was based on a 90% level of confidence calculated by orthogonal arrays and utilised for statistical design of the experiments (SDE). The ten combinations of conjoint cards were simulated with the help of four microwave ovens placed at three different heights. The workstations were evaluated by a representative group of elderly subjects. The subjects were allowed about 20 minutes to get acquainted with all the oven arrangements with the task assignment.

After the user trials, the software calculated the relative importance of each feature and the utility of each feature level. According to these results, the best product combination was attained. The trade-offs of different feature levels and the scores for the possible combination of chairs can also be derived with the software.

In paper IV, the conjoint method was used for chair development by reducing the alternatives from 81 possible chairs to ten, comprising 9 actual alternative choices and one reference chair. All the ten chairs were made of wood. The elderly tested the chairs, compared nine alternatives to the reference chair and divided 100 points between each of the chair pairs to be compared at a time. The conjoint analysis was done with the Market Maker software (Market Maker 1996), which is a revised version of the React! customer software used in paper III. Fig. 14 illustrates the procedure used in the microwave oven experiment (paper III).

Figure 14. An example of an EEE5 procedure with generalisation: the process of conjoint analysis in the assessment of microwave oven workstations (paper III).

4.3.6. Multi-criteria evaluation procedure (V)

Paper V described a new procedure of multi-criteria product evaluation and one application of it. The procedure was a combination of three methods: use-value analysis (Pahl & Beitz 1988), the Dutch method (Roozenburg & Eekels 1995) and paired comparison (Mitchell 1992). The main features of the procedure include combined evaluation by experts and users (as subjects or participating partners), paired comparison, calculation of weighting factors and a possibility to utilise user trials. The procedure helps to make optimal trade-offs in the context of user-centred design.

The experimental part of paper V (Fig. 15) consisted of: (a) making a multi-criteria objective model for a chair by experts, (b) determining the importance of each criterion (done by the elderly using the Mitchell method, paper IV), (c) assigning scores to three chair alternatives (prototypes) with regard to each property corresponding to a criterion as defined by the experts, and (d) determining the overall value of each chair prototype according to the model.

Three researchers (acting as “pilot subjects”) rated the chairs and their individual characteristics based on the criteria after having got acquainted with them for 5 minutes (small-scale user trial). The scoring could also be done by the elderly. A special form was designed for assigning the scores (in this case only subjective verbal assessments). Calculation of the mean was preferred to trying to make a consensus decision concerning the scores.

Figure 15. An example of an EEE6 procedure with generalisation: the process of multi-criteria evaluation in the chair experiment (paper V).

4.3.7. Statistical methods

The statistical methods used in the papers are shown in Tables 6 and 7. Regarding statistics, the main emphasis was placed on four points: case results as such, significance of the results, consistency and validity. These points are exceedingly important as far as experimental ergonomic evaluation procedures are concerned and simultaneously less dependent on the particular case.

For the processing and presentation of case results, the following methods were used: tabulation, cross-tabulation, matrix, mean, standard deviation, conjoint analysis, regression, correlation, arithmetic mean for the score value of each task-surface height, standard deviation as a measure of the distribution of judgements, scattergram.

Table 6. The significance of the results was clarified with help of different statistical methods.

Method (paper)Statistical method usedReference
Rating (II): the difference between the expert’s and the subjects’ scores was analysed by the t-test for paired samples.Howell 1992
Rating (II): to compare the differences between the subjects’ ratings of two heights of the same furniture or fixture, Wilcoxon’s matched-pairs signed-ranks test was used.Howell 1992
Ranking (VI): the validity of the rankings given by the experts was estimated by Kendall’s Tc coefficient.Siegel & Castellan 1988
Ranking (VI): the validity of the rankings given by the experts was estimated by Kendall’s rank order correlation coefficient τ .Siegel & Castellan 1988
Ranking (VI): to determine if the agreement between the respondents was higher than it would have been by chance, a method for assessing the differences in the sums of individual rankings between the groups (Doornbos-Prins test) was used.Lokki 1980
Rating (VII): χ2 –test of the correlation between the statements. 

Table 7. Consistency of the results.

Method (paper)Statistical method usedReference
Inter-rater:
Rating (II): The indicator of the reliability of the subjects’ ratings was the overall mean score. 
Conjoint (III, IV): With the half-split test, the stability of the relative significances of the attributes for both subject groups were calculated and compared with the results of all subjects.Cushman & Rosenberg 1991
Mitchell (IV, V):To estimate the agreement between the subjects in their Mitchell preferences, the method of the Kendall coefficient of agreement u for paired comparisons was used.Siegel & Castellan 1988
Multi-criteria (V) Ranking (VI): The interpersonal variability of scores between the three researchers (paper V) and between the rankings made within each group (paper VI) was determined by using the Kendall coefficient of concordance W.Siegel & Castellan 1988
Intra-rater:
Objective (I): The degree of consistency with which a certain foot was used to mount the first step was determined for the different mounting occasions of each subject. 
Rating (II): Six of the subjects repeated the experiment after a couple of weeks, and their ratings of the furniture heights at the different times were compared with Wilcoxon’s matched-pairs signed-ranks test.Howell 1992
Mitchell (IV): 11 subjects repeated the experiment after four months, and their pair comparisons of the chair criteria at the different times were compared with kappa Κ . Fleiss 1973
Mitchell (IV): As a measure of the differences between rankings by the same subjects on two occasions Spearman’s rank-order correlation rs was used.Siegel & Castellan 1988
Mitchell (IV): The proportion of consistent assessments CR shows the degree of agreement.Fleiss 1973