2.4. Collagen genes

The knowledge of the structure of collagen genes has a number of important applications. One of them is to provide a necessary database for the identification of mutations in collagen genes that cause human diseases. These will be discussed in section 2.5. While screening for mutations in collagen genes, researcher often encounters normal variations, the knowledge of which is fundamental in understanding the functional properties of the protein. In addition, some of these normal variations may even turn out to be potently predisposing to common diseases (for examples, see Kivirikko, 1993). The knowledge of gene structures across distant phyla can be used in evolutionary studies and in the identification of functionally important domains in the protein structure and within the regulatory regions. The genomic sequences also provide a necessary tool for many molecular biological studies, for example, in gene regulation and elucidation of protein function by generating genetically modified animals (see 2.6.2.).

Collagen genes and their loci have been given names with the prefix COL, followed by an Arabic number denoting the collagen type, the letter A, and another Arabic number for the α-chain in question. The gene names are usually written in italics. Those encoding human polypeptides are written with capital letters, whereas lower case letters are used to distinguish the corresponding genes in mouse or chicken. The 34 collagen genes characterized to date, excluding the most recently identified collagen types XX-XXIII, are dispersed throughout the genome and are located in 15 human and 13 mouse chromosomes. The collagen genes in human and mouse, their chromosomal locations, and characteristic features are presented in Table 2 and discussed briefly below.

Table 2. Collagen genes and their chromosomal locations*.

GeneFeaturesChromosome*References
ExonsSize (kb)
COL1A1511817q21.3-q22Chu et al., 1985; D"Alessio et al., 1988; Määttä et al., 1991; Westerhausen et al., 1991
COL1A252387q21.3-q22de Wet et al., 1987; Körkkö et al., 1998
COL2A1543112q13-q14Ala-Kokko & Prockop, 1990
Col2a15428,915Metsäranta et al., 1991
COL3A151442q24.3-q31Chu & Prockop, 1993
Col3a15137,61Toman & de Crombrugghe, 1994
COL4A152>10013q34Soininen et al., 1989
COL4A247>10013q34Heikkilä & Soininen, 1996
Col4a247>908Buttice et al., 1990
COL4A3522502q34-q37Heidet et al., 2001
COL4A448>1132q35-q37Boye et al., 1998
COL4A551140 Xq22Zhou et al., 1994
COL4A646425Xq22Oohashi et al., 1995; Zhang et al., 1996
COL5A1667509q34.2-q34.3Takahara et al., 1995
COL6A1362921q22.3Heiskanen et al., 1995; Saitta et al., 1991; Trikka et al., 1997
COL6A2363021q22.3Saitta et al., 1991; Saitta et al., 1992
COL7A111831,13p21Christiano et al., 1994
Col7a1118319Kivirikko et al., 1996
COL9A138906q12-q14Pihlajamaa et al., 1998
COL9A232151p32Pihlajamaa et al., 1998
Col9a232164Peralä et al., 1994
COL9A3322320q13.3Paassilta et al., 1999
COL10A136,26q21-q22Apte et al., 1992; Thomas et al., 1991
Col10a137,210Apte & Olsen, 1993
COL11A168>1501p21Annunen et al., 1999
COL11A266>286p21.2Lui et al., 1996; Vuoristo et al., 1995
COL13A141/4214010q22Hägg et al., 1998; Tikka et al., 1991
Col13a14213510Kvist et al., 1999
COL15A1421459q21-q22see I
Col15a1401104see II
COL17A1565210q24.3Gatalica et al., 1997
COL18A14310521q22.3Elamaa et al., personal communication
Col18a143>10210Rehn et al., 1996
COL19A151>2506q12-q14Khaleduzzaman et al., 1997

* The chromosomal locations of human and mouse genes were collected from the GeneCards and Mouse Genome databases, respectively.

Only completely characterized genes are listed, thus some genes whose chromosomal locations are known are excluded.

Typically, genes encoding collagens span large genomic areas and consist of multiple exons that have some common characteristics due to the repeating Gly-X-Y –unit structure (see Vuorio & de Crombrugghe, 1990; Chu & Prockop, 1993, for reviews). Accordingly, the genes encoding fibril-forming collagens are similar in structure, whereas those encoding non-fibril forming collagens are more heterogeneous. The region encoding the triple-helical domain of the major fibril-forming collagens, types I-III, consists of 41-42 exons, all of which are multiples of 9 bp. Most exons are 54 bp in size, but can also be multiples of 54 bp or combinations of 45- and 54-bp exons. Furthermore, each exon starts with a complete codon for glycine and therefore codes for a discrete number of Gly-X-Y –units. Because of the high evolutionary conservation among the fibrillar collagen genes, it has been proposed that the ancestral gene arose by amplification of a 54-bp exon unit. The genes encoding minor fibrillar collagens, types V and XI, have a large number of 54-bp exons, thus supporting the hypothesis of a 54-bp ancestor exon, although their structures otherwise diverse considerably from that of the major ones, indicating a separate evolutionary pathway (Takahara et al., 1995; Vuoristo et al., 1995).

The triple helix encoding regions of nonfibril-forming collagen genes do not reflect the 54-bp exon motif common in fibril-forming collagens, but contain 36- and 63-bp exons or other sizes that are multiples of 9-bp, or slight deviations from that. The presence of imperfections in the Gly-X-Y sequences, together with the occurrence of split codons at the 5’- or 3’-ends of exons, some of them involving the first G-residue of a Gly-codon, further account for the variation in the exon sizes (see Vuorio & de Crombrugghe, 1990; Chu & Prockop, 1993, for reviews).

Type XIII collagen was the first collagen shown to be modified by alternative splicing (Pihlajaniemi et al., 1987), but subsequent results have indicated that the occurrence of the variant collagen transcripts is the rule rather than the exception in the collagen family. The mode of generation of the alternative transcripts varies from the use of alternative promoters to exon skipping and utilization of internal splice sites (reviewed by Pihlajaniemi & Rehn, 1995). In most cases the alternative splicing affects the N and C-terminal NC domains with the exception of type XIII collagen, where both NC and COL domains are affected. Although the significance of these modifications is not fully understood, the tissue- and developmental stage-specific expression patterns of the variant forms reported e.g. for collagen II (Sandell et al., 1991 and 1994; Lui et al., 1995a), collagen IX (Liu et al., 1993), collagen XI (Sugimoto et al., 1998; Iyama et al., 2001), and collagen XII (Böhme et al., 1995) have been suggested to be implicated in conferring different functional properties (and see later in 2.7.).

To ensure that various collagen types are expressed at controlled rates in their specific locations in adult (see 2.1. and 2.2.) and developing tissues (see 2.7.), the coordinate function of a multiplicity of regulatory elements located in the core promoter areas, 5’-flanking sequences, and within introns is required. In addition, further modulation of collagen gene expression is provided by various cytokines or hormones (for reviews see Vuorio & de Crombrugghe 1990). Recently, type XV collagen expression was reported to be enhanced by transforming growth factor-β (TGF-β ) and reduced by tumor necrosis factor-α (TNF-α) and interleukin-1β (IL-1β ) (Kivirikko et al., 1999).

Structurally, the collagen genes, like other genes, can be roughly divided into two categories based on the characteristics in their core promoter areas. These categories are “tissue-specific genes”, which have TATA boxes specifying the precise position of transcription initiation, and “housekeeping genes”, which lack TATA boxes, but have instead high GC-contents and multiple transcription start sites. The genes belonging into the latter category are transcribed widely in many tissues, but at low RNA levels. Of the collagen genes, those encoding the major fibrillar collagens, COL1A1 (Bornstein et al., 1987), COL2A1 (Metsäranta et al., 1991), and COL3A1 (Benson-Chanda et al., 1989), the COL10A1 encoding the highly specialized collagen of hypertrophic chondrocytes (Apte & Olsen, 1993), and the downstream promoter initiating the synthesis of the cornea-specific transcript of collagen IX (Pihlajamaa et al., 1998) belong to the tissue-spesific gene category. COL4A3-A4 (Momota et al., 1998), COL5A1 (Lee & Greenspan, 1995), COL7A1 (Christiano et al., 1994), COL9A2-A3 (Pihlajamaa et al., 1998; Paassilta et al., 1999), COL11A1 (Yoshioka et al., 1995), COL11A2 (Vuoristo et al., 1995), Col13a1 (Kvist et al., 1999), the promoter 1 of Col18a1 (Rehn et al., 1996), and the downstream promoter of COL6A2 (Saitta et al., 1992) all belong to the housekeeping genes category. Furthermore, some collagen promoters, such as COL4A5-6 (Sugimoto et al., 1994) and the promoter 2 of Col18a1 (Rehn et al., 1996), lack both TATA- and GC-boxes, but contain CCAAT boxes. Others, however, lack all the above mentioned proximal promoter elements, and examples of these are COL6A1 (Bonaldo et al., 1993) and the upstream promoter of the cartilage-specific transcript of collagen IX (Pihlajamaa et al., 1998).

There are several ways to identify and characterize the regulatory elements. As described in publications I and II, putative regulatory elements can be identified simply by sequencing the 5’-flanking areas of the genes and by searching for binding sites for known transcription factors, the functional significance of which must be determined by other means. In several studies, hints provided by phylogenic conservation of critical regulatory elements have been utilized (collagens I, II, V and X) (Vikkula et al., 1992; Truter et al., 1993; Thomas et al., 1995; Antoniv et al., 2001, and see below). An important experimental system to study elements conferring tissue-specificity in intact animals is provided by transgenic mice, or lately also by nematodes, frogs, and zebra fish. Typically, a potential regulatory sequence is fused to a reporter gene, such as β -galactosidase, luciferase, or green fluorescent protein (GFP), introduced into the mouse germline, and the expression of the reporter gene is monitored in tissues (for reviews see Hogan et al., 1994). This strategy has been used for example in the identification of the chondrocyte-specific elements in the first intron of Col2a1 gene (Zhou et al., 1995; Zhou et al., 1998), in the identification of osteoblast-specific elements in the promoter of Col1a1 gene (Rossert et al., 1996), as well as in the study of isoform specificity in the expression patterns of collagen XVIII, cle-1, in C. elegans (Ackley et al., 2001). Similarly, the promoter efficacy can be studied in vitro in transient transfection assays using reporter gene constructs, which, when coupled with cotransfection, gel-shift, and footprinting assays or mutagenesis, reveal functional characteristics of the promoter, such as the cis-acting elements implicated in the gene regulation. This strategy has been successfully used for example in the identification of regulatory elements conferring the liver-specificity of promoter 2 of the Col18a1 gene (Liétard et al., 2000).