In recent years the collection of statistical data has grown at such a rate that it would be impossible to keep up with even a small part of the things that directly affect our lives unless this information is disseminated in "predigested" or summarized form. The whole matter of putting large masses of data into a usable form has always been important, but it has multiplied greatly in the last few decades. This has been due partly to the development of computers, which was previously left undone because it would have taken months or years, and partly to the deluge of data generated by the increasingly quantitative approach of the sciences, especially the behavioral and social sciences, where nearly every aspect of human life is nowadays measured in one way or another.

 

The most common method of summarizing data is to present them in condensed form in tables or charts, and at one time this took up the better part of an elementary course in statistics. Nowadays, there is so much else to learn in statistics that very little time is devoted to this kind of work. In a way this is unfortunate, because one does not have to look far in newspapers, magazines, an even professional journal to find unintentionally or intentionally misleading statistical charts.

 

In Sections 3.1 and 3.2 we shall present ways of listing data so that they present a good overall picture and, hence, are easy to use. By listing we are referring to any kind of treatment that preserves the identity of each value (or item). In other words, we rearrange but do not change. A speed of 63 mph remains a speed of 63 mph, a salary of $75,00 and when sampling public opinion, a National Party remains a National and a Wafdy remains a Wafdy. In Sections 3.3 and 3.4, we shall discuss ways of grouping data into a number of classes, intervals, or categories and presenting the result in the form of a table or a chart. This will leave us with data in a relatively compact and easy-to-use form, but it does entail a substantial loss of information. Instead of a person's weight, we may know only that he or she weights anywhere from 160 to 169 pounds, and instead of an actual pollen count we may know only that it is medium (11-25 parts per cubic meter).


3.1 Listing Numerical Data

 

Listing and thus, organizing the data is usually the first task in any kind of statistical analysis. As a typical situation, consider the following data, representing the lengths (in centimeters) of 60 sea trout caught by a commercial trawler in Bay Area : 

 

19.2

19.6

17.3

19.3

19.5

20.4

23.5

19.0

19.4

18.4

19.4

21.8

20.4

21.0

21.4

19.8

19.6

21.5

20.2

20.1

20.3

19.7

19.5

22.9

20.7

20.3

20.8

19.8

19.4

19.3

19.5

19.8

18.9

20.4

20.2

21.5

19.9

21.7

19.5

20.9

18.1

20.5

18.3

19.5

18.3

19.0

18.2

21.9

17.0

19.7

20.7

21.1

20.6

16.6

19.4

18.6

22.7

18.5

20.1

18.6

 

The mere gathering of this information is so small task, but it should be clear that more must be done to make the numbers comprehensible.

 

What can be done to make this mass of information more usable? Some persons find it interesting to locate the extreme values, which are 16.6 and 23.5 for this list. Occasionally, it is useful to sort the data in an ascending or descending order. The following list gives the lengths of the trout arranged in an ascending order.

 

16.6

17.0

17.3

18.1

18.2

18.3

18.3

18.4

18.5

18.6

18.6

18.9

19.0

19.0

19.2

19.3

19.3

19.4

19.4

19.4

19.4

19.5

19.5

19.5

19.5

19.5

19.6

19.6

19.7

19.7

19.8

19.8

19.8

19.9

20.1

20.1

20.2

20.2

20.3

20.3

20.4

20.4

20.4

20.5

20.6

20.7

20.7

20.8

20.9

21.0

21.1

21.4

21.5

21.5

21.7

21.8

21.9

22.7

22.9

23.5

 

Sorting a large set of numbers in an ascending or descending order can be a surprisingly difficult task. It is simple, though, if we can use a computer or a graphing calculator. In that case, entering the data is the most tedious part. Then, with a graphing calculator we press STAT and 2, fill in the list where we put the data, press ENTER, and the display screen spells out DONE.

 

If a set of data consists of relatively few values, many of which are repeated, we simply count how many times each value occurs and then present the results in the form of a Table or a dot diagram. In such a diagram we indicate by means of dots how many times each value occurs.

 

Example (1)

An audit of twenty tax returns revealed 0, 2, 0, 0, 1, 3, 0, 0, 0, 1, 0, 1, 0, 0, 2, 1, 0, 0, 1, and 0 mistakes in arithmetic.

 

(a)         Construct a table showing the number of tax returns with 0, 1, 2, and 3, mistakes in arithmetic.

(b)         Draw a dot diagram displaying the same information

 

Solution:

Counting the number of 0's, 1's, 2's and 3's we find that they are, respectively, 12, 5, 2, and 1. This information is displayed as follows, in tabular form on the left and n graphical form on the right.

 

Number of mistakes

Number of the returns

0

1

2

3

12

5

2

1

Text Box: •           
•
•
•
•
•
•
•           
•           •
•           •
•           •
•           •           •
•           •           •           •
0          1          2          3

 

 

 

 

                                                                            Number of mistakes

 

Number of Mistakes

 

There are various ways in which dot diagram can be modified, for instance, instead of dots we can use other symbols such as x's,«'s, or²'s. Also, we could align the dots horizontally rather than vertically.

 

The methods we used to display relatively few numerical values, many of which are repeated, can also be used to display categorical data.

 

Example (2)

The faculty of a university's mathematics department consists of four professors, six associate professors, eleven assistant professors, and nine instructors. Display this information in the form of a horizontally aligned dot diagram.

 

Solution:

Faculty Rank

 

 

 

 

 

 

 

 

 

 

 

Professor

²

²

²

²

 

 

 

 

 

 

 

Associate professor

²

²

²

²

²

²

 

 

 

 

 

Assistant professor

²

²

²

²

²

²

²

²

²

²

²

Instructor

²

²

²

²

²

²

²

²

²

 

 

 

Another way of modifying dot diagram is to replace the numbers of dots with rectangles lengths are proportional to the respective numbers of dots. Such diagrams are referred to as bar charts, and the rectangles are often supplemented with the corresponding frequencies (number of symbols) as shown in the next Figure of Example 3.

 

Example (3)

Draw a bar chart for the data of Example 3.1; that is, for the numbers of mistakes in arithmetic in the twenty tax returns.

  

Solution:

Bar Chart of Mistakes in Arithmetic in Tax Returns


3.2 Stem-And-Leaf-Display

 

Dot diagrams are impractical and ineffective when a set of data contains many different values or categories, or when some of the values or categories require too many dots to yield a coherent picture. To give an example, consider the first –round scores in PGA tournament, where the lowest score was a 62, the highest score was an 88, and 27 of the 126 golfers shot a par 72. This illustrates both of the reasons cited previously for not using dot diagrams. There are too may different values from 62 to88, and at least one of them, 72 requires too many dots.

 

In recent years, an alternative method of listing data has been proposed for the exploration of relatively small sets of numerical data. It is called a stem-and leaf display and it also yields a good overall picture of the data without any appreciable loss of information. Again, each value retains its identify, and the only information we lose is the order in which the data were obtained.

 

To illustrate this technique consider the following data on the number of rooms occupied each day in a resort hotel during a recent month of June:

 

55

49

37

57

46

40

64

35

73

62

61

43

72

48

54

69

45

78

46

59

40

58

56

52

49

42

62

53

46

81

 

The smallest and largest values are 35 and 81, so that a dot diagram would require that we allow for 47 possible values. Actually, only 25 of the values occur, but in order to avoid having to allow for that many possibilities, let us combine all the values beginning with a 3, all those beginning with a 4, all those beginning with a 5 and so on. This would yield

37

35

 

 

 

 

 

 

 

 

 

49

46

40

43

48

45

46

40

49

42

46

55

57

54

59

58

56

52

53

 

 

 

64

62

61

69

62

 

 

 

 

 

 

73

72

78

 

 

 

 

 

 

 

 

81

 

 

 

 

 

 

 

 

 

 

 

This arrangement is quite informative, but it is not the kind of diagram we use in actual practice. To simplify it further, we show the first digit only once for each row, on the left and separated from the other digits by means of a vertical line. This leaves us with

 

3

7

5

 

 

 

 

 

 

 

 

 

4

9

6

0

3

8

5

6

0

9

2

6

5

5

7

4

9

8

6

2

3

 

 

 

6

4

2

1

9

2

 

 

 

 

 

 

7

3

2

8

 

 

 

 

 

 

 

 

8

1

 

 

 

 

 

 

 

 

 

 

 

And this is what we refer to as a stem-and leaf display. In this arrangement, each row is called a stem, each number on a stem to the left of the vertical line is called a stem label, and each number on a stem to the right of the vertical line is called a leaf. As we shall see later, there is a certain advantage to arranging the leaves on each stem according to size, and for our data this would yield

 

3

5

7

 

 

 

 

 

 

 

 

 

4

0

0

2

3

5

6

6

6

8

9

9

5

2

3

4

5

6

7

8

9

 

 

 

6

1

2

2

4

9

 

 

 

 

 

 

7

2

3

8

 

 

 

 

 

 

 

 

8

1

 

 

 

 

 

 

 

 

 

 

 

A stem-and-leaf display is actually a hybrid kind of arrangement obtained in part by grouping and in part by listing. The values are grouped into the six stems, and yet each value retains its identity Thus, from the preceding stem-and-leaf display, we can reconstruct the original data as 35, 37, 40, 40, 42, 43, 45, 46, 46, 46, 48, 49, 49, 52, 53, …, and 81, though not in their original order.

 

There are various ways in which stem-and-leaf displays can be modified For instance, the stem labels or the leaves could be two-digit numbers, so that

 

                        24        |           0         2          5          8          9

would represent the numbers 240, 242, 245, 248, and 249, and

 

                        2          |           31        45        70        88

Would represent the numbers 231, 245, 270, and 288.

Now suppose that in the room occupancy Example we had wanted to use more than six stems. Using each stem label twice, if necessary, once to hold the leaves from 0 to 4 and once to hold the leaves from 5 to 9, we would get

 

3

5

7

 

 

 

 

 

4

0

0

2

3

 

 

 

4

5

6

6

6

8

9

9

5

2

3

4

 

 

 

 

5

5

6

7

8

9

 

 

6

1

2

2

4

 

 

 

6

9

 

 

 

 

 

 

7

2

 

 

 

 

 

 

7

8

 

 

 

 

 

 

8

1

 

 

 

 

 

 


3.3 Frequency Distributions

 

When we deal with large sets of data, and sometimes even when we deal with not so large sets of data, it can be quite a problem to get a clear picture of the information that they convey. As we saw in Sections 3.1 and 3.2, this usually requires that we rearrange and/or display the raw (untreated) data in some special form. Traditionally, this involves a frequency distribution or one of its graphical presentations, where we group or classify the data into a number of categories or classes.

 

Following are two examples. A recent study of their total billings (rounded to the nearest dollar) yielded data for a sample of 4,757 law firms. Rather than providing printouts of the 4,757 values, the information is disseminated by means of the following table:

 

Total billings

Number of law firms

Less than $300,000

$300,000 to $499,999

$500,000 to $749,999

$750,000 to $999,999

$1,000,000 or more

2,405

1,088

271

315

678

Total

4,757

 

This distribution does not show much detail, but it may well be adequate for most practical purposes. This should also be the case in connection with the following table, which summarizes the 2,439 complaints received by an airline about comfort-related characteristics of its airplanes:

  

Nature of complaint

Number of complaints

 

Inadequate leg room

Uncomfortable seats

Narrow aisles

Insufficient carry-on facilities

Insufficient restrooms

Miscellaneous other complaints

719

914

146

218

58

384

Total

2,439

 

When data are grouped according to numerical size, as in the first example, the resulting table is called a numerical or quantitative distribution. When they are grouped into nonnumerical categories, as in the second example, the resulting table is called a categorical or qualitative distribution.

 

Frequency distributions present data in a relatively compact form, give a good overall picture, and contain information that is adequate for many purposes, but, as we said previously, there is some loss of information. Some things that can be determined from the original data cannot be determined from a distribution. For instance, in the first Example the distribution does not tell us the exact size of the lowest and the highest billings, nor does it provide the total of the billings of the 4,757 law firms. Similarly, in the second Example we cannot tell how many of the complaints about uncomfortable seats pertained to their width or how many complains about insufficient carry-on facilities applied to particular size luggage. Nevertheless, frequency distributions present information in a generally more usable form, and the price we pay for this-the loss of certain information-is usually a fair exchange.

 

The construction of a frequency distribution consists essentially of three steps:

1-     Choosing the classes (intervals or categories)

2-     Sorting or tallying the data into these classes

3-     Counting the number of items in each class

 

Since the second and third steps are purely mechanical, we concentrate here on the first, namely, that of choosing a suitable classification.

 

For numerical distributions, this consists of deciding how many classes we are going to use and from where to where each classes should go, both of these choices are essentially arbitrary, but the following rules are usually observed:

 

 We seldom use fewer than 5 or more than 15 classes; the exact number we use in a given situation depends largely on how many measurements or observations there are.

 

Clearly, we would lose more than we gain if we group five observations into 12 classes with most of them empty, and we would probably discard too much information if we group a thousand measurements into three classes.

 

We always make sure that each item (measurement or observation) goes into one and only one class.

 

To this end, we must make sure that the smallest and largest values fall within the classification, that none of the values can fall into a gap between successive classes, and that the classes do not overlap, namely, that successive classes have no values in common.

 

Whenever possible, we make the classes cover equal ranges of values.

 

Also, if we can, we make these ranges multiples of numbers that are easy to work with, such as 5, 10, or 100, since this will tend to facilitate the construction and the use of a distribution.

 

If we assume that the law firm billings were all rounded to the nearest dollar, only the third of these rules was violated in the construction of the distribution on page 21. However, had the billings been given to the nearest cent, then a billing of, say, $499,999.54 would have fallen between the second class and the third class, and we would also have violated the second rule. The third rule was violated because the classes do not all cover equal ranges of values; in fact, the first class and the last class have, respectively, no specified lower and upper limits.

 

Classes of the "less than," "or less," "more than," or "or more" variety are referred to as open classes, and they are used to reduce the number of classes that are needed when some of the values are much smaller than or much greater than the rest. Generally, open classes should be avoided, however, because they make it impossible to calculate certain values of interest, such as averages or totals.

 

Insofar as the second rule is concerned, we have to watch whether the data are given to the nearest dollar or to the nearest cent, whether they are given to the nearest inch or 10 the nearest tenth of an inch, whether they are given to the nearest ounce or to the nearest hundredth of an ounce, and so on. For instance, if we want to group the weights of certain animals, we might use the first of the following classifications when the weights are given to the nearest kilogram, the second when the weights are given to the nearest tenth of a kilogram, and the third when the weights are given to the nearest hundredth of a kilogram:

 

 

Weight

(Kilograms)

Weight

(Kilograms)

Weight

(Kilograms)

10-14

15-19

20-24

25-29

30-34

10.0-14.9

15.0-1909

20.0-24.9

25.0-29.9

30.0-34.9

10.0-14.9

15.0-1909

20.0-24.9

25.0-29.9

30.0-34.9

etc.

etc.

etc.

 

To illustrate what we have been discussing in this section, let us now go through the actual steps of grouping a set of data into a frequency distribution.

 

Example (4)

Based on 1997 figures, the following are 11.0 "waiting times" (in minutes) between eruptions of the Old Faithful Geyser m Yellowstone National Park:

 

81

83

94

73

78

94

73

89

112

80

94

89

35

80

74

91

89

83

80

82

91

80

83

91

89

82

118

105

64

56

76

69

78

42

76

82

82

60

73

69

91

83

67

85

60

65

69

85

65

82

53

83

62

107

60

85

69

92

40

71

82

89

76

55

98

74

89

98

69

87

74

98

94

82

82

80

71

73

74

80

60

69

78

74

64

80

83

82

65

67

94

73

33

87

73

85

78

73

74

83

83

51

67

73

87

85

98

91

73

108

 

Construct a frequency distribution.

 

Solution:

Since the smallest value is 33 and the largest value is 118, we have to cover an interval of 86 values and a convenient choice would be to use the nine classes 30 -39, 40 - 49, 50 - 59, 60 - 69, 70 - 79, 80 - 89, 90 - 99, 100 - 109, and 110-119. These classes will accommodate all of the data, they do not overlap, and they are all of the same size. There are other possibilities (for instance, 25 - 34, 35 - 44, 45 - 54, 55 - 64, 65 - 74, 75 - 84, 85 - 94, 95 - 104, 105 - 114, and 115 - 124), but it should be apparent that our first choice will facilitate the tally.

 

We now tally the 110 values and get the result shown in the following table:  

 

Waiting between

eruption (minutes)                              Tally                                               Frequency

30-39

||

 

 

 

 

 

 

 

 

2

40-49

||

 

 

 

 

 

 

 

 

2

50-59

||||

 

 

 

 

 

 

 

 

4

60-69

||||

||||

||||

||||

 

 

 

 

 

19

70-79

||||

||||

||||

||||

||||

 

 

 

 

24

80-89

||||

||||

||||

||||

||||

||||

||||

||||

 

39

90-99

||||

||||

||||

 

 

 

 

 

 

15

100-109

|||

 

 

 

 

 

 

 

 

3

110-119

||

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

Total

110

 

The numbers given in the right-hand column of this table, which show how many values fall into each class, are called the class frequencies. The smallest and largest values that can go into any given class are called its class limits, and for the distribution of the waiting times between eruptions they are 30 and 39, 40 and 49,50 and 59,. .., and 110 and 119. More specifically, 30, 40, 50, ..., and 110 are called the lower class limits, and 39,49,59, ..., and 119 are called the upper class limits.

 

The amounts of time that we grouped in our Example were all given to the nearest minute, so that 30 actually includes everything from 29.5 to 30.5,39 includes everything from 38.5 to 39.5, and the class 30-39 includes everything from 29.5 to 39.5. Similarly, the second class includes everything from 39.5 to 49.5... and the class at the bottom of the distribution includes everything from 109.5 to 119.5. It is customary to refer to 29.5, 39.5, 49.5... and 119.5 as the class boundaries or the real class limits of the distribution. Although 39.5 is the upper boundary of the first class and also the lower boundary of the second class, 49.5 is the upper boundary of the second class and also the lower boundary of the third class, and so forth, there is no cause for alarm. The class boundaries are by choice impossible values that cannot occur among the data being grouped. If we assume again that the law firm billings grouped in the distribution on page 21 were all rounded to the nearest dollar, the class boundaries $299,999.50, $499,999.50, $749,999.50, and $999,999.50 are also impossible values.

 

We emphasize this point because, to avoid gaps in the continuous number scale, some statistics texts, some widely used computer programs, and some graphing calculators (MINITAB, for example, and the TI-83) include in each class its lower boundary, and the highest class also includes its upper boundary. They would include 29.5 but not 39.5 in the first class of the preceding distribution of waiting times between eruptions of Old Faithful. Similarly, they would include 39.5 but not 49.5 in the second class,. .., but 109.5 as well as 119.5 in the high boundaries are impossible values that cannot occur among the data being grouped. Especially for this reason, the use of impossible class boundaries can- not be.

 

Numerical distributions also have what we call class marks and classes intervals. Class marks are simply the midpoints of the classes, and they are found by adding the lower and upper limits of a class (or its lower and upper boundaries) and dividing by 2. A class interval is merely the length of a class, or the range of values it can contain, and it is given by the difference between its boundaries. If the classes of a distribution are all equal in length, their common class interval, which we call the class interval or the distribution, is also given by the difference between any two successive class marks. Thus, the class marks of the waiting-time distribution are 34.5, 44.5, 54.5, ..., and 114.5, and the class intervals and the class interval of the distribution are all equal to 10.

 

There are essentially two ways in which frequency distributions can be modified to suit particular needs. One way is to convert a distribution into a percentage distribution by dividing each class frequency by the total number of items grouped, and then multiplying by 100.

 

Example (5)

Convert the waiting-time distribution of Example 2.4 into a percentage distribution.

 

Solution:

The first class contains

  

of the data (rounded to two decimals), and so does the second class. The third class contains   

of the data, the fourth class contains

   

of the data,..., and the bottom class again contains 1.82% of the data. These results are shown in the following table:

 

Waiting times between eruptions

(minutes)

Percentage

30-39

40-49

50-59

60-69

70-79

80-89

90-99

110-109

110-119

1.82

1.82

3.64

17.27

21.82

35.45

13.64

2.73

1.82

 

The percentages total 100.01, with the difference, of course, due to rounding.

 

The other way of modifying a frequency distribution is to convert it into a "less than," "or less," "more than," or "or more" cumulative distribution. To construct a cumulative distribution, we simply add the class frequencies, starting either at the top or at the bottom of the distribution.

 

Example (6)

Convert the waiting-time distribution of Example 6 into a cumulative "less than" distribution.

 

Solution:

Since none of the values is less than 30, 2 of the values are less than 40, 2 + 2 = 4 of the values are less than 50, 2 + 2 + 4 = 8 of the values are less than 60, ..., and all 110 of the values are less than 120, we get

 

Waiting times between eruptions

(minutes)

Cumulative Frequency

Less than 30

Less than 40

Less than 50

Less than 60

Less than 70

Less than 80

Less than 90

Less than 100

Less than 110

Less than 120

0

2

4

8

27

51

90

105

108

110

 

Note that instead of "less than 30" we could have written "29 or less," instead of "less than 40" we could have written "39 or less," instead of "less than 50" we could have written "49 or less," and so forth.

 

In the same way we can also convert a percentage distribution into a cumulative percentage distribution. We simply add the percentages instead of the frequencies, starting either at the top or at the bottom of the distribution.

 

So far we have discussed only the construction of numerical distributions, but the general problem of constructing categorical (or qualitative) distributions is about the same. Here again we must decide how many categories (classes) to use and what kind of items each category is to contain, making sure that all the items are accommodated and that there are no ambiguities. Since the categories must often be chosen before any data are actually collected, it is usually prudent to include a category labeled "others" or "miscellaneous."

  

For categorical distributions, we do not have to worry about such mathematical details as class limits, class boundaries, and class marks. On the other hand, there is often a serious problem with ambiguities and we must be very careful and explicit in defining what each category is to contain. For instance, if we had to classify items sold at a supermarket into "meats," "frozen foods," "baked goods," and so forth, it would be difficult to decide, for example, where to put frozen beef pies. Similarly, if we had to classify occupations, it would be difficult to decide where to put a farm manager, if our table contained (without qualification) the two categories "farmers" and "managers." For this reason, it is advisable, where possible, to use standard categories developed by the Bureau of the Census and other government agencies.


3.4 Graphical Presentation

 

When frequency distributions are constructed mainly to condense large sets of data and present them in an "easy to digest" form, it is usually most effective to display them graphically. As the saying goes, a picture speaks louder than thousand words, and this was true even before the current proliferation of computer graphics. Nowadays, each statistical software package strives to outdo is competitors by means of more and more elaborate pictorial presentations of statistical data. 

 

For frequency distributions, the most common form of graphical presentation is the histogram, like the one shown in Figures 3.1 and 3.2. Histograms are constructed by representing the measurements or observations that are grouped (in Figures 3.1-3.2 the waiting times between eruptions of old Faithful) on a horizontal scale, the class frequencies on a vertical scale, and drawing rectangles whose bases equal the class intervals and whose heights are the corresponding class frequencies.

 

The marketing on the horizontal scale of histogram can be the class limits as in Figures 3.1-3.2 the class marks, the class boundaries, or arbitrary key values. For practical reasons, it is usually preferable to show the class limits, even though the rectangles actually go from one class boundary to the next. After all, they tell us what values go into each class. Note that histograms cannot be drawn for distributions with open classes and that they require special care when the class intervals are not all equal.

 

The data that led to Figure 3.1 were easy to group because there were only 110 values in the sample. For really large sets of data, it may be convenient to construct histograms directly from raw data by using a suitable computer package or graphing calculator. We said that it may be convenient to use a computer package or a graphing calculator – in actual practice, just entering the data in a computer or a calculator can be more work than tallying the data manually and drawing the rectangles.

 

Figure 3.1: Histogram of waiting times between eruptions of old faithful geyser

 

Also referred to at times as histograms are bar charts (see Section 2.1), such as the one shown in Figure 3.2. The heights of the rectangles, or bars again represent the class frequency but there is no pretense of having a continuous horizontal scale.

 

Figure 3.2: Bar Chart of distribution of waiting times between eruptions of old faithful geyser


3.5 Measures of Association

 

In Chapter 2, we presented scatter diagrams, which graphically depict variables that are related. In this section, we present two numerical measure linear relationships depicted in a scatter diagram. The two measures are covariance and the coefficient of correlation.

 

Covariance

If we have all the observations that constitute a population, we can compute population covariance. It is defined as follows.

Population covariance =

Where  is the population mean of the first variable,  is the population mean of the second variable, Y; and N is the size of the population. The sample covariance is defined similarly, where n is the number of pairs of observation sample.

Sample covariance =

 

For convenience, we label the population covariance COV(X,Y) and the sample covariance COV(X,Y). To illustrate how covariance measures association, the following three sets of sample data are given.

 

 

x

y

 

Set 1

2

6

7

13

20

27

-3

1

2

-7

0

7

21

0

14

 

 

 

17.5 = cov(X,Y)

 

 

x

y

 

Set 2

2

6

7

27

20

13

-3

1

2

7

0

-7

-21

0

-14

 

 

 

-17.5 = cov(X,Y)

 

 

x

Y

 

Set 3

2

6

7

20

27

13

-3

1

2

0

7

-7

0

7

-14

 

 

 

-3.5 = cov(X,Y)

 

In set 1, as x increases, so does y. In this case, when x is larger than its mean, and y is at least as large as its mean, thus  and  have the same sign or zero, which means that the product is either positive or zero. Consequently, the covariance is a positive number. In general, if two variables move in the same direction (both increase or both decrease), the covariance will be a large positive number. Figure 3.3 depicts a scatter diagram of one such case.

 

Next, consider set 2. As x increases, y decreases. Thus, the signs of and are opposite. As a result, the covariance is a negative number. If, as one variable increases, the other generally decreases, the covariance will be a large negative number. See Figure 3.4 for an illustrative scatter diagram.

 

Now consider set 3. As x increases, y exhibits no particular pattern. One product is positive, one is negative, and the third is zero. Consequently, the covariance is a small number. Generally speaking, if the two variables are unrelated (as one increases, the other shows no pattern), the covariance will be close to zero (either positive or negative). Figures 3.5, 3.6, 3.7, 3.8 describe the movement of two unrelated variables.

 

As a measure of association, covariance suffers from a major drawback. It is usually difficult to judge the strength of the relationship from the covariance. For example, suppose that you have been told that the covariance of two variables is 250. What does this tell you about the relationship between the two variables? The sign, which is positive, tells you that as one increases, the other also generally increases. However, the degree to which the two variables move together is difficult to ascertain because we don't know whether 250 is a large number. To over-come this shortcoming, statisticians have produced another measure of association, which is based on the covariance. It is called the coefficient of correlation.

 

A

cov x,y > 0

B

cov x,y < 0

Figure 3.3                                                                 Figure 3.4

 

C

cov x,y = 0

D

r = + 1

      

 

 

Figure 3.5:                                                                Figure 3.6

E

r = -1

F

r = 0

Figure 3.7                                                                 Figure 3.8

 

 

Coefficient of Correlation

The coefficient of correlation is the covariance divided by the standard deviation of X and Y. The population coefficient of correlation is labeled Greek and is defined as

 

Where  and  are the standard deviations of X and Y, respectively. We label the sample coefficient of correlation r, which we define as

 

Where Sx and Sy are the sample standard deviations of X and Y, respectively.

 

Solution:

We begin by calculating the sample means and standard deviations.

We then compute the deviations from the mean for both x and y, ant, their products. The following Table describes these calculations.

 

X

y

20.0

14.8

20.5

12.5

18.0

14.3

24.9

16.5

24.3

20.2

22.0

19.0

12.3

14.0

16.7

219

190

199

121

150

198

334

188

310

213

288

312

186

173

174

2.0

-3.2

2.5

-5.5

0.0

-3.7

6.9

-1.5

6.3

2.2

4.0

1.0

-5.7

-4.0

-1.3

2.0

-27.0

-18.0

-96.0

-67.0

-19.0

117.0

-29.0

93.0

-4.0

71.0

95.0

-31.0

-44.0

-43.0

4.0

86.4

-45.0

528.0

0.0

70.3

807.3

43.5

585.9

-8.8

284.0

95.0

176.7

176.0

55.9

Total = 2,859.2

 

Thus,

 

The coefficient of correlation is

 

           

 

USING THE COMPUTER

 

Excel Output for the Example

 

 

 

 

 

1

 

Size

Price

2

Size

15.0667

 

3

Price

190.6133

3808.667

 

Excel prints the population covariance and variances. Thus, cov(X,Y) = 109.6133, =15.06667, and = 3,808.667. To compute the corresponding sample statistics, multiply each by n/(n-1). Therefore, the sample covariance is cov (X, Y) = 190.6133 (15/14) = 204.2286.

 

COMMANDS

COMMANDS FOR EXAMPLE

1 type or import the data into two columns

2 click Tools, Data Analysis …, and Covariance

3 Specify the coordinates of the data

 

Open file

 

 

 

 

 

A1:B16

 

From the output, we observe that r = .795716

 

Commands

Repeat the steps above, except click Correlation instead of Covariance.

 

A

B

C

1

 

Size

Price

2

Size

1

 

3

Price

0.795716

1

 

The covariance provides very little useful information other than telling us that the two variables are positively related. The coefficient of correlation informs us that there is a strong positive relationship. This information can be extremely useful to real estate agents, insurance brokers, and all potential home purchas

 

Excel Output for the Example

 

 

A

B

C

1

 

Odometer

Price

2

Odometer

1

 

3

Price

-0.806307604

1

 

Excel prints the coefficient of correlation. The test can manually

 

COMMANDS

COMMANDS FOR EXAMPLE

1 type or import the data into adjacent columns

2 click Tools, Data Analysis …, and Correlation

3 Specify the input range. Click Labels in First row (if necessary). Click OK

 

Open file

 

 

 

 

 

A1:B101

 

Interpreting the results

 

There is overwhelming evidence to infer that the two variables are correlated.

 

 

Spearman Rank Correlation Coefficient

 

In the previous sections of this chapter, we have dealt only with quantitative variables and have assumed that all of the conditions for the validity of the hypothesis tests and confidence interval estimates have been met. In many situations, however, one or both variables may be ranked, or if both variables are quantitative, the normality requirement may not be satisfied. In such cases, we measure and test to determine if a relationship exists by employing a nonparametric technique, the Spearman rank correlation coefficient.

 

The Spearman rank correlation coefficient is calculated like all of the previously introduced nonparametric methods by first ranking the data. We then calculate the Pearson correlation coefficient of the ranks.

 

The population Spearman correlation coefficient is labeled , and the sample statistics used to estimate its value is labeled .

 

Sample Spearman Rank Correlation Coefficient

 

Where a and b are the ranks of the data.


3.6 Summarizing Two-Variable Data

 

So far we have dealt only with situations involving one variable- the room occupancies in Section 2.2, the waiting times between eruptions of Old Faithful in Example 2.4, and so on. In actual practice, many statistical methods apply to situations involving two variables, and some of them apply even when the number of variables cannot be counted on one's fingers and toes not quite so extreme would be a problem in which we want to study the values of one-family homes, taking into consideration their age, their location, the number of bedrooms, the number of baths, the size of the garage, the type of roof, the number of fireplaces, the lot size, the value of nearby properties, and the accessibility of schools.

 

Leaving some of this work to later work and, in fact, most of it to advanced courses in statistics, we shall treat here only the display, listing, and grouping of data involving two variables; that is, problem dealing with the display of paired data. In most of these problems, the main objective is to see whether there is a relationship, and if so what kind of relationship, so that we can predict one variable, denoted by the letter y, in terms of other variable denoted by the letter x .For instance, the x's might be family incomes and the y's might be family expenditures on medical care, they might be annealing temperatures and the hardness of steel, or they might be the time that has elapsed since the chemical treatment of a swimming pool and the remaining on concentration of chlorine.

 

Pairs (x, y), in the same way which we denote points in the plane, with x, and y being their x- and y-coordinates. When we actually plot the points corresponding to paired values of x and y, we refer to the resulting graph as a scatter diagram, a scatter plot, or a scatter gram. As their name implies, such graphs are useful tools in the analysis of whatever relationship there may exist between the x's and the y's namely, judging whether there are any discernible patterns.

 

Example (7)

Raw materials used in the production of synthetic fiber are stored in a place that has no humidity control. Following are measurement of the relative humidity in the storage place, x, and the moisture content of a sample of the raw material, y, on 15 days

 

X

(Percent)

Y

(Percent)

 

X

(Percent)

Y

(Percent

36

27

24

50

1

23

45

44

12

11

10

17

10

12

18

16

 

3

32

19

34

38

21

16

14

13

11

12

17

8

7

Construct a scatter gram.

 

Solution

Scatter grams are easy enough to draw, yet the work can be simplified by using appropriate computer software or a graphing calculator. The one shown in Figure 3.9 was reproduced from the display screen of a TI-83 graphing calculator.

 

Figure 3.9: Scatter gram of humidity and water content data

 

As can be seen from the diagram the points are fairly widely scattered, yet there is evidence of an upward trend that is, increase in the water content of the raw material seem to go with increase in humidity. In Figure 3.9 the dots are squares with their centers removed, but they can also be circles, x's dots, or other kinds of symbols (The units are not marked to either scale, but on the horizontal axis the tick marks are at 10, 20, 30, 40, and 50, and on the vertical axis they are at 5, 10, 15, and 20).

 

Some difficulties arise when two or more of the data points are identical. In that case, the TI-83 graphing calculator shows only one point and so do some of the printouts obtained with statistical software. However, MINITAB has a special scatter gram to take care of situations like this. Its so called character plot prints the number 2 instead of the symbol x or « to indicate that there are two identical data points, and it would print a 3 if there were three. This is illustrated by the following example.

 

Example (8)

 

Following are the scores which 40 students obtained on both parts of the test, with the scores on the even-numbered problems denoted by x and the scores on the odd-numbered problems denoted by y.

 

x

y

x

y

x

y

x

y

40

45

27

42

42

49

36

39

43

39

39

45

24

39

9

40

28

39

38

34

32

45

42

44

41

48

44

40

50

37

23

35

36

42

35

45

39

28

48

39

37

41

35

34

38

42

32

38

36

43

34

38

33

30

40

34

35

27

37

42

32

40

37

47

44

35

44

43

37

43

28

34

37

45

40

35

35

38

35

33

 

Choosing the five classes 26-30, 31-35, 36-40, 41-45, and 46-50 for x and the six classes 21-25, 26-30, 31-35, 36-4041-45, and 46-50 for y, group these data into a two-way frequency distribution.

 

Solution

 

Performing the tally, we find that the first of values, 40 and 39, goes into the cell belonging to the third column and the fourth row, the second pair of values, 45 and 45, goes into the cell belonging to the fourth column and the fifth row, and so on. We thus get.

 

 

 

 

 

 

y

 

x

 

26-30

31-35

36-40

41-45

46-50

21-25

|

|

 

 

 

26-30

 

||

|||

|

 

31-35

 

|||

||||

||||

 

36-40

 

 

|||| |

|||| ||

|

41-45

 

 

 

|||

||

46-50

 

 

 

 

|

 

and, hence, the following two-way frequency distribution :

 

 

 

 

 

 

y

 

x

 

26-30

31-35

36-40

41-45

46-50

21-25

1

1

 

 

 

26-30

 

2

3

1

 

31-35

 

3

4

5

 

36-40

 

 

6

7

1

41-45

 

 

 

3

2

46-50

 

 

 

 

1