My statistics class must have felt a great deal of sympathy with Eliza at the conclusion of a recent investigation into the styles of different authors. Not only did they study words in their English and Creative Writing courses, they were also inundated with words in statistics.
The experiment was stimulated by the presence on staff of a successful author Jack Hodgins. The aim was to see if we could distinguish Jack’s style from that of other authors, using methods that were strictly numerical. In Probability and Statistics, John Durran treated statistics connected with sentence length and variety of noun usage. These two areas were therefore chosen for the first investigation.
A statistic for the variety of noun usage
One of the exercises in Probability and Statistics is based on material from G. U. Yule’s Statistical Study of Literary vocabulary (CUP 1944). The exercise develops a statistic that measures the variety of noun usage. The procedure is as follows:
b. For the selected passage, write down each noun as it occurs.
c. Record the number of repetitions of each noun until exactly 100 nouns have been collected. At this stage you will have many nouns that occur just once, some twice, three times, etc. and a few nouns that occur many times.
d. Record the results in a frequency table similar to that shown for a passage from Hemingway’s For Whom the Bell Tolls.
x is the number of times the noun occurred
f is the number of nouns
Calculate Sfx and Sfx2
Yule has given reasons why this statistic will vary with the variety of nouns used in a passage. Two examples may illustrate this. The extreme case of greatest variety would occur when a passage contains 100 nouns each of which is used once. In this case:
·Sfx = ·Sfx2 = 1 and m = 1
A passage from Children of Dune by Frank Herbert came close to this extreme having 94 nouns used once and 6 nouns used twice.
Sx = 106 ·Sfx2 = 118. and m = 10.7
The opposite extreme would be an elementary reader. A few nouns would be used many times before 100 different nouns are collected In this case: Sfx2 is much greater than Sfx and m is very high. Using the factor l04 produces values of m commonly in the range 0 <m <200. The Hemingway passage mentioned earlier has so far given the highest value of m of approximately 193.
Students each chose a different author and calculated m, for
at least 5 passages from one of his/her novels. The results are shown below::
The Invention of the World by Jack Hodgins and Around the World
in 80 Days by Jules Verne.
Clearly with one variable, it would not be easy to distinguish between passages from these two novels solely on the basis of the one statistic measuring noun usage.
This situation, however, changes when a second variable is considered, in this case, a statistic for sentence length.
A statistic for sentence length.
Again in Probability and Statistics John Durran mentions a study
by C. B. Williams that shows that, for most authors, the distribution of
the logarithms of sentence length (number of words) is nearly normal. Twenty
sentences from each passage were randomly selected and the mean of log10
(sentence length) was calculated This statistic ()
fell in the range 0.8 < <14.
Students calculated for
each of the passages that they had already examined for noun usage.
Discriminating between authors
Students now have a pair of statistics for each sample from the book. Collecting the results together on a graph make this picture much clearer.
Some quite distinct patterns appear and the data for The Invention of the World and Around the World in 80 Days exhibit much more distinct differences, than was the case when noun usage alone was considered.
The final part of the project was to provide each student with a short
story by one of the 6 authors and ask him to identify the author. The data
for this passage has also been included on the graph. The identification
is left to the reader as an exercise.
Many more statistics of style are possible and a follow-up to this project is planned looking at two such statistics.
Maurice Kendall in an address to the American Statistical Association, describes a fog index related to sentence length and long words (3 or more syllables). A fog index of 10 or less is very readable but considerable determination is required for fog indices greater than 25.
Jack Hodgins has suggested that the proportion of adverbs and adjectives in sentences might also be very revealing.
What support, or otherwise these projects give to the contention that mathematicians are illiterate (!!!) remains to be seen.
Back to Contents of The Best of Teaching Statistics
Back to main Teaching Statistics Page