Letter frequency is important for many endeavors, not just keyboard design. But how to calculate letter frequency? It is of course impossible to calculate exact letter frequency, because that would require a record of every single thing that's ever been typed. But we can get close.
The purpose of my letter frequency is to produce an accurate representation of letter frequency and to emphasize quality over quantity. Some sources such as the
Brown Corpus have a huge quantity of text but are heavily biased toward professional writing. I attempt to maximize the quality of text, and thus have a broad range of categories.
I have noticed that different types of text have very different letter frequencies. And some people do more typing of some types than of others. Some people do a lot of programming, while others write a lot of emails. So the letter frequency must be customizable. To allow for this, I have five different categories that text is in: prose, casual, programming, formal, and news. (The basic idea for categorization came from
Arensito.) My justification for prose is that the writing style is more formalized than casual writing, but in different ways than formal writing; also, prose is frequently older, so there are unusual words and conventions. Casual requires no justification; this category includes topics such as email and blogs. Programming is significantly different from anything else, because of the vastly different syntax. Formal writing, which includes scientific papers and the like, has a different writing style and frequently has technical jargon. News, which I use to mean anything from a newspaper, is similar to casual but is somewhat more formal and follows certain syntax conventions; I include news mainly because I find that it well reflects the expected letter frequency (by "expected", I mean expected by letter frequency statistics that I have found online). Each of these categories is noticeably different; these are the letter frequencies that I have gotten for each category.
Prose: e t a o n i h s r d l u m w c f g y p b v k x j q z
Casual: e t a o i n s r h l d c u m g y f p w b v k x j q z
Programming: e t n s a o r i l d c p u m f h g b v y w x q k j z
Formal: e t a i o n s r h l d c u f m p g y w b v k x j q z
News: e t a i o n s r h l d c u m p f g y w b v k x j z q
Or, the complete character frequency:
Prose: e t a o n i h s r d l u m w c f g y , p b . v k " ' - ! ; x ? j q z : ) ( 1 < > 0 2 3 8 4 * 5 9 6 7 ] [ + / & = { } % @ # $ ~ _
Casual: e t a o i n s r h l d c u m g y f p w b . , v k 0 - ' x ) ( 1 j 2 : q " / 5 ! ? z 3 4 6 8 7 9 % ] [ * = + | _ ; \ > $ # ^ & @ < ~ { } `
Programming: e t n s a o r i l d c _ p u m f " . , = h ' ( : ) g b v > y w < [ ] / 1 x @ q k 0 \ 2 | ? { } 3 - j 5 4 z 6 7 % 9 8 + ! * & $ ; # ^ ~ `
Formal: e t a i o n s r h l d c u f m p g y w b , v . k - x " ; 1 j q 0 2 ' ) ( z : 9 [ ] 3 4 5 6 8 7 ? ` _ / ! & ^ + % = { * } | ~ > # < @ $
News: e t a i o n s r h l d c u m p f g y w b , . v k " - 0 ' x j 1 z 2 q 9 5 3 8 4 7 : 6 ( ) $ ; | ? / ! & [ ] % _ @ > = < * + #
They may look similar, but these differences are significant. (However, it is still worth noting that the differences here are more minor than some differences between supposedly comprehensive letter frequencies I have found online, which calls into question the other online frequencies' reliability.) When other characters besides letters are included, the differences are even greater. For example, programming uses far more semicolons than any other form of typing.
I think these categories adequately cover the different styles of typing. The next question is, by how much should they each be weighted? It will obviously differ from person to person; so in the letter frequency calculation program I am writing and will be releasing soon, the option to weigh these categories differently is left open. But I want to create a single letter frequency which is the best for the most people. This makes the weighing more tricky.
Each category gets a multiplier: for every one occurrence of some letter under this category, treat it as n occurrences. For example, these multipliers
Prose = 1, Casual = 1, Programming = 1, Formal = 1, News = 1
mean that each category is weighted equally, and
Prose = 2, Casual = 1, Programming = 1, Formal = 1, News = 0
means that prose is twice as important, while news is completely ignored.
Since I am still trying to determine what the best weightings are, here are several. I have included frequency of punctuation and numbers as well as letters.
Prose = 1, Casual = 1, Programming = 1, Formal = 1, News = 1: e t a o i n s r h l d c u m f p g y w b , \ . v k _ " ( ) ' - ; = x $ 0 : 1 / q j > { } 2 [ ] z * ? < ! 3 5 @ | 4 9 8 + 6 7 & # % ^ ~ `
The above one is definitely not accurate; for one, formal is not typed nearly as much as casual for most people.
Prose = 6, Casual = 8, Programming = 4, Formal = 2, News = 7 (unweighted): e t a o i n s r h l d c u m f p g y w b . , v k _ ( ) ; " = ' - $ x / 0 : { } 1 j * > q 2 [ ] z ! \ ? < + 3 @ | 5 4 # & 6 8 9 7 % ~ ^ `
Prose = 6, Casual = 8, Programming = 4, Formal = 2, News = 7 (weighted) : e t a i o n s r h l d c u m p f g y w b , . v k " - 0 ' x j 1 z 2 q 9 5 3 8 4 7 : 6 ) ( $ ; | ? / ! & [ ] % _ @ > = * < + # ` ^ { } ~ \
The above proportions seem reasonable, and letter frequency is very close to that of
letterfrequency.org. I think that programming and prose may be overstated. Nonetheless, these are the proportions that I have used for the letter frequency on my
letter frequency page, as I see them as pretty reliable.