Character Frequency in Dogpile Searches
(or: search engine users don't know about the SHIFT key)
HomeOut of curiosity, I ran a character frequency analysis against my ~97 megabyte archive of dogpile searchspy results. I was curious to see if 'internet english' varied statistically from proper english. Note that the statistics are likely to be somewhat skewed: the raw text file was first ran through unix 'sort' and 'uniq' to remove any repeated searches. As I have no way of confirming that dogpile does or does not show a single search multiple times on searchspy in the first place, I don't think this is necessarily good or bad.
What is interesting about this, if anything, is the nearly absolute lack of capital letters in search terms. Capital letters came in at the extreme bottom of the scale, when they appeared at all. Even then, the instances I investigated didn't imply that anyone was using the shift key: The source of the capital letter 'A' for instance was the search: "Server: Apache/1.3.12 (Unix) Resin/2.1.2", which is obviously a cut-and-paste job.
The raw data at the time this was generated was from Feb-12-2006 through Mar-19-2006. The script I used to do much of the heavy lifting (not polished at all, the output was touched up by hand) is here.
%~/dogpile$ ls -alh uniques -rw-r--r-- 1 root root 97M 2006-03-19 10:56 uniques %~/dogpile$ wc uniques 4132663 14327267 101415738 uniques
After adjusting for newlines, there were 97283075 characters remaining.
The results, after removing the newlines from the text file...
e a o i t r n s l c d u m h p g f b y w k v q x j z
| Percentage: | Count: | Letter: | Chart: |
|---|---|---|---|
| 10.4800 | 10195284 | space (32) | |
| 8.9807 | 8736729 | e | |
| 7.6550 | 7447017 | a | |
| 6.9163 | 6728353 | o | |
| 6.3526 | 6179991 | i | |
| 6.2801 | 6109449 | t | |
| 6.2481 | 6078315 | r | |
| 5.8841 | 5724190 | n | |
| 5.8756 | 5715978 | s | |
| 4.2972 | 4180409 | l | |
| 3.8968 | 3790961 | c | |
| 2.7991 | 2723074 | d | |
| 2.6968 | 2623570 | u | |
| 2.6789 | 2606070 | m | |
| 2.5150 | 2446702 | h | |
| 2.2793 | 2217325 | p | |
| 2.0371 | 1981770 | g | |
| 1.5680 | 1525428 | f | |
| 1.5266 | 1485113 | b | |
| 1.3995 | 1361434 | y | |
| 1.3424 | 1305918 | w | |
| 0.9719 | 945517 | k | |
| 0.9114 | 886669 | v | |
| 0.5384 | 523765 | q | |
| 0.4629 | 450364 | ; | |
| 0.4589 | 446406 | & | |
| 0.3135 | 304993 | . | |
| 0.2977 | 289599 | 0 | |
| 0.2824 | 274721 | x | |
| 0.2615 | 254358 | j | |
| 0.1910 | 185824 | 1 | |
| 0.1858 | 180762 | z | |
| 0.1837 | 178700 | 2 | |
| 0.1379 | 134184 | , | |
| 0.1246 | 121259 | - | |
| 0.1043 | 101473 | 3 | |
| 0.1041 | 101309 | 5 | |
| 0.0973 | 94643 | 9 | |
| 0.0964 | 93748 | + | |
| 0.0910 | 88521 | 6 | |
| 0.0884 | 86019 | ' | |
| 0.0867 | 84351 | 4 | |
| 0.0697 | 67850 | 8 | |
| 0.0637 | 61946 | 7 | |
| 0.0480 | 46699 | / | |
| 0.0316 | 30706 | ? | |
| 0.0160 | 15586 | # | |
| 0.0138 | 13466 | : | |
| 0.0108 | 10524 | ( | |
| 0.0107 | 10385 | ) | |
| 0.0075 | 7258 | % | |
| 0.0059 | 5702 | _ | |
| 0.0044 | 4329 | @ | |
| 0.0035 | 3437 | * | |
| 0.0033 | 3214 | \ | |
| 0.0028 | 2750 | = | |
| 0.0027 | 2655 | ! | |
| 0.0015 | 1419 | $ | |
| 0.0013 | 1248 | ] | |
| 0.0011 | 1095 | ` | |
| 0.0010 | 947 | [ | |
| 0.0003 | 331 | ~ | |
| 0.0002 | 197 | | | |
| 0.0001 | 127 | ^ | |
| 0.0001 | 120 | M | |
| 0.0001 | 119 | { | |
| 0.0001 | 110 | } | |
| 0.0001 | 109 | T | |
| 0.0001 | 89 | carriage return (13) | |
| 0.0001 | 84 | G | |
| 0.0001 | 64 | S | |
| 0.0001 | 56 | F | |
| 0.0001 | 56 | " | |
| 0.0000 | 45 | C | |
| 0.0000 | 42 | D | |
| 0.0000 | 36 | W | |
| 0.0000 | 3 | delete (127) | |
| 0.0000 | 2 | U | |
| 0.0000 | 1 | R | |
| 0.0000 | 1 | P | |
| 0.0000 | 1 | H | |
| 0.0000 | 1 | A |