OFF-TOPIC: How many words?

I am fascinated by language and communication. Because I am British – or, rather, English, to be precise – I was fortunate to be born to speak the most useful language on the planet. I dabble with other languages and enjoy learning about them. I have a few words here and there and a little bit of skill with French, German and Italian, but I really could not hold a conversation. My excuse [apart from laziness] is that English gets me by in most places, but also I have no idea which other language to invest time on. It would be obvious if I spent a lot of time in a specific foreign country, but I do not. Maybe my impending retirement will change matters …

Why is English so omnipotent [or, at least, omnipresent]? The answers are complex and mainly accidental. The language evolved in quite a haphazard way, importing vocabulary and grammar from many sources. The result is that it is very rich, having considerably more words than any comparable language. Its spread around the planet was aided by my ancestors’ predilection of subjugating any other countries that they thought were interesting or useful. A common use of English nowadays is as an auxiliary language. For example, if a Frenchman wants to speak with a German and neither speaks the other’s language, they will most likely communicate quite well in English. It seems to me that this works because the language withstands damage. You can speak very bad English and still communicate; most other languages are rendered useless more readily.

I started to wonder about the efficiency of the language, compared with some other options. I have noticed that French people tend to speak incredibly fast, which makes it almost impossible for me to follow. [In the French-speaking part of Switzerland, however, they seem to converse at a more leisurely pace and I find that I can understand quite a lot.] I therefore wondered whether French was actually a more efficient communications medium. I could not figure out a simple way to test this hypothesis as there are too many variables in a spoken language. However, I thought that I might explore written text and see how the efficiency of different languages varies.

I wondered whether other languages take up more or less space on the page for a given piece of text. I write a lot of articles and commonly have a target length to work to. Sometimes this is expressed in words, occasionally in characters. I figured that, if I took some text, I could get its word/character count from Word. Then, if I [Google] translated it into another language, I could perform another measurement and compare. In an attempt to be somewhat scientific, I chose 3 pieces of text:

  • a randomly selected news article from the BBC
  • a randomly selected news article from the Guardian newspaper
  • an article by a favourite author, Garrison Keillor [to include some American English]

They are all of similar length and long enough, I believe, to be a meaningful test. I cannot share the text for obvious, copyright reasons.

I chose 5 languages:

  • French [not too far from English]
  • German [somewhat like English, but a lot of compound words]
  • Finnish [from a different language group]
  • Esperanto [I thought that it might be interesting to see how a synthetic language performed]
  • Arabic [a very widely used language, with a different alphabet; I hope the comparison is valid]

Here are the results of my measurements:

I show the words/characters for each piece in each language with a percentage difference from English. Initially, I used the character count without spaces, but realized that this would be less useful; so I added one character for each word to give a more sensible result.

I have highlighted examples of where another language “performs better” than English. French and German do not do well. Finnish and Esperanto seem to be on a par with English. Arabic appears to show a significant increase in efficiency. I did do some tentative tests with Japanese, Chinese and Korean and these seemed to show very great increases in efficiency, but I am not confident in interpreting the results.

I was challenged when doing this research because Google Translate felt that I was doing too many translates and must be a robot. This was rather frustrating. How scientific this research was is open to question. Partly this depends on how good the translation was, but also, in particular with Arabic, whether my assumptions about measurement were correct.

If anyone wants to suggest that I should measure another language [and can explain why], I would be interested to hear by comment, email or via social media.

Want to stay up to date on news from Siemens Digital Industries Software? Click here to choose content that's right for you


One thought about “OFF-TOPIC: How many words?
  • I once read that the reason why e.g. Spanish sounds so fast, and Chinese sounds so slow, is because the information density per syllable varies a lot. Where Spanish has only 5 vowels, Chinese has many more when you include its tonal system. So a Chinese syllable has many more possible permutations than a Spanish one does, and therefore a Spanish speaker has to hurry up, whereas a Chinese speaker can take his time.

    Moreover, I can imagine that while your analysis is on written text, ultimately only the spoken language matters. Arabic might have less characters, but they don’t include the vowels. Arabic vowels are added by the speaker through shear knowledge of the context of the word.

    Also, spoken Germanic languages are less “true to spelling”, whereas Romance languages and Finnish much more so. So the information density ratio of spoken/written Germanic languages will be higher than spoken/written Romance or Finnish.

    In my native language/dialect, Limburgish, we also use two distinct vowel tones (dragging and pushing) plus we have a myriad of vowels, but you write only parts of it. I think that this is the reason why to a Dutch speaker our language sounds slow and lazy, plus it sounds like we’re singing. Fun fact, the word “with” is spelled “biej” and the word “bee” is spelled “biej”, but pronounced differently because it carries a different tone. Similarly, we use tone differences to distinguish between singular and plural, compare “sjtein”/”sjtein” for “stone”/”stones” 🙂

    Anyway, I liked your article and it got me thinking. Thanks!

Leave a Reply

This article first appeared on the Siemens Digital Industries Software blog at