![In an article in Sciencedirect's congition September 2023 edition, researchers Alexey Koshevoy, Helena Miton, and, Olivier Morin released a paper: "Zipf’s Law of Abbreviation holds for individual characters across a broad range of writing systems". ](https://www.sciencedirect.com/science/article/pii/S0010027723001610?via%3Dihub). If the findings were reproducible, It would be possible to identify languages using a straightforward method, of simply comparing character frequencies. This method has been used to identify cryptographic passphrase's language, and deduce the language of a given cipher block.
- The correlation is much weaker across language families, even within the same script
Orthography is a cultural thing, and sometimes, it does not well reflect the language it is made to represent. Languages which are closely linked culturally and historically become hard to take apart, as with our example of Portuguese and Spanish. Two Indo-European languages written in Latin will look similar as well, especially if you include A E I O U. However, orthography seems to clump and cluster around language families/groups if they are using the same writing system. They are not always wholly distinct for your average sample.
To overcome this limitation, a machine learning-driven algorithm is implemented to increase the accuracy rate. This helps similar languages with the same writing system.
Here is a comparison of output, run between the following languages: English, French, Indonesian, and Swahili.
French makes up an equally large portion of English words to Greek and Latin. And, French has a very deep connection with English. So it makes sense that, there is a 15% increase in mistakes when there is no word layer.
It does hold up in a general sense, ~80% of the time.
# The process
All data is sourced from Wikipedia. The data is scraped in 3 parts.
This data is used to make a model of character occurrences for each language.
For languages, where possible, a second dataset is created using an algorithm, a set of words which are both most common and most unique to a given language. The purpose and effects were detailed in the above passage.
## Character occurrences
A slope is created, where, each character in a language is converted to a percentage of the total found. For example, if a language had 1000 characters in it, and 1 of them was "a", "a" = 0.1. The top 0.1% of characters are counted as "in this language" and, the others are discarded. The same process is done for any given sample string to identify.
The next step is for each character, is to get the absolute distance between each character.
For example, a sample has a "w" occurrence rate of 2% and the sample language is 0.5%, the distance is 1.5. This distance is then processed through a cubic function. That 1.5 is now 1.11.
After character occurrences are chosen, each word for each language selected is iterated through. If this word is found in the sample, the score is decreased by *= 0.7 as of writing.
For each language, words and word segments (split from "root-words") are chosen from the words database, and compared against all Wikipedia samples. Two variables are created from this:
The top n best scoring words are outputted and stored in data/mostCommonWords.json
Words are tested in 4 stages, and testing is stopped depending on how good the words are performing. After each state, 50% of the words are stopped. Each stage is 25% of total samples. Only 0.125% of all words are tested against the entire database.