I was playing around with an idea of an automatic language detection script for detection of languages using ngrams. The idea was to use a sample corpus for each language to build language profiles. For a sentence whose language is to be detected, a profile consiting of ngrams with relative frequency scores is built and then compared to the existing language profiles. The output is a normalized ranking score for each language profile, with 100 being the score of the best match.
Since I am comparing ngrams, the comparison is done on an orthographic level.
I refrained from googling the idea, since I figured that it must already have been done and wanted to figure out how to do it myself. However, I know that google does it much better.
Language Profile
The language profile is built using ngrams of three to five letters. Sample texts taken from random websites (news sites, wikipedia etc) in the actual language are chopped up in all its possible ngram combinations and the frequencies for each ngram is counted and normalized (the score is 100 for the most frequent ngram). The top 200 ngrams are stored in a text file as a language profile.
When building a language profile, first all ngrams for all languages are calculated. All ngrams are also given the score one in a global ngram dictionary for each language in which they occur. Then each ngram is getting a discount exponentionally proportional to the number of languages in which they occur.
This approach has similarities to tfidf score calculations.
The Detection Process
When an example text is given to the program, a language profile for that text is calculated and then compared to all the existing language profiles. This is simply calculated as given a constant score addition for each ngram occurrence in the language and the text profile. Additionally a score proportional to the ngram score the text language profile is given. Comparing text and language profile score for each ngram did not improve the detection precision.
Example
The following text is run through the language detection system:
Es ist Heute schönes Wetter. Ich glaube, daß der Frühling unterwegs ist.
The scores are like following:
de score: 100
da score: 57
sv score: 30
es score: 21
en score: 13
is score: 6
fr score: 0
hu score: 0
sk score: 0
German (de) is detected as the most probable language, followed by Danish (da) and Swedish (sv). French (fr), Hungarian (hu) and Slovakian (sk) end up with a score of 0. This reflects the similarities between languages regarding their orthographic structure. The last three languages contain a many diacritics not being a part of the German language, and hence the similarities in ngrams are few, if any. For an idea of how the languages look orthographically, look at the top 10 ngrams for the the top three and bottom three languages:
de | da | sv | fr | hu | sk |
---|---|---|---|---|---|
cht icht sich eine ich sch werb der wer das | læs der artik rtik skat kke ere nde nsk det | för att och det ätt rätt säg äge till äger | pour rés pré ait our que pou eur dans iards | cso csonk cson sonk onk cik cikk szó hog ócikk | kelt ých pred ého tick kelti elti kelts ltsk eltsk |
Considering that ngrams that are present in more than one language are given a score discount, it's interesting to note that German and Danish have “der” in common in the top ngram, and Danish and Swedish “det”. However, please note that this notion is of anecdotal nature and doesn't prove anything.
Problems
The script can only discriminate between different languages. If a text is written in a language that does not have a profile, the system is not able to say. It would be possible to implement some kind of threshold mechanism, but the problem is that some language generally end up with higher scores for all texts than others. A normalization for each language profile must then first be done.
For example, the German language profile always ends up with an unproportional high score for most language. The exact reason for this is at the moment not quite clear to me.
Another example of this unproportionality is that a Norwegian language profile ended up with higher scores for a Swedish sentence than the Swedish language profile itself.
Conclusion
Using linguistic and/or statistical/mathematical theory to inform the implementation would probably yield better precision, but this shows that it is possible to create something that works moderately well just using trial and error.
Downloads
langdetect.tar.gz - Source code with language profiles
For copyright reasons I cannot provide any corpus for you, but you can just copy and paste texts for different languages if you want to build your own language profiles.
UPDATE 2010-08-11: I got an e-mail from a guy trying to use it but got problem because he was probably running a too old version of python, because the script uses the built-in sorted I've tried it with python 2.6 and that works.