I was playing around with an idea of an automatic language detection script for detection of languages using ngrams. The idea was to use a sample corpus for each language to build language profiles. For a sentence whose language is to be detected, a profile consiting of ngrams with relative frequency scores is built and then compared to the existing language profiles. The output is a normalized ranking score for each language profile, with 100 being the score of the best match.
Since I am comparing ngrams, the comparison is done on an orthographic level.
I refrained from googling the idea, since I figured that it must already have been done and wanted to figure out how to do it myself. However, I know that google does it much better.
Language Profile
The language profile is built using ngrams of three to five letters. Sample texts taken from random websites (news sites, wikipedia etc) in the actual language are chopped up in all its possible ngram combinations and the frequencies for each ngram is counted and normalized (the score is 100 for the most frequent ngram). The top 200 ngrams are stored in a text file as a language profile.
When building a language profile, first all ngrams for all languages are calculated. All ngrams are also given the score one in a global ngram dictionary for each language in which they occur. Then each ngram is getting a discount exponentionally proportional to the number of languages in which they occur.
This approach has similarities to tfidf score calculations.
The Detection Process
When an example text is given to the program, a language profile for that text is calculated and then compared to all the existing language profiles. This is simply calculated as given a constant score addition for each ngram occurrence in the language and the text profile. Additionally a score proportional to the ngram score the text language profile is given. Comparing text and language profile score for each ngram did not improve the detection precision.
Example
The following text is run through the language detection system:
Es ist Heute schönes Wetter. Ich glaube, daß der Frühling unterwegs ist.
The scores are like following:
de score: 100
da score: 57
sv score: 30
es score: 21
en score: 13
is score: 6
fr score: 0
hu score: 0
sk score: 0
German (de) is detected as the most probable language, followed by Danish (da) and Swedish (sv). French (fr), Hungarian (hu) and Slovakian (sk) end up with a score of 0. This reflects the similarities between languages regarding their orthographic structure. The last three languages contain a many diacritics not being a part of the German language, and hence the similarities in ngrams are few, if any. For an idea of how the languages look orthographically, look at the top 10 ngrams for the the top three and bottom three languages:
de | da | sv | fr | hu | sk |
---|---|---|---|---|---|
cht icht sich eine ich sch werb der wer das | læs der artik rtik skat kke ere nde nsk det | för att och det ätt rätt säg äge till äger | pour rés pré ait our que pou eur dans iards | cso csonk cson sonk onk cik cikk szó hog ócikk | kelt ých pred ého tick kelti elti kelts ltsk eltsk |
Considering that ngrams that are present in more than one language are given a score discount, it's interesting to note that German and Danish have “der” in common in the top ngram, and Danish and Swedish “det”. However, please note that this notion is of anecdotal nature and doesn't prove anything.
Problems
The script can only discriminate between different languages. If a text is written in a language that does not have a profile, the system is not able to say. It would be possible to implement some kind of threshold mechanism, but the problem is that some language generally end up with higher scores for all texts than others. A normalization for each language profile must then first be done.
For example, the German language profile always ends up with an unproportional high score for most language. The exact reason for this is at the moment not quite clear to me.
Another example of this unproportionality is that a Norwegian language profile ended up with higher scores for a Swedish sentence than the Swedish language profile itself.
Conclusion
Using linguistic and/or statistical/mathematical theory to inform the implementation would probably yield better precision, but this shows that it is possible to create something that works moderately well just using trial and error.
Downloads
langdetect.tar.gz - Source code with language profiles
For copyright reasons I cannot provide any corpus for you, but you can just copy and paste texts for different languages if you want to build your own language profiles.
UPDATE 2010-08-11: I got an e-mail from a guy trying to use it but got problem because he was probably running a too old version of python, because the script uses the built-in sorted I've tried it with python 2.6 and that works.
19 comments:
It seems that download link to langdetect.tar.gz is broken.
May I don’t know the real meaning of the article, but I was very clear affirmation. This article is the best I read an article. Write in neatly vivid. Let a person calm of.
Linen shirts for men
lehenga sarees online
Excellent and decent post. I found this much informative, as to what I was exactly searching for. Thanks for such post and keep it up.
I am using the web service from http://www.whatlanguage.net. It integrates nicely with my Python code is outputs JSON or XML. It can detect 100+ languages and is as accurate as the Google web service.
This can be a great website write-up and I defer to you what you've said below. I've previously subscribed for a RSS feed in Firefox and are going to be your regular reader. Many thanks for the time in writing the submit.
saree blouse
Thanks for sharing the great ideas.You will find a lot of approaches after visiting your post.
web design baroda
I appreciate the work of all people who share information with others.
electrical control panel manufacturers
I have never seen this type of post from you,You tried something new!! Good for you!
I have always afraid I'll mess when I try new things. What made you decide to try something new? I want to motivate you.
Maybe you have some ideas you can share.nursing essay writing service
It is compelling and interactive content and consider important for my personal
gwen stefani jacket
Awesome article dude! Thank you so much, However I am going through difficulties with your RSS. I don't understand why I cannot subscribe to it. Is there anyone else having similar RSS problems? Anyone who knows the solution can you kindly respond?
Thank you!!
essay writer online
essay writer online
essay writer online
essay writer online
Free Essay Writing Services...
DO My Homework For Free
Write My Essay For Free
Essay Writing Planet
Essay Writing Service
Shop Online Medjugorje Souvenirs bracelet at Medjugorje Souvenirs
http://MedjugorjeSouvenirs.com
this is a very cool website. you seem to have experienced writers.
Very efficiently written information. It will be valuable to everyone who uses it, including myself. Thanks.
I can say this is the best way to know gain knowledge thank You!!
Here My website for WEB DEVELOPMENT Company in Vadodara
Blog post is very informative for its content explained in it and also find out the best Website content writing services India at best offers.
Having a separate office can boost your productivity, but undoubtedly it costs too much. Meanwhile, Coworking Space becomes the new option for startups and entrepreneurs to work in a productive environment where they can book their chairs or cabins to work with their team or even individually as well. You can check out the Coworking Space In Greater Kailash here, and you will see a lot of amazing and cost-effective options.
AI Based Online Reputation Management Tool. An Integrated solution for Social Listening, Response Management, Social CRM, Social Analytics & sentiment analysis.
online reputation management tool
social listening tool
customer experience tool
Post a Comment