Cognitive Science and Coding: Automatic Language Identification using Python

I was playing around with an idea of an automatic language detection script for detection of languages using ngrams. The idea was to use a sample corpus for each language to build language profiles. For a sentence whose language is to be detected, a profile consiting of ngrams with relative frequency scores is built and then compared to the existing language profiles. The output is a normalized ranking score for each language profile, with 100 being the score of the best match.

Since I am comparing ngrams, the comparison is done on an orthographic level.

I refrained from googling the idea, since I figured that it must already have been done and wanted to figure out how to do it myself. However, I know that google does it much better.

Language Profile

The language profile is built using ngrams of three to five letters. Sample texts taken from random websites (news sites, wikipedia etc) in the actual language are chopped up in all its possible ngram combinations and the frequencies for each ngram is counted and normalized (the score is 100 for the most frequent ngram). The top 200 ngrams are stored in a text file as a language profile.

When building a language profile, first all ngrams for all languages are calculated. All ngrams are also given the score one in a global ngram dictionary for each language in which they occur. Then each ngram is getting a discount exponentionally proportional to the number of languages in which they occur.

This approach has similarities to tfidf score calculations.

The Detection Process

When an example text is given to the program, a language profile for that text is calculated and then compared to all the existing language profiles. This is simply calculated as given a constant score addition for each ngram occurrence in the language and the text profile. Additionally a score proportional to the ngram score the text language profile is given. Comparing text and language profile score for each ngram did not improve the detection precision.

Example

The following text is run through the language detection system:

Es ist Heute schönes Wetter. Ich glaube, daß der Frühling unterwegs ist.

The scores are like following:

de score: 100
da score: 57
sv score: 30
es score: 21
en score: 13
is score: 6
fr score: 0
hu score: 0
sk score: 0

German (de) is detected as the most probable language, followed by Danish (da) and Swedish (sv). French (fr), Hungarian (hu) and Slovakian (sk) end up with a score of 0. This reflects the similarities between languages regarding their orthographic structure. The last three languages contain a many diacritics not being a part of the German language, and hence the similarities in ngrams are few, if any. For an idea of how the languages look orthographically, look at the top 10 ngrams for the the top three and bottom three languages:

de	da	sv	fr	hu	sk
cht icht sich eine ich sch werb der wer das	læs der artik rtik skat kke ere nde nsk det	för att och det ätt rätt säg äge till äger	pour rés pré ait our que pou eur dans iards	cso csonk cson sonk onk cik cikk szó hog ócikk	kelt ých pred ého tick kelti elti kelts ltsk eltsk

cht

icht

sich

eine

ich

sch

werb

der

wer

das

læs

der

artik

rtik

skat

kke

ere

nde

nsk

det

för

att

och

det

ätt

rätt

säg

äge

till

äger

pour

rés

pré

ait

our

que

pou

eur

dans

iards

cso

csonk

cson

sonk

onk

cik

cikk

szó

hog

ócikk

kelt

ých

pred

ého

tick

kelti

elti

kelts

ltsk

eltsk

Considering that ngrams that are present in more than one language are given a score discount, it's interesting to note that German and Danish have “der” in common in the top ngram, and Danish and Swedish “det”. However, please note that this notion is of anecdotal nature and doesn't prove anything.

Problems

The script can only discriminate between different languages. If a text is written in a language that does not have a profile, the system is not able to say. It would be possible to implement some kind of threshold mechanism, but the problem is that some language generally end up with higher scores for all texts than others. A normalization for each language profile must then first be done.

For example, the German language profile always ends up with an unproportional high score for most language. The exact reason for this is at the moment not quite clear to me.

Another example of this unproportionality is that a Norwegian language profile ended up with higher scores for a Swedish sentence than the Swedish language profile itself.

Conclusion

Using linguistic and/or statistical/mathematical theory to inform the implementation would probably yield better precision, but this shows that it is possible to create something that works moderately well just using trial and error.

Downloads

langdetect.tar.gz - Source code with language profiles

For copyright reasons I cannot provide any corpus for you, but you can just copy and paste texts for different languages if you want to build your own language profiles.

UPDATE 2010-08-11: I got an e-mail from a guy trying to use it but got problem because he was probably running a too old version of python, because the script uses the built-in sorted I've tried it with python 2.6 and that works.

19 comments:

Slavik said...: It seems that download link to langdetect.tar.gz is broken.; November 3, 2011 at 3:51 PM
ashishmaihu said...: May I don’t know the real meaning of the article, but I was very clear affirmation. This article is the best I read an article. Write in neatly vivid. Let a person calm of.

Linen shirts for men; July 31, 2012 at 7:47 AM
SEO said...: lehenga sarees online
Excellent and decent post. I found this much informative, as to what I was exactly searching for. Thanks for such post and keep it up.; November 5, 2012 at 11:35 AM
Unknown said...: I am using the web service from http://www.whatlanguage.net. It integrates nicely with my Python code is outputs JSON or XML. It can detect 100+ languages and is as accurate as the Google web service.; March 2, 2013 at 10:17 AM
SEO said...: This can be a great website write-up and I defer to you what you've said below. I've previously subscribed for a RSS feed in Firefox and are going to be your regular reader. Many thanks for the time in writing the submit.
saree blouse; June 4, 2013 at 11:26 AM
SEO said...: Thanks for sharing the great ideas.You will find a lot of approaches after visiting your post.
web design baroda; August 23, 2013 at 6:42 AM
SEO said...: I appreciate the work of all people who share information with others.
electrical control panel manufacturers; September 14, 2013 at 8:22 AM
Unknown said...: I have never seen this type of post from you,You tried something new!! Good for you!
I have always afraid I'll mess when I try new things. What made you decide to try something new? I want to motivate you.
Maybe you have some ideas you can share.nursing essay writing service; September 17, 2014 at 9:31 AM
Unknown said...: It is compelling and interactive content and consider important for my personal
gwen stefani jacket; September 2, 2015 at 9:09 AM
Unknown said...: Awesome article dude! Thank you so much, However I am going through difficulties with your RSS. I don't understand why I cannot subscribe to it. Is there anyone else having similar RSS problems? Anyone who knows the solution can you kindly respond?
Thank you!!
essay writer online
essay writer online
essay writer online
essay writer online; September 2, 2015 at 5:58 PM
Kankotri24 said...: Free Essay Writing Services...

DO My Homework For Free

Write My Essay For Free

Essay Writing Planet; February 18, 2016 at 5:47 AM
prinsu said...: Essay Writing Service; April 19, 2016 at 11:36 AM
prinsu said...: Shop Online Medjugorje Souvenirs bracelet at Medjugorje Souvenirs; April 30, 2016 at 8:28 AM
prinsu said...: http://MedjugorjeSouvenirs.com; April 30, 2016 at 8:30 AM
Alex Hendrix said...: this is a very cool website. you seem to have experienced writers.; January 26, 2018 at 11:19 AM
pratik said...: Very efficiently written information. It will be valuable to everyone who uses it, including myself. Thanks.
I can say this is the best way to know gain knowledge thank You!!
Here My website for WEB DEVELOPMENT Company in Vadodara; September 27, 2021 at 6:54 AM
Amit Gupta said...: Blog post is very informative for its content explained in it and also find out the best Website content writing services India at best offers.; December 10, 2021 at 1:55 PM
Steve said...: Having a separate office can boost your productivity, but undoubtedly it costs too much. Meanwhile, Coworking Space becomes the new option for startups and entrepreneurs to work in a productive environment where they can book their chairs or cabins to work with their team or even individually as well. You can check out the Coworking Space In Greater Kailash here, and you will see a lot of amazing and cost-effective options.; May 5, 2022 at 8:33 AM
QuickMetrix said...: AI Based Online Reputation Management Tool. An Integrated solution for Social Listening, Response Management, Social CRM, Social Analytics & sentiment analysis.
online reputation management tool
social listening tool
customer experience tool; July 7, 2023 at 2:09 PM

Cognitive Science and Coding

3/18/09

Automatic Language Identification using Python