3/18/09

Automatic Language Identification using Python

I was playing around with an idea of an automatic language detection script for detection of languages using ngrams. The idea was to use a sample corpus for each language to build language profiles. For a sentence whose language is to be detected, a profile consiting of ngrams with relative frequency scores is built and then compared to the existing language profiles. The output is a normalized ranking score for each language profile, with 100 being the score of the best match.

Since I am comparing ngrams, the comparison is done on an orthographic level.

I refrained from googling the idea, since I figured that it must already have been done and wanted to figure out how to do it myself. However, I know that google does it much better.

Language Profile

The language profile is built using ngrams of three to five letters. Sample texts taken from random websites (news sites, wikipedia etc) in the actual language are chopped up in all its possible ngram combinations and the frequencies for each ngram is counted and normalized (the score is 100 for the most frequent ngram). The top 200 ngrams are stored in a text file as a language profile.

When building a language profile, first all ngrams for all languages are calculated. All ngrams are also given the score one in a global ngram dictionary for each language in which they occur. Then each ngram is getting a discount exponentionally proportional to the number of languages in which they occur.

This approach has similarities to tfidf score calculations.

The Detection Process

When an example text is given to the program, a language profile for that text is calculated and then compared to all the existing language profiles. This is simply calculated as given a constant score addition for each ngram occurrence in the language and the text profile. Additionally a score proportional to the ngram score the text language profile is given. Comparing text and language profile score for each ngram did not improve the detection precision.

Example

The following text is run through the language detection system:

Es ist Heute schönes Wetter. Ich glaube, daß der Frühling unterwegs ist.

The scores are like following:

de score: 100
da score: 57
sv score: 30
es score: 21
en score: 13
is score: 6
fr score: 0
hu score: 0
sk score: 0

German (de) is detected as the most probable language, followed by Danish (da) and Swedish (sv). French (fr), Hungarian (hu) and Slovakian (sk) end up with a score of 0. This reflects the similarities between languages regarding their orthographic structure. The last three languages contain a many diacritics not being a part of the German language, and hence the similarities in ngrams are few, if any. For an idea of how the languages look orthographically, look at the top 10 ngrams for the the top three and bottom three languages:


de da sv fr hu sk

cht

icht

sich

eine

ich

sch

werb

der

wer

das

læs

der

artik

rtik

skat

kke

ere

nde

nsk

det

för

att

och

det

ätt

rätt

säg

äge

till

äger

pour

rés

pré

ait

our

que

pou

eur

dans

iards

cso

csonk

cson

sonk

onk

cik

cikk

szó

hog

ócikk

kelt

ých

pred

ého

tick

kelti

elti

kelts

ltsk

eltsk

Considering that ngrams that are present in more than one language are given a score discount, it's interesting to note that German and Danish have “der” in common in the top ngram, and Danish and Swedish “det”. However, please note that this notion is of anecdotal nature and doesn't prove anything.

Problems

The script can only discriminate between different languages. If a text is written in a language that does not have a profile, the system is not able to say. It would be possible to implement some kind of threshold mechanism, but the problem is that some language generally end up with higher scores for all texts than others. A normalization for each language profile must then first be done.

For example, the German language profile always ends up with an unproportional high score for most language. The exact reason for this is at the moment not quite clear to me.

Another example of this unproportionality is that a Norwegian language profile ended up with higher scores for a Swedish sentence than the Swedish language profile itself.

Conclusion

Using linguistic and/or statistical/mathematical theory to inform the implementation would probably yield better precision, but this shows that it is possible to create something that works moderately well just using trial and error.

Downloads

langdetect.tar.gz - Source code with language profiles

For copyright reasons I cannot provide any corpus for you, but you can just copy and paste texts for different languages if you want to build your own language profiles.

UPDATE 2010-08-11: I got an e-mail from a guy trying to use it but got problem because he was probably running a too old version of python, because the script uses the built-in sorted I've tried it with python 2.6 and that works.


19 comments:

Slavik said...

It seems that download link to langdetect.tar.gz is broken.

ashishmaihu said...

May I don’t know the real meaning of the article, but I was very clear affirmation. This article is the best I read an article. Write in neatly vivid. Let a person calm of.

Linen shirts for men

SEO said...

lehenga sarees online
Excellent and decent post. I found this much informative, as to what I was exactly searching for. Thanks for such post and keep it up.

Unknown said...

I am using the web service from http://www.whatlanguage.net. It integrates nicely with my Python code is outputs JSON or XML. It can detect 100+ languages and is as accurate as the Google web service.

SEO said...

This can be a great website write-up and I defer to you what you've said below. I've previously subscribed for a RSS feed in Firefox and are going to be your regular reader. Many thanks for the time in writing the submit.
saree blouse

SEO said...

Thanks for sharing the great ideas.You will find a lot of approaches after visiting your post.
web design baroda

SEO said...

I appreciate the work of all people who share information with others.
electrical control panel manufacturers

Unknown said...

I have never seen this type of post from you,You tried something new!! Good for you!
I have always afraid I'll mess when I try new things. What made you decide to try something new? I want to motivate you.
Maybe you have some ideas you can share.nursing essay writing service

Unknown said...

It is compelling and interactive content and consider important for my personal
 gwen stefani jacket

Unknown said...

Awesome article dude! Thank you so much, However I am going through difficulties with your RSS. I don't understand why I cannot subscribe to it. Is there anyone else having similar RSS problems? Anyone who knows the solution can you kindly respond?
Thank you!!
essay writer online
essay writer online
essay writer online
essay writer online

Kankotri24 said...

Free Essay Writing Services...

DO My Homework For Free


Write My Essay For Free


Essay Writing Planet

prinsu said...

Essay Writing Service

prinsu said...

Shop Online Medjugorje Souvenirs bracelet at Medjugorje Souvenirs

prinsu said...

http://MedjugorjeSouvenirs.com

Alex Hendrix said...

this is a very cool website. you seem to have experienced writers.

pratik said...

Very efficiently written information. It will be valuable to everyone who uses it, including myself. Thanks.
I can say this is the best way to know gain knowledge thank You!!
Here My website for WEB DEVELOPMENT Company in Vadodara

Amit Gupta said...

Blog post is very informative for its content explained in it and also find out the best Website content writing services India at best offers.

Steve said...

Having a separate office can boost your productivity, but undoubtedly it costs too much. Meanwhile, Coworking Space becomes the new option for startups and entrepreneurs to work in a productive environment where they can book their chairs or cabins to work with their team or even individually as well. You can check out the Coworking Space In Greater Kailash here, and you will see a lot of amazing and cost-effective options.

QuickMetrix said...

AI Based Online Reputation Management Tool. An Integrated solution for Social Listening, Response Management, Social CRM, Social Analytics & sentiment analysis.
online reputation management tool
social listening tool
customer experience tool