Cognitive Science and Coding

4/22/11

LyX and LaTeX

It's been said that one should never use MS Word to write a thesis, so I headed that advice and started writing my thesis in LaTeX, using LyX since I'm on Windows. Here the problems started.

Here's a list of problems I've encountered, and their solutions:

Escape _ URLs and DOIs when using natbib and mode author-year

When using natbib citation engine and using author-year style, for some reason you have to make sure that there'are no unescaped underscores _ in URLs and DOIs and probably other fields. To esacape, add backslash \, so for example

doi = {10.1007/978-3-540-72559-6_4}

should be

doi = {10.1007/978-3-540-72559-6\_4}

Hack your .lyx file to get author-year working with natbib

I always encountered (author?) in my files instead of a good reference when using natbib and mode author-year. Until I found out how it should be done:

Edit your .lyx file in a plain text editor, and find the entry that starts with:

\begin_inset CommandInset bibtex

LatexCommand bibtex

...

\end_inset

and change the option field to:

options "plainnat"

(At least for LyX 2.0 this works)

It's sad that you have go to those great lengths to track down errors. If you wanna use LyX you need to learn some LaTeX too, unfortunately.

The reason I learned this, was that I downloaded a LyX thesis template and compared the template with my file. The link to the template is: http://sites.google.com/site/lyxthesistemplate/

I hope that this information can help somebody struggling with the same hard-to-track-down errors.

3/18/09

Automatic Language Identification using Python

I was playing around with an idea of an automatic language detection script for detection of languages using ngrams. The idea was to use a sample corpus for each language to build language profiles. For a sentence whose language is to be detected, a profile consiting of ngrams with relative frequency scores is built and then compared to the existing language profiles. The output is a normalized ranking score for each language profile, with 100 being the score of the best match.

Since I am comparing ngrams, the comparison is done on an orthographic level.

I refrained from googling the idea, since I figured that it must already have been done and wanted to figure out how to do it myself. However, I know that google does it much better.

Language Profile

The language profile is built using ngrams of three to five letters. Sample texts taken from random websites (news sites, wikipedia etc) in the actual language are chopped up in all its possible ngram combinations and the frequencies for each ngram is counted and normalized (the score is 100 for the most frequent ngram). The top 200 ngrams are stored in a text file as a language profile.

When building a language profile, first all ngrams for all languages are calculated. All ngrams are also given the score one in a global ngram dictionary for each language in which they occur. Then each ngram is getting a discount exponentionally proportional to the number of languages in which they occur.

This approach has similarities to tfidf score calculations.

The Detection Process

When an example text is given to the program, a language profile for that text is calculated and then compared to all the existing language profiles. This is simply calculated as given a constant score addition for each ngram occurrence in the language and the text profile. Additionally a score proportional to the ngram score the text language profile is given. Comparing text and language profile score for each ngram did not improve the detection precision.

Example

The following text is run through the language detection system:

Es ist Heute schönes Wetter. Ich glaube, daß der Frühling unterwegs ist.

The scores are like following:

de score: 100
da score: 57
sv score: 30
es score: 21
en score: 13
is score: 6
fr score: 0
hu score: 0
sk score: 0

German (de) is detected as the most probable language, followed by Danish (da) and Swedish (sv). French (fr), Hungarian (hu) and Slovakian (sk) end up with a score of 0. This reflects the similarities between languages regarding their orthographic structure. The last three languages contain a many diacritics not being a part of the German language, and hence the similarities in ngrams are few, if any. For an idea of how the languages look orthographically, look at the top 10 ngrams for the the top three and bottom three languages:

de	da	sv	fr	hu	sk
cht icht sich eine ich sch werb der wer das	læs der artik rtik skat kke ere nde nsk det	för att och det ätt rätt säg äge till äger	pour rés pré ait our que pou eur dans iards	cso csonk cson sonk onk cik cikk szó hog ócikk	kelt ých pred ého tick kelti elti kelts ltsk eltsk

cht

icht

sich

eine

ich

sch

werb

der

wer

das

læs

der

artik

rtik

skat

kke

ere

nde

nsk

det

för

att

och

det

ätt

rätt

säg

äge

till

äger

pour

rés

pré

ait

our

que

pou

eur

dans

iards

cso

csonk

cson

sonk

onk

cik

cikk

szó

hog

ócikk

kelt

ých

pred

ého

tick

kelti

elti

kelts

ltsk

eltsk

Considering that ngrams that are present in more than one language are given a score discount, it's interesting to note that German and Danish have “der” in common in the top ngram, and Danish and Swedish “det”. However, please note that this notion is of anecdotal nature and doesn't prove anything.

Problems

The script can only discriminate between different languages. If a text is written in a language that does not have a profile, the system is not able to say. It would be possible to implement some kind of threshold mechanism, but the problem is that some language generally end up with higher scores for all texts than others. A normalization for each language profile must then first be done.

For example, the German language profile always ends up with an unproportional high score for most language. The exact reason for this is at the moment not quite clear to me.

Another example of this unproportionality is that a Norwegian language profile ended up with higher scores for a Swedish sentence than the Swedish language profile itself.

Conclusion

Using linguistic and/or statistical/mathematical theory to inform the implementation would probably yield better precision, but this shows that it is possible to create something that works moderately well just using trial and error.

Downloads

langdetect.tar.gz - Source code with language profiles

For copyright reasons I cannot provide any corpus for you, but you can just copy and paste texts for different languages if you want to build your own language profiles.

UPDATE 2010-08-11: I got an e-mail from a guy trying to use it but got problem because he was probably running a too old version of python, because the script uses the built-in sorted I've tried it with python 2.6 and that works.

4/12/07

Beware of The on* Attributes In The img Tag

Remeber to always validate user input server-side, especially if you allow HTML in posts. I will use FCKeditor as an example of how false feelings of security might leave an application vulnerable.

I recently discovered a scary vulnerability on a site using FCKeditor. FCKeditor has this feature that lets you drag and drop an image into the edit field. That's neat, but it all the attributes in the image tag is copied. That means that if you drag-and-drop an image with, say the onmouseout attribute set, to the edit window you can easily inject any javascript code you want. I won't publish any example code here, cause that would only help the script kiddies.

FCKeditor generates XHTML snippets that are convinient to just publish on the forum, guestbook or whatever. I guess many people do. The problem is that the generated snippets are not safe.

Now, FCKeditor comes with server-side modules/scripts for various clients. I haven't scrutinized all of those scripts, but as far as I could tell at least many (if not all) of them lacked server-side validation functions. So the problem is not really FCKeditor, but the lack of server-side validation on many sites. FCKeditor just leverages the effort to inject code.

The lesson learned is nothing new, but it needs to be repeated:

Input validation must always be done at server-side. Don't trust the client to do that. It's easy to manipulate the data sent in by using plugins such as Tamper Data for FF. Using FCKeditors drag-and-drop functionality just leverages the problem.

Also, when you validate the HTML code, filter out all tags and attributes but those you trust. Don't build a filter based on filtering OUT tags you think are dangerous, because you will might forget some, and new tags and attributes being potential threats might be introduced in the future.