Dictionaries research

In order to create a very good rich and extensible VN-EN dictionary (and other languages in the future for sure) I made a research of the current open source environment.

Dictionary formats(rich text, XML metalanguage):

We pretend our dictionaries to be extensible by the community, kind of wiki style, in order to do that, the dictionaries must be well formatted so we can recognize units like NOUN,VERB, DEFINITIONS,EXAMPLES, etc

XDXF (XML Dictionary Exchange Format) is one of the best and well structured languages. This format is good to make the computer understand what is each element and that way transform them to specific colors, sizes, etc. I hate to see the dictionaries in boring plain-text all the time.

The second advantage is the extensibility of the dictionaries by the communities,let’s see this example:

play
A noun
  1 play, swordplay
    the act using a sword (or other weapon) vigorously and skillfully
  2 play, child’s play
    play by children that is guided more by imagination than by fixed rules; “Freud believed in the utility of play to a small child”

Let’s say that somebody wants to add the meaning for the “play” when used as a verb, so instead of messing the content adding new lines, change the colour, etc , the dictionary software should have the necessary tools(buttons) to allow the user to add new XML structures to the content.

StarDict format The structure is not XML, all byte-coded, it can contain Images,Audio,XDXF,HTML,Wiki links, I don’t really like this because it’s being created from StarDict and almost all the dictionaries are just plain-text not profiting this features.

Here there is a list of other formats.


Engine dictionary (compression,memory usage):

DICT It’s widely used because the StarDict files are compressed using this engine. Info,Index and Content are separated in different files. StarDict project has optimized the engine creating some extra files (cache, collation,etc)

Sdictionary. Info, Index and Content is compressed in the same file. There are tools to create files using HTML content.

Both engines allow a minimum usage of memory because only the index is loaded in memory and only the requested content is being taken by the compressed file.


Open Source Software dictionaries containers:

Name

Language

Active

Features

STARDICT C / GTK Reactive from 2007 (2 developers) Many dictionaries, plain text format, supports DICT, Sdictionary format and Babylon format.
JALINGO Java Inactive from 2006(we’ll reactivate it) Very nice interface. Supports Multisearch. formats: Sdictionary, MOVA, etc (No DICT yet). Supports Richtext.
SDIQT python / QT Inactive from 2006 Not nice interface.Only supports Sdictionary format.
KTRANSLATOR C++ / QT Inactive from 2006 Stardict, Freedict, DICT formats. KDE component.
QSTARDICT C / QT4 Active (1 developer) it’s a clone of stardict using QT4

Our decision: If nobody changes their mind, I think we’ll go for the XDXF/Sdictionary/Jalingo combination to create the VN-EN project.

Leave a comment

Name:

eMail:

Website:

Comment: