[HOWTO] Create Gingerbread Keyboard Dictionary for your Language (Croatian example)
I really liked new Gingerbread Keyboard but was bugged by the fact that it was missing Croatian language dictionary, so I tried to figure out how to create one.
I made it, and it's working great so I decided to share the procedure here so others can make use of it.
Here it goes...
What you need:
1. Good source for word list frequency
Good prediction Dictionary relies on word list frequency, as defined by the AOSP
So, you need a source from which you can extract how often different words appear. After some thinking, googling, trial and error I came to conclusion that for smartphone usage there is no better place than big national forum. That's what I used, anyway.
2. OpenOffice (and MS Office) dictionary for your language
You can find it here:
You don't want to have misspelled words in the dictionary, right? So, after creating word list from the source, you'll want to throw out the words that are not in this list.
Just to be sure that I'll keep all the 'good' words in the list I also ran MS Office Spelling procedure trough it. Will explain it later on.
3. Tools - GNU Utilities, MS Office, Ultraedit, Wget (HTTrack)...
There are no more powerful tools for stream editing than Unix tools. Period.
At first I tried to do something without it and when I learned a bit about them, realized how great these are for task like this. Get them here:
Windows comes with it's own 'sort' command but you'll want to use the one from GNU utilities, so put it in the directory where you start your commands from.
You'll need to download that forum that I mentioned earlier somehow. I used wget:
It was pretty slow (took like two days to mirror part of the forum with posts). When I was near the end with the download I learned about HTTrack:
I tried it out shortly and it seems a lot faster (can do multiple connections!)
Get it here:
For Windows, you need makedict_Windows.bat and makedict.jar
I don't have experience with html, so at first I had to study how is vbulletin forum that I aimed at structured. I wanted to download just the pages that contain posts, and not memberlists etc. At the end I came up to this syntax for wget
wget -k -m -E -p -np -R member.php*,memberlist.php*,calendar.php*,faq.php*,printthread.php*,newreply.php*,search.php*,*sendtofriend*,sendmessage.php*,*goto=nextnewest*,newreply.php*,misc.php*,forumdisplay.php*,showpost.php*,announcment.php*,image.php*,viewonline.php*,showthread.php*mode*,showthread.php*s=*,showthread.php*page* -o log.txt http://xxxxxxx.hr/
I'm not sure if the syntax is entirely correct, but it worked for me, so I never looked back. Wget started to download only the stuff I wanted - thread pages from forum. It took long time to collect 9 GB of data. Look at the HTTrack. I think it can do it much faster.
Now you want to extract only messages text from the html
cat showthread* | sed -n "/<!-- message -->/,/<!-- \/ message -->/p > forum0.txt"
Check out what you got. You don't want the quotes included in this because they would pump up the word count for words that appear in them, so strip that out too:
sed "s/<[^:]*said://g" > forum1.txt
Finally, strip out the rest of HTML code:
sed -e "s/<[^>]*>//g" forum1.txt > forum2.txt
I noticed that I had some leftover croatian characters represented with their Unicode codes, so I replaced those too:
cat forum2.txt | sed "s/š/š/g" | sed "s/đ/đ/g" | sed "s/č/č/g" | sed "s/ć/ć/g"| sed "s/ž/ž/g" | sed "s/Š/Š/g" | sed "s/Đ/Đ/g" | sed "s/Č/Č/g" | sed "s/Ć/Ć/g" | sed "s/Ž/Ž/g" > forum.txt
Found the codes here:
Now you can start to make your word list by throwing out all but words
cat forum.txt | tr "[:punct:][:blank:][:digit:]" "\n" | grep "^." > unsortedallwordslist.txt
and counting how often they appear
cat allwordslist.txt | tr "A-Z" "a-z" | tr "ŠĐČĆŽ" "šđčćž" | sort | uniq -c | sort -nr > words.txt
I got around 205.000 counted words after this.
Now when you have it all nicely counted and sorted, you want to throw out misspelled and incorrect words from it. I used Excel for it. But first, I took OpenOffice word list (you can simply unzip oxt file) and cleaned it up a bit.
First, you need it in correct Windows encoding. Ultraedit can do it. In my case I had to convert from iso-8859-2 to win-1250. Open an iso-8859-2 document, go to "view/set code page" and choose "iso-8859-2", than go to: "file/conversions" and choose ASCII to UNICODE, than you will see all characters right, but when you want save edited code/text you must
convert it back, so choose UNICODE to ASCII and save it, that's it.
Also, it had suffixes such as "/AE" here and there so I removed those too
sed "s/\/[A-Z]*//g" hr_HR.dic > hr_HR.txt
and mad it all lowercase
cat hr_big.dic | tr "[A-Z]" "[a-z]" | tr "[ŠĐČĆŽ]" "[šđčćž]"
Now I imported both lists in Excel and simply checked if my forum word list words are correct by checking if they can be found in OpenOffice dictionary.
After that's finished, copy just the values in new column and delete the column with formulas, so it doesn't go trough it again. Sort by new values you got and keep the ones that passed trough this 'spell check' (I got around 90.000 woords in this step).
Cut&paste rows that have zeros in it in new worksheet. I wanted to compare those with MS dictionary so I don't throw anything out that is not in OpenOffice dictionary. Here is the function I used
Public Function SpellCheck(rng As Excel.Range) As Boolean()
Dim i as Long, size as Long
Dim objExcel as New Excel.Application
Dim result() as Boolean
size = rng.Cells.Count
ReDim result(1 to size)
for i = 1 to size
result(i) = objExcel.CheckSpelling(rng.Cells(i).Text)
SpellCheck = result()
The function I found was originally written to act as an array function but I never managed to work. But it worked as normal function and I just invoked it by
This took reaaaally loooong time. Again copy values to new column and delete the column with formulas so it won't go trough it again. Delete the rows with 'FALSE' in it. Check trough the rest and clean it up a bit - MS spellcheck can act funny sometimes. I got another 20.000 words from this list that weren't recognized by OpenOffice spell check, merged two lists and the final word count for dictionary was now around 110.000. I believe it's optimal, maybe a little on a bigger side, but the final main.dict is just under 900 kB which is more than acceptable.
Now, you have to distribute frequencies in 255 classes for Gingerbread prediction engine. You could do it just by dividing every number with a factor you get by dividing top word count by 255. But look at the scatter plot of this and you'll notice that you'll spend top classes very fast that way. So, I optimized the distribution a bit in a separate calculation. I arranged the word count in top class to be 1 and calculated the rest by using the formula "nextclasswordcount=previousclasswordcount*factor^ 4". I used Excel Solver to find the factor. Total word count had to match original wordcount (in my case 110.000), obviously. I even corrected it a bit, so that sum in new distribution is only 70.000 (1 in first class and 2000 in last), so that it smooths out the distribution nicely with more frequent words and let the rest of 40.000 fall in class "1". It took some tweaking and you could use better formula maybe, but this worked for me much better than just dividing it with same factor.
I had this calculation in two separate rows and returned the classes back next to the words:
Maybe it's best to d/l xlsm from here
so I don't have to explain a lot...
The rest is easy. Create the string needed for correct xml format,
close the word list with "<wordlist>" in first and "</wordlist>" in last row and add "<?xml version="1.0" encoding="UTF-8" ?>" at the top and finally compile the .dict file:
makedict_Windows.bat from.xml > main.dict
Phew! A lot of typing.
And here is the LatinIME.apk with croatian layout and dictionary that I got this way:
I used mobilix's layout (thank you!) with just few my own corrections (corrected & glitches in symbolic keyboard).
And hera are resources where I got inspiration from (thanks Gert Schepens):
So much for now. Enjoy.
Now, the next step would be to try to include my work in official AOSP or maybe Cyanogen source. I could use few pointers on how to do that. I would prefer to do it simple. I registered on github, but that's where I got for now. I have to do some more reading about it...