I made it, and it's working great so I decided to share the procedure here so others can make use of it.
Here it goes...
What you need:
1. Good source for word list frequency
Good prediction Dictionary relies on word list frequency, as defined by the AOSP
So, you need a source from which you can extract how often different words appear. After some thinking, googling, trial and error I came to conclusion that for smartphone usage there is no better place than big national forum. That's what I used, anyway.
2. OpenOffice (and MS Office) dictionary for your language
You can find it here:
You don't want to have misspelled words in the dictionary, right? So, after creating word list from the source, you'll want to throw out the words that are not in this list.
Just to be sure that I'll keep all the 'good' words in the list I also ran MS Office Spelling procedure trough it. Will explain it later on.
3. Tools - GNU Utilities, MS Office, Ultraedit, Wget (HTTrack)...
There are no more powerful tools for stream editing than Unix tools. Period.
At first I tried to do something without it and when I learned a bit about them, realized how great these are for task like this. Get them here:
Windows comes with it's own 'sort' command but you'll want to use the one from GNU utilities, so put it in the directory where you start your commands from.
You'll need to download that forum that I mentioned earlier somehow. I used wget:
It was pretty slow (took like two days to mirror part of the forum with posts). When I was near the end with the download I learned about HTTrack:
I tried it out shortly and it seems a lot faster (can do multiple connections!)
Get it here:
For Windows, you need makedict_Windows.bat and makedict.jar
I don't have experience with html, so at first I had to study how is vbulletin forum that I aimed at structured. I wanted to download just the pages that contain posts, and not memberlists etc. At the end I came up to this syntax for wget
wget -k -m -E -p -np -R member.php*,memberlist.php*,calendar.php*,faq.php*,printthread.php*,newreply.php*,search.php*,*sendtofriend*,sendmessage.php*,*goto=nextnewest*,newreply.php*,misc.php*,forumdisplay.php*,showpost.php*,announcment.php*,image.php*,viewonline.php*,showthread.php*mode*,showthread.php*s=*,showthread.php*page* -o log.txt http://xxxxxxx.hr/
Now you want to extract only messages text from the html
cat showthread* | sed -n "/<!-- message -->/,/<!-- \/ message -->/p > forum0.txt"
sed "s/<[^:]*said://g" > forum1.txt
sed -e "s/<[^>]*>//g" forum1.txt > forum2.txt
cat forum2.txt | sed "s/š/š/g" | sed "s/đ/đ/g" | sed "s/č/č/g" | sed "s/ć/ć/g"| sed "s/ž/ž/g" | sed "s/Š/Š/g" | sed "s/Đ/Đ/g" | sed "s/Č/Č/g" | sed "s/Ć/Ć/g" | sed "s/Ž/Ž/g" > forum.txt
Now you can start to make your word list by throwing out all but words
cat forum.txt | tr "[:punct:][:blank:][:digit:]" "\n" | grep "^." > unsortedallwordslist.txt
cat allwordslist.txt | tr "A-Z" "a-z" | tr "ŠĐČĆŽ" "šđčćž" | sort | uniq -c | sort -nr > words.txt
Now when you have it all nicely counted and sorted, you want to throw out misspelled and incorrect words from it. I used Excel for it. But first, I took OpenOffice word list (you can simply unzip oxt file) and cleaned it up a bit.
First, you need it in correct Windows encoding. Ultraedit can do it. In my case I had to convert from iso-8859-2 to win-1250. Open an iso-8859-2 document, go to "view/set code page" and choose "iso-8859-2", than go to: "file/conversions" and choose ASCII to UNICODE, than you will see all characters right, but when you want save edited code/text you must
convert it back, so choose UNICODE to ASCII and save it, that's it.
Also, it had suffixes such as "/AE" here and there so I removed those too
sed "s/\/[A-Z]*//g" hr_HR.dic > hr_HR.txt
cat hr_big.dic | tr "[A-Z]" "[a-z]" | tr "[ŠĐČĆŽ]" "[šđčćž]"
After that's finished, copy just the values in new column and delete the column with formulas, so it doesn't go trough it again. Sort by new values you got and keep the ones that passed trough this 'spell check' (I got around 90.000 woords in this step).
Cut&paste rows that have zeros in it in new worksheet. I wanted to compare those with MS dictionary so I don't throw anything out that is not in OpenOffice dictionary. Here is the function I used
Public Function SpellCheck(rng As Excel.Range) As Boolean() Dim i as Long, size as Long Dim objExcel as New Excel.Application Dim result() as Boolean size = rng.Cells.Count ReDim result(1 to size) for i = 1 to size result(i) = objExcel.CheckSpelling(rng.Cells(i).Text) next i SpellCheck = result() objExcel.Quit End Function
Now, you have to distribute frequencies in 255 classes for Gingerbread prediction engine. You could do it just by dividing every number with a factor you get by dividing top word count by 255. But look at the scatter plot of this and you'll notice that you'll spend top classes very fast that way. So, I optimized the distribution a bit in a separate calculation. I arranged the word count in top class to be 1 and calculated the rest by using the formula "nextclasswordcount=previousclasswordcount*factor^ 4". I used Excel Solver to find the factor. Total word count had to match original wordcount (in my case 110.000), obviously. I even corrected it a bit, so that sum in new distribution is only 70.000 (1 in first class and 2000 in last), so that it smooths out the distribution nicely with more frequent words and let the rest of 40.000 fall in class "1". It took some tweaking and you could use better formula maybe, but this worked for me much better than just dividing it with same factor.
I had this calculation in two separate rows and returned the classes back next to the words:
The rest is easy. Create the string needed for correct xml format,
makedict_Windows.bat from.xml > main.dict
And here is the LatinIME.apk with croatian layout and dictionary that I got this way:
I used mobilix's layout (thank you!) with just few my own corrections (corrected & glitches in symbolic keyboard).
And hera are resources where I got inspiration from (thanks Gert Schepens):
So much for now. Enjoy.
Now, the next step would be to try to include my work in official AOSP or maybe Cyanogen source. I could use few pointers on how to do that. I would prefer to do it simple. I registered on github, but that's where I got for now. I have to do some more reading about it...