Post Reply

[HOWTO] Create Gingerbread Keyboard Dictionary for your Language (Croatian example)

OP navdra

8th April 2011, 11:44 AM   |  #1  
OP Senior Member
Thanks Meter: 31
 
118 posts
Join Date:Joined: Jun 2010
I really liked new Gingerbread Keyboard but was bugged by the fact that it was missing Croatian language dictionary, so I tried to figure out how to create one.
I made it, and it's working great so I decided to share the procedure here so others can make use of it.

Here it goes...


What you need:

1. Good source for word list frequency
Good prediction Dictionary relies on word list frequency, as defined by the AOSP
http://android.git.kernel.org/?p=pla...275504;hb=HEAD
So, you need a source from which you can extract how often different words appear. After some thinking, googling, trial and error I came to conclusion that for smartphone usage there is no better place than big national forum. That's what I used, anyway.

2. OpenOffice (and MS Office) dictionary for your language
You can find it here:
http://extensions.services.openoffic...n/dictionaries
You don't want to have misspelled words in the dictionary, right? So, after creating word list from the source, you'll want to throw out the words that are not in this list.
Just to be sure that I'll keep all the 'good' words in the list I also ran MS Office Spelling procedure trough it. Will explain it later on.

3. Tools - GNU Utilities, MS Office, Ultraedit, Wget (HTTrack)...
There are no more powerful tools for stream editing than Unix tools. Period.
At first I tried to do something without it and when I learned a bit about them, realized how great these are for task like this. Get them here:
http://sourceforge.net/projects/unxutils
Windows comes with it's own 'sort' command but you'll want to use the one from GNU utilities, so put it in the directory where you start your commands from.
You'll need to download that forum that I mentioned earlier somehow. I used wget:
http://www.gnu.org/software/wget/
It was pretty slow (took like two days to mirror part of the forum with posts). When I was near the end with the download I learned about HTTrack:
http://www.httrack.com
I tried it out shortly and it seems a lot faster (can do multiple connections!)

4. Makedict
Get it here:
http://softkeyboard.googlecode.com/s...ctionaryTools/
For Windows, you need makedict_Windows.bat and makedict.jar


PROCEDURE:
I don't have experience with html, so at first I had to study how is vbulletin forum that I aimed at structured. I wanted to download just the pages that contain posts, and not memberlists etc. At the end I came up to this syntax for wget
Code:
wget -k -m -E -p -np -R member.php*,memberlist.php*,calendar.php*,faq.php*,printthread.php*,newreply.php*,search.php*,*sendtofriend*,sendmessage.php*,*goto=nextnewest*,newreply.php*,misc.php*,forumdisplay.php*,showpost.php*,announcment.php*,image.php*,viewonline.php*,showthread.php*mode*,showthread.php*s=*,showthread.php*page* -o log.txt http://xxxxxxx.hr/
I'm not sure if the syntax is entirely correct, but it worked for me, so I never looked back. Wget started to download only the stuff I wanted - thread pages from forum. It took long time to collect 9 GB of data. Look at the HTTrack. I think it can do it much faster.
Now you want to extract only messages text from the html
Code:
cat showthread* | sed -n "/<!-- message -->/,/<!-- \/ message -->/p > forum0.txt"
Check out what you got. You don't want the quotes included in this because they would pump up the word count for words that appear in them, so strip that out too:
Code:
sed "s/<[^:]*said://g" > forum1.txt
Finally, strip out the rest of HTML code:
Code:
sed -e "s/<[^>]*>//g" forum1.txt > forum2.txt
I noticed that I had some leftover croatian characters represented with their Unicode codes, so I replaced those too:
Code:
cat forum2.txt | sed "s///g" | sed "s/đ/đ/g" | sed "s/č/č/g"  | sed "s/ć/ć/g"| sed "s///g" | sed "s///g" | sed "s/Đ/Đ/g" | sed "s/Č/Č/g" | sed "s/Ć/Ć/g" | sed "s///g" > forum.txt
Found the codes here:
http://yorktown.cbe.wwu.edu/sandvig/docs/unicode.aspx
Now you can start to make your word list by throwing out all but words
Code:
cat forum.txt | tr "[:punct:][:blank:][:digit:]" "\n" | grep "^." > unsortedallwordslist.txt
and counting how often they appear
Code:
cat allwordslist.txt | tr "A-Z" "a-z" | tr "ĐČĆ" "đčć" | sort | uniq -c | sort -nr  > words.txt
I got around 205.000 counted words after this.
Now when you have it all nicely counted and sorted, you want to throw out misspelled and incorrect words from it. I used Excel for it. But first, I took OpenOffice word list (you can simply unzip oxt file) and cleaned it up a bit.
First, you need it in correct Windows encoding. Ultraedit can do it. In my case I had to convert from iso-8859-2 to win-1250. Open an iso-8859-2 document, go to "view/set code page" and choose "iso-8859-2", than go to: "file/conversions" and choose ASCII to UNICODE, than you will see all characters right, but when you want save edited code/text you must
convert it back, so choose UNICODE to ASCII and save it, that's it.
Also, it had suffixes such as "/AE" here and there so I removed those too
Code:
sed "s/\/[A-Z]*//g" hr_HR.dic > hr_HR.txt
and mad it all lowercase
Code:
cat hr_big.dic | tr "[A-Z]" "[a-z]" | tr "[ĐČĆ]" "[đčć]"
Now I imported both lists in Excel and simply checked if my forum word list words are correct by checking if they can be found in OpenOffice dictionary.
=COUNTIF('openofficedic'!A1:A375541;B1)
After that's finished, copy just the values in new column and delete the column with formulas, so it doesn't go trough it again. Sort by new values you got and keep the ones that passed trough this 'spell check' (I got around 90.000 woords in this step).
Cut&paste rows that have zeros in it in new worksheet. I wanted to compare those with MS dictionary so I don't throw anything out that is not in OpenOffice dictionary. Here is the function I used
Code:
Public Function SpellCheck(rng As Excel.Range) As Boolean()
   Dim i as Long, size as Long
   Dim objExcel as New Excel.Application
   Dim result() as Boolean

   size = rng.Cells.Count
   ReDim result(1 to size)

   for i = 1 to size
      result(i) = objExcel.CheckSpelling(rng.Cells(i).Text)
   next i

   SpellCheck = result()
   objExcel.Quit
End Function
The function I found was originally written to act as an array function but I never managed to work. But it worked as normal function and I just invoked it by
Code:
=SpellCheck(B1)
This took reaaaally loooong time. Again copy values to new column and delete the column with formulas so it won't go trough it again. Delete the rows with 'FALSE' in it. Check trough the rest and clean it up a bit - MS spellcheck can act funny sometimes. I got another 20.000 words from this list that weren't recognized by OpenOffice spell check, merged two lists and the final word count for dictionary was now around 110.000. I believe it's optimal, maybe a little on a bigger side, but the final main.dict is just under 900 kB which is more than acceptable.
Now, you have to distribute frequencies in 255 classes for Gingerbread prediction engine. You could do it just by dividing every number with a factor you get by dividing top word count by 255. But look at the scatter plot of this and you'll notice that you'll spend top classes very fast that way. So, I optimized the distribution a bit in a separate calculation. I arranged the word count in top class to be 1 and calculated the rest by using the formula "nextclasswordcount=previousclasswordcount*factor^ 4". I used Excel Solver to find the factor. Total word count had to match original wordcount (in my case 110.000), obviously. I even corrected it a bit, so that sum in new distribution is only 70.000 (1 in first class and 2000 in last), so that it smooths out the distribution nicely with more frequent words and let the rest of 40.000 fall in class "1". It took some tweaking and you could use better formula maybe, but this worked for me much better than just dividing it with same factor.
I had this calculation in two separate rows and returned the classes back next to the words:
Code:
=IF(ISNA(HLOOKUP(A6;$J$4:$JD$5;2;FALSE));D5;HLOOKUP(A6;$J$4:$JD$5;2;FALSE))
Maybe it's best to d/l xlsm from here so I don't have to explain a lot...
The rest is easy. Create the string needed for correct xml format,
Code:
=CONCATENATE("<w f=";CHAR(34);D2;CHAR(34);">";F2;"</w>")
close the word list with "<wordlist>" in first and "</wordlist>" in last row and add "<?xml version="1.0" encoding="UTF-8" ?>" at the top and finally compile the .dict file:
Code:
makedict_Windows.bat from.xml > main.dict
Phew! A lot of typing.

And here is the LatinIME.apk with croatian layout and dictionary that I got this way:
HR_Gingerbread_keyboard-1.0.apk
I used mobilix's layout (thank you!) with just few my own corrections (corrected &amp glitches in symbolic keyboard).

And hera are resources where I got inspiration from (thanks Gert Schepens):
http://www.gertschepens.be/android-dictionary-files
http://blog.cone.be/2010/08/19/andro...-dictionaries/

So much for now. Enjoy.

Now, the next step would be to try to include my work in official AOSP or maybe Cyanogen source. I could use few pointers on how to do that. I would prefer to do it simple. I registered on github, but that's where I got for now. I have to do some more reading about it...
The Following 18 Users Say Thank You to navdra For This Useful Post: [ View ]
10th April 2011, 12:50 AM   |  #2  
Junior Member
Thanks Meter: 1
 
13 posts
Join Date:Joined: Mar 2010
Thanks for your guide!
I successfully created a danish dictionary, but how do i implement the main.dict file i just created into the LatinIME.apk?
In the LatinIME.apk file, i tried creating the folder /res/raw-da, and putting the main.dict file there. But it didn't work.
Last edited by anders4431; 10th April 2011 at 09:14 AM.
10th April 2011, 07:54 AM   |  #3  
OP Senior Member
Thanks Meter: 31
 
118 posts
Join Date:Joined: Jun 2010
I'm glad you made it!
Be sure to create nice, clean main.dict which we will add to AOSP hopefully.
I used APK Manager to decompile, add 'raw-hr' with my main.dict, recompile and sign the .apk that already had Croatian layout. There was a bug in .apk I used that was preventing language switching and I noticed that the bug was widespread in many LatinIME.apk versions floating around. I don't know where this bug comes from but the problem was in default main.dict file in 'raw' folder which had to be replaced with proper one (I took from Cyanogenmod version of LatinIME, but you can use my .apk).
20th April 2011, 04:20 PM   |  #4  
Stile35's Avatar
Senior Member
Thanks Meter: 7
 
192 posts
Join Date:Joined: Dec 2010
More
Can you, please, send me somehow this, already produced Croatian dictionary file in order to incorporate it into mine Gingerbread keyboard?

Thanks.
21st April 2011, 06:20 AM   |  #5  
OP Senior Member
Thanks Meter: 31
 
118 posts
Join Date:Joined: Jun 2010
I presume you're after main.dict...
Use APK Manager, decompile my apk and you'll find it in 'raw-hr' folder.

Sent from my HTC Desire
The Following User Says Thank You to navdra For This Useful Post: [ View ]
15th June 2011, 09:27 AM   |  #6  
Senior Member
Flag San Pedro
Thanks Meter: 11
 
139 posts
Join Date:Joined: Jan 2011
Donate to Me
More
hi there, i was so grateful to have found this thread after googling for almost 7hours for a tagalog dictionary

although your method of bytestreaming a forum could not work for for me who has no fast internet connection.

so i would like to verify if, upon continuous usage, will the Gingerbread keyboard modify the word frequency over time?
i mean i could modify a script to just assign 0 as frequency value for all words and use make_dict, (or to avoid problems, just assign any random value from 0 - 255)
and as i use my keyboard, will it edit those frequency scores eventually?

anyway, im trying it out right now and would post my results here too.

again thank you very much for your insight
18th June 2011, 01:39 PM   |  #7  
Junior Member
Flag Zagreb
Thanks Meter: 4
 
9 posts
Join Date:Joined: Nov 2008
More
Thumbs up
Thank you very much! (Hvala!)
20th June 2011, 01:12 AM   |  #8  
Senior Member
Flag Manila
Thanks Meter: 2
 
309 posts
Join Date:Joined: Jun 2006
Quote:
Originally Posted by lockzackary

hi there, i was so grateful to have found this thread after googling for almost 7hours for a tagalog dictionary

although your method of bytestreaming a forum could not work for for me who has no fast internet connection.

so i would like to verify if, upon continuous usage, will the Gingerbread keyboard modify the word frequency over time?
i mean i could modify a script to just assign 0 as frequency value for all words and use make_dict, (or to avoid problems, just assign any random value from 0 - 255)
and as i use my keyboard, will it edit those frequency scores eventually?

anyway, im trying it out right now and would post my results here too.

again thank you very much for your insight

I'm very much looking forward to your results..
it's driving me crazy everytime i reflash my ROM that i need to rebuild my Tagalog user dictionary.
20th June 2011, 02:18 AM   |  #9  
Senior Member
Flag San Pedro
Thanks Meter: 11
 
139 posts
Join Date:Joined: Jan 2011
Donate to Me
More
@ytsejam_
Hey there, i was done with the dictionary although i have yet to test it with a compatible rom, and to further complicate thngs, its too tedious to populate the dictionary with tagalog text-speak (e.g.: cnu,sno)
As there are so much variations for a single word, hehe although im still building it up so hehe, i hope other pinoy's can wait for it,
As far as i know this approach on creating dictionaries only work on Samsung devices and not necessarily android so i hope by then they still own their galaxy hehe
Sent from my GT-I9000 using XDA App
25th June 2011, 03:21 PM   |  #10  
Junior Member
Flag Tehran
Thanks Meter: 0
 
15 posts
Join Date:Joined: May 2010
More
Unhappy Need a keyboard not just dictionary
Hi,

I just wanna add a new language (in this case, Persian) for keyboard.

I Use Cyanogen 7 and it supports Persian and Arabic very well. but it has no layout for Persian (but it has arabic) can you please help me with this?

Just some clue.

thanks

Post Reply Subscribe to Thread
Previous Thread Next Thread
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes


Top Threads in Android Software Development by ThreadRank