[HOWTO] Create Gingerbread Keyboard Dictionary for your Language (Croatian example)

Search This thread

revan17

Senior Member
Dec 23, 2010
995
273
38
Zagreb
Hi, by using your modded latinime, most themes don't get the keyboard themed properly. So I tried to put your raw-hr folder with main.dict inside into my latinime and it doesn't get recognized(there already was a croatian layout present).

any ideas?

Hvala za rječnik BTW. :good::good::good:
 

underlines

Senior Member
Aug 26, 2011
489
440
Bangkok
Sorry to bother you guys again. I really struggle to get it working.

What I want: The original LatinImeGoogle.apk (from JB 4.1.1) with an additional dictionary.
What I tried:
  1. Create a swiss text-corpus, because none exist for this dialect-language
  2. Create an XML with the format:
    Code:
    <dictionary>
    <w f="255">word</w>
    ...
    <w f="1">anotherword</w>
    </dictionary>
  3. Use makedict.jar from ICS (didn't find a JB version yet) to create main_de_ch.dict file (DE stands for the language german, and CH is the abbrevation for Switzerland)
  4. Use the newest apktool_1.5.0 to unpack the LatinImeGoogle.apk
  5. Copy the main_de_ch.dict into ..\res\raw\ folder
  6. open ..\res\xml\spellchecker.xml and add:
    Code:
    <subtype android:label="@string/subtype_generic" android:subtypeLocale="de_ch" />
  7. go to \res\ and make a copy of values-de with the name values-de_ch
  8. Use the newest apktool_1.5.0 to build LatinImeGoogle.apk
  9. Copy to system, apply proper permissions

Keyboard not appearing in Settings > Input.
I see various possible errors:
1) wrong naming at step 5) because others are values-en-rGB and values-es-rUS
2) maybe I miss some file where I must declare that naming?
3) The apk isn't signed (?)
4) Maybe I don't need to use apktool, but just open the apk as a zip and do the modifications without unpacking/packing with apktool?

To test if any of the four reasons above are making problems, i tried to rename my main_de_ch.dict to main_de.dict and overwrite it the original LatinImeGoogle.apk so it would basically use swiss-german in the german keyboard for predictions and corrections, right?

Result: The apk can be pushed to system, I see the input method in Settings, everything is ok, but predictions/corrections are german, not swiss-german. HOW IS THAT EVEN POSSIBLE? :( :( :( I've overwritten the main-de.dict with my swiss-german wordlist, how can it shows german words?

Any help/hint would be very much appreciated.

-------

EDIT: It seems that the makedict.jar from different Firmwares provides different binary formats.
I guess I have the wrong version of makedict.jar and I can't find any actual version from AOSP 4.1.1 Jelly Bean.
Any help?

Here is the raw XML which i want to convert to .dict
 
Last edited:

Egy-bluE

Member
Dec 20, 2013
26
5
Hello .. I've my stock rom backup files for alcatel 985D and I'm using a custom rom which doesn't support my Arabic language as locale system language, how can I copy language files from my stock rom and paste it in the custom rom I'm using now (I can't restore my stock rom to phone that's why I wanna replace files but I don't know which files in android system contains language)
 

Top Liked Posts

  • There are no posts matching your filters.
  • 18
    I really liked new Gingerbread Keyboard but was bugged by the fact that it was missing Croatian language dictionary, so I tried to figure out how to create one.
    I made it, and it's working great so I decided to share the procedure here so others can make use of it.

    Here it goes...


    What you need:

    1. Good source for word list frequency
    Good prediction Dictionary relies on word list frequency, as defined by the AOSP
    http://android.git.kernel.org/?p=pl...33b63a8b8a1043fceae592b567b93ee275504;hb=HEAD
    So, you need a source from which you can extract how often different words appear. After some thinking, googling, trial and error I came to conclusion that for smartphone usage there is no better place than big national forum. That's what I used, anyway.

    2. OpenOffice (and MS Office) dictionary for your language
    You can find it here:
    http://extensions.services.openoffice.org/en/dictionaries
    You don't want to have misspelled words in the dictionary, right? So, after creating word list from the source, you'll want to throw out the words that are not in this list.
    Just to be sure that I'll keep all the 'good' words in the list I also ran MS Office Spelling procedure trough it. Will explain it later on.

    3. Tools - GNU Utilities, MS Office, Ultraedit, Wget (HTTrack)...
    There are no more powerful tools for stream editing than Unix tools. Period.
    At first I tried to do something without it and when I learned a bit about them, realized how great these are for task like this. Get them here:
    http://sourceforge.net/projects/unxutils
    Windows comes with it's own 'sort' command but you'll want to use the one from GNU utilities, so put it in the directory where you start your commands from.
    You'll need to download that forum that I mentioned earlier somehow. I used wget:
    http://www.gnu.org/software/wget/
    It was pretty slow (took like two days to mirror part of the forum with posts). When I was near the end with the download I learned about HTTrack:
    http://www.httrack.com
    I tried it out shortly and it seems a lot faster (can do multiple connections!)

    4. Makedict
    Get it here:
    http://softkeyboard.googlecode.com/svn/trunk/DictionaryTools/
    For Windows, you need makedict_Windows.bat and makedict.jar


    PROCEDURE:
    I don't have experience with html, so at first I had to study how is vbulletin forum that I aimed at structured. I wanted to download just the pages that contain posts, and not memberlists etc. At the end I came up to this syntax for wget
    Code:
    wget -k -m -E -p -np -R member.php*,memberlist.php*,calendar.php*,faq.php*,printthread.php*,newreply.php*,search.php*,*sendtofriend*,sendmessage.php*,*goto=nextnewest*,newreply.php*,misc.php*,forumdisplay.php*,showpost.php*,announcment.php*,image.php*,viewonline.php*,showthread.php*mode*,showthread.php*s=*,showthread.php*page* -o log.txt http://xxxxxxx.hr/
    I'm not sure if the syntax is entirely correct, but it worked for me, so I never looked back. Wget started to download only the stuff I wanted - thread pages from forum. It took long time to collect 9 GB of data. Look at the HTTrack. I think it can do it much faster.
    Now you want to extract only messages text from the html
    Code:
    cat showthread* | sed -n "/<!-- message -->/,/<!-- \/ message -->/p > forum0.txt"
    Check out what you got. You don't want the quotes included in this because they would pump up the word count for words that appear in them, so strip that out too:
    Code:
    sed "s/<[^:]*said://g" > forum1.txt
    Finally, strip out the rest of HTML code:
    Code:
    sed -e "s/<[^>]*>//g" forum1.txt > forum2.txt
    I noticed that I had some leftover croatian characters represented with their Unicode codes, so I replaced those too:
    Code:
    cat forum2.txt | sed "s/š/š/g" | sed "s/đ/đ/g" | sed "s/č/č/g"  | sed "s/ć/ć/g"| sed "s/ž/ž/g" | sed "s/Š/Š/g" | sed "s/Đ/Đ/g" | sed "s/Č/Č/g" | sed "s/Ć/Ć/g" | sed "s/Ž/Ž/g" > forum.txt
    Found the codes here:
    http://yorktown.cbe.wwu.edu/sandvig/docs/unicode.aspx
    Now you can start to make your word list by throwing out all but words
    Code:
    cat forum.txt | tr "[:punct:][:blank:][:digit:]" "\n" | grep "^." > unsortedallwordslist.txt
    and counting how often they appear
    Code:
    cat allwordslist.txt | tr "A-Z" "a-z" | tr "ŠĐČĆŽ" "šđčćž" | sort | uniq -c | sort -nr  > words.txt
    I got around 205.000 counted words after this.
    Now when you have it all nicely counted and sorted, you want to throw out misspelled and incorrect words from it. I used Excel for it. But first, I took OpenOffice word list (you can simply unzip oxt file) and cleaned it up a bit.
    First, you need it in correct Windows encoding. Ultraedit can do it. In my case I had to convert from iso-8859-2 to win-1250. Open an iso-8859-2 document, go to "view/set code page" and choose "iso-8859-2", than go to: "file/conversions" and choose ASCII to UNICODE, than you will see all characters right, but when you want save edited code/text you must
    convert it back, so choose UNICODE to ASCII and save it, that's it.
    Also, it had suffixes such as "/AE" here and there so I removed those too
    Code:
    sed "s/\/[A-Z]*//g" hr_HR.dic > hr_HR.txt
    and mad it all lowercase
    Code:
    cat hr_big.dic | tr "[A-Z]" "[a-z]" | tr "[ŠĐČĆŽ]" "[šđčćž]"
    Now I imported both lists in Excel and simply checked if my forum word list words are correct by checking if they can be found in OpenOffice dictionary.
    =COUNTIF('openofficedic'!A1:A375541;B1)
    After that's finished, copy just the values in new column and delete the column with formulas, so it doesn't go trough it again. Sort by new values you got and keep the ones that passed trough this 'spell check' (I got around 90.000 woords in this step).
    Cut&paste rows that have zeros in it in new worksheet. I wanted to compare those with MS dictionary so I don't throw anything out that is not in OpenOffice dictionary. Here is the function I used
    Code:
    Public Function SpellCheck(rng As Excel.Range) As Boolean()
       Dim i as Long, size as Long
       Dim objExcel as New Excel.Application
       Dim result() as Boolean
    
       size = rng.Cells.Count
       ReDim result(1 to size)
    
       for i = 1 to size
          result(i) = objExcel.CheckSpelling(rng.Cells(i).Text)
       next i
    
       SpellCheck = result()
       objExcel.Quit
    End Function

    The function I found was originally written to act as an array function but I never managed to work. But it worked as normal function and I just invoked it by
    Code:
    =SpellCheck(B1)
    This took reaaaally loooong time. Again copy values to new column and delete the column with formulas so it won't go trough it again. Delete the rows with 'FALSE' in it. Check trough the rest and clean it up a bit - MS spellcheck can act funny sometimes. I got another 20.000 words from this list that weren't recognized by OpenOffice spell check, merged two lists and the final word count for dictionary was now around 110.000. I believe it's optimal, maybe a little on a bigger side, but the final main.dict is just under 900 kB which is more than acceptable.
    Now, you have to distribute frequencies in 255 classes for Gingerbread prediction engine. You could do it just by dividing every number with a factor you get by dividing top word count by 255. But look at the scatter plot of this and you'll notice that you'll spend top classes very fast that way. So, I optimized the distribution a bit in a separate calculation. I arranged the word count in top class to be 1 and calculated the rest by using the formula "nextclasswordcount=previousclasswordcount*factor^4". I used Excel Solver to find the factor. Total word count had to match original wordcount (in my case 110.000), obviously. I even corrected it a bit, so that sum in new distribution is only 70.000 (1 in first class and 2000 in last), so that it smooths out the distribution nicely with more frequent words and let the rest of 40.000 fall in class "1". It took some tweaking and you could use better formula maybe, but this worked for me much better than just dividing it with same factor.
    I had this calculation in two separate rows and returned the classes back next to the words:
    Code:
    =IF(ISNA(HLOOKUP(A6;$J$4:$JD$5;2;FALSE));D5;HLOOKUP(A6;$J$4:$JD$5;2;FALSE))
    Maybe it's best to d/l xlsm from here so I don't have to explain a lot...
    The rest is easy. Create the string needed for correct xml format,
    Code:
    =CONCATENATE("<w f=";CHAR(34);D2;CHAR(34);">";F2;"</w>")
    close the word list with "<wordlist>" in first and "</wordlist>" in last row and add "<?xml version="1.0" encoding="UTF-8" ?>" at the top and finally compile the .dict file:
    Code:
    makedict_Windows.bat from.xml > main.dict

    Phew! A lot of typing.

    And here is the LatinIME.apk with croatian layout and dictionary that I got this way:
    HR_Gingerbread_keyboard-1.0.apk
    I used mobilix's layout (thank you!) with just few my own corrections (corrected &amp glitches in symbolic keyboard).

    And hera are resources where I got inspiration from (thanks Gert Schepens):
    http://www.gertschepens.be/android-dictionary-files
    http://blog.cone.be/2010/08/19/android-keyboard-dictionaries/

    So much for now. Enjoy.

    Now, the next step would be to try to include my work in official AOSP or maybe Cyanogen source. I could use few pointers on how to do that. I would prefer to do it simple. I registered on github, but that's where I got for now. I have to do some more reading about it...
    1
    I presume you're after main.dict...
    Use APK Manager, decompile my apk and you'll find it in 'raw-hr' folder.

    Sent from my HTC Desire
    1
    @navdra
    Added post regarding your work on new oxygen forum:
    http://forum.oxygen.im/viewtopic.php?id=464
    Hope it's ok with you?!
    1
    Thanks for the help navdra, but I failed again. I've recompiled the whole CM again, edited the spellchecker.xml to include Hungarian but it still doesn't work. I've attached my xml and dict files, could you perhaps take a look at them? I'm completely clueless about what I'm doing wrong... Thanks in advance!

    Just to clarify, here's what I did:
    1. Download Hungarian Webcorpus from here: http://mokk.bme.hu/en/eszkozok/ (it's an open-licenced word list with frequencies included)
    2. Edited the corpus (kept only the most relevant frequency, removed html & other special characters and words with lower than 500 frequency, converted to UTF-8, etc)
    3. Created an XML from the result
    4. Used reweigh.pl from here to set frequencies between 0-254 (seems like this step is no longer necessary - if compiled without it I get the same .dict file)
    5. Compiled the XML to a dict
    6. Put the dict into cyanogenmod's overlay directory into raw-hu, where the other main.dict files are also present.
    7. Edited spellchecker.xml to include hu as a supported language
    8. Compiled CM9
    9. Failed, because I'm not getting suggestions :(

    Update: I tried compiling the same xml with the new makedict (from the ICS source), and it complains about duplicate words! (for the words which are also in the dictionary with Capitals). Perhaps this is the problem...

    Update 2: It works! Seems like you have to use the ICS makedict in order to have a working dictionary on ICS. As soon as I managed to compile it with the new makedict (had to convert frequencies to 0-254 range) it works. Thanks for the help.