The four keys at the bottom of the phone are monitored by a melfas touchkey chip (http://www.melfas.com/english/touch/sensor.asp) that connects to the main processor via an I2C bus (http://en.wikipedia.org/wiki/i2c). The melfas chip generates an interrupt whenever one of the keys is touched or released. The processor then reads the key value from this chip over the i2c bus. The problem is that the touchkey chip is located right next to the 3G antenna. When the phone is accessing the 3G network the RF energy gets transferred to the interrupt and i2c clock and data lines causing false interrupts to occur. The processor responds to the interrupt by reading the key value from the cypress chip. The symptoms occur more frequently in low signal areas because the phone outputs a higher RF level in those situations which causes more RF interference on the interrupt line.
Most of the time when a false interrupt has occurred the touchkey chip will return a value of zero for the key and the driver will recognize this as a bad key press and ignore it. Sometimes the RF interference on the i2c clock and/or data line causes a valid value to be returned and the driver reports a key press value to the application. In the case where the driver reports a ‘back’ key down, the software sees this as holding the back key down so when you press the power button you get a screen shot. The easiest way to cure this is to always press and release the back key before pushing the power button. This causes the software to see both a key down and key up event which cancels the screenshot mode.
This RFI induced touchkey interrupt happens hundreds of times per second when the phone is using 3G. It produces lots of different symptoms including applications that always seem to shut down. A wide variety of problems can be attributed to this failure. In addition, the processor spends a lot of time servicing these bogus interrupts, which take cpu time away from the other applications. This can make the phone appear to be slow or even freeze up for short periods of time. There’s a good chance that most people have experience this to some degree without realizing the root cause.
Solution one. Fix the driver.
Since this is a true hardware failure, a software solution is going to be less than perfect. After dozens of experiments rewriting the interrupt service routines in the driver I’ve settled on a combination of fixes. The first is to re-test the interrupt input line several times. In normal operation when you touch or release a button, the touchkey chip drives the interrupt line low and keeps it low until the driver reads data over the i2c interface. Since the RF interference is a sine wave and is being sampled it causes the interrupt line to go high and low at a fast rate. Sampling the line multiple times in software increases the chance of finding it in the high state. This is done both in the interrupt handler and then again in the interrupt thread. About 90% of the false interrupts are filtered out by testing the line in the handler. If the interrupt handler doesn’t find the line high after 10 samples, it masks the interrupt so that another falling edge doesn’t produce another interrupt. In testing I’ve noticed that the interrupt handler would run multiple times before the interrupt thread was even called. Once in a while, so many interrupts would get stacked up that the phone would just reboot. It was probably a stack or buffer overflow that wasn’t being handled. Remember, this interrupt would happen many hundreds of times a second. About 90% of the remaining false interrupts are filtered out by sampling this line in the thread. That leaves about 1% of the interrupts that need to be further tested. The second test is to read the data from the chip and discard anything that isn’t a valid key press value. This is easily done with a case statement. Finally, since occasionally a bogus valid value will get through, I set up a timer so that any key down event that doesn’t have a corresponding key up event within 3 seconds is canceled by calling the all_keys_up routine.
This combination all but eliminates the symptoms produced by this failure. The only draw back is that the processor still spends a considerable amount of time servicing the false interrupts. And rarely a phantom keypress does get through. In all, it’s a fairly good piece of duct tape and JB Weld.
During my experiments I used a copy of the kgb kernel. My version with the modified driver is in github at https://github.com/dmriley/kgb. If you want to try this yourself, be sure to use the ‘dev’ branch.
Solution two. Fix the hardware.
There are three signals that connect from the melfas touchkey chip to the processor. They are the two i2c lines: sdc which is the clock and sda which is the data. The third line is the interrupt. In troubleshooting this problem, I took my phone apart and put oscilloscope probes on the three lines. This allowed me to see the real cause of the problem. Since the interference is RFI (or EMI) the only real way to fix the problem is to either remove the RF or make the impedance of the signals much lower. Removing the RF is easy if you don’t need to use 3G. When the phone is using wifi (or no network connectivity at all) the problem does not exist. Also, when you are very close to a cell tower, the phone transmits at a much lower level. This lower level greatly reduces the RFI. Lowering the impedance is a little harder. I2C uses active pull down and passive pull up for the logic levels for both sda and sdc. This means that the impendence is mostly governed by the pull up resistor. This resistor value is typically upwards of 1kohm and probably as high as 3kohms (I didn’t measure it in this phone). Since the impedance only needs to be lowered for the 3G frequencies of around 800MHz, a capacitor can be added from the signal source to signal ground. At 800MHZ a 100 pf cap is about 2 ohms (1/ 2*pi*f*c). That’s a couple of orders of magnitude lower than the pull up resistor alone, and much too low for the RF signal to induce any significant voltage on the line. This value is also low enough not to interfere with the signal rise and fall times for the interrupt line. In the case of the interrupt line, the melfas chip drives the signal low and keeps it low until the interrupt is serviced. Discharging a 100pf cap with a 2mA driver takes only microseconds. This much delay is not noticeable when touching the key and is much less than the amount of time that the processor takes to service the interrupt.
Adding the cap to the interrupt line eliminates false interrupts. A chance does exist that a valid key event during 3G access could cause an incorrect key value to be returned due to RFI on the clock and data lines. The i2c protocol is designed to compensate for capacitive loading on the lines. Although it would cause the clock period to be stretched out significantly it would still only take milliseconds to read the key data from the chip. The difference would be imperceptible. To date I have only added the cap to the interrupt line and have yet to experience an invalid key press.
I’ll post pictures of cap mod.
Most people will be satisfied using the software fix. I think that a couple of the kernel devs are incorporating some or most of the driver mods outlined in this document. Both comradesven (kgb dev) and ssewk2x aka Efpophis (glitch dev) were involved in the test and debug process. Much appreciation is given to both of them for the help that they gave me and for allowing me to use and hack up their code on github. Efpophis saved me hours of searching through code. Without their help, I’d still be unable to build a kernel.
UPDATE:30 Mar 2012
The phone had been working fine since the mod. I hadn't seen a screen capture or any of the other symptoms. Then, a couple of nights ago, while I running maps on 3G (a data intensive app) the touchkey backlights started flashing rapidly like the phone was having a little seizure. And then it happened, the voice search popped up. A couple of debug kernels later I've come to the conclusion (and I'm never wrong) that the clock line (SCL) going to the melfas chip was being toggled by the same RF interference that was causing the false interrupts. A random clock along with random data was causing the chip to turn the backlights on and off as well as generate a false interrupt. I was able to reliably duplicate the problem in a couple of really low signal level areas (not hard to find when you live out in the boonies).
I tore the phone apart (again) today and added a 100pf cap to the scl line right next to the chip. I also added another cap in parallel with the 100pf on the interrupt line. I spent about 1/2 hour tonight running 3G data apps in the same location where the problem first appeared. So far, no problems and none of the debug messages have shown up on dmesg.
If anyone wants pics of the added cap I'll open it back up, no problem, otherwise if you look at this photo you can see which pin is scl (although I incorrectly labeled it SDC in the photo). http://forum.xda-developers.com/atta...4&d=1332117055
If anyone tries these mods I'd be real interested in your results.