eMMC sudden death research

Oranav

Senior Member
Oct 9, 2010
53
262
0
Update from Feb 17th:
Samsung has started to upgrade eMMC firmwares on the field - only for GT-I9100 for now.
See post #79 for additional details.

Update from Feb 13th:
If you want to dump the eMMC's RAM yourself, go ahead to post #72.
I'm looking for a dump of firmware revision 0xf7 if you've got one.
-----------------------


Since it's very likely that the recent eMMC firmware patch by Samsung is their patch for the "sudden death" issue, it would be very nice to understand what is really going on there.

According to a leaked moviNAND datasheet, it seems that MMC CMD62 is vendor-specific command that moviNAND implements.
If you issue CMD62(0xEFAC62EC), then CMD62(0xCCEE) - you can read a "Smart report". To exit this mode, issue CMD62(0xEFAC62EC), then CMD62(0xDECCEE).


So what are they doing in their patch?

1. Whenever an MMC is attached:
a. If it is "VTU00M", revision 0xf1, they read a Smart report.
b. The DWORD at Smart[324:328] represents a date (little-endian); if it is not 0x20120413, they don't patch the firmware. (Maybe only chips from 2012/04/13 are buggy?)
2. If the chip is buggy, whenever an MMC is attached or the device is resumed:
a. Issue CMD62(0xEFAC62EC) CMD62(0x10210000) to enter RAM write mode. Now you can write to RAM by issuing MMC_ERASE_GROUP_START(Address to write) MMC_ERASE_GROUP_END(Value to be written) MMC_ERASE(0).
b. *(0x40300) = 10 B5 03 4A 90 47 00 28 00 D1 FE E7 10 BD 00 00 73 9D 05 00
c. *(0x5C7EA) = E3 F7 89 FD
d. Exit RAM write mode by issuing CMD62(0xEFAC62EC) CMD62(0xDECCEE).
10 B5 looks like a common Thumb push (in ARM architecture). Disassembling the bytes that they write to 0x40300 yields the following code:
Code:
ROM:00040300                 PUSH    {R4,LR}
ROM:00040302                 LDR     R2, =0x59D73
ROM:00040304                 BLX     R2
ROM:00040306                 CMP     R0, #0
ROM:00040308                 BNE     locret_4030C
ROM:0004030A
ROM:0004030A loc_4030A                               ; CODE XREF: ROM:loc_4030Aj
ROM:0004030A                 B       loc_4030A
ROM:0004030C ; ---------------------------------------------------------------------------
ROM:0004030C
ROM:0004030C locret_4030C                            ; CODE XREF: ROM:00040308j
ROM:0004030C                 POP     {R4,PC}
ROM:0004030C ; ---------------------------------------------------------------------
Disassembling what they write to 0x5C7EA yields this:
Code:
ROM:0005C7EA                 BL      0x40300
Looks like it is indeed Thumb code.
If we could dump the eMMC RAM, we would understand what has been changed.


By inspecting some code, it seems that we know how to dump the eMMC RAM:
Look at the function mmc_set_wearlevel_page in line 206. It patches the RAM (using the method mentioned before), then it validates what it has written (in lines 255-290). Seems that the procedure to read the RAM is as following:
1. CMD62(0xEFAC62EC) CMD62(0x10210002) to enter RAM reading mode
2. MMC_ERASE_GROUP_START(Address to read) MMC_ERASE_GROUP_END(Length to read) MMC_ERASE(0)
3. MMC_READ_SINGLE_BLOCK to read the data
4. CMD62(0xEFAC62EC) CMD62(0xDECCEE) to exit RAM reading mode


I don't want to run this on my device, because I'm afraid - messing with the eMMC doesn't sound like a very good idea on my device (I don't have a spare one).
Does someone have a development device which he doesn't mind to risk, and want to dump the eMMC firmware from it? :)
 
Last edited:

sponix2ipfw

Senior Member
Jul 21, 2012
58
27
0
Since it's very likely that the recent eMMC firmware patch by Samsung is their patch for the "sudden death" issue, it would be very nice to understand what is really going on there.

According to a leaked moviNAND datasheet, it seems that MMC CMD62 is vendor-specific command that moviNAND implements.
If you issue CMD62(0xEFAC62EC), then CMD62(0xCCEE) - you can read a "Smart report". To exit this mode, issue CMD62(0xEFAC62EC), then CMD62(0xDECCEE).


So what are they doing in their patch?

1. Whenever an MMC is attached:
a. If it is "VTU00M", revision 0xf1, they read a Smart report.
b. The DWORD at Smart[324:328] represents a date (little-endian); if it is not 0x20120413, they don't patch the firmware. (Maybe only chips from 2012/04/13 are buggy?)
2. If the chip is buggy, whenever an MMC is attached or the device is resumed:
a. Issue CMD62(0xEFAC62EC) CMD62(0x10210000) to enter RAM write mode. Now you can write to RAM by issuing MMC_ERASE_GROUP_START(Address to write) MMC_ERASE_GROUP_END(Value to be written) MMC_ERASE(0).
b. *(0x40300) = 10 B5 03 4A 90 47 00 28 00 D1 FE E7 10 BD 00 00 73 9D 05 00
c. *(0x5C7EA) = E3 F7 89 FD
d. Exit RAM write mode by issuing CMD62(0xEFAC62EC) CMD62(0xDECCEE).
10 B5 looks like a common Thumb push (in ARM architecture). Disassembling the bytes that they write to 0x40300 yields the following code:
Code:
ROM:00040300                 PUSH    {R4,LR}
ROM:00040302                 LDR     R2, =0x59D73
ROM:00040304                 BLX     R2
ROM:00040306                 CMP     R0, #0
ROM:00040308                 BNE     locret_4030C
ROM:0004030A
ROM:0004030A loc_4030A                               ; CODE XREF: ROM:loc_4030Aj
ROM:0004030A                 B       loc_4030A
ROM:0004030C ; ---------------------------------------------------------------------------
ROM:0004030C
ROM:0004030C locret_4030C                            ; CODE XREF: ROM:00040308j
ROM:0004030C                 POP     {R4,PC}
ROM:0004030C ; ---------------------------------------------------------------------
Disassembling what they write to 0x5C7EA yields this:
Code:
ROM:0005C7EA                 BL      0x40300
Looks like it is indeed Thumb code.
If we could dump the eMMC RAM, we would understand what has been changed.


By inspecting some code, it seems that we know how to dump the eMMC RAM:
Look at the function mmc_set_wearlevel_page in line 206. It patches the RAM (using the method mentioned before), then it validates what it has written (in lines 255-290). Seems that the procedure to read the RAM is as following:
1. CMD62(0xEFAC62EC) CMD62(0x10210002) to enter RAM reading mode
2. MMC_ERASE_GROUP_START(Address to read) MMC_ERASE_GROUP_END(Length to read) MMC_ERASE(0)
3. MMC_READ_SINGLE_BLOCK to read the data
4. CMD62(0xEFAC62EC) CMD62(0xDECCEE) to exit RAM reading mode


I don't want to run this on my device, because I'm afraid - messing with the eMMC doesn't sound like a very good idea on my device (I don't have a spare one).
Does someone have a development device which he doesn't mind to risk, and want to dump the eMMC firmware from it? :)
:crying: --> **Ultimate GS3 sudden death thread** :crying:
Just wanted to link to a prior thread with some information/testing that as been done. Completely understand if you nuke it because it doesn't meet the proper criteria or is way to noobish to be posted here. Anyway, just though it _might_ help, so giving it a shot..
 

odewdney

New member
Feb 28, 2008
1
2
0
So I decided to do a small RAM dump after all.

Before the patch, 0x5C7EA reads FD F7 C2 FA, which is "BL 0x59D72".
As I thought, they replace a function call to the new one.

I will dump function 0x59D72 later this week.
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?

Seems like an odd fix, maybe self presevation?

Oli
 
  • Like
Reactions: Brunal and Oranav

Rob2222

Senior Member
Feb 18, 2008
413
306
0
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.

But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.

Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.

But this is only a idea... don't now if it is like that.

BR
Rob

PS: Oranav, thank you very much for your effort.
 
Last edited:

Oranav

Senior Member
Oct 9, 2010
53
262
0
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?

Seems like an odd fix, maybe self presevation?

Oli
Right, haven't spotted this. Thanks for the observation.
Self preservation sounds possible.

WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.

But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.

Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.

But this is only a idea... don't now if it is like that.

BR
Rob

PS: Oranav, thank you very much for your effort.
This could be possible - this patch looks like a quick and dirty fix, so maybe they didn't have the time to properly fix this. Instead, they just avoid the bug absolutely (with the cost of data corruption).
But I don't think this would cause lockups - I believe the chip has a watchdog...


All in all, I think the best thing we can do right now is to dump the whole firmware out of it. I will do it soon.
 
Last edited:

Bäcker

Senior Member
Jan 4, 2007
485
241
73
there is also a chance that this is just a temporary workaround to prevent further bricking - until there is a final fix.

As of now we asume that this is the fix as it directly adresses the eMMC in concern, but all this is just based on asumptions.
 

ivan4o5

Senior Member
Nov 24, 2010
77
6
0
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.

But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.

Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.

But this is only a idea... don't now if it is like that.

BR
Rob

PS: Oranav, thank you very much for your effort.
I think we can prove that the fix is actualy locking into loop ,but you must risk your phone :/ . If you are in ->
1) Flash back to older version without the fix
2) Wait and pray
a) If your phone dies --> the guys were right about the fix and the loop
b) If it stays alive ,then ...

We wont know for sure ,but your phone is maybe in the "perfect" condition for the test :/ .
(Sorry if this makes no sence)
 

Rob2222

Senior Member
Feb 18, 2008
413
306
0
@ivan:
Sorry, can't do that. :( Cause of high air humidity my humidity indicator is already a little soaked. Cause of the warranty-repair reports in out local forums I am not sure if I would get warranty. I think theres a fair chance, that they would deny my warranty. Cause of that I don't want to take any extra risk. I am on unrooted stock at the moment, cause of that.

@Oranav:
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time.

I have no idea if it's that way, I just wanted to post it as theory to think about.

BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?

BR
Robert
 
Last edited:

Oranav

Senior Member
Oct 9, 2010
53
262
0
I have no idea if it's that way, I just wanted to post it as theory to think about.
The problem is that there are too many theories imaginable, but I can't think of no way to prove them but to reverse engineer the MoviNAND firmware.

BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
Certainly not. Watchdogs are slow, drivers running on a Cortex-A9 are blazing fast.
But I do think Linux's MMC driver can handle device restarts during an MMC operation.
 
  • Like
Reactions: liamR and Rob2222

odoto

Senior Member
Dec 20, 2010
56
21
0
@ivan:
Sorry, can't do that. :( Cause of high air humidity my humidity indicator is already a little soaked. Cause of the warranty-repair reports in out local forums I am not sure if I would get warranty. I think theres a fair chance, that they would deny my warranty. Cause of that I don't want to take any extra risk. I am on unrooted stock at the moment, cause of that.

@Oranav:
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time.

I have no idea if it's that way, I just wanted to post it as theory to think about.

BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?

BR
Robert
if those freezes are really caused by the firmware fix, I don't see how the would disappear over time...

I mean if it really is the case that the fix trades data corruption for eMMC survival, it would make sense to see freezes... but depending on what data is affected, they should only be treatable by reinstalling the affected app or deleting its data/cache.

updated theory:
for all we know the error condition where the eMMC dies is quite rare, since most devices have been used for month before they passed away. So under the assumption that the error condition appears randomly and that there is a chance of data corruption every time the condition appears with fixed kernel, we could expect to see freezes and other problems some time after the fix was applied. So that would explain the raising number of freezes reported. Furthermore I'd assume that people getting freezes would try to do something about it, like reinstalling/deleting apps wiping caches and/or data... or even reflashing, thus repairing the corrupted data. So freezes would disappear.

Wait, doesn't most evidence point to the fact that the error condition does NOT appear in a random fashion, since there were no cases in the beginning, and then a lot all of a sudden? Well, it might be that this is just the way we perceive the issue. Maybe there were cases before, but they weren't reported... phones died, people sent them in, got new ones and went on with their lives. But after some time the issue got known... bloggers wrote about it... and so on... people realized their phones died because of a wider problem... voila, steep raise in reported cases. Also the number of dying S3s would simply rise by a rising number of overall S3s, I mean Samsung kept selling phones, right?

But even under the assumption the bug is related to wear-levelling and not random, here is another idea: I have no clue how the algorithms work, but maybe it uses some sort of pseudo-random data to do whatever, with the same seed on all eMMCs... and thus all of them go through the same series of numbers. And now imagine the error condition is only triggered by a specific number or number set (say someone screwed up a boundary condition). Under this theory the error condition wouldn't appear randomly, but after a certain amount of write ops (or something).

Another question I asked myself is: shouldn't there be cases were data corruption does damage beyond all repair except for reflashing?
Well, it might be, but it seems reasonable to assume that it is a lot less likely than user-data corruption, since most critical files on the phone shouldn't be opened writeable (or are on a read-only mounted partition in the first place), hence shouldn't be affected by ****-ups during writes.

Like the previous poster I want to add that this is most likely all bull****... but it is what came to my mind looking for a theory that supports the data we got.
 

Oranav

Senior Member
Oct 9, 2010
53
262
0
Okay, got a RAM dump :)
I won't post it here (or anywhere else for that matter) because I don't want to get sued by Samsung.

I might release a kernel which allows you to dump the RAM yourself if there's enough demand, but I don't want to right now, because:
1. The code is ugly as hell, not implemented as a kernel module, not thread-safe etc.
2. It is highly dangerous (messing with the eMMC chip - I really don't know how much stable this thing is), so if you want to do it on your device, you should be an expert. In that case, you can write the code yourself (with little effort) :)


Anyway, I hope the FTL is Whimory, since I'm familiar with it. Would be easier.
I'll let you know if I find anything interesting.


PS I've attached a little teaser. (Yes, this is the patched function. 0x40300 is red because I've opened a partial RAM dump.)



EDIT - Some initial results:
0. The CPU is a Cortex-M3.
1. No strings at all :( Just some uninteresting release asserts ("REL_ASSERT")
2. Found the Smart Report generator function -> found the MMC command handlers.
3. Most MMC commands handlers are stored in a function table. There are 3 special commands: MMC60, MMC62, MMC64. Depends on the arguments these special commands are provided, they modify the function table (this is the so called "vendor mode").
4. There are a lot of possible arguments for MMC62, not the only ones we know.
5. If you trace back the function they patch all the way up the call stack, you get to MMC24 and MMC25 handler. These commands are MMC_WRITE_BLOCK and MMC_WRITE_MULTIPLE_BLOCK. Since the function they patch is deep down the call stack, it's very likely that it is the wear level.

Anyway, because of the lack of strings I guess it would be very hard to truly understand the SDS bug we're facing :(
 

Attachments

Last edited:

resore

Senior Member
Apr 12, 2011
115
25
0
Odp: eMMC sudden death research

i cant say i have an idea whats going on inside emmc but usually in this case of mistakes/failures debug or diagnostics code is used for release.
maybe some debug info repeatedly written triggers wear levelling failure
so fix has to simply disable it
 

Rebellos

Senior Recognized Developer
May 13, 2009
1,353
3,427
0
Gdańsk
Awesome research.
So we're dealing with bug in exactly the same eMMC subsystem as in faulty SGS2 eMMC chips, but in device that was released after proving SGS2 eMMCs to be faulty.


Oranav, for some reason I cannot send you PMs. Could you send me your dump? Does your eMMC come from faulty serie?
 

owl74

Senior Member
Nov 15, 2012
328
585
0
Hi all, after reading this thread, I am now scared....

I have a Note 2 N7100 which is running ARHD V8.0 with Perseus Kernel V31.2 and TWRP recovery 3.2.2.3

The above all include the fix for SDS and Exynos hole.

I have been running the device for nearly 1 week I think. Last night I fully charged my phone, used it for 3 minutes surfing the forum (chrome) via wifi connection. After 3 minutes, I left the phone on the table for 3 hours. The only running app is Viber... when I tried to wake the phone up, it did wake up but it froze... no button worked, I tried for about 2 minutes... nothing worked except the power button which booted the device.

This is weird, never experienced this before... I am now scared the phone will die unexpectedly.
 

thealgorithm

Senior Member
Jan 3, 2007
69
18
0
Hi all, after reading this thread, I am now scared....

I have a Note 2 N7100 which is running ARHD V8.0 with Perseus Kernel V31.2 and TWRP recovery 3.2.2.3

The above all include the fix for SDS and Exynos hole.

I have been running the device for nearly 1 week I think. Last night I fully charged my phone, used it for 3 minutes surfing the forum (chrome) via wifi connection. After 3 minutes, I left the phone on the table for 3 hours. The only running app is Viber... when I tried to wake the phone up, it did wake up but it froze... no button worked, I tried for about 2 minutes... nothing worked except the power button which booted the device.

This is weird, never experienced this before... I am now scared the phone will die unexpectedly.
How long were you using the system before you had updated to the 'fix'? None the less, it does not necessarily mean that the phone is getting near to the SDS. I have had a few android phones which would sometimes reboot or hang for other reasons.
 

Oranav

Senior Member
Oct 9, 2010
53
262
0
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.

Rebellos, I enabled the private messaging system for me. I do have the faulty chip.

Sent from my GT-I9300 using xda app-developers app
 

Rob2222

Senior Member
Feb 18, 2008
413
306
0
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Damn Oranav, nice work! :D
So does it mean you get a total screen freeze? Every time?

BR
Robert
 

owl74

Senior Member
Nov 15, 2012
328
585
0
How long were you using the system before you had updated to the 'fix'? None the less, it does not necessarily mean that the phone is getting near to the SDS. I have had a few android phones which would sometimes reboot or hang for other reasons.
About 2 weeks. Forgot to mention i reebooted the phone before I charged it. I will return this phone before it dies on me.... I think I will get an S3 but I will check if it has a new chip... otherwise I will return again and stick to my desire hd which is already running S3 rom....

---------- Post added at 02:24 PM ---------- Previous post was at 02:22 PM ----------

I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.

Rebellos, I enabled the private messaging system for me. I do have the faulty chip.

Sent from my GT-I9300 using xda app-developers app
So everyone's speculation is right about thw fix causing the freeze...
 

SamsungPisser

Senior Member
Jun 17, 2011
50
7
0
AW: eMMC sudden death research

Suppose this fix addresses wear leveling. If firmwares without this fix wear out the eMMC would then not the device still boot and then crash? As far as I know flash is still readable but not writable any more when worn out. Could it be that the wear leveling algorithm has a problem so that after some time it replaces cells from the bootloader and that causes the death?

In short: I want to know if it had a negative effect using the old firmware for some time because that old software caused extreme aging for the eMMC.