eMMC sudden death research

E:V:A · Feb 3, 2013

You can dump all of your "live" RAM with Lime forensic tool:
http://code.google.com/p/lime-forensics/

But it may be overkill...

Oranav · Feb 4, 2013

Lime isn't relevant here, since the CPU's RAM and the MoviNAND's RAM aren't shared; they aren't even mapped to the same memory space.
In order to read the MoviNAND's RAM, we have to send vendor-specific eMMC commands and read eMMC data...

Entropy512 · Feb 4, 2013

Oranav said:
I have a Hex-Rays license. I actually reverse most of the time using it; I posted assembly code since it's easier to understand with these short snippets (in my point of view).

I won't post a RAM dump since it contains (probably?) licensed code.
I can however post the memory map:
0x00000000 - 0x00020000 BootROM (I guess it's a mask ROM)
0x00040000 - 0x00060000 Firmware (resides in RAM, the BootROM reads it from the NAND chip itself so it's upgradable!)
0x00060000 - 0x00080000 Data (no dynamic memory there BTW)
0x20000000 - 0x20028000 eMMC interface MMIO
0x20080000 - 0x20080400 I don't know, maybe another eMMC interface MMIO?
0x40000000 - 0x40010000 NAND interface MMIO

I can send you my RAM dump over IRC if you'd like. Besides that, I contemplate posting a .ko which exports the RAM over a character device (this is how I dumped it).

And, yes, dumping the new firmwares to see what has changed is super-cool

A .ko (preferably with source

) would be great. It would save me a lot of time implementing the dumper myself. Ideally I'd like to get dumps of VYL00M or MAG4FA 0x19, along with 0x25.

We do know these chips are upgradable, however, Samsung claims that:
1) To upgrade the firmware, you must completely wipe the chip including all bootloaders. (Interestingly enough, this fullwipe will resurrect Superbricked devices) - I believe this
2) The process is so dangerous that it fails frequently in a way that makes the chip 100% unrepairable - I'm a bit skeptical about this one, at least the claims of an absurdly high failure rate.

#2 is why we have no way to repair Superbricked devices.

E:V:A · Feb 5, 2013

At first I didn't quite understand what was going on, but now when I see what is happening... Excellent!

I'd love to see some tool to come out of this, to read eMMC RAM. I can see several cross-platform applications for this! Something like "viewmem", but "viewemmc" instead. Would this be feasible?

@ Entropy512: Could this be incorporated into your "Got Brickbug?" app, or something similar?

Entropy512 · Feb 5, 2013

E:V:A said:
At first I didn't quite understand what was going on, but now when I see what is happening... Excellent!

I'd love to see some tool to come out of this, to read eMMC RAM. I can see several cross-platform applications for this! Something like "viewmem", but "viewemmc" instead. Would this be feasible?

@ Entropy512: Could this be incorporated into your "Got Brickbug?" app, or something similar?

That's not my app. It also probably couldn't be integrated into any detection app, BUT it would be interesting to see what the differences are between MAG4FA 0x19 (BAD) and MAG4FA 0x25 (good other than a far less nasty "wear leveller randomly inserts 32kb of zeros" bug).

Maybe even if there were a way to write the RAM back to NAND, then the chip could be reset/wiped - we know this is possible but dangerous. It could only be researched by someone who has JTAG access to an affected device, since no affected device has any known way to boot from USB or SDCard.

Product F(RED) · Feb 5, 2013

What's sad about this is that you guys are probably doing more work than Samsung, trying to get to the bottom of the problem. But I guess that's also a good thing a way; I still have faith in humanity.

E:V:A · Feb 6, 2013

Why doesn't someone just email whoever made the patch? Perhaps he/she could at least explain the reasoning behind, without giving out all Samsung "secrets".

Rob2222 said:
...
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time...

odoto said:
...I have no clue how the algorithms work, but maybe it uses some sort of pseudo-random data to do whatever, with the same seed on all eMMCs... and thus all of them go through the same series of numbers. And now imagine the error condition is only triggered by a specific number or number set (say someone screwed up a boundary condition). Under this theory the error condition wouldn't appear randomly, but after a certain amount of write ops (or something).

I'm thinking the same thing. Just like the "bad sectors" on a good old HD, perhaps the bad "sectors" on an eMMC is getting avoided or tagged as bad by the wear leveling algorithm. These "tags" also have to be written somewhere, so if these function was screwed up somehow, I guess you'd get corruption although most part still functioning. The "patch-fix" then slowly discovers these errors and avoids them, causing us to see a decrease of problems.

There must be something written somewhere about wear-leveling of eMMC's...

Oranav said:
* If someone has a BinDiff license and wants to help, it'd be great!

Zynamics BinDiff seem very nice...but with(out) a price tag.
(Again, write the guys and ask for an 99% XDA Developer discount. After all, the company has been acquired by Google and we're working for Android!)

But you can also try some free a HEX based ones
VBinDiff
Another BinDiff
DiffNow (for source code/text and web based)

Entropy512 said:
That's not my app.

My bad. Sleepy eyes. (Chainfire's app)

Oranav · Feb 6, 2013

Entropy512 said:
That's not my app. It also probably couldn't be integrated into any detection app, BUT it would be interesting to see what the differences are between MAG4FA 0x19 (BAD) and MAG4FA 0x25 (good other than a far less nasty "wear leveller randomly inserts 32kb of zeros" bug).

Maybe even if there were a way to write the RAM back to NAND, then the chip could be reset/wiped - we know this is possible but dangerous. It could only be researched by someone who has JTAG access to an affected device, since no affected device has any known way to boot from USB or SDCard.

I think it is possible to update the firmware.
Except for CMD62, there are 2 more vendor specific commands (CMD60 and CMD64). I think I saw somewhere a command which updates the firmware on the NAND; I'm not sure now but I'll check it later. The BootROM is also very small so it's easy to find exactly where the firmware is stored on the NAND.
About the danger with this process, I think it's mostly due to the risk of having no bootloader or no Movi firmware. However, I think the Movi BootROM has a recovery mode, so if we're somehow able to boot the device from the mmc1 bus (SD card), we're okay.

Anyway, later this week I'll write that .ko (currently I just edited the Linux MMC subsystem code

) and push to Github.

E:V:A · Feb 6, 2013

@Oranav: Can you PM me a memory dump?

I'd like to see the Smart Report for a failing device versus a working one...

I have a bad feeling that this problem can be much greater than what Samsung like to admit. At least if this bug have anything to do with wear-leveling...

Also, can "someone" help me "fill in" the following:

a) what exact devices are having problems?
b) What exact eMMC cards do they have? (And size)
c) Is that leaked datasheet in OP, for any of those in (b)?
d) What die size "technology" are these eMMC's using? (25 nm, 34 nm or other?)
e) Do you know anything about how people with eMMC problems use their devices?
f) The Linux kernel version for problematic devices...

Entropy512 · Feb 6, 2013

E:V:A said:
Why doesn't someone just email whoever made the patch? Perhaps he/she could at least explain the reasoning behind, without giving out all Samsung "secrets".

Not possible, since the patch was excised from a MASSIVE kernel update with thousands of lines of changes. There is zero commit history for the tarball drop. The tagging of Andrei's patch as "Samsung OSRC" was some custom hackery by him - he diffed two kernels, split up the commits, and set authorship to "Samsung OSRC".

I'm thinking the same thing. Just like the "bad sectors" on a good old HD, perhaps the bad "sectors" on an eMMC is getting avoided or tagged as bad by the wear leveling algorithm. These "tags" also have to be written somewhere, so if these function was screwed up somehow, I guess you'd get corruption although most part still functioning. The "patch-fix" then slowly discovers these errors and avoids them, causing us to see a decrease of problems.

There must be something written somewhere about wear-leveling of eMMC's...

How an eMMC does internal wear levelling is up to the manufacturer - eMMC only defines the external interface.

Wear levelling algorithms are typically considered highly proprietary by the manufacturer.

So far, historically every "catastrophic" MMC failure we've dealt with on Samsung eMMCs has had nothing to do with bad/corrupt sectors - it has to do with bad/corrupt internal data. Think of it as a lower-level version of a corrupt ext4 filesystem... The underlying disk is fine, but the filesystem is useless without a reformat. Problem is, there's no documented way outside of a factory to completely reset an eMMC (e.g. "low level format").

In the case of Superbrick, a secure erase command issued to a region that contains sectors in a certain state (associated with a performance optimization, not with failure handling/recovery) would corrupt the wear leveller's internal data. Then, any time you tried to access nearby memory, the wear leveller would simply crash.

SDS seems to be a case of some function potentially returning 0 (maybe due to integer overflow? The statistics of the issue and how it suddenly "spiked" after a number of months of usage screams overflow to me), and that 0 then being treated as data instead of an error, corrupting data structures right and left.

Oranav said:
I think it is possible to update the firmware.
Except for CMD62, there are 2 more vendor specific commands (CMD60 and CMD64). I think I saw somewhere a command which updates the firmware on the NAND; I'm not sure now but I'll check it later. The BootROM is also very small so it's easy to find exactly where the firmware is stored on the NAND.
About the danger with this process, I think it's mostly due to the risk of having no bootloader or no Movi firmware. However, I think the Movi BootROM has a recovery mode, so if we're somehow able to boot the device from the mmc1 bus (SD card), we're okay.

Anyway, later this week I'll write that .ko (currently I just edited the Linux MMC subsystem code ) and push to Github.

No internal firmware seems to be the risk that Samsung was most concerned about when they decided not to release Superbrick repair code - Supposedly if the firmware update doesn't go perfectly the chip is 100% toast. (However, language barrier and such could have really meant just that the device's bootloaders were hosed...)

There might also be some interface that allows the MMC to be programmed in the factory that isn't exposed once soldered to a board.

Unfortunately, as we don't have a security-dropped IBL that is signed for Exynos 4210, there is no SDCard or USB recovery available for 4210 devices like there is for Exynos3 and for 4412 devices. If you kill the bootloaders, JTAG is it.

Oranav · Feb 6, 2013

Entropy512 said:
SDS seems to be a case of some function potentially returning 0 (maybe due to integer overflow? The statistics of the issue and how it suddenly "spiked" after a number of months of usage screams overflow to me), and that 0 then being treated as data instead of an error, corrupting data structures right and left.

It doesn't seem like an integer overflow, at least not a straightforward one.
This is the function they patch:

Code:

int __fastcall f_to_be_patched_function(_DWORD *out, int val)
{
  int ret; // r2@1

  ret = 0;
  if ( *off_5FC60 == val )
  {
    *out = off_5FC60;
    return 1;
  }
  if ( *off_5FC64 == val )
  {
    *out = off_5FC64;
    return 1;
  }
  *out = 0;
  return ret;
}

Both off_5FC60 and off_5FC64 point to some FTL related contexts.
This is the wrapper function they write to the RAM:

Code:

void __fastcall f_new_function_by_patch(_DWORD *out, int val)
{
  if ( !f_to_be_patched_function(out, val) )
  {
    while ( 1 )
      ;
  }
}

The BL instruction that is being patched used to call the old function (f_to_be_patched_function), without checking its return value, hence the bug.
What's so strange about it is that "f_to_be_patched_function" is called from many other locations in the code, without checking the return value! So the bug exists in other locations as well.
Either the other locations don't cause internal metadata corruption, or they are just so rare that Samsung didn't even bother to patch them.

Entropy512 said:
Unfortunately, as we don't have a security-dropped IBL that is signed for Exynos 4210, there is no SDCard or USB recovery available for 4210 devices like there is for Exynos3 and for 4412 devices. If you kill the bootloaders, JTAG is it.

Wait, so we do have a way to boot Exynos 4412 devices (Galaxy S3) from the mmc1 bus?
If so, why isn't SDS fixable?

E:V:A · Feb 7, 2013

Oranav said:
I think it is possible to update the firmware.
Except for CMD62, there are 2 more vendor specific commands (CMD60 and CMD64). I think I saw somewhere a command which updates the firmware on the NAND; I'm not sure now but I'll check it later.

There is no CMD64, because CMDs go from 0-63. CMD's 60-63 are
"Reserved for Manufacturers" and belong to the reserved Class-11.

But I agree, there have to be a way to update eMMC firmware. Although Entropy may be right about factory programming, I don't think this "interface" would only be available at that time. I have a strong belief that it should be possible to update. We know all the eMMC pins, and we know the basic interface and the basic technology within, but we don't know the firmware! Samsung's SSD firmwares can certainly be updated!

(We could look for the firmware in there.)

E:V:A said:
I'll just start filling in this myself...

a) what exact devices are having problems?
- GT-I9300/3 with 16 GB MoviNAND
b) What exact eMMC cards do they have? (Samsung part-no/name)
c) Is that leaked datasheet in OP, for any of those in (b)?
- <unknown>
d) What die size "technology" are these eMMC's using? (25 nm, 34 nm or other?)
e) Do you know anything about how people with eMMC problems use their devices?
f) The Linux kernel version for problematic devices...

Code:

Model: Samsung GT-I9300 Chip: KMVTU000LM-B503 Part No: 1108-000424 ? Size: eMMC(16GB)+MDDR(64MB) eMMC ID: VTU00M eMMC FW Rev.: 0xF1

Oranav said:
It doesn't seem like an integer overflow, at least not a straightforward one. This is the function they patch:
...
Wait, so we do have a way to boot Exynos 4412 devices (Galaxy S3) from the mmc1 bus? If so, why isn't SDS fixable?

Unless you can somehow provide something more substantial than that reversed pseudo-C stuff, I cannot help much. (Or if you can post that module so that we can look for ourselves!)

We can certainly unbrick anything supported by Adam Outlers/Rebellos/Ralekdevs unbrickable mods. They also have the Boot from SD card mod. In theory we should be able to unbrick I9100 in the same way, but no one want to waste more energy on that PoS device! (I know, because I have one...with the VYL00M brick bug!)

E:V:A · Feb 7, 2013

In case someone else like to join in on this, here are some eMMC basics for
reference. (That I cut and pasted from various sources.)

Also, I found it useful to understand, that from the low-level point of view, an eMMC and SSD are
essentially the same. An SSD is basically a huge eMMC, but where the NAND chips are used in
parallel with an added DRAM cache buffer and a SATA interface operating at 5V. So the
wear-leveling etc. works in the same way, eventhough the microcontroller in an SSD is much
more advanced. (I.e. For a Samsung SSD 840 Pro, there is an 3-core Cortex R-4 running @
300MHz!) Thus, any problem you encounter in the FTL of an eMMC, you will likeely also have
in an SSD if using similar NANDs, and vice versa.

The most important and relevant documents are those of the JEDEC standard.
However, our device conforms to (JESD84) v4.41 and not v4.51, AFAIK.
"JEDEC: Embedded MultiMediaCard(eMMC) Product Standard..." (JESD84-A441)
"JEDEC: Embedded MultiMediaCard(eMMC) Electrical Standard" (JESD84-B451)
"eMMC v4.41 and v4.5" (JDEC presentation by Victor Tsai)

2013-02-09: ORIGINAL POST MOVED!

As was pointed out in the subsequent post, this is somewhat OT,
so I decided that a better home for it would be HERE.

DualJoe · Feb 7, 2013

Oranav said:
it's easy to find exactly where the firmware is stored on the NAND.

Correct me if i'm wrong... the firmware NAND you're talking about is the same like the eMMC (not a separate one), right?
If so, can you provide some signature bytes (maybe the first 32bytes) and the firmware length so we can dump the whole NAND with a Riffbox (AdamOutler?) and extract the firmware ourself?

AndreiLux · Feb 7, 2013

EVA, what exactly are you trying to achieve? Seems more off-topic than actual on-topic discussion to me.

And SSDs have absolutely nothing in common with eMMC chips, there's a wholly independent controller on SSDs which simply doesn't exist in embedded devices. The firmwares we're talking about here are not even in the same device category.

E:V:A · Feb 7, 2013

@AndreiLux: Yes, you're right, that was a bit over ambitious OT. But I'm also preventing more OT by people who will eventually post speculations about wear-leveling, and giving them the document and links to go research the topic by themselves. Increasing public knowledge will hopefully up the ~~level~~ speed of this discussion.

Also, the above can help explain why there are often large "empy" (non-user) partitions on Samsung phones. It could be that these act as moving "holes" to improve eMMC life. Thus if we remove them or keep our eMMC maxed out, we'll get problems much sooner than someone who has lots of space left.

But more importantly, I'm showing you that with a P/E of ~3000, it could very well be easily reached by any excessive writes, especially with eMMC firmware bugs. Also, I completely disagree that "SSDs have absolutely nothing in common with eMMC chips", they certainly do have much in common, as I stated above. An SSD basically consist of N x M Raid-0 like array of MLC NAND's, and each of those conforms to the exact same criteria as our eMMC in question. At the low-level the individual wear-leveling must be the same or very similar. (Mind you, I'm ignoring the SATA "controller" + cache memory.)

I could of course be completely wrong, but then I suggest that you provide some backup of your statement...

---

We should make a comparison of the "Smart Reports" from a working and a problematic eMMC. If these are very different, we could learn more...

Could someone dump such a report?

E:V:A · Feb 9, 2013

Not sure if this helps, but if there is any dependence on kernel version, we might figure it from this list of kernel emmc patches...

Code:

[SIZE=2]2.6.36 
        • ERASE, SECURE ERASE, TRIM, and SECURE TRIM operations (JEDEC 4.4)
        • mmc_block: Discard and secure discard support
        • SD-combo (IO+mem) support
        • Performance tests
2.6.37 
        • New sdhci-pxa driver for Marvell SoCs
        • MMC 4.4 DDR support
        • sdhci-pltfm: Platform driver for imx35/51
        • USB SD host controller (USHC) driver
2.6.39 
        • mxs-mmc: MMC host driver for i.MX23/28
3.0 
        • MMC CMD+ACMD passthrough IOCTL reliable write support
        • MMC boot partition support
        • New VUB300 USB-to-SD/SDIO/MMC driver
        • SD: Support for signal voltage switch procedure
3.2 
        • Enabled HPI for MMC cards that support this feature
        • Cache control for e·MMC 4.5 devices
        • e·MMC hardware reset support
        • Random fault injection
        • General-purpose MMC partition support (JEDEC 4.4)
        • SDHCI: e·MMC hardware reset support
        • sdhci-pci: Runtime PM support
        • mmc-test: e·MMC hardware reset test
[/SIZE]

In the meantime I'm waiting with great expectations on the code for that kernel module...

koalauk · Feb 9, 2013

I am sorry yes I know I am not spouse to post here as I am not a developer but I thought I ll share my little finding about SDS,

Users having SDS always confirming with the Red sensor LED staying on, As I have been a bit worried about SDS (4.1.1 UK BTU no update as of now) Everytime I rebooted (or started from switched off) I can see that bootloader checks HW as this red led comes on for about 0.6second and boot sequence continuous. But now I couldnt wait any longer for the BTU OTA update and now updated to N7100XXDMA6 N7100OJVDMA2 TURKEY rom and I can clearly see that during the startup process Red LED does not come on or HW is possibly not checked ! I hope I dont sound daft !! and SDS can be malfunctioning of Sensor board ?

E:V:A · Feb 10, 2013

The possibility of eMMC firmware updates is determined by "Update_Disable" bit-0
of the FW_CONFIG field, which is located in CSD-slice [169] of the CSD register.

Entropy512 · Feb 11, 2013

Oranav said:
It doesn't seem like an integer overflow, at least not a straightforward one.
This is the function they patch:

Code:

int __fastcall f_to_be_patched_function(_DWORD *out, int val) { int ret; // r2@1 ret = 0; if ( *off_5FC60 == val ) { *out = off_5FC60; return 1; } if ( *off_5FC64 == val ) { *out = off_5FC64; return 1; } *out = 0; return ret; }

Both off_5FC60 and off_5FC64 point to some FTL related contexts.
This is the wrapper function they write to the RAM:

Code:

void __fastcall f_new_function_by_patch(_DWORD *out, int val) { if ( !f_to_be_patched_function(out, val) ) { while ( 1 ) ; } }

The BL instruction that is being patched used to call the old function (f_to_be_patched_function), without checking its return value, hence the bug.
What's so strange about it is that "f_to_be_patched_function" is called from many other locations in the code, without checking the return value! So the bug exists in other locations as well.
Either the other locations don't cause internal metadata corruption, or they are just so rare that Samsung didn't even bother to patch them.

So it sorta looks like in the original firmware, it's (bear with me, this is really fugly pseudocode)

Code:

if( !some_sanity_check_here())
{
  crater_the_chip();
}

(where, obviously, crater_the_chip() is not actually a function, but it is what happens if that sanity check ever fails when called from that part of the code...)

Now it's

Code:

if( !some_sanity_check_here())
{
  hang_chip_until_reset();
}

Wait, so we do have a way to boot Exynos 4412 devices (Galaxy S3) from the mmc1 bus?
If so, why isn't SDS fixable?

It's possible to boot from the MMC1 bus.

SDS is still not fixable since at this point, the internal eMMC is hosed at a very low level - unless we can figure out how to do a full reset/wipe of the eMMC chip from the main eMMC interface (we know that this is theoretically possible as Ken Sumrall of Google had access to such a procedure but was not able to provide us the info on it due to NDAs, but do not have any examples of performing this procedure due to aforementioned NDAs). Same reason Superbricked devices can't even be repaired using JTAG.

Some SDSed devices behaved similarly to how many Superbricked devices behaved - parts of the chip worked OK (including the bootloader), others were hosed. Quite a few people who suffered from SDS were able to boot into download mode but not write to any part of the chip.

eMMC sudden death research

Inactive Recognized Developer

Senior Member

Senior Recognized Developer

Inactive Recognized Developer

Senior Recognized Developer

Senior Member

Inactive Recognized Developer

Senior Member

Inactive Recognized Developer

Senior Recognized Developer

Senior Member

Inactive Recognized Developer

Inactive Recognized Developer

Attachments

Senior Member

Senior Member

Inactive Recognized Developer

Inactive Recognized Developer

Senior Member

Inactive Recognized Developer

Senior Recognized Developer

Similar threads

Top Liked Posts