Discussion thread for /data EMMC lockup/corruption bug

Esoteric68 · Jun 2, 2012

Samsung really needs to hire you, garwynn and Entropy.

Entropy512 · Jun 2, 2012

Hmm. CONFIG_MMC_DISCARD is enabled on I9100/I777 Gingerbread kernels, which is why I said that MMC_CAP_ERASE was enabled (MMC_CAP_ERASE is #ifdefed based on CONFIG_MMC_DISCARD) - I'll look elsewhere tomorrow to see if maybe the "chain of doom" was broken elsewhere.

It's a lot easier to find stuff when you know what you're looking for.

I believe that still, removing MMC_CAP_ERASE from mshci.c is guaranteed to render you safe - but even if it's enabled, there are other places in the kernel where the chain could be broken.

sfhub · Jun 2, 2012

Entropy512 said:
Hmm. CONFIG_MMC_DISCARD is enabled on I9100/I777 Gingerbread kernels, which is why I said that MMC_CAP_ERASE was enabled (MMC_CAP_ERASE is #ifdefed based on CONFIG_MMC_DISCARD) - I'll look elsewhere tomorrow to see if maybe the "chain of doom" was broken elsewhere.

I am not surprised I9100/I777 source tree differs from E4GT in some areas. I saw a bunch of #ifdef CONFIG_MMC_DISCARD_MERGE code which indicates to me they were (for our tree) in the process of merging the discard code so it probably could have gone either way depending on when GB for a particular platform was released.

Entropy512 said:
It's a lot easier to find stuff when you know what you're looking for.

It is easier to search when you know what to search for

In this case I wasn't starting with MMC_CAP_ERASE though.

I traced top-down from the wipe.c ioctl() from userspace to kernelspace and eventually down to mmc_erase() through the block interface indirection layers. Then I traced from mmc_erase() upwards to see who else might be making calls to it. It was all a bit messy because of the CONFIG_MMC_DISCARD_MERGE code and the resulting multiple definitions of functions.

I kept wondering why it seemed like the ioctl() function was seemingly left undefined by the ifdef's, then realized the function table wasn't populated thus the ioctl() functionality was essentially disabled for the mmc driver (on our platform) I tried to stress a few times above this was based on our platform and that I wasn't sure about others.

Entropy512 said:
I believe that still, removing MMC_CAP_ERASE from mshci.c is guaranteed to render you safe - but even if it's enabled, there are other places in the kernel where the chain could be broken.

Most likely that is correct.

My goal in the previous post was not necessarily to determine where to do a workaround in the kernel, but to answer the question of why the "unsafe" update-binary included with our CM9/AOKP installs didn't trigger mmc_erase() (and expose themselves to the EMMC lockup/superbrick) when run on a GB kernel/recovery which I thought had mmc_erase() enabled. I found that for our platform, the most immediate reason was the ioctl() was disabled.

The difference in the CONFIG_MMC_DISCARD ifdef status in E4GT and I9100/I777 might end up resulting in slightly different bricking behavior when coupling an "unsafe" update-binary with a stock GB kernel/recovery.

I say "might" because I am not sure if the issue is with ERASE, TRIM, SECDISCARD, all three, or some combination of the three. It is possible only SECDISCARD is problematic but ERASE and TRIM are ok because the SECDISCARD code wasn't even in the GB source (for our platform) so it would be something new introduced in ICS (and also we can see it is being used in wipe.c). It is also possible they are all bad.

garwynn · Jun 2, 2012

Esoteric68 said:
Samsung really needs to hire you, garwynn and Entropy.

Then I would have to leave my current job, which I can say is the only job in 17 years that I love. The biggest reason are the managers and coworkers. Without the people and mindset in that office I could easily see a polar opposite of working there.

It is honestly a miracle how I've had opportunities that I did... Much like getting a response from Samsung, Sprint and Google on inquiries with no previous working relationship.

Sent from my SPH-D710 using XDA

sfhub · Jun 2, 2012

I posted this less technical writeup on agat63's cwm repacked thread but figured it would be useful to have here also. I am working with CM9/AOKP to have their install scripts replace format("/system") with delete_recursive("/system") After that I think if we still have volunteers, we are ready to do some more testing.

I'll provide more details when the pieces become available, but if you'd like to take one for the team and help test, please post. If our understanding of the problem is accurate, you should be safe this time around, but there is always the chance we are not understanding the problem completely.

====

I would like to stress that the information I gave agat was based on code tracing and NOT based on real-world testing. You should treat this as a *testing period* to confirm the analysis. You MAY BRICK if the analysis is incorrect.

The following is all assuming you are repacked with ICS kernel (ie we aren't talking about the GB kernel)...

Background
The nature of the problem is a call to the function make_ext4fs(). This function isn't provided by the kernel, rather it is provided as a library (libext4_utils.a) that is used when compiling Recovery and the update installer. It does end up eventually calling kernel mmc driver routines, which then trigger the EMMC firmware lockup/superbrick bug.

The make_ext4fs() function changed between GB and ICS. In GB the function didn't try to erase the partition before creating the EXT4 fs. In ICS it tries to erase it first. The erase is triggering an EMMC firmware bug that was always there via the kernel MMC driver. GB is also "doubly" safe in that not only do Recovery and update-binary never attempt to do the erase, even if they did, the request to erase is blocked in the GB-kernel and never run.

The EMMC firmware bug will lockup your phone and corrupt internal EMMC meta-data which cannot be accessed or repaired at this time. It isn't crashing your hard drive per-se, it is crashing your hard drive controller in a way that prevents the hard drive controller from accessing parts of your disk. We don't have any way of updating the EMMC hard drive controller at this time.

The EMMC firmware lockup/superbrick bug is likely contained in the wear-level firmware code which shifts mmc-internal memory block usage around to prevent any one area from overuse. The bug MIGHT NOT be triggered every time, so you can do the same operation with no issues then on your Nth attempt it bricks.

Details
So what does that mean for you? There are 2 executables we are concerned with, Custom Recovery and the update installer (update-binary)

Custom Recovery is responsible for 2 potential bricking points:
1) wipe data/factory reset
2) nandroid backup/restore

These are both handled by the Recovery itself so if your Recovery is "safe" then these operations should be safe. The nandroid backup is safe regardless. Our concern is only for wipe data/factory reset and nandroid restore. Both of these make the call to make_ext4fs(), so if they are using the GB-based version they are safe. If they are using the ICS version, they are not safe (when used with ICS kernel) Agat has made the effort to make sure the recovery he has provided is compiled against GB CM7 source.

You may ask what about Installing ROMs, you thought Recovery was responsible for that too?
This is only partially true. You use the menu option in Recovery to choose to install an update.zip. Recovery is responsible for providing the location of the update.zip and verifying the signature, but when it comes to actually "installing" the update.zip, Recovery uses a "helper app" called update-binary contained in the update.zip.

This update-binary helper app is responsible for running the Edify install script in the update.zip. It communicates with Recovery just to update the progress bar, output ui messages, and set up the updating of firmware. The rest of the script functions, it handles by itself directly, so Recovery isn't involved.

update-binary also calls make_ext4fs() so it can also do potentially "unsafe" operations, just like we discussed for Recovery above. If the update-binary, that was included in the update.zip, was compiled using GB-sources, then it is "safe". If it was compiled against ICS sources then there is one function in the Edify script that can potentially cause bricking, format().

To be clear, Recovery has no control over the update-binary that is included in the update.zip. Whomever built the ROM update.zip package made that decision. So this is why even with a "safe" Recovery, you can brick your phone installing ROMs (with an ICS kernel).

Even if the Recovery is "safe", if you ask it to use an "unsafe" update-binary to install a ROM AND that ROM install script chose to do a format(), then the EMMC lockup/superbrick bug can be triggered.

The reason why most stock-based ROMs don't brick in ICS is because
1) most of them probably include a GB-based update-binary
2) most of them are not performing a format() within their Edify updater-script

So a ROM builder has 2 ways to make a ROM update.zip install "safe" to install in ICS. Either package a known GB-based update-binary OR eliminate format(), if present, from the Edify install script.

So why does Calk's format_all seem to never brick even on ICS? Given the date on the update-binary and when he created the package, it is most likely using a GB-based update-binary

So why does CM9/AOKP seem to brick more often than stock-based ICS ROM installs? The Edify install script for CM9/AOKP uses new functions that were introduced in the ICS update-binary. This in turn is why they bundle the ICS-based update-binary. They could still potentially be safe, but in the install script a format("/system") is performed. If that format is run under an ICS kernel it will trigger the EMMC firmware lockup/superbrick bug. Under a GB-kernel, the request to erase "/system" is blocked by the GB-kernel.

What can CM9/AOKP do to make their installs "safe" to install in ICS? All they need to do is replace the format("system") with delete_recursive("/system"). They could also replace the ICS-based update-binary with a GB-based update-binary, but that would require more rewrites to the install script. Replacing the format() call is simpler/easier.

Why are some superbricks blue-light specials and others only make ODIN hang at data.img? This likely has to do with whether you got your brick from

1) the format() in the CM9/AOKP install
2) restoring nandroid backup
3) doing the wipe data/factory reset in Recovery

The first two tend to be blue-light specials as they affect /system and/or kernel. The last one tends to affect /data and/or /cache.

So how do you make sure you are totally safe?
1) make sure you are using a "safe" recovery repacked with the stock ICS kernel. This is a Recovery that was compiled against GB-based libext4_utils.a (ie GB source) This will assure you that wipe data/factory reset and nandroid restores are safe
2) whenever you install a ROM for the first time, verify EITHER
a) the ROM install script is NOT performing any format() calls
b) the ROM install has bundled a GB-based update-binary

If neither 2a NOR 2b are true (ie ICS-based update-binary and install performs format) then you DO NOT want to flash that ROM in Recovery while on an ICS kernel. Flash that ROM on a GB-based kernel/recovery.

Hope that clears things up, and once again, remember, this analysis is only based on tracing code. I may have made a mistake in the analysis or our understanding of the problem could be wrong. We will not be sure if all these statements hold UNTIL WE DO REAL-WORLD TESTING.

_yupa_ · Jun 2, 2012

A simple question: how do deleting a large file is suspected to trigger the bug? And how much large the file have to be?

sfhub · Jun 2, 2012

_yupa_ said:
A simple question: how do deleting a large file is suspected to trigger the bug? And how much large the file have to be?

I don't think I ever stated that and I haven't found code which can support that statement with the bug as we understand it.

There may be a separate bug that supports the statement you are saying, but it isn't the bug we are dealing with here, though I really think we would have heard more complaints if such a bug existed.

Perhaps you are confusing with the 32KB zero bug. I don't know if we are afflicted with that bug. fwrev 0x25 definitely is and possibly we are too (which is basically what Mr. Sumrall assumed), but that one would likely end up corrupting data in a recoverable manner (as in you might lose data, but you can reformat/reflash and things should be ok), rather than a superbrick.

RustedRoot · Jun 2, 2012

Once again, SF, you do a remarkable job of clarifying a nettlesome and thorny issue. Thanks my friend.

Sent from my SPH-D710 using XDA

musashiro · Jun 3, 2012

if i understand correctly, flashing CM9 or AOKP based roms over ICS kernel can trigger the bug..

while flashing directly from GB is safe...?

i tried CM9 on GB rooted and came back using nandroid backup just fine...

sfhub · Jun 3, 2012

musashiro said:
if i understand correctly, flashing CM9 or AOKP based roms over ICS kernel can trigger the bug..

It is not flashing them "over" ICS kernel per-se that triggers the bug.

It is running an ICS-based update-binary while in an ICS kernel and having that update-binary process a "format()" command in the Edify install script.

I think that is what you meant, but "flashing over" could have different meanings and I wanted to clarify that point.

Once the CM9 and AOKP get rid of the format() command in their Edify install scripts, then the above scenario should be safe.

musashiro said:
while flashing directly from GB is safe...?

Yes, that is correct. Even with the ICS-based update-binary and the format() currently in the Edify install script, if update-binary is running under a GB-based kernel, the problematic "erase" code initiated by make_ext4fs() will be blocked by the GB kernel, so no issue arises.

musashiro said:
i tried CM9 on GB rooted and came back using nandroid backup just fine...

Thanks for trying that. The behavior you describe is as expected. When you flashed CM9 on GB rooted, the GB kernel blocked the CM9 update-binary from making the problematic call when the install script asked to format("/system") so that is why there are no issues.

Then when you "came back" using a nandroid backup, that is being handled by Recovery. Assuming the Recovery was compiled against GB source, which I believe the Recoveries you get from here have, then it will Recovery will not make the problematic "erase" call when it restores, even though the restore process is calling make_ext4fs(). The GB source for make_ext4fs() just doesn't try to erase at all.

BTW may I ask which Kernel/Recovery did you use to do your Nandroid restore?

The only "Recovery" I believe you need to worry about is CWM Touch fake-flashed onto an ICS kernel, as I believe (but have not confirmed because I can't find source code) that Recovery has been compiled with an ICS version of make_ext4fs().

musashiro · Jun 3, 2012

i tried CM9 long ago, and i never knew it would contribute anything..

before trying CM9, i messaged XplodWild if nandroid restore is safe and he boldly said "Yes". i trusted him and i tried CM9 for a day..

the recovery that i used is the one (if im not mistaken) that is included in the CM9 safe kernel. flashing their CM9.zip includes the kernel as stated by Entropy on cm9 thread

sfhub · Jun 3, 2012

I didn't even realize you were talking about Note until you mentioned some unfamiliar names and I looked at your sig.

My answer was based on E4GT.

To answer based on Note and your newly provided info, if it was a "safe" ICS kernel with a Recovery compiled against ICS sources, then your Recovery would have called the problematic make_ext4fs() when you did the Nandroid restore, but the erase command would have been blocked by your "safe" ICS kernel, thus your nandroid restore worked as expected.

Same result, but the in-between happenings were different.

Entropy512 · Jun 3, 2012

sfhub said:
I don't think I ever stated that and I haven't found code which can support that statement with the bug as we understand it.

There may be a separate bug that supports the statement you are saying, but it isn't the bug we are dealing with here, though I really think we would have heard more complaints if such a bug existed.

Perhaps you are confusing with the 32KB zero bug. I don't know if we are afflicted with that bug. fwrev 0x25 definitely is and possibly we are too (which is basically what Mr. Sumrall assumed), but that one would likely end up corrupting data in a recoverable manner (as in you might lose data, but you can reformat/reflash and things should be ok), rather than a superbrick.

That's bringing back something I suspected long long ago back when we knew far less about this bug, but I'm still not sure... Obviously a userspace-triggered erase operation will fire ERASE commands at the card - but what about TRIM (which is a variant of ERASE) - is it blocked by another mechanism? (I suspect so but I'm not entirely sure.)

As to musashiro's questions - the CM9 kernel for N7000 has MMC_CAP_ERASE removed, which prevents attempts to erase from ever reaching the eMMC chip.

temporarium · Jun 3, 2012

What about using a different file system?

sfhub said:
Background
The nature of the problem is a call to the function make_ext4fs(). This function isn't provided by the kernel, rather it is provided as a library (libext4_utils.a) that is used when compiling Recovery and the update installer. It does end up eventually calling kernel mmc driver routines, which then trigger the EMMC firmware lockup/superbrick bug.

The make_ext4fs() function changed between GB and ICS. In GB the function didn't try to erase the partition before creating the EXT4 fs. In ICS it tries to erase it first. The erase is triggering an EMMC firmware bug that was always there via the kernel MMC driver. GB is also "doubly" safe in that not only do Recovery and update-binary never attempt to do the erase, even if they did, the request to erase is blocked in the GB-kernel and never run.

Apologies if this was already tried/suggested, but since we're dealing with a Gnu/Linux kernel, and if setting up the ext4 file system triggers the bug in the eMMC firmware, what about using a different one? (See en.wikipedia.org/wiki/Flash_file_system)

Entropy512 · Jun 3, 2012

temporarium said:
Apologies if this was already tried/suggested, but since we're dealing with a Gnu/Linux kernel, and if setting up the ext4 file system triggers the bug in the eMMC firmware, what about using a different one? (See en.wikipedia.org/wiki/Flash_file_system)

It has nothing directly to do with ext4 itself - it's just that recovery was modified such that it would issue ERASE commands prior to formatting the partition for privacy purposes.

sfhub · Jun 3, 2012

Entropy512 said:
That's bringing back something I suspected long long ago back when we knew far less about this bug, but I'm still not sure... Obviously a userspace-triggered erase operation will fire ERASE commands at the card - but what about TRIM (which is a variant of ERASE) - is it blocked by another mechanism? (I suspect so but I'm not entirely sure.)

The way I answered that question for E4GT is to observe that ERASE, TRIM, DISCARD basically essentially all do the same mmc operation but pass different arguments to the mmc command. I believe in the ICS source tree they all get filtered down to mmc_erase() so if you should do something to short-circuit mmc_erase() (like not setting MMC_CAP_ERASE in the host capabilities) TRIM and DISCARD will get blocked also. They won't be blocked explicitly, but since mmc_erase() checks the host's ability to do MMC_CAP_ERASE first, it will return with EOPNOTSUPP because presumably if your host doesn't support ERASE, it wouldn't support TRIM and DISCARD since ERASE is the older base functionality.

In the case of the E4GT GB source tree the only references I saw were having it called from userspace through the ioctl(). Since the ioctl() is disabled in our GB build, there is no way to eventually reach mmc_erase(), thus TRIM and DISCARD should never be reached.

Since our build has CONFIG_MMC_DISCARD disabled, the discard code isn't even compiled. I believe the TRIM and ERASE code is still there, but since the ioctl to switch to kernelspace to run mmc_erase() isn't present it won't ever get run.

I still am not sure if ERASE, TRIM, and DISCARD versions of erase are all borked or some smaller subset are the issues.

So to answer your question, if you have blocked mmc_erase() as you described earlier, by not enabling MMC_CAP_ERASE in the host capabilities, based on the generic ICS tree, that should block TRIM and DISCARD as well. I haven't looked in the Samsung specific source tree though.

sfhub · Jun 3, 2012

temporarium said:
Apologies if this was already tried/suggested, but since we're dealing with a Gnu/Linux kernel, and if setting up the ext4 file system triggers the bug in the eMMC firmware, what about using a different one? (See en.wikipedia.org/wiki/Flash_file_system)

I'm pretty sure the wipe code that was added between GB and ICS only gets called when creating ext4 filesystems. However I didn't specifically check if the code paths for the other filesystems would reach some variation of the wipe code.

It is probable the other filesystems aren't doing any wipes so you could theoretically change the FS and the bug wouldn't be triggered. You'd have to change all the config on your ROMs/kernels/recovery yourself to use a different FS though and it would be more intrusive than just making sure your recovery, update-binary, updater-script, and in some cases recompiled kernel are safe.

To be clear, it isn't a problem with the ext4fs filesystem itself. It is that the userspace "utility" function make_ext4fs() (this is not part of the kernel filesystem code) eventually tries to erase the partition and this leads to a bad reaction with EMMC firmware that has the lockup/superbrick bug. make_ext4fs() could be renamed joes_erase_ext4_partition() for all we care. It is just some arbitrary utility function being used by Recovery and update-binary to create an ext4fs() filesystem. Don't let the ext4fs() lead you into thinking it represents some problem with generic ext4fs code in the kernel.

sfhub · Jun 3, 2012

Entropy, do you recall how it came about that deleting large files was a possible suspect?

Was it just a theory or was there specific code people were looking at that led to that belief?

For completeness purposes I would like to research if this possibility exists.

blackhorses777 · Jun 3, 2012

sfhub said:
Entropy, do you recall how it came about that deleting large files was a possible suspect?

Was it just a theory or was there specific code people were looking at that led to that belief?

For completeness purposes I would like to research if this possibility exists.

im curious if its a possibility also. i regulary delete large files off my sd card and ive been using a ICS kernel for a couple of months now.

sfhub · Jun 3, 2012

Without looking at source code, my guess is you would hear of more bricks when people cleared out files if this was a likely scenario.

I don't like to completely rule something out until/unless I look over the source code to verify though.

Clearly the wipe data and format from ICS-based update-binary are the most likely scenarios where people see superbricks as the boards are littered with anecdotes for those two.

Discussion thread for /data EMMC lockup/corruption bug

Senior Member

Senior Recognized Developer

Senior Member

Retired Forum Mod / Inactive Recognized Developer

Senior Member

Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Recognized Developer

Senior Member

Senior Recognized Developer

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Similar threads

Top Liked Posts