Post Reply

Discussion thread for /data EMMC lockup/corruption bug

OP sfhub

9th May 2012, 01:08 PM   |  #1  
OP Recognized Contributor
Thanks Meter: 6,845
 
4,757 posts
Join Date:Joined: Oct 2008
This post will evolve over time as more info is found:

Latest Updates
6/24 Custom PIT repartition workaround posted by Jaymoon.
You lose 8GB (some of which might be further recoverable with extra work) and restores using the original PIT will lockup your phone again (a scenario that could happen if you brought your phone back to Sprint for some unrelated problem) so if you have the opportunity to get your phone replaced with little to no cost, IMO that should be your primary option.
http://forum.xda-developers.com/show...9#post27852689

E4GT specific PIT file here (theoretically instead of losing 8GB, you'll only lose 2GB):
http://forum.xda-developers.com/show...&postcount=654

6/8 Update for other platforms waiting for fix
Codeworkx's contact with Samsung got following response [discussion]
Quote:

Update 14:56 CEST:
Patches will be out in form of new official ROMs and also sourcecode releases after testing, which might take some time.

6/7 Update
Test plan posted - see bottom of post for results so far (esoteric68, krazy_smokezalot report success)
BIG THANKS to Esoteric68 (and robertm2011 before her) who took the plunge to benefit everyone else. She has completed the test plan and more. 6 flashes of CM9, 3 flashes of AOKP, 3 wipe data/factory resets, and 3 nandroid restores, 1 stock FF02 flash, all successful. We are ready to have more testers try out the test ROM installs. We are getting more confident the code analysis was correct.

6/2 Update
Less technical summary and preparation for new round of testing

5/31 Lots of discussion on the code path detailing how the problem occurs and where to put the workaround, select posts below
Call trace for CWM Recovery - wipe data/factory reset
Call trace for CWM Recovery - restore
Section of update-binary afflicted by same issue as wipe data/factory reset
Recap of where workarounds can be placed
MD5s of various update-binary executables
Pros/Cons of placing workaround in kernel vs libext4_utils.a
Are ICS nandroid backup/restores safe?
Are ICS recoveries safe?
Why do CM9/AOKP installs often brick in ICS but not in GB?

5/24 Update pretty much ties up all the loose ends - Thanks Mr. Sumrall, Garwynn, Entropy, and everyone else who pitched in!
http://forum.xda-developers.com/show...3#post26521643

Potentially very GOOD NEWS
It appears Sprint/Samsung tested the EMMC brick issue, confirmed the problem, and tested a fix that appears to resolve the problem:
http://forum.xda-developers.com/show...5#post26465085
Quote:
Originally Posted by thirdcoastraised

To clarify this...in testing done over the weekend, there was a small "subtest" group which consisted of 20 devices. This group was put together STRICTLY for the propose of testing the emmc bug and fix. The devices were all programmed with the data known to have cause bricks when wiping. Of those 20, all but 6 also had the code patch to resolve that issue, so there was a possibility for 6 hard bricks, only 4 actually bricked, therefore, on the build currently being tested, the "emmc break issue" has been deemed "resolved"



We now have an update on why this bug is happening and which PRV/fwrevs are affected. PRV/fwrev 0x19 are susceptible to the EMMC /data corruption issue (which should now be referred to as EMMC lockup issue). PRV/fwrev 0x25 has the fix for the lockup issue but has a separate 32KB of zeros data corruption issue, which is being patched in the kernel (our kernels don't have that patch). All these problems are in the EMMC firmware. It can potentially be updated, but nothing is publicly available. EMMC lockup issue is triggered on erasing the EMMC. The only piece we have not been able to explain is why GB-based kernels seem immune to the EMMC lockup problem whereas ICS seems more susceptible to the problem. Presumably both are doing ERASE commands, but possibly in slightly different ways. See these posts for more details [#1 / #2]

To get your PRV/fwrev, you can use this (if you have busybox installed):
Quote:

shell@android:/ $ su
shell@android:/ # cd /sys/class/block/mmcblk0/device

shell@android:/sys/class/block/mmcblk0/device # cat cid | cut -b 19,20
19

If you don't have busybox installed just visually parse the line, match the serial # (0xd3f24fe6 - example only - yours will be different) with the cid, and look at the 2 numbers before the serial #.
Quote:

shell@android:/ $ su
shell@android:/ # cd /sys/class/block/mmcblk0/device

shell@android:/sys/class/block/mmcblk0/device # cat serial cid
0xd3f24fe6
1501004d414734464119d3f24fe68e8b

It appears after looking at the code more closely and examining the results of the card info dumps, we do not have this fix in our kernel. It isn't clear whether the fix would resolve our /data EMMC brick issues, but the point is moot right now because we don't have the fix.

Possible BRICK here. Please do NOT do any more testing until further notice. Please do NOT use Wipe Data/Factory Reset. It is the main difference between first and 2nd round of testing and is the current suspect

FE10 repacks added to Resource section

Esoteric68, azyouthinkeyeiz, and robertm2011 are testing flashing different ROMs with FE07/FE10 repacked with unlocked recovery. We all owe them our thanks for risking their phones to help the community (taking one for the team) No bricks so far.

Separately we are still discussing whether the fix Samsung checked in will get applied to our phone. No firm conclusions yet. Even if it doesn't apply, the hope is the data we get from testing will help us produce more flexible "safe" flashing practices.

Please do NOT test CWM Touch for now. We want to isolate just the FE07 kernel and unlocked stock recovery before introducing new variables.


Executive Summary
Garwynn has found a recent checkin from Samsung in the kernel code handling EMMC memory that fixes a data corruption problem. It is possible this might fix the /data EMMC corruption we have been seeing, but we aren't sure if it is fixing the same problem. The first release to include that checked in code is FE07. There has been some communication with the developers in charge of that area to gather further info.

This thread's purpose is to foster discussion on the issue and to determine if the potential fix actually does fix our issue. Even if the fix doesn't address the issue, it is hoped in the process we are able to gather more info into specific "safe" and "unsafe" scenarios.

Please do NOT jump ahead and think it is fixed. It is TOO EARLY to make that claim.


Background
As many of you are aware since ICS has come out, there has been a nagging issue where in some situations flashing ROMs with an ICS-based kernel and custom recovery has left the phone with EMMC corruption. This EMMC corruption is so far non-recoverable, even with JTAG bit blasting, which should bypass all but hardware issues.

This problem is NOT limited to the Epic 4G Touch. Other GS2 models as well as Galaxy Note are experiencing the same thing as can be seen by this Public Service Announcement in the Galaxy Note section.

The problem first cropped up when people used ROM Manager to temporary "fake" flash CWM Touch onto an ICS-based kernel to do their flashing needs. In particular wipe data/factory reset seemed to often trigger the /data EMMC corruption. However later we found it wasn't limited to just CWM Touch and temporary flashing as CWM repacks with the ICS-based kernel also exhibited that behavior, albeit not as often.

Even more frustrating is that this bug is not always deterministic, in that you could do some operation 3 times and have it work fine, then on the 4th, trigger the /data EMMC corruption.

Complicating the testing/debugging is the issue that once the problem is triggered, your phone is basically not recoverable. You can try and ODIN a stock ROM on top which will basically work for all the components except the /data partition. Once it reaches the /data partition, ODIN will hang. Similarly if you try and wipe data/factory reset, it will hang or timeout after a while. Attempts to repartition and reformat using ODIN have not changed this behavior. Attempts to edit the partition info manually have not been successful. JTAG bit blasting has not been successful.

You can read about the past experiences in the Stuck at "Data.img" thru odin thread. By the time you get to ODIN, the damage to /data EMMC is already done. ODIN is NOT causing the damage. ODIN is hanging on data.img because the hardware won't let it write successfully to that area of EMMC.

This has led to many custom ROMs giving special procedures to go back to a GB-based kernel repacked with CWM recovery to do all your flashing (EL26+CWM). It is also the motivation for the How Not To Brick Your E4GT thread.


Details
The code checkin that has piqued our interest is in regards to data corruption caused by problem in the wear-level firmware code of the emmc. This is low-level code that runs on a processor in the emmc module. It basically tries to spread out the data writes so you get an even distribution of writes so as any one section of emmc memory does not get worn out prematurely. This code apparently can corrupt data by writing 32KB of incorrect data under some situations.

https://bitbucket.org/franciscofranc...t/cea631bdac53

The code appears to restrict the firmware fix to only certain "affected" emmc modules. Also it is not able to persistently/permanently patch the firmware so this code must run at each startup. The following modules were identified in the code:

Name: VYL00M
HwRev: 0x0
FwRev: 0x25

Name: KYL00M
HwRev: 0x0
FwRev: 0x25

Name: MAG4FA
HwRev: 0x0
FwRev: 0x25

Unfortunately during ad-hoc polling we have found a case of an EMMC /data bricked phone with fwrev 0x0, so either we are not understanding what Samsung's fix is doing or they may not have addressed the full scope of the problem. Do NOT assume if your fwrev is 0x0 you are safe.

At this point, this does NOT mean the fix is not applicable. We might be looking at the wrong data. The kernel might not be exporting the data to us. The fix might need to be expanded to more modules. The fix could be for something else entirely but we might be able to avoid the bug anyway using stock recovery.

To determine what version you have (keep in mind we are at the preliminary stage, so this info might not be the right info to collect or could be meaningless for the /data EMMC corruption issue)
Quote:

shell@android:/ $ su
shell@android:/ # cd /sys/class/block/mmcblk0/device

shell@android:/sys/class/block/mmcblk0/device # cat name hwrev fwrev manfid oemid date type serial cid
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
0xd3f24fe6
1501004d414734464119d3f24fe68e8b

The comments for the code checkin give the following info:
Quote:

/*
* There is a bug in some Samsung emmc chips where the wear leveling
* code can insert 32 Kbytes of zeros into the storage. We can patch
* the firmware in such chips each time they are powered on to prevent
* the bug from occurring. Only apply this patch to a particular
* revision of the firmware of the specified chips. Date doesn't
* matter, so include all possible dates in min and max fields.
*/

The critical piece of code appears to be the following:
Code:
	/* set value 0x000000FF : It's hidden data
	 * When in vendor command mode, the erase command is used to
	 * patch the firmware in the internal sram.
	 */
	err = mmc_movi_erase_cmd(card, 0x0004DD9C, 0x000000FF);
	if (err) {
		pr_err("Fail to Set WL value1\n");
		goto err_set_wl;
	}
	/* set value 0xD20228FF : It's hidden data */
	err = mmc_movi_erase_cmd(card, 0x000379A4, 0xD20228FF);
	if (err) {
		pr_err("Fail to Set WL value2\n");
		goto err_set_wl;
	}
Action items
At this point we would like to

1) gather more info on which emmc modules folks have and see if we can detect any patterns, so if you could post your EMMC info and optionally include whether you have the ability to do testing (presumably because you have a way to replace your phone if it is damaged)

2) solicit one volunteer to try different flashing scenarios using the unlocked stock recovery and FE07 kernel repack (bigpeng indicated earlier he would be willing to do this for the community, but that was before the fwrev info, so he might have had a false sense of security, so no pressure on him if he changed his mind)

If we find that the volunteer does not see any corruption despite trying to do so, then we can expand testing to a few more people and also work on getting CWM repacks.

If the volunteer hits the bug, then we will know the issue is still there even with stock recovery and FE07 kernel.

Keep in mind, at some point someone will need to take one for the team or we will be forever in fear of bricking our phones using ICS-based kernels.


Resources

1) FE07-based repacks
Unlocked Recovery Only [update.zip / tar]
Plus (unlocked recovery, init.d, adb-root) [update.zip / tar]

2) FE10-based repacks
Unlocked Recovery Only [update.zip / tar]
Plus (unlocked recovery, init.d, adb-root) [update.zip / tar]

3) JEDEC eMMC documentation

Related threads
Galaxy Note CID investigation thread
Last edited by sfhub; 29th June 2012 at 11:32 PM.
The Following 70 Users Say Thank You to sfhub For This Useful Post: [ View ]
9th May 2012, 01:15 PM   |  #2  
Account currently disabled
Thanks Meter: 1,379
 
1,438 posts
Join Date:Joined: Apr 2012
Good job sfhub. I am learning new stuff everyday

Sent from my SPH-D710 using xda premium
9th May 2012, 02:07 PM   |  #3  
Azrael.arach's Avatar
Senior Member
Thanks Meter: 26
 
186 posts
Join Date:Joined: Jul 2010
More
This is mine. I can try to help but not till weekend when my other phone gets here.

MAG4FA
0x0
0x0
0x000015
0x0100
11/2011
MMC

Sent from my SPH-D710 using Tapatalk 2
The Following User Says Thank You to Azrael.arach For This Useful Post: [ View ]
9th May 2012, 02:44 PM   |  #4  
nivron's Avatar
Senior Member
Flag Ocala/Orlando, Florida
Thanks Meter: 256
 
668 posts
Join Date:Joined: Oct 2010
More
Sorry I don't have anything majorly different than the normal, but I have this on my phone:

MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
The Following User Says Thank You to nivron For This Useful Post: [ View ]
9th May 2012, 02:59 PM   |  #5  
Wabem's Avatar
Member
Flag Santa Monica
Thanks Meter: 10
 
49 posts
Join Date:Joined: Nov 2009
Mine is also the same,

MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
9th May 2012, 03:20 PM   |  #6  
mmark27's Avatar
Senior Member
Thanks Meter: 223
 
669 posts
Join Date:Joined: Jun 2010
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC

I am not available to test. Sorry.
The Following User Says Thank You to mmark27 For This Useful Post: [ View ]
9th May 2012, 03:30 PM   |  #7  
Senior Member
Flag Albany
Thanks Meter: 163
 
475 posts
Join Date:Joined: Dec 2010
More
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC

Also unable to risk the brick. Good luck guys.
9th May 2012, 04:03 PM   |  #8  
Senior Member
Thanks Meter: 38
 
154 posts
Join Date:Joined: Apr 2010
MAG4FA
0x0
0x0
0x000015
0x0100
10/2011
MMC

I'll take one for the team if needed. I've been eyeballing the 720p Evo.
Last edited by pbassjunk; 9th May 2012 at 04:08 PM.
The Following User Says Thank You to pbassjunk For This Useful Post: [ View ]
9th May 2012, 04:04 PM   |  #9  
krazyflipj's Avatar
Senior Member
Thanks Meter: 322
 
2,322 posts
Join Date:Joined: Jan 2008
More
Full readings from my /data bricked device. Let me know if you want me to check anything else out:


MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
9th May 2012, 04:07 PM   |  #10  
azyouthinkeyeiz's Avatar
Senior Member
Thanks Meter: 116
 
663 posts
Join Date:Joined: Jan 2010
More
I can risk a brick to save future ones. I can help test.

/*
MAG4FA
0x0
0x000015
0x0100
12/2011
MMC*/
Sent from my SPH-D710 using Tapatalk 2
Last edited by azyouthinkeyeiz; 9th May 2012 at 06:20 PM.

The Following 4 Users Say Thank You to azyouthinkeyeiz For This Useful Post: [ View ]
Post Reply Subscribe to Thread
Previous Thread Next Thread
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes