FORUMS
Remove All Ads from XDA

Discussion thread for /data EMMC lockup/corruption bug

5,342 posts
Thanks Meter: 7,242
 
By sfhub, Senior Member on 9th May 2012, 01:08 PM
Post Reply Email Thread
9th May 2012, 08:54 PM |#21  
garwynn's Avatar
Retired Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag NE Ohio
Thanks Meter: 8,731
 
Donate to Me
More
Analysis/Opinion
OK, now that you've seen the comments from the Android team, have links to the code and we have confirmed that our devices are affected, let's try and walk this through:

Linux File System Terms:

inode:
Also known as metadata, it's the data about the data. Out of the many articles out there, I thought this might help without going too much into the technical side: http://www.linux-mag.com/id/8658/

Binary Bitmap:
In order to account for the usage of the blocks on the filesystem, the ext2 filesystem consists of a block bitmap. This keeps track of blocks that have been used and those that are free. Each bit in the Block Bitmap denotes an integral number of fragments. So if a bit is allocated to a file and marked as used, then an entire set of fragments are allocated to it.

The Block Bitmap is a clever way to keep track of new empty and old used ones. In order to look for a block, one needs to check the group to which the file belongs. Then the Block Bitmap of the appropriate `Group’ is selected and searched for the required block.

(Source: http://freeos.com/node/41)

Q: Can inserting 32 Kb of zeros corrupt a file system?
Certainly - as mentioned by Mr. Sumrall, it just depends on where the insertion was made.

Q: Can this corruption cause an I/O error?
Again, yes - under the conditions as described by Mr. Sumrall

Q: Can this particular bug be repaired?
Mr. Sumrall says no, and that sounds awfully familiar with our bricks.

Q: Why couldn't JTAG, with its bit blast, at least reset the values and allow it to go ahead?
The answer, as far as I can tell, is simply that the support is not available at a driver and/or eMMC controller level to handle this type of operation. This is either because the embedded controller chip simply cannot do so or the driver has not been designed to use those instructions as of yet. As a result, all it can do is throw an error in frustration. It might be possible down the road but not now. So instead of waiting for this to come Samsung implemented a workaround to the WL logic to avoid it corrupting the filesystem.

I'm still stumped as to why we saw this particularly on the /data portion of the filesystem. It's the most likely to see file changes the most often so perhaps the wear leveling logic kicked in on this partition first. It's also interesting to note that bypassing that block restores the file system to something stable, as tested by drnull. But if this truly is the bug, skipping the bad blocks is not solving the problem; it's only extending the life at the cost of possibly further corrupting the file system. I'm optimistic that it may be possible in the future to save a device even after it's bricked - filesystem corruption is not physical damage so I consider it in the realm of possibility. Whether it is practical or cost-effective is up to Samsung - they may even have a solution already available, just not for end users.

Initial Summary:
Based on available information this is does have significant credibility to be the bug in question and a rather clever attempt to work around the issue short of eMMC replacement. It should be tested for verification by a willing member of the community so long as they can afford to brick and replace the device if necessary. If verified the solution may not save an already bricked device at the moment, but it may avoid future bricks of this nature. It would also mean a high probability that any versions of Android prior to 4.0.4_r1.1 (which is the first standard build with the fix) should be the minimum requirements for any device with this eMMC if it can be supported.

*Disclaimer*
Comments and summary on this post, unless otherwise specified, should not be considered the definitive conclusion for this topic. Instead it is a summary of my observations - as such it should be reviewed and critiqued by others for possible improvement before the community comes to a conclusion.
The Following 12 Users Say Thank You to garwynn For This Useful Post: [ View ] Gift garwynn Ad-Free
 
 
9th May 2012, 11:25 PM |#22  
Senior Recognized Developer
Flag Owego, NY
Thanks Meter: 25,477
 
Donate to Me
More
Quote:
Originally Posted by garwynn

OK, now on to the bug that this fixes. This post will only contain the discussions between myself and Mr. Sumrall of the Android team.

Initial inquiry to Mr. Sumrall:


Initial Response:


We also knew this from the code:

Code:
	 * There is a bug in some Samsung emmc chips where the wear leveling
	 * code can insert 32 Kbytes of zeros into the storage.  We can patch
	 * the firmware in such chips each time they are powered on to prevent
	 * the bug from occurring.
Note: Snipped last part of comment as it is already covered.

OK, so it's putting potentially zeros in the storage; but it doesn't give us any clues as to where the possible storage was or how this could corrupt the filesystem. So I sent some follow-up questions and got the following responses. (Regular is question to Mr. Sumrall, bold is response.



Thanks again to Mr. Sumrall for this information. More soon.

I don't think the question was appropriately phrased to Mr. Sumrall. Filesystem corruption like he describes is one thing - it's easy enough to fix, just reformat.

However, the symptoms we have been seeing are:
1) Parts of the eMMC are becoming inaccessible, to the point where they cannot be written to in any way, shape, or form. This goes way beyond filesystem I/O errors - a reformat would fix those.
2) This includes attempting to rewrite corrupted bootloaders using JTAG recovery methods.

Is there any chance you could send me the contact info for this person, as I've been fairly "down in the dirt" with the hardbricks incurred on other devices.
The Following User Says Thank You to Entropy512 For This Useful Post: [ View ]
10th May 2012, 12:13 AM |#23  
azyouthinkeyeiz's Avatar
Senior Member
Thanks Meter: 122
 
More
Quote:
Originally Posted by Entropy512

1) Parts of the eMMC are becoming inaccessible, to the point where they cannot be written to in any way, shape, or form. This goes way beyond filesystem I/O errors - a reformat would fix those.

Q: Can inserting 32 Kb of zeros corrupt a file system?
Certainly - as mentioned by Mr. Sumrall, it just depends on where the insertion was made.

/*In this case in the low-level bootloader firmware*/

Quote:
Originally Posted by Entropy512

2) This includes attempting to rewrite corrupted bootloaders using JTAG recovery methods.

Q: Why couldn't JTAG, with its bit blast, at least reset the values and allow it to go ahead?
The answer, as far as I can tell, is simply that the support is not available at a driver and/or eMMC controller level to handle this type of operation. This is either because the embedded controller chip simply cannot do so or the driver has not been designed to use those instructions as of yet. As a result, all it can do is throw an error in frustration. It might be possible down the road but not now. So instead of waiting for this to come Samsung implemented a workaround to the WL logic to avoid it corrupting the filesystem.

Quote:
Originally Posted by Entropy512

I don't think the question was appropriately phrased to Mr. Sumrall. Filesystem corruption like he describes is one thing - it's easy enough to fix, just reformat.

That's why he clarified with
Quote:
Originally Posted by garwynn

The explanation in the bugfix mentions 32 Kb of zeros being added to storage. But I can't see this causing an I/O error unless it was doing this in the storage containing the instruction set. Was this somehow corrupting the I/O instruction set contained within the firmware?*I have spent several weeks defending the opinion that this was not a hardware failure but software-based."

And the answer was- it's corrupting the low-level firmware that handles EMMC, which makes the partition inaccessible.

The intent has not been written at that level to correct that with jTag.

Both of your questions are answered in the two quotes.

Sent from my SPH-D710 using Tapatalk 2
10th May 2012, 01:05 AM |#24  
OP Senior Member
Thanks Meter: 7,242
 
More
Quote:
Originally Posted by azyouthinkeyeiz

That's why he clarified with
And the answer was- it's corrupting the low-level firmware that handles EMMC, which makes the partition inaccessible.

The intent has not been written at that level to correct that with jTag.

Both of your questions are answered in the two quotes.

This is the part I don't understand about this answer. If they cannot permanently patch the EMMC firmware (presumably because it is not writable), how is the 32kb of zeros corrupting that same unwritable firmware. If the firmware is kept on EMMC it should be writable though I find it difficult to believe the EMMC firmware (essentially the "OS" handling the blocks on the EMMC is being stored on the regular EMMC itself, subject to wearlevel algorithms)

If that stuff about the firmware was just miscommunication and we are back to talking about filesystem corruption, I can write 32KB of zeros to any of the filesystems just using "dd" and "/dev/zero". These can be repaired by reformating the partition. There is some "special" corruption or damage going on here, IMO.

It may be that the bug they are fixing is causing all this. I just cannot picture the mechanism yet with the descriptions being given.
The Following 5 Users Say Thank You to sfhub For This Useful Post: [ View ] Gift sfhub Ad-Free
10th May 2012, 01:13 AM |#25  
Senior Member
Thanks Meter: 324
 
More
Quote:
Originally Posted by sfhub

This is the part I don't understand about this answer. If they cannot permanently patch the EMMC firmware (presumably because it is not writable), how is the 32kb of zeros corrupting that same unwritable firmware. If the firmware is kept on EMMC it should be writable though I find it difficult to believe the EMMC firmware (essentially the "OS" handling the blocks on the EMMC is being stored on the regular EMMC itself, subject to wearlevel algorithms)

If that stuff about the firmware was just miscommunication and we are back to talking about filesystem corruption, I can write 32KB of zeros to any of the filesystems just using "dd" and "/dev/zero". These can be repaired by reformating the partition. There is some "special" corruption or damage going on here, IMO.

It may be that the bug they are fixing is causing all this. I just cannot picture the mechanism yet with the descriptions being given.

Dear Mr. sfhub,
There is a new sheriff in town. Or soon will be.
It's going to be either Evo LTE or some kind of a variation.
Can we count you to provide hacking/unlocking support for this puppy?
10th May 2012, 01:20 AM |#26  
OP Senior Member
Thanks Meter: 7,242
 
More
Probably not any time soon. I have turtle DNA in my family tree and move very slowly between phones
The Following 8 Users Say Thank You to sfhub For This Useful Post: [ View ] Gift sfhub Ad-Free
10th May 2012, 01:23 AM |#27  
azyouthinkeyeiz's Avatar
Senior Member
Thanks Meter: 122
 
More
Quote:
Originally Posted by sfhub

This is the part I don't understand about this answer. If they cannot permanently patch the EMMC firmware (presumably because it is not writable), how is the 32kb of zeros corrupting that same unwritable firmware. If the firmware is kept on EMMC it should be writable though I find it difficult to believe the EMMC firmware (essentially the "OS" handling the blocks on the EMMC is being stored on the regular EMMC itself, subject to wearlevel algorithms)

If that stuff about the firmware was just miscommunication and we are back to talking about filesystem corruption, I can write 32KB of zeros to any of the filesystems just using "dd" and "/dev/zero". These can be repaired by reformating the partition. There is some "special" corruption or damage going on here, IMO.

It may be that the bug they are fixing is causing all this. I just cannot picture the mechanism yet with the descriptions being given.

That's what I had started to suppose in the post I removed. It would have to be binary inserts between the software and firmware that corrupted. Otherwise, re-formating would correct it. But, if it went into panic, it could have locked down the partition, and like garwynn suggested the controller hasn't been defined.

Also, there was a certain subset of variables that had to have been met during initial custom development of our device, that would have had the same consequences. We would have had at least one, and I don't think I ever heard of a similar brick before ICS. It could be an issue with this bug and the conversion between two different kernels, which is likely since it is the ICS recovery intents that are borking the partition.

---------- Post added at 07:23 PM ---------- Previous post was at 07:21 PM ----------

Quote:
Originally Posted by sfhub

I have turtle DNA in my family tree

/Epic
10th May 2012, 01:29 AM |#28  
OP Senior Member
Thanks Meter: 7,242
 
More
Quote:
Originally Posted by garwynn

If I look at this part of code alone:

Code:
cid_rev(0, 0x25, 1997, 1)
...this tells me to look for a definition of cid_rev. So I do and get here:
Code:
#define cid_rev(hwrev, fwrev, year, month)      \
        (((u64) hwrev) << 40 |                  \
         ((u64) fwrev) << 32 |                  \
         ((u64) year) << 16 |                   \
         ((u64) month))
...
Code:
#define _FIXUP_EXT(_name, _manfid, _oemid, _rev_start, _rev_end,        \
                   _cis_vendor, _cis_device,                            \
                   _fixup, _data)                                       \
        {                                                  \
                .name = (_name),                           \
                .manfid = (_manfid),                       \
                .oemid = (_oemid),                         \
                .rev_start = (_rev_start),                 \
                .rev_end = (_rev_end),                     \
                .cis_vendor = (_cis_vendor),               \
                .cis_device = (_cis_device),               \
                .vendor_fixup = (_fixup),                  \
                .data = (_data),                           \
         }
So to properly identify what is affected:

Model Name (.name) -> VYL00M, KYL00M or MAG4FA
Manu. Firwmare ID: --> 0x15
OEM ID: Any ID (easy extrapolation of CID_OEMID_ANY)
Revision Start (Range): The result of cid_rev(0, 0x25, 1997, 1). The date indicates a low limit value.
Revision End (Range: The result of cid_rev(0, 0x25, 2012, 12). The date indicates a high limit value.
Fixup: Function to call to add the fixup - in this case, add_quirk_mmc (data.h linked above)
Data: MMC_QUIRK_SAMSUNG_WL_PATH (Not 100% but looks like a label to me. Can't find a definition in change.)
...
The corrected eMMCs affected involve VYL00M, KYL00M or MAG4FA at Manufacturer Firmware ID 0x15.
...

But even if name, manfid, and oemid match, presumably the reason they give a revision start/end range is because it is actually being used to limit which revisions get the patch. Given that, our fwrev is 0x0 and they provided 0x25. Bitwise shift those over 32 bits and 0x0 still won't be within the range of 0x25.

Am I missing something?

If the wrote range like this, then it would make more sense why it would apply to our phones:
Quote:

cid_rev(0, 0x0, 1997, 1), cid_rev(0, 0x25, 2012, 12)
instead of this
cid_rev(0, 0x25, 1997, 1), cid_rev(0, 0x25, 2012, 12)

10th May 2012, 01:40 AM |#29  
azyouthinkeyeiz's Avatar
Senior Member
Thanks Meter: 122
 
More
Quote:
Originally Posted by sfhub

But even if name, manfid, and oemid match, presumably the reason they give a revision start/end range is because it is actually being used to limit which revisions get the patch. Given that, our fwrev is 0x0 and they provided 0x25. Bitwise shift those over 32 bits and 0x0 still won't be within the range of 0x25.

Am I missing something?

I wonder what the control group and test group in their lab is? You said the scope was incomplete? It could include a higher range?
10th May 2012, 01:44 AM |#30  
OP Senior Member
Thanks Meter: 7,242
 
More
Quote:
Originally Posted by garwynn

So to properly identify what is affected:

Model Name (.name) -> VYL00M, KYL00M or MAG4FA
Manu. Firwmare ID: --> 0x15
OEM ID: Any ID (easy extrapolation of CID_OEMID_ANY)
Revision Start (Range): The result of cid_rev(0, 0x25, 1997, 1). The date indicates a low limit value.
Revision End (Range: The result of cid_rev(0, 0x25, 2012, 12). The date indicates a high limit value.
Fixup: Function to call to add the fixup - in this case, add_quirk_mmc (data.h linked above)
Data: MMC_QUIRK_SAMSUNG_WL_PATH (Not 100% but looks like a label to me. Can't find a definition in change.)
...
The corrected eMMCs affected involve VYL00M, KYL00M or MAG4FA at Manufacturer Firmware ID 0x15.

BTW according to linux kernel documentation, manfid is "Manufacturer ID", not "Manufacturer Firmware ID"

http://www.kernel.org/doc/Documentat...-dev-attrs.txt

Quote:

fwrev Firmware/Product Revision (from CID Register) (SD and MMCv1 only)
hwrev Hardware/Product Revision (from CID Register) (SD and MMCv1 only)
manfid Manufacturer ID (from CID Register)

Also they mention a restriction on fwrev and hwrev of "SD and MMCv1 only". Is it possible we don't have MMCv1 and that is affecting what information the kernel is providing for fwrev and hwrev? I'm grasping at straws a little bit here, because based on folks' responses, it still doesn't sound like any of our EMMC falls into the range they are talking about. If the information provided by the kernel is accurate, we all seem to have the same EMMC revs, just with different manufacturing dates.
The Following 2 Users Say Thank You to sfhub For This Useful Post: [ View ] Gift sfhub Ad-Free
10th May 2012, 01:49 AM |#31  
OP Senior Member
Thanks Meter: 7,242
 
More
Quote:
Originally Posted by azyouthinkeyeiz

I wonder what the control group and test group in their lab is? You said the scope was incomplete? It could include a higher range?

I mentioned that only as a possibility to rationalize the scenario where they put out a fix to address the corruption we are seeing, but for some reason it doesn't seem to be being applied to our EMMC units based on the information provided by our kernels and their table.

I didn't mean I unconditionally knew their scope was incomplete. Another explanation could be that their fix has nothing to do with the /data corruption we are seeing. That is also a reasonable explanation why the fwrevs don't match ours.
Post Reply Subscribe to Thread

Guest Quick Reply (no urls or BBcode)
Message:
Previous Thread Next Thread
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes