FORUMS
Remove All Ads from XDA

Discussion thread for /data EMMC lockup/corruption bug

5,342 posts
Thanks Meter: 7,242
 
By sfhub, Senior Member on 9th May 2012, 01:08 PM
Post Reply Email Thread
9th May 2012, 04:19 PM |#11  
garwynn's Avatar
Retired Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag NE Ohio
Thanks Meter: 8,731
 
Donate to Me
More
Sorry for the delay on giving more details. I've got info that I'll be passing along soon, just want to read up on something a little more before I post it out here.

Regardless whether this fix turns out to be related to the issue we've been looking at, I want to throw this in now before getting into the weeds:

Big thanks to Ken Sumrall from the Android team as he's been good enough to share info with us about this bugfix. As the person who signed off on the change commit he's one of the best resources on this issue. Also want to give credit to Mr. Min of Samsung who developed the fix in question.

I'll be posting more shortly. Those with dev experience, particularly with C++ and/or Assembly may be able to help us as well.
The Following 3 Users Say Thank You to garwynn For This Useful Post: [ View ] Gift garwynn Ad-Free
 
 
9th May 2012, 04:28 PM |#12  
robertm2011's Avatar
Senior Member
Thanks Meter: 502
 
More
MAG4FA
0x0
0x0
0x000015
0x0100
11/2011
MMC

If need be, will test.
9th May 2012, 05:55 PM |#13  
garwynn's Avatar
Retired Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag NE Ohio
Thanks Meter: 8,731
 
Donate to Me
More
Source Notes - Part 1
Source Notes - Part 1
(Please feel free to skip if you're not interested in the programming)

This section is just to document what models and versions are affected.
I'm posting this in rather lengthy detail in part for peer review and also as I misread this the first time.
If you want to see the results of this documentation you can skip to the bottom.

Again, you'll need the link to the change:
https://bitbucket.org/franciscofranc...t/cea631bdac53

If I look at this part of code alone:
Code:
cid_rev(0, 0x25, 1997, 1)
...this tells me to look for a definition of cid_rev. So I do and get here:
Code:
#define cid_rev(hwrev, fwrev, year, month)      \
        (((u64) hwrev) << 40 |                  \
         ((u64) fwrev) << 32 |                  \
         ((u64) year) << 16 |                   \
         ((u64) month))
It's not included in this change as this was introduced previously.
But you can see the definition here:
https://bitbucket.org/franciscofranc...nux/mmc/card.h

OK, so I should look for HW revision 0x0, FW revision 0x25, right?
Nope. This was nested in another function and I didn't look at that right:
Code:
	MMC_FIXUP_REV("VYL00M", 0x15, CID_OEMID_ANY,
		      cid_rev(0, 0x25, 1997, 1), cid_rev(0, 0x25, 2012, 12),
		      add_quirk_mmc, MMC_QUIRK_SAMSUNG_WL_PATCH),
	MMC_FIXUP_REV("KYL00M", 0x15, CID_OEMID_ANY,
		      cid_rev(0, 0x25, 1997, 1), cid_rev(0, 0x25, 2012, 12),
		      add_quirk_mmc, MMC_QUIRK_SAMSUNG_WL_PATCH),
	MMC_FIXUP_REV("MAG4FA", 0x15, CID_OEMID_ANY,
		      cid_rev(0, 0x25, 1997, 1), cid_rev(0, 0x25, 2012, 12),
		      add_quirk_mmc, MMC_QUIRK_SAMSUNG_WL_PATCH),
OK, so back to the card.h file to get my definition of MMC_FIXUP_REV:

Code:
#define MMC_FIXUP_REV(_name, _manfid, _oemid, _rev_start, _rev_end,     \
                      _fixup, _data)                                    \
        _FIXUP_EXT(_name, _manfid,                                      \
                   _oemid, _rev_start, _rev_end,                        \
                   SDIO_ANY_ID, SDIO_ANY_ID,                            \
                   _fixup, _data)                                       \
I also want to look at _FIXUP_EXT next:
Code:
#define _FIXUP_EXT(_name, _manfid, _oemid, _rev_start, _rev_end,        \
                   _cis_vendor, _cis_device,                            \
                   _fixup, _data)                                       \
        {                                                  \
                .name = (_name),                           \
                .manfid = (_manfid),                       \
                .oemid = (_oemid),                         \
                .rev_start = (_rev_start),                 \
                .rev_end = (_rev_end),                     \
                .cis_vendor = (_cis_vendor),               \
                .cis_device = (_cis_device),               \
                .vendor_fixup = (_fixup),                  \
                .data = (_data),                           \
         }
So to properly identify what is affected:

Model Name (.name) -> VYL00M, KYL00M or MAG4FA
Manu. Firwmare ID Manufacturer ID: --> 0x15
OEM ID: Any ID (easy extrapolation of CID_OEMID_ANY)
Revision Start (Range): The result of cid_rev(0, 0x25, 1997, 1). The date indicates a low limit value.
Revision End (Range: The result of cid_rev(0, 0x25, 2012, 12). The date indicates a high limit value.
Fixup: Function to call to add the fixup - in this case, add_quirk_mmc (data.h linked above)
Data: MMC_QUIRK_SAMSUNG_WL_PATH (Not 100% but looks like a label to me. Can't find a definition in change.)

Note:
The fix mentioned right above the models affected in a note: "Date doesn't matter, so include all possible dates in min and max fields." I misread how they were getting the low and high limits of the range.


The corrected eMMCs affected involve VYL00M, KYL00M or MAG4FA at Manufacturer Firmware ID 0x15.

I would like to apologize for providing inaccurate info the first time; after going through the code another time I'm fairly certain the correction to the affected model list is accurate.

This also confirms that those who have posted are in the affected list, which we knew but couldn't confirm until now.
The Following User Says Thank You to garwynn For This Useful Post: [ View ] Gift garwynn Ad-Free
9th May 2012, 06:06 PM |#14  
Azrael.arach's Avatar
Senior Member
Thanks Meter: 27
 
More
So does this mean the new code in the kernel is to help this problem and what would be the steps to testing it?

Sorry if dumb questions, just trying to learn.

Sent from my SPH-D710 using Tapatalk 2
9th May 2012, 06:39 PM |#15  
robertm2011's Avatar
Senior Member
Thanks Meter: 502
 
More
So theoretically, those of us who have posted are able to make full use of the ability to flash, backup, etc. with the proper modification to our kernels? (Hope I got this right.)

Sent from my SPH-D710 using Tapatalk 2
9th May 2012, 06:56 PM |#16  
garwynn's Avatar
Retired Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag NE Ohio
Thanks Meter: 8,731
 
Donate to Me
More
Quote:
Originally Posted by Azrael.arach

So does this mean the new code in the kernel is to help this problem and what would be the steps to testing it?

Sorry if dumb questions, just trying to learn.

I'm of the thought that 99% of all questions are never dumb. And that 1% is extremely rare.

The thread is somewhat of peer review and discussion about what we've found and as a community possibly confirm whether this is both the bug causing the bricks (and by doing so confirming that this is the fix.)

Quote:
Originally Posted by robertm2011

So theoretically, those of us who have posted are able to make full use of the ability to flash, backup, etc. with the proper modification to our kernels? (Hope I got this right.)

It means that the fix applies to those devices. What still isn't closed is whether the bug that this squishes is *the* bug (causing the ICS based bricks). I'm going to be posting more about that part here shortly for feedback and discussion.
The Following 2 Users Say Thank You to garwynn For This Useful Post: [ View ] Gift garwynn Ad-Free
9th May 2012, 07:06 PM |#17  
Senior Member
Thanks Meter: 51
 
More
Very interesting but so far over my head....
9th May 2012, 07:21 PM |#18  
azyouthinkeyeiz's Avatar
Senior Member
Thanks Meter: 122
 
More
/* Missed the first paragraph of the details section. Low level corruption would do what I was questioning. Nvm
Sent from my SPH-D710 using Tapatalk 2
9th May 2012, 07:51 PM |#19  
garwynn's Avatar
Retired Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag NE Ohio
Thanks Meter: 8,731
 
Donate to Me
More
Discussion with Android Team
OK, now on to the bug that this fixes. This post will only contain the discussions between myself and Mr. Sumrall of the Android team.

Initial inquiry to Mr. Sumrall:
Quote:
Originally Posted by Garwynn

1) Was the bug that this patched causing the eMMC failures on Samsung devices using an 3.0+ kernel?

2) If #1 is yes, is it known if this correct the I/O errors already experienced? Or is this perhaps preventative in nature?

Initial Response:
Quote:
Originally Posted by Ken Sumrall - Android Team

The bug was in the emmc firmware which ran on a small microprocessor inside the emmc chip, and it didn't matter what kernel was running on the device to which it was attached. However, it may be the case that a particular kernel version was more likely to trigger the bug.

With this patch, the bug is worked around, and the emmc chip should no longer corrupt data.

We also knew this from the code:
Code:
	 * There is a bug in some Samsung emmc chips where the wear leveling
	 * code can insert 32 Kbytes of zeros into the storage.  We can patch
	 * the firmware in such chips each time they are powered on to prevent
	 * the bug from occurring.
Note: Snipped last part of comment as it is already covered.

OK, so it's putting potentially zeros in the storage; but it doesn't give us any clues as to where the possible storage was or how this could corrupt the filesystem. So I sent some follow-up questions and got the following responses. (Regular is question to Mr. Sumrall, bold is response.

Quote:

1)*Can this release fix a device where the bug has already been triggered (resulting in I/O error)?

No. *The 32 Kbytes of zeros have already been inserted into the filesystem (usually in a particularly bad place, like the inode or block bitmaps, or the inode table) and the filesystem is now corrupt.

2) What would happen if a device is rolled back to a previous kernel - one without this fix? Would it be exposed again to the bug?

Yes, the corruption could happen on older kernels. *The fix doesn't permanently fix the firmware, it patches the firmware every time the device is powered on (initial power-on, wakeup from sleep).

The next question was one that has been bugging me so I figured it wouldn't hurt to learn more about this bug. Sorry if this rubs some people the wrong way.

3) The explanation in the bugfix mentions 32 Kb of zeros being added to storage. But I can't see this causing an I/O error unless it was doing this in the storage containing the instruction set. Was this somehow corrupting the I/O instruction set contained within the firmware?*I have spent several weeks defending the opinion that this was not a hardware failure but software-based.

When the ext4 filesystem detects an error, and the filesystem is set to panic or re-mount read-only on error, the function ext4_handle_error() will record an EIO *in the journal:

Code:
static void ext4_handle_error(struct super_block *sb)
{
  if (sb->s_flags & MS_RDONLY)
    return;

  if (!test_opt(sb, ERRORS_CONT)) {
    journal_t *journal = EXT4_SB(sb)->s_journal;

    EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED;
    if (journal)
      jbd2_journal_abort(journal, -EIO);
  }
  .
  .
  .
  .
}
So it doesn't have to be an actual low-level IO error to cause EIO to be recorded in the journal.

As for hardware vs. software failure, it is a bug in the firmware of the emmc chip, and this kernel patch enables a work-around to prevent the problem from happening.

Thanks again to Mr. Sumrall for this information. More soon.
The Following 4 Users Say Thank You to garwynn For This Useful Post: [ View ] Gift garwynn Ad-Free
9th May 2012, 07:56 PM |#20  
keithatoz's Avatar
Member
Flag Tulsa, OK
Thanks Meter: 25
 
More
Here's what I have - hope it helps. I would help test, but not until after the weekend. Thanks for all the work and info.


MAG4FA
0x0
0x0
0x000015
0x0100
10/2011
MMC
9th May 2012, 08:54 PM |#21  
garwynn's Avatar
Retired Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag NE Ohio
Thanks Meter: 8,731
 
Donate to Me
More
Analysis/Opinion
OK, now that you've seen the comments from the Android team, have links to the code and we have confirmed that our devices are affected, let's try and walk this through:

Linux File System Terms:

inode:
Also known as metadata, it's the data about the data. Out of the many articles out there, I thought this might help without going too much into the technical side: http://www.linux-mag.com/id/8658/

Binary Bitmap:
In order to account for the usage of the blocks on the filesystem, the ext2 filesystem consists of a block bitmap. This keeps track of blocks that have been used and those that are free. Each bit in the Block Bitmap denotes an integral number of fragments. So if a bit is allocated to a file and marked as used, then an entire set of fragments are allocated to it.

The Block Bitmap is a clever way to keep track of new empty and old used ones. In order to look for a block, one needs to check the group to which the file belongs. Then the Block Bitmap of the appropriate `Group’ is selected and searched for the required block.

(Source: http://freeos.com/node/41)

Q: Can inserting 32 Kb of zeros corrupt a file system?
Certainly - as mentioned by Mr. Sumrall, it just depends on where the insertion was made.

Q: Can this corruption cause an I/O error?
Again, yes - under the conditions as described by Mr. Sumrall

Q: Can this particular bug be repaired?
Mr. Sumrall says no, and that sounds awfully familiar with our bricks.

Q: Why couldn't JTAG, with its bit blast, at least reset the values and allow it to go ahead?
The answer, as far as I can tell, is simply that the support is not available at a driver and/or eMMC controller level to handle this type of operation. This is either because the embedded controller chip simply cannot do so or the driver has not been designed to use those instructions as of yet. As a result, all it can do is throw an error in frustration. It might be possible down the road but not now. So instead of waiting for this to come Samsung implemented a workaround to the WL logic to avoid it corrupting the filesystem.

I'm still stumped as to why we saw this particularly on the /data portion of the filesystem. It's the most likely to see file changes the most often so perhaps the wear leveling logic kicked in on this partition first. It's also interesting to note that bypassing that block restores the file system to something stable, as tested by drnull. But if this truly is the bug, skipping the bad blocks is not solving the problem; it's only extending the life at the cost of possibly further corrupting the file system. I'm optimistic that it may be possible in the future to save a device even after it's bricked - filesystem corruption is not physical damage so I consider it in the realm of possibility. Whether it is practical or cost-effective is up to Samsung - they may even have a solution already available, just not for end users.

Initial Summary:
Based on available information this is does have significant credibility to be the bug in question and a rather clever attempt to work around the issue short of eMMC replacement. It should be tested for verification by a willing member of the community so long as they can afford to brick and replace the device if necessary. If verified the solution may not save an already bricked device at the moment, but it may avoid future bricks of this nature. It would also mean a high probability that any versions of Android prior to 4.0.4_r1.1 (which is the first standard build with the fix) should be the minimum requirements for any device with this eMMC if it can be supported.

*Disclaimer*
Comments and summary on this post, unless otherwise specified, should not be considered the definitive conclusion for this topic. Instead it is a summary of my observations - as such it should be reviewed and critiqued by others for possible improvement before the community comes to a conclusion.
The Following 12 Users Say Thank You to garwynn For This Useful Post: [ View ] Gift garwynn Ad-Free
Post Reply Subscribe to Thread

Guest Quick Reply (no urls or BBcode)
Message:
Previous Thread Next Thread
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes