FORUMS
Remove All Ads from XDA

Discussion thread for /data EMMC lockup/corruption bug

5,333 posts
Thanks Meter: 7,219
 
By sfhub, Senior Member on 9th May 2012, 01:08 PM
Post Reply Email Thread
17th May 2012, 05:52 AM |#151  
OP Senior Member
Thanks Meter: 7,219
 
More
Quote:
Originally Posted by garwynn

@sfhub,

Typing this real quick while the little one is sleeping. #1 answer is here:
http://forum.xda-developers.com/show...php?p=25925957

Yes, I understand that post is explaining why our fwrev is 0x0.

My point above is you are suggesting fwrev being 0x0 is *an error in the kernel source* and it really should be set to 0x19 in the example above. I'm asking how you came to the conclusion PRV should be used to populate fwrev, as there are some inconsistencies with making that assumption and also I didn't see anywhere it says explicitly it should be the case that for v4 MMCA/JEDEC, PRV should be used as fwrev. There are some bit-width differences and also differences in interpretation. Just re-read the post I made and you'll see them detailed.


Quote:
Originally Posted by garwynn

But if that is there, why isn't it picked up? That is why I wanted the other thread linked as Entropy512 brought up a good point on that... And why I changed my approach for now.

Could you point me to the post for the good point from Entropy512 you are referring to?
 
 
17th May 2012, 06:21 AM |#152  
garwynn's Avatar
Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag Okinawa
Thanks Meter: 8,635
 
Donate to Me
More
Quote:
Originally Posted by sfhub

Yes, I understand that post is explaining why our fwrev is 0x0.

My point above is you are suggesting fwrev being 0x0 is *an error in the kernel source* and it really should be set to 0x19 in the example above. I'm asking how you came to the conclusion PRV should be used to populate fwrev, as there are some inconsistencies with making that assumption and also I didn't see anywhere it says explicitly it should be the case that for v4 MMCA/JEDEC, PRV should be used as fwrev. There are some bit-width differences and also differences in interpretation. Just re-read the post I made and you'll see them detailed.

Attached wrong link... how about this instead?
http://forum.xda-developers.com/show...7&postcount=56

That should clear that part up. Sorry about the previous link.

Quote:
Originally Posted by sfhub

Could you point me to the post for the good point from Entropy512 you are referring to?

Quoted post from Entropy and my reply:
http://forum.xda-developers.com/show...&postcount=232
17th May 2012, 08:36 AM |#153  
OP Senior Member
Thanks Meter: 7,219
 
More
Quote:
Originally Posted by garwynn

Attached wrong link... how about this instead?
http://forum.xda-developers.com/show...7&postcount=56

That should clear that part up. Sorry about the previous link.

Ah, ok, that clears things up regarding stuffing PRV into fwrev.

So then based on that, there is a very high likelihood the original conclusion that we have the emmc workaround in our kernel is not correct. Was that still in question? I've lost track of the timeline in the various threads.

If we had the emmc fix in our kernel, then it shouldn't be returning 0x0 for fwrev and instead should have returned 0x19 (because stuffing PRV into fwrev came in the same changeset as the emmc fix). Since it didn't return 0x19, there is a very high likelihood we don't have the other change (emmc fix/workaround)

Quote:
Originally Posted by garwynn

Quoted post from Entropy and my reply:
http://forum.xda-developers.com/show...&postcount=232

Quote:
Originally Posted by garwynn

If a Google SE - with 20+ years experience (based on his LinkedIn profile) and extensive knowledge on his OS - says this would corrupt the file system to an uncorrectable point I have to give credence to it. The fact that the fix came from Samsung and he applied it would also suggest he's had at least some, if not extended, dialogue with their dev team on this issue.

My reading of Sumrall's response was the filesystem data could not be recovered to the pre-corruption state, after it had been corrupted by the writing of 32KB of zeros.

I didn't read his response as "the emmc is so screwed after the writing of 32KB of zeros that that section of EMMC can never be written to again"
17th May 2012, 04:50 PM |#154  
garwynn's Avatar
Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag Okinawa
Thanks Meter: 8,635
 
Donate to Me
More
Quote:
Originally Posted by sfhub

My reading of Sumrall's response was the filesystem data could not be recovered to the pre-corruption state, after it had been corrupted by the writing of 32KB of zeros.

I didn't read his response as "the emmc is so screwed after the writing of 32KB of zeros that that section of EMMC can never be written to again"

You're reading it the same as I am - perhaps uncorrectable was not the right word to use there. I think it's not possible to correct this at the moment and that it's simply due to a lack of the necessary software solution to manipulate the controller to do what is necessary. (I noted this in the same post that you quoted)

From an experience and educational background I'll admit that I'm probably a lightweight compared to several others involved in the discussion. But the basic premise I remember in my classes should stand - a electronic device is either only limited by physical means or the lack of software solutions to accomplish the goal. There doesn't seem to be a physical explanation for these bricks (which is where I have disagreed with MobileTechVideos on the topic) so the only logical alternative is a lack of software support to correct the issue.
17th May 2012, 07:49 PM |#155  
OP Senior Member
Thanks Meter: 7,219
 
More
Quote:
Originally Posted by garwynn

There doesn't seem to be a physical explanation for these bricks (which is where I have disagreed with MobileTechVideos on the topic) so the only logical alternative is a lack of software support to correct the issue.

Well, if I program the hard drive firmware to spin faster than the hardware can handle and it crashes, it was a software problem that caused hardware damage that can't be corrected in software.

The firmware on these EMMC chips have to handle low-level stuff like voltage levels, time sensitive operations, etc. If there is a bug in certain operations they can cause hardware issues. If those certain operations are avoided by using alternate operations, then you won't know there is any problem.

I've had bunch of same brand SD cards that would hard crash with a certain card reader only (actually the particular chip set controller in that reader), because the voltage levels were not what it expected during write operations. Switch card readers and everything worked perfect.
The Following User Says Thank You to sfhub For This Useful Post: [ View ] Gift sfhub Ad-Free
17th May 2012, 07:57 PM |#156  
OP Senior Member
Thanks Meter: 7,219
 
More
Quote:
Originally Posted by garwynn

You're reading it the same as I am - perhaps uncorrectable was not the right word to use there. I think it's not possible to correct this at the moment and that it's simply due to a lack of the necessary software solution to manipulate the controller to do what is necessary. (I noted this in the same post that you quoted)

Actually I was of the thinking the 32KB zero problem is not what we are experiencing. I could be wrong, and it might still be the cause, but I can't see the mechanism where it would result in EMMC blocks which can't be written to again, based on the explanation of the problem they are fixing. That is making me lean towards thinking what we are experiencing is a different problem, but possibly also in the EMMC firmware.

I also feel, based on the data provided, for our problem, this is either permanent hardware damage, or damage where we won't be able to repair outside of a lab (which is basically the same thing for our purposes)

I know you don't agree with this assessment. Not knowing the true cause can allow for multiple possibilities, and we all have opinions based on interpretation of the data.
17th May 2012, 09:19 PM |#157  
garwynn's Avatar
Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag Okinawa
Thanks Meter: 8,635
 
Donate to Me
More
Quote:
Originally Posted by sfhub

Actually I was of the thinking the 32KB zero problem is not what we are experiencing. I could be wrong, and it might still be the cause, but I can't see the mechanism where it would result in EMMC blocks which can't be written to again, based on the explanation of the problem they are fixing. That is making me lean towards thinking what we are experiencing is a different problem, but possibly also in the EMMC firmware.

I also feel, based on the data provided, for our problem, this is either permanent hardware damage, or damage where we won't be able to repair outside of a lab (which is basically the same thing for our purposes)

I know you don't agree with this assessment. Not knowing the true cause can allow for multiple possibilities, and we all have opinions based on interpretation of the data.

There always exists the case that there are multiple issues going on with these eMMCs, but then what we're doing now should help at the least by process of elimination. It would certainly explain why we can't fill all of the proverbial holes with what we know. Source will also help us greatly once we get it - but that also depends on if they'll give the whole thing or hold back on the proprietary code.

Have we gotten any better visibility into the "superbrick" at the time that it happens? Maybe try and have someone run a logcat window while performing these operations?
18th May 2012, 12:05 AM |#158  
OP Senior Member
Thanks Meter: 7,219
 
More
Well, my take away so far is I'm willing to wipe my system using delete_recursive. Format is a possibilty also but not completely sure. However I will completely avoid wipe data/factory reset. GB-based kernel and ODIN appear to be safe also, but we already knew that.

I'll go out on a limb and just predict that we won't find a "bug" in the kernel even when source is released. We will probably figure out that the ICS kernel produces a different set of commands to the EMMC to achieve certain functions and then we'll revert to the GB way of doing things and avoid the bug. IMO for this type of crash to happen, the likelihood is the core problem is in the EMMC firmware and the best we can hope for is to avoid the problem (which should be good enough for almost everyone)
The Following 2 Users Say Thank You to sfhub For This Useful Post: [ View ] Gift sfhub Ad-Free
18th May 2012, 03:13 PM |#159  
garwynn's Avatar
Forum Moderator / Inactive Recognized Developer / XDA Portal Team
Flag Okinawa
Thanks Meter: 8,635
 
Donate to Me
More
Morning Update
Well, it's been some time but thankfully Mr. Sumrall from Android did get back to us on our questions. I think the community will find that this was worth the wait.

Issue: fwrev not set properly.
As we suspected the bugfix is not in our build. (The patch applies this unconditionally.)
Quote:
Originally Posted by Ken Sumrall

The patch includes a line in mmc.c setting fwrev to the rights bits from the cid register. Before this patch, the file /sys/class/block/mmcblk0/device/fwrev was not initialized from the CID for emmc devices rev 4 and greater, and thus showed zero.

(On second inquiry)
fwrev is zero until the patch is applied.

Question: Revision didn't match the fix
(Emphasis mine in red as it discusses the superbrick issue.)
Quote:
Originally Posted by Ken Sumrall

You probably have the bug, but rev 0x19 was a previous version of the firmware we had in our prototype devices, but we found it had another bug that if you issued an mmc erase command, it could screw up the data structures in the chip and lead to the device locking up until it was powered cycled. We discovered this when many of our developers were doing a fastboot erase userdata while we were developing ICS. So Samsung fixed the problem and moved to firmware revision 0x25. Yes, it is very annoying that 0x19 is decimal 25, and that led to lots of confusion when trying to diagnose emmc firmware issues. I finally learned to _ALWAYS_ refer to emmc version in hexadecimal, and precede the number with 0x just to be unambiguous.

However, even though 0x19 probably has the bug that can insert 32 Kbytes of zeros into the flash, you can't use this patch on devices with firmware revision 0x19. This patch does a very specific hack to two bytes of code in the revision 0x25 firmware, and the patch most likely will not work on 0x19, and will probably cause the chip to malfunction at best, and lose data at worst. There is a reason the selection criteria are so strict for applying this patch to the emmc firmware.

I passed on our results a few days later mentioning that the file system didn't corrupt until the wipe. This is a response to that follow-up.

As I mentioned in the previous post, firmware rev 0x19 has a bug where the emmc chip can lockup after an erase command is given. Not every time, but often enough. Usually, the device can reboot after this, but then lockup during the boot process. Very rarely, it can lockup even before fastboot is loaded. Your tester was unlucky. Since you can't even start fastboot, the device is probably bricked. :( If he could run fastboot, then the device could probably be recovered with the firmware update code I have, assuming I can share it. I'll ask.

Question: Why the /data partition?
Quote:
Originally Posted by Ken Sumrall (Android SE)

Because /data is the place the chip that experiences the most write activity. /system is never written to (except during an system update) and /cache is rarely used (mostly to receiving OTAs).

Question: Why JTAG won't work?
Quote:
Originally Posted by Ken Sumrall

As I mention above, the revision 0x19 firmware had a bug that after an emmc erase command, it could leave the internal data structures of the emmc chip in a bad state that cause the chip to lock up when a particular sector was accessed. The only fix was to wipe the chip, and update the firmware. I have code to do that, but I don't know if I can share it. I'll ask.

Question: Can a corrupted file system be repaired (on the eMMC)?
Quote:
Originally Posted by Ken Sumrall

e2fsck can repair the filesystem, but often the 32 Kbytes were inserted at the start of a block group, which erased many inodes, and thus running e2fsck would often result in many files getting lost.

So, while the fix doesn't apply to us at the moment, we've been given a great insight into the superbrick issue as well as information that a fix is already developed (hopefully we'll see it released!). The bug likely applies to us and assuming the fix for the 0x19 firmware is given then it would apply to our devices.

On a lighter note, I wanted to include his close:
Quote:
Originally Posted by Ken Sumrall

You are getting a glimpse into the exciting life of an Android kernel developer. :) Turns out the job is mostly fighting with buggy hardware. At least, it seems that way sometimes.

The Following 19 Users Say Thank You to garwynn For This Useful Post: [ View ] Gift garwynn Ad-Free
18th May 2012, 04:18 PM |#160  
azyouthinkeyeiz's Avatar
Senior Member
Thanks Meter: 122
 
More
I also think (after that last quote, garwyyn) this a good time to explain to those who are constantly complaining about waiting for releases, that these issues are examples of what hold that process up. I always hear, "Oh, Samsung sucks. They are so slow with updates."

Well, innovation takes time. When you are going above and beyond to make quality software, and not keep it as close to the source as you can, it takes time.

Would you guys rather risk bricks, (after updating to unvetted software) than have just a little patience?



Sent from my SPH-D710 using Tapatalk 2
18th May 2012, 04:26 PM |#161  
Esoteric68's Avatar
Senior Member
Flag Hellabama
Thanks Meter: 1,482
 
More
Quote:
Originally Posted by azyouthinkeyeiz

I also think (after that last quote, garwyyn) this a good time to explain to those who are constantly complaining about waiting for releases, that these issues are examples of what hold that process up. I always hear, "Oh, Samsung sucks. They are so slow with updates."

Well, innovation takes time. When you are going above and beyond to make quality software, and not keep it as close to the source as you can, it takes time.

Would you guys rather risk bricks, (after updating to unvetted software) than have just a little patience?



Sent from my SPH-D710 using Tapatalk 2

As eager as I am to have source I'd much prefer they take whatever time they need to get the big issues in hand.

Little bugs I can live with, there are always follow-on updates to correct them or doable workarounds to avoid them but this bricking issue is something we need resolved no matter how long it takes.

I appreciate the immense amount of time you guys are spending on this trying to help the process along.
Post Reply Subscribe to Thread

Guest Quick Reply (no urls or BBcode)
Message:
Previous Thread Next Thread
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes