Samsung working on Emmc brick fix

Mario1968 · Sep 12, 2012

jetmarc said:
Sure: https://bitbucket.org/franciscofranco/android-tuna-omap/changeset/cea631bdac53/raw/

The value 0xD20228FF looks suspiciously like

cmp r0,#0xff
bcs there
...
...
...
there:
...

The other value 0x000000FF looks suspiciously related to that interpretation, too.

There also seems to be code to dump firmware, which is necessary to analyse where to hook, to learn how to do IO and raw array access inside the chip, and also to make sense of the existing data structures. We could also compare vulnerable firmware to fixed firmware, highlighting which portions are relevant for the damage mechanism.

Hi Paul, what do you think about this fix/code? Is enough to prevent the eMMC brick? And what about the bricked devices, can this code used to revive those devices?
Sorry for the noob questions, but I prefer to know the expert thinking about this fix.
Thank you
Mario

Inviato dal mio Galaxy S2 con Tapatalk2®
XWLPF Stock
Siyah 4.1.5
Jkay V14.1

aletheus · Sep 13, 2012

+1
you guys have a few readers trying to wrap their heads around this problem. superbricked my p6800 a couple of weeks ago.... it was like loosing a pet.....

definitely interested in understanding this problem here.

PaulF8080 · Sep 13, 2012

garwynn said:
Thank you for the clarification - I've made it more than clear that I'm the amateur in the group. You are certainly right to criticize those remarks and I was wrong to post that, no matter what your opinion or background was. (Actually in this case it embarrasses me far more than you for clearly showing ignorance and a lack of knowledge on the subject)

I will revise that statement with an apology immediately.

As for further comments from me I think I'm better off holding any further comments/questions for now while the experts in the room hash out the more technical aspects. I've made myself look stupid enough for the day, no?

Please do not be be embarrassed. I always tell people HW debug is a very weird job. You are wrong 90% of time. Debuggers always have their own guesses. The rules are: there are no stupid guesses and listen to everyone. I was trying to get off the FW is borked track.

Sent from my SAMSUNG-SGH-I717 using xda app-developers app

Entropy512 · Sep 13, 2012

garwynn said:
Paul,

JEDEC standards specify the requirement for an embedded controller as part of the eMMC package, thus why a standardized command set. (Edit: At least as far as I have read and understood it - I can certainly be wrong here as proven already today)
(Sentence following this removed as it was offensive and did not take into account Paul's background and experience, which is far, FAR more than mine in these matters. My apologies for posting it in the first place and hope this will not detract from the discussion any further. )

---------- Post added at 01:32 PM ---------- Previous post was at 01:21 PM ----------

Refresher: Back and Forth with Mr. Sumrall (5/22/12)
Bold is me, non-bold is Mr. Sumrall's response:

A couple of things if I re-read this:
1) I suspect this is what jetmarc and Entropy are suggesting can only be done once (maybe CMD26 is a part of this?). But if so how did they develop a patch for it in the first place to update to revision 0x25? And if it was that unstable did they burn through a lot of chips or GNex phones just to fix this for Google to continue their kernel development? That seems awfully... cost insensitive.

2) Don't know if what Mr. Sumrall is mentioning is the same thing that Entropy was given access to. It's never been definitively confirmed nor can I assume it will be.

I wasn't given access to anything. The original hope was that Samsung and I would be providing this information when we met - but the guy working with the fix (He was at the meeting) was experiencing a very high failure rate. I think back in Korea he had a fairly large pile of dead chips. (They had an I9100 and an N7000 with socketed eMMC. It was pretty cool tbh.)

The problem is we'll likely never know exactly what Google had access to and if it even was the same as what the storage/memory guys gave to the engineer that was at the meeting with myself and my primary contact. (There were a total of three Samsung people there.) This stuff is all considered highly proprietary by the storage guys and there's also clearly a lot of issues with internal coordination/communication at Samsung (even within the same division, let alone cross-division.)

The eMMC standard allows vendors to implement proprietary extended commands, the update/repair process likely used one of these. (I'd link to the relevant kernel patch but I can't find it at the moment... drivers/mmc has a LONG changelog...)

It was my thinking at the time that this was being patched into RAM outside of the eMMC (since I thought it didn't have any usable RAM, just the controller and storage) - but if I put it into context as jetmarc suggested then I can see that his analysis makes more sense to me.

Question though: Would you have to re-patch the eMMC firmware again upon wake if indeed the patch took place in the SRAM (perhaps this is like a cache?) for the eMMC?

The internal controller within the eMMC (which translates MMC commands into whatever the underlying memory technology understands) appears, based on all of these patches, to have SRAM. It seems like it shadows its firmware image to SRAM on powerup. You won't see anything about this in the JEDEC spec because this is all internal manufacturer-specific implementation stuff.

Mario1968 said:
Hi Paul, what do you think about this fix/code? Is enough to prevent the eMMC brick? And what about the bricked devices, can this code used to revive those devices?
Sorry for the noob questions, but I prefer to know the expert thinking about this fix.
Thank you
Mario

Inviato dal mio Galaxy S2 con Tapatalk2®
XWLPF Stock
Siyah 4.1.5
Jkay V14.1

That code will do nothing for the eMMC bug. Here's a summary of what we know from Ken for the VYL00M/MAG4FA:
Fwrev 0x19 has two bugs - one that can cause the eMMC's internal data structures to get severely corrupted when a secure erase is issued (Superbrick), and a second one where the wear leveller will occasionally insert 32kb of zeros that shouldn't be there.
Fwrev 0x25 has one known bug - the 32kb of zeroes one (it's immune to the secure erase issue). The patch that is changing the firmware in SRAM is issuing a fix for this bug. It won't fix the secure erase bug and in fact cannot be applied to fwrev 0x19.

PaulF8080 said:
Please do not be be embarrassed. I always tell people HW debug is a very weird job. You are wrong 90% of time. Debuggers always have their own guesses. The rules are: there are no stupid guesses and listen to everyone. I was trying to get off the FW is borked track.

Sent from my SAMSUNG-SGH-I717 using xda app-developers app

The problem is, we know for a FACT that it is the firmware that is borked.

Seriously, you, as an I717 owner, could fire secure erase commands at your eMMC all day long without ever worrying, because all I717 units that shipped contained updated eMMC firmware that is immune to this bug.

garwynn · Sep 13, 2012

Entropy512 said:
That code will do nothing for the eMMC bug. Here's a summary of what we know from Ken for the VYL00M/MAG4FA:

Fwrev 0x19 has two bugs - one that can cause the eMMC's internal data structures to get severely corrupted when a secure erase is issued (Superbrick), and a second one where the wear leveller will occasionally insert 32kb of zeros that shouldn't be there.
Fwrev 0x25 has one known bug - the 32kb of zeroes one (it's immune to the secure erase issue). The patch that is changing the firmware in SRAM is issuing a fix for this bug. It won't fix the secure erase bug and in fact cannot be applied to fwrev 0x19.

I want to bring this up again because I'm noticing something that might be explained away by this.
Lately on flashes people are reporting random hangs - and it's not quite certain what the cause is. But if I think about the possibility for corruption by the above WL bug... a corrupted apk could definitely cause it to crash.

So if people are noticing odd behavior with an application it may be worth reinstalling just to make sure this bug isn't to blame. Just a thought for consideration.....

alvinasnow · Sep 13, 2012

Well.... looks like all this 2-3 months wait is gonna go astray. I am losing hope to find any solution for my superbricked eMMC. The best thing is that now, people have stopped giving us hope too

Dryblow · Sep 13, 2012

hi, on the LRQ firmware, fixed it?

p.s. sorry for my english.

baz77 · Sep 13, 2012

dekelm8 said:
I know that regular user cannot replace it in home, but i worked in cellphones lab for 8 years.
anyway i purchased the emmc chip over the net (around 70euro), and flash it with my riffbox.
Now is work like a charm, all the 16GB just like new.

Thanks anyway to all the forum about the info...

so cool, maybe walkthrought photos?

Sent from my GT-N7000 using Tapatalk 2

friedje · Sep 13, 2012

garwynn said:
I want to bring this up again because I'm noticing something that might be explained away by this.
Lately on flashes people are reporting random hangs - and it's not quite certain what the cause is. But if I think about the possibility for corruption by the above WL bug... a corrupted apk could definitely cause it to crash.

So if people are noticing odd behavior with an application it may be worth reinstalling just to make sure this bug isn't to blame. Just a thought for consideration.....

There has been a bug in mobile odin and the latest roms. It caused the /cache to corrupt and I think most flash-hangs reported lately came from this problem. Chainfire updated MO since..

---------- Post added at 08:32 PM ---------- Previous post was at 08:23 PM ----------

baz77 said:
so cool, maybe walkthrought photos?

Sent from my GT-N7000 using Tapatalk 2

Somehow, i don't think so....
But then again I might be paranoid lol

But someone asking on a sunday where to buy the chip and if it is maybe a sandisk.. and 2 days later not only he ordered and received the chip, but also replaced and reflashed it... I have a hard time believing it in such a short timespan.

However if it's done, that would be good news for the community. If there is enough space then it could be an option to get the chip desoldered and then resolder a socket for the replacement chip.

AndroidSlave · Sep 14, 2012

I had the mobile Odin hang with lrq... was fine with PC Odin

Sent from my GT-N7000 using Tapatalk 2

fleurdelisxliv · Sep 14, 2012

So ive read a few things on this is it even safe to do a factory reset?

MARBLE WHITE GALAXY SIII

PaulF8080 · Sep 14, 2012

Entropy512 said:
The internal controller within the eMMC (which translates MMC commands into whatever the underlying memory technology understands) appears, based on all of these patches, to have SRAM. It seems like it shadows its firmware image to SRAM on powerup. You won't see anything about this in the JEDEC spec because this is all internal manufacturer-specific implementation stuff.

That code will do nothing for the eMMC bug. Here's a summary of what we know from Ken for the VYL00M/MAG4FA:
Fwrev 0x19 has two bugs - one that can cause the eMMC's internal data structures to get severely corrupted when a secure erase is issued (Superbrick), and a second one where the wear leveller will occasionally insert 32kb of zeros that shouldn't be there.
Fwrev 0x25 has one known bug - the 32kb of zeroes one (it's immune to the secure erase issue). The patch that is changing the firmware in SRAM is issuing a fix for this bug. It won't fix the secure erase bug and in fact cannot be applied to fwrev 0x19.

The problem is, we know for a FACT that it is the firmware that is borked

Are you insisting there is a microprocessor on the chip that executes FW even though the implementation is unknown.to you? I think we have language problem. I view a controller as logic that controls some external signals. An embedded controller does the same except the signals are internal. You can buy micro controllers(not embedded controllers) that control slow signals and they just microprocessors with a lot of I/O pins. Not all controllers have are cpus. There is exynos/arm FW on the eMMC. Surely you are not saying there is a cpu on the eMMC chip that executes FW arm code. Here is where the language comes in. Sometimes there is code implemented in the silicon, but that is call microcode not FW in my world. Even microcode is too slow for high speed signals.

I am a retired engineer who did debug for a very long time and I am sorry, I can't seem to stop.

Please note: I always use "guess" instead of "fact"(an old debugger habit). I am glad my i717 is safe, thanks.

baz77 · Sep 14, 2012

friedje said:
Somehow, i don't think so....
But then again I might be paranoid lol
But someone asking on a sunday where to buy the chip and if it is maybe a sandisk.. and 2 days later not only he ordered and received the chip, but also replaced and reflashed it... I have a hard time believing it in such a short timespan.

However if it's done, that would be good news for the community. If there is enough space then it could be an option to get the chip desoldered and then resolder a socket for the replacement chip.

I am pretty gullible I guess

Sent from my GT-N7000 using Tapatalk 2

PaulF8080 · Sep 14, 2012

friedje said:
Somehow, i don't think so....
But then again I might be paranoid lol
But someone asking on a sunday where to buy the chip and if it is maybe a sandisk.. and 2 days later not only he ordered and received the chip, but also replaced and reflashed it... I have a hard time believing it in such a short timespan.

However if it's done, that would be good news for the community. If there is enough space then it could be an option to get the chip desoldered and then resolder a socket for the replacement chip.

It's the Sanddisk guy? That seals it. Good catch! I pointed out to him that there no sector marks on a ram disk. Being paranoid explains a reply to one of my posts. You have no way of knowing who people are here, I guess.

friedje · Sep 14, 2012

PaulF8080 said:
Are you insisting there is a microprocessor on the chip that executes FW even though the implementation is unknown.to you? I think we have language problem. I view a controller as logic that controls some external signals. An embedded controller does the same except the signals are internal. You can buy micro controllers(not embedded controllers) that control slow signals and they just microprocessors with a lot of I/O pins. Not all controllers have are cpus. There is exynos/arm FW on the eMMC. Surely you are not saying there is a cpu on the eMMC chip that executes FW arm code. Here is where the language comes in. Sometimes there is code implemented in the silicon, but that is call microcode not FW in my world. Even microcode is too slow for high speed signals.

I am a retired engineer who did debug for a very long time and I am sorry, I can't seem to stop. Please note: I always use "guess" instead of "fact"(an old debugger habit). I am glad my i717 is safe, thanks.

Although this is off topic, Firmware = Microcode.

So I don't see the point in spending 23453455445643 posts on insisting that
we DON'T call it firmware but microcode. by definition firmware IS microcode that bridges between actual software calls and the underlying hardware.

Now it doesn't matter if this is inside a CPU, a GPU, an Embedded controller or even an Electronic Toiletflusher.

jetmarc · Sep 14, 2012

PaulF8080 said:
Are you insisting there is a microprocessor on the chip that executes FW even though the implementation is unknown.to you?

Although you didn't direct this to me, I'll reply.

The implementation is indeed unknown to us, so we are all guessing. You said your guess is that it's implemented as random logic, just like you did for a QPI interface implementation in the past. I don't agree with this guess, but likewise you don't agree with mine. There wouldn't be any problem if it weren't for the several pages of flame posts that follow. Those make it difficult for others to find real information here.

Therefore I'll bite and give reasons why I don't agree with your guess. Please defend your guess on a technical level, not a personal one, so that the thread stays on topic. I certainly don't want to attack you (or anyone else). If it may appear so, then it is not intentional. English is not my native language and this can cause all kinds of misunderstandings.

1. All (useful!) standards are made to solve realworld problems, with realworld tools. Often a reference implementation is made (public or not) and the standard is drafted with data observed with the reference implementation. QPI is a highspeed bus with low latency. Response time is defined in clock cycles. There is no way to implement QPI in software anytime soon (except when you don't want to interface to real hardware on the other side, like a simulator for example). Therefore QPI is clearly meant to be implemented in hardware, just like you did.

The EMMC spec is different. It does mandate timings, but only those of the IO interface itself are specified in clock cycles. All other timings are given in ms and some even in seconds. Those who wrote the spec didn't have a big and fast Verilog FSM in mind. The mandated features are easy to implement in software and still meet the timing spec. All you need is a peripherial that implements the IO interface. This is very much like the situation on USB, where an IO peripherial implements the "data pump", and all the rest is usually done in software.

Along the same lines, there is another hint apart from the timing: feature diversity. Since QPI is clearly meant to be implemented in hardware, it won't request extra features that can't be implemented well in hardware. EMMC on the other hand, does require lots of stupid extra stuff that vastly complicates a pure logic implementation, for almost no benefit.

For example, the EMMC spec includes a security feature, that is probably not even used in many devices with EMMC. Yet this feature makes it necessary to implement an HMAC with SHA256. Adding those in firmware is easy, it takes just a few extra kb of code. However, a direct hardware implementation suffers a lot from this (mostly unnecessary) burden. It will take lots of extra silicon area (and IP licences) for a feature that is rarely used.

For these and similar reasons I believe that EMMC is meant to be implemented in software, not hardware. That doesn't make it impossible to do it in hardware alone, but it certainly won't be cost effective. The utilization of the various (mandatory) circuits will be too low.

2. Samsung clearly says that their EMMC devices use "firmware", even giving their firmware names and version numbers. For example "VYL00M" 0x25.

3. The official android kernel patch mentioned earlier, contains comments which reveals information. It sais "We can patch the firmware in such chips", "the wear leveling code can insert 32 bytes", "Patch the firmware in certain Samsung emmc chips", "fix bug in wear leveling firmware", "patch the firmware in the internal sram", etc.

Clearly there is firmware (code) running in the device.

4. We know two 32bit words of a valid firmware image, which are contained in the same android kernel patch. One of them decodes to two valid thumb instructions. Coincidence?

5. Also from the patch, we learn something about the internal sram size. The patch accesses 0x4dd9c and 0x379a4. Therefore the SRAM is certainly at least 0x163f8 bytes big. Applying the typical rules about how memory is implemented, it is probably at least 0x80000 bytes big. Whoever put that much sram into the device, counted with quite a complex firmware.

6. Considering that there is firmware and it runs from sram, the only way to maintain your guess of a pure hardware implementation, is to use programmable hardware. CPLDs are not complex enough to require such big firmware images, leaving only FPGAs. It's certainly possible to patch FPGA bitstreams, and indeed they "execute from" (or better: are stored in) SRAM cells. However, all FPGA architectures that I know use configuration frames instead of 32-bit data words and 32-bit aligned memory addresses. Those configuration frames are typically longer, and the population of databits within the frame is very sparse. To me, the value 0xD20228FF (from the android kernel patch) doesn't look like a typical piece of FPGA configuration. I would expect something like 0x00110044, or 0x40840001.

7. FPGAs are used to get to the market quick. But they are not cost effective in a mass market, because they contain lots of flexibility (=extra area and layers) that is not needed once the functional requirements are frozen. EMMCs are clearly a mass market product. Including an FPGA with a bitstream size of 512KB is not the way to earn money. 512KB bitstream size is about equivalent to a XC3S1400A or a XC4VLX15.

Of course we don't have much official information, but what we have seems to support an embedded CPU core and at least 512kb SRAM available to it. For the reasons above, it is more likely than anything else.

Entropy512 · Sep 14, 2012

Also note that eMMC stands for "embedded MMC" - For all practical purposes, it is an MMC/SD card but in a BGA package with some additional features to take advantage of the fact that it is a fixed installation with extra pins available.

We also know that almost all SD/MMC cards on the market use embedded controllers that execute some sort of firmware (they are clearly NOT an FPGA and , as the above poster stated, almost surely not a CPLD because you couldn't fit stuff like this in a CPLD), often on a separate die from the NAND flash memory - see http://www.bunniestudios.com/blog/?page_id=1022

We also know for a fact that these chips can have their firmware updated, however the exact method of updating the firmware is unknown AND the update method is known to be unreliable/dangerous when performed in-system.

letters_to_cleo · Sep 14, 2012

Isn't the eMMC have some form of embedded memory controller that optimized to take advantage of specific NAND flash performance features, including program and read cachings? I read somewhere that it 's even possible to boot directly from eMMC... if so, then I would suffice that this embedded memory controller(block-level interface/error management and wear-leveler) is the one controlling and handling the chip itself?

PaulF8080 · Sep 14, 2012

I guess I can give a technical response without revealing secrets. I was called an RTLer. RTL is register transfer logic described in a hardware description language HDL. I used Verilog HDL at the end of my career.

Here is some pseudo code that can easily be converted to Verilog:

Code:

Case of State
    Idle:
        Case of eMMC_cmd
            Wipe:
                State = InitWipe
             more commands ...
        end case
     more states ...
end case

The synthesizer would generate tens of transistors. Why anyone would use hundreds of thousands of transistors for a cpu core and associated FW to do the same thing(interpret emmc commands) is very unlikely, but not impossible, in my opinion.

As a side note the chip I worked on also added an on chip memory controller and they just grabbed the RTL from the chip set people.

As far as the 2 hex digit chip revision, Intel would name the revisions, too. eg. 0x05 could be called "B1 step"

The only reason I brought up the FW issue is that an embedded programmer was discussing his guesses. It was very interesting to me until he got side tracked to "what if the eMMC FW got corrupted".

jetmarc · Sep 15, 2012

Paul,

The official (!) firmware patch addresses suggest an SRAM size of at least 512KB, which translates into 25M transistors (6T). This is a given, unless you insist that Samsung will not align memory to natural boundaries.

Adding an extra half million (?) more for an ARM7 isnt that much. The IP issues weight in more than the silicon area. Also, ARM7 isn't a given - it's just my guess. Samsung has a 16bit CPU core of their own, and of course there are the usual suspects too.

I think you are not deeply familiar with the (disturbing) details of NAND flash, which contradict the harddisk-like guarantees that EMMC must give. EMMC is *not* just a memory controller or IO interface.

A simple NAND chip would be more or less equivalent to what you seem to think about. NAND client controllers accept commands like "fetch page, die=0 core=0 block=7 page=23" and then you (or your peripherial) polls a status register. Once the page is fetched, you access it through a memory-mapped 512 or 4K window (or you peripherial DMAs it into main memory). This kind of controller is indeed very simple, and will implemented in logic (Verilog if you want).

On the surface, EMMC looks similar. The spec lists quite some commands, and a few waveforms diagrams. But the difference is the *guarantees* it gives. EMMC offers a harddisk emulation, where 512-byte sectors can be written and retrieved without caring about the limitations of flash!

The three most important NAND limitations are:

1. Pages can be written only once per erase-cycle (meaning that "harddisk sectors" can't be changed individually, without affecting their neighbors).

2. Erasing flash blocks wears them out. Using FAT32 on a NAND without "wear leveling" stresses it beyond spec, simply because each block is allowed to fail once it has been erased/written more than X times. Filesystems like YAFFS try to address this issue natively (=they work with pure NAND). EMMC on the other hand wants to support traditional filesystems such as FAT32, which write the same sectors over and over (FAT=file allocation table, those sectors are updated during *every* write). This is why EMMC exists after all! It offers harddisk compatibility on NAND flash memory, without extra software.

3. Appending additional pages into a block can damage previously written pages. Usually there is a spec of how many times pages can be appended before you have to abandon the block.

To put your 10s of transistors into the real world:

Samsung has invented and promoted the most popular flash wear leveling system and flash translation layer. It is called XSR. I would be surprised to find anything different used in a Samsung EMMC.

It took me more than 16kb of ARM thumb code to do an (incomplete) blackbox implementation of XSR. Incomplete, as in: full read support but only limited write support). It was tedious to do in software, but doing it in hardware would have been a nightmare. It's really not hardware friendly. Many of the structures are clearly invented by a software engineer, rather than by an "RTL guy" like you.

You wouldn't be able to do all that in RTL with a straight face. And even if, then they would just hand you a new spec and ask you to "update". XSR is currently widely deployed in version 1.7. I wouldn't try to keep pace with software peoples cancerous "improvements" while being handicapped by wrong implementation decisions.

Of course you and me aren't in this situation, but the Samsung EMMC designers are. The fact that they keep selling product suggest that they made the right decisions (within reason, see superbrick bug).

I also want to take advantage of the attention and repeat my offer: If somebody prepares one of the EMMC chips (damaged or not), soldered to a breakout board with 2.54mm spacing for the relevant pins, I promise to spend time on it. Any takers?

Samsung working on Emmc brick fix

Senior Member

Senior Member

Senior Member

Senior Recognized Developer

Retired Forum Mod / Inactive Recognized Developer

Senior Member

Senior Member

baz77

Guest

Senior Member

AndroidSlave

Guest

Senior Member

Senior Member

baz77

Guest

Senior Member

Senior Member

Member

Senior Recognized Developer

Senior Member

Senior Member

Member

Similar threads

Top Liked Posts