[Bug report] Music player stops playing music from sd after some time

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
I was referring to this line, doing bit-operations on pointers:
Code:
__u8 *read_buffer = (__u8 *) ((uintptr_t)(handler->read_buffer + PAGESIZE) & ~((uintptr_t)PAGESIZE-1));
Although I'm sure that when you read the corresponding documentation it will become clear.
nothing strange. adds a page size to the address, then zero out all address bits that are offsets within a page and leaves the higher bits (ie page number) intact. in effect selects the start address of the next page.

it'd be more "correct" to do (X + (PS-1)) & ~(PS-1)
this would return X (instead of the current X+PS) if X is already at the start of a page
but for this case, both results are equally valid

---------- Post added at 06:13 AM ---------- Previous post was at 06:09 AM ----------

As a side note, I wonder why CyanogenMod's sdcard: Respect check_caller_access_to_name on readdir() calls or an equivalent fix seems not to be included in AOSP.
maybe they didn't upstream?

---------- Post added at 06:14 AM ---------- Previous post was at 06:13 AM ----------

And since the struct's size seems to matter, could the AOSP commit Use the correct fuse_init_out structure size. be related?
no. that's an important commit but the struct is only used in handle_init.

---------- Post added at 06:47 AM ---------- Previous post was at 06:14 AM ----------

I tried many other variations, with putting in a dummy buffer before the read buffer to prevent the overlap. But the conclusions of that were that even a dummy buffer much larger than the request buffers didn't stop the problem, just made it less likely; only when the dummy buffer was PAGESIZE, the problem disappeared. It turns out that for us it is not needed to really start on page boundaries, but the real trick is that we need the read buffer to start on the next page.
so i'm looking at the disassembly code. its a bit of a mess due to heavy inlining and optimizations. plus not knowing the instructions, addressing modes nor calling conventions also slows me down.

i'm seeing something so far compatible with a compiler bug, but i need to see more code to confirm or disprove. unfortunately i'll have to leave it for later because this is not what i should be doing right now.

the problem i'm having is that your observations contradict this bug, so i want to make sure i understood you correctly. you said "I tried many other variations, with putting in a dummy buffer before the read buffer to prevent the overlap. But the conclusions of that were that even a dummy buffer much larger than the request buffers didn't stop the problem". how exactly did you do this? did you do something like this:

Code:
    union {
        __u8 request_buffer[MAX_REQUEST_SIZE];
        struct {
            __u8 dummy[...];
            __u8 read_buffer[MAX_READ];
        };
    };
or like this:

Code:
    union {
        __u8 request_buffer[MAX_REQUEST_SIZE];
        __u8 dummy[...];
        __u8 read_buffer[MAX_READ];
    };
of course the second option would have no effect whatsoever. sorry to ask this question, sometimes people just make simple mistakes, i need to rule this out.
 

GidiK

Senior Member
Jan 8, 2013
113
103
0
Best
the problem i'm having is that your observations contradict this bug, so i want to make sure i understood you correctly. you said "I tried many other variations, with putting in a dummy buffer before the read buffer to prevent the overlap. But the conclusions of that were that even a dummy buffer much larger than the request buffers didn't stop the problem". how exactly did you do this? did you do something like this:

Code:
    union {
        __u8 request_buffer[MAX_REQUEST_SIZE];
        struct {
            __u8 dummy[...];
            __u8 read_buffer[MAX_READ];
        };
    };
or like this:

Code:
    union {
        __u8 request_buffer[MAX_REQUEST_SIZE];
        __u8 dummy[...];
        __u8 read_buffer[MAX_READ];
    };
of course the second option would have no effect whatsoever. sorry to ask this question, sometimes people just make simple mistakes, i need to rule this out.
I'm quite prone to simple mistakes, but in this case I used the first option, with struct inside the union.

As the size for the dummy I took the size of the two request buffers added together (this failed somewhere in the second run if I remember correctly), 8 times these request buffer sizes (this failed at the end of the sixth run, from my memory), the buffer sizes + PAGESIZE (this didn't fail). One run is reading two nandroids at full speed. The read speed is around 16MB/s. I also did a variation where I used just the AOSP commit, but added "13" to the calculated read buffer address to force a misalignment (but still in the next page): this also didn't fail.

Without any modifications the test usually fails somewhere in the first run, after having read some GB already. I observe nothing in common happening about when it fails, though it is on a live phone; in fact, with the same build it tends to fail around the same moment if you reboot and try again.

The missing "-1" in the AOSP commit actually quite helps us...
 

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
CM11-M10 sdcard.c (hammerhead)

source:
https://github.com/CyanogenMod/andr...d078e477baba9583544d9b26b00f9/sdcard/sdcard.c

gcc command:
prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.7/bin/arm-linux-androideabi-gcc -I system/core/sdcard -I /home/XXX/android/system/out/target/product/hammerhead/obj/STATIC_LIBRARIES/libsdcard_intermediates -I libnativehelper/include/nativehelper -isystem system/core/include -isystem hardware/libhardware/include -isystem hardware/libhardware_legacy/include -isystem hardware/ril/include -isystem libnativehelper/include -isystem frameworks/native/include -isystem frameworks/native/opengl/include -isystem frameworks/av/include -isystem frameworks/base/include -isystem external/skia/include -isystem /home/XXX/android/system/out/target/product/hammerhead/obj/include -isystem device/lge/hammerhead/kernel-headers -isystem hardware/qcom/msm8x74/kernel-headers -isystem bionic/libc/arch-arm/include -isystem bionic/libc/include -isystem bionic/libstdc++/include -isystem bionic/libc/kernel/common -isystem bionic/libc/kernel/arch-arm -isystem bionic/libm/include -isystem bionic/libm/include/arm -isystem bionic/libthread_db/include -c -fno-exceptions -Wno-multichar -msoft-float -fpic -fPIE -ffunction-sections -fdata-sections -funwind-tables -fstack-protector -Wa,--noexecstack -Werror=format-security -D_FORTIFY_SOURCE=2 -fno-short-enums -mcpu=cortex-a15 -mfloat-abi=softfp -mfpu=neon -include build/core/combo/include/arch/linux-arm/AndroidConfig.h -I build/core/combo/include/arch/linux-arm/ -Wno-unused-but-set-variable -fno-builtin-sin -fno-strict-volatile-bitfields -Wno-psabi -mthumb-interwork -DANDROID -fmessage-length=0 -W -Wall -Wno-unused -Winit-self -Wpointer-arith -DNO_SECURE_DISCARD -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -DNDEBUG -g -Wstrict-aliasing=2 -fgcse-after-reload -frerun-cse-after-loop -frename-registers -DNDEBUG -UDEBUG -mthumb -Os -fomit-frame-pointer -fno-strict-aliasing -Wall -Wno-unused-parameter -MD -MF /home/XXX/android/system/out/target/product/hammerhead/obj/STATIC_LIBRARIES/libsdcard_intermediates/sdcard.d -o /home/XXX/android/system/out/target/product/hammerhead/obj/STATIC_LIBRARIES/libsdcard_intermediates/sdcard.o system/core/sdcard/sdcard.c​

listing:
https://gist.github.com/Lanchon/9bbd717f96346c7e15b0

objdump:
https://gist.github.com/Lanchon/8bb37d20e493c75e10b4
 
Last edited:

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0

patrickph

Senior Member
Jun 4, 2014
122
48
0
bug (sort of) found!

fuse_reply receives unique in r2,r3 and stores it.

inlined handle_read, after calling pread64, restores unique into r2,r3 from a double-precision floating-point register before invoking fuse_reply.

because of two levels of inlining, unique is actually saved in handle_fuse_requests and the intermediate redundant save of handle_read is optimized away.

the interrupts are clobbering the VFP registers, not the stack!
Great news, dose this mean the bug is real fixed? No more work around?

Sent from my GT-I9100 using XDA Free mobile app
 

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
Great news, dose this mean the bug is real fixed? No more work around?

Sent from my GT-I9100 using XDA Free mobile app
well its a start!

i did the analysis on the code generated for my phone, hammerhead. i need to repeat it for the i9100 when i have time.

we can now build sdcard or the whole system so that it wont be affected by the corruption, but still 3rd party apps will. so we need to find what interrupt is clobbering the FP regs and fix it.

meanwhile, i'll share an interesting tidbit:
Note for GCC (and possibly other compiler) users: Some GCC libraries optimise memory copy and memory set (and possibly other) functions by making use of the wide floating point registers. Therefore, by default, any task that uses functions such as memcpy(), memcmp() or memset() will inadvertently corrupt the floating point registers.
 
  • Like
Reactions: Hank87 and sharp87

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
CM11-M10 sdcard.c (i9100)

source:
https://github.com/CyanogenMod/andr...d078e477baba9583544d9b26b00f9/sdcard/sdcard.c

gcc command:
prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.7/bin/arm-linux-androideabi-gcc -I device/samsung/galaxys2-common/include -I system/core/sdcard -I /home/XXX/android/system/out/target/product/i9100/obj/STATIC_LIBRARIES/libsdcard_intermediates -I libnativehelper/include/nativehelper -isystem system/core/include -isystem hardware/libhardware/include -isystem hardware/libhardware_legacy/include -isystem hardware/ril/include -isystem libnativehelper/include -isystem frameworks/native/include -isystem frameworks/native/opengl/include -isystem frameworks/av/include -isystem frameworks/base/include -isystem external/skia/include -isystem /home/XXX/android/system/out/target/product/i9100/obj/include -isystem bionic/libc/arch-arm/include -isystem bionic/libc/include -isystem bionic/libstdc++/include -isystem bionic/libc/kernel/common -isystem bionic/libc/kernel/arch-arm -isystem bionic/libm/include -isystem bionic/libm/include/arm -isystem bionic/libthread_db/include -c -fno-exceptions -Wno-multichar -msoft-float -fpic -fPIE -ffunction-sections -fdata-sections -funwind-tables -fstack-protector -Wa,--noexecstack -Werror=format-security -D_FORTIFY_SOURCE=2 -fno-short-enums -mcpu=cortex-a9 -mfloat-abi=softfp -mfpu=neon -include build/core/combo/include/arch/linux-arm/AndroidConfig.h -I build/core/combo/include/arch/linux-arm/ -Wno-unused-but-set-variable -fno-builtin-sin -fno-strict-volatile-bitfields -Wno-psabi -mthumb-interwork -DANDROID -fmessage-length=0 -W -Wall -Wno-unused -Winit-self -Wpointer-arith -DEXYNOS4_ENHANCEMENTS -DEXYNOS4210_ENHANCEMENTS -DDISABLE_HW_ID_MATCH_CHECK -DFORCE_SCREENSHOT_CPU_PATH -DWORKAROUND_BUG_10194508 -DHAVE_ISO -DSAMSUNG_CAMERA_HARDWARE -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -DNDEBUG -g -Wstrict-aliasing=2 -fgcse-after-reload -frerun-cse-after-loop -frename-registers -DNDEBUG -UDEBUG -mthumb -Os -fomit-frame-pointer -fno-strict-aliasing -Wall -Wno-unused-parameter -MD -MF /home/XXX/android/system/out/target/product/i9100/obj/STATIC_LIBRARIES/libsdcard_intermediates/sdcard.d -o /home/XXX/android/system/out/target/product/i9100/obj/STATIC_LIBRARIES/libsdcard_intermediates/sdcard.o system/core/sdcard/sdcard.c​

listing:
https://gist.github.com/Lanchon/1996581ebd30c3a2c877

objdump:
https://gist.github.com/Lanchon/c8599c7a4535e8420803

---------- Post added at 02:21 AM ---------- Previous post was at 02:08 AM ----------

bug find confirmed!

the build for i9100 saves and restores unique in the same way as the previously commented hammerhead build.

bug confirmed to be clobbering the FP registers.
 

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
CM11-M10 Music Bug Fix

this is just sdcard.c compiled with "-mfloat-abi=soft" instead of "-mfloat-abi=softfp". disassembly confirms that regular instead of FP registers are used to temporarily save unique.

sdcard.c is usually left alone in roms. so although this fix is for CM11-M10, it probably works fine on all kitkat roms.
@GidiK, could you please test? just flash from recovery.
thank you!
 

Attachments

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
@Entropy512, now that we know that something (interrupts? context switches?) is clobbering the FP registers, you have any idea of where to go?

my problem is that i dont know the first thing about arm arch, linux kernel, or android.

compiling all of android with "-mfloat-abi=soft" is out of the question. plus it wouldnt help the asm parts or 3rd party apps.

i thought of a tool to instrument arm ISRs: prolog to save FP regs on entry, epilog to verify that FP regs were undisturbed or properly restored by the ISR. i googled it but instead i found that 1) ISRs dont use FP, and 2) the kernel doesnt use FP. so there's no such tool.

code can inadvertently use FP regs by using lib functions such as memcpy(), memcmp() or memset(), but those should not be used from ISRs or the kernel anyway.

maybe some driver or part of the kernel is being erroneously compiled with "-mfloat-abi=softfp"? can you think of any reason why FP regs would get clobbered in 4210 but not 4412? different drivers? different libs? different compile options?

it would be trivial to make a clobber detector in userland. the detector would trigger immediately, as soon as it's scheduled again. is there a way we can find out which was the last ISR run? (of course there are races, but this is informative.) is there a way to log ISR invocation times and then match them to clobbering times?

are there closed-source ISRs in our kernel? what about drivers?

is there a way to instrument ISRs or otherwise install ISR wrappers in linux? we could detect clobbering, or save/restore FP regs in the wrapper even for closed-source ISRs.

how about instrumenting driver entry points with similar motives?

there must be some kind of tool that can help.

thanks!
 

jnetode

Senior Member
Mar 19, 2011
233
106
0
I did!
Should it really work so easilly?
I flashed on recovery and restarted. Than I made copies from internal to external SDCard, which always gets stuck.
I copied about 4GB and no hangs.
I mounted the SDCard on PC and also copied about 1GB, it would also get stuck.
It worked flawlessly too.
 

Dasmikko

Senior Member
Jun 13, 2010
235
32
0
Going to try it too later!

UPDATE #1:
Flashed it, everything boots as usual.
Will listen to music for the next 3 hours and see if it hangs.
 
Last edited:
  • Like
Reactions: Lanchon

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
I did!
Should it really work so easilly?
I flashed on recovery and restarted. Than I made copies from internal to external SDCard, which always gets stuck.
I copied about 4GB and no hangs.
I mounted the SDCard on PC and also copied about 1GB, it would also get stuck.
It worked flawlessly too.
there is a widespread problem in the platform. this only shields from it the fuse process that usually hangs, so only that particular process will be ok.

in fact not even that: none of the libraries that the process uses are shielded, so the process is not really safe. but in practice, only userland code (kernel should not use FP regs, mostly) and only floating point calculations, or some uses of long ints, or functions with spilling (too many) registers should be affected. the fuse process probably uses none of these things in the libraries, and will probably be in the clear even though the libs are not shielded.

this is stopgap. we need to fix the platform.

---------- Post added at 12:17 PM ---------- Previous post was at 12:16 PM ----------

How can i apply this patch?
Just copy sdcard to \system\bin or i need something else?
yes, you can just copy sdcard. flashing the zip does the same thing.
 

Entropy512

Senior Recognized Developer
Aug 31, 2007
14,095
25,085
0
Owego, NY
@Entropy512, now that we know that something (interrupts? context switches?) is clobbering the FP registers, you have any idea of where to go?
Well, other than it specifically being the FP registers, we've known for a while it was a register-clobbering issue. Someone else (CGX I think?) already tried a different approach to this, by making the variables in question array members, which causes the compiler to put them into RAM instead of sticking them in registers. It did successfully work around this issue.

my problem is that i dont know the first thing about arm arch, linux kernel, or android.

compiling all of android with "-mfloat-abi=soft" is out of the question. plus it wouldnt help the asm parts or 3rd party apps.

i thought of a tool to instrument arm ISRs: prolog to save FP regs on entry, epilog to verify that FP regs were undisturbed or properly restored by the ISR. i googled it but instead i found that 1) ISRs dont use FP, and 2) the kernel doesnt use FP. so there's no such tool.

code can inadvertently use FP regs by using lib functions such as memcpy(), memcmp() or memset(), but those should not be used from ISRs or the kernel anyway.

maybe some driver or part of the kernel is being erroneously compiled with "-mfloat-abi=softfp"? can you think of any reason why FP regs would get clobbered in 4210 but not 4412? different drivers? different libs? different compile options?

it would be trivial to make a clobber detector in userland. the detector would trigger immediately, as soon as it's scheduled again. is there a way we can find out which was the last ISR run? (of course there are races, but this is informative.) is there a way to log ISR invocation times and then match them to clobbering times?

are there closed-source ISRs in our kernel? what about drivers?

is there a way to instrument ISRs or otherwise install ISR wrappers in linux? we could detect clobbering, or save/restore FP regs in the wrapper even for closed-source ISRs.

how about instrumenting driver entry points with similar motives?

there must be some kind of tool that can help.

thanks!
I THINK, but I'm not sure, that CGX already had some sort of evidence pointing to modem ISRs as being the ones doing the clobbering.

Which might explain why this only happens on 4210s - one of the items that had to be merged into the 4412 kernel was a few additional modem drivers, so there's a possibility one of these was mismerged.

And I'm positive this clobbering issue is occurring elsewhere - it's just FAR more obvious in this case when it happens because the failure happens to be unrecoverable without manual intervention (such as killing the sdcard daemon).

That's why I've been hoping someone might take a crack at bringing the old smdk4210 kernel tree up to KitKat standards - if that results in a reliable system, it points to a mismerge.