[Bug report] Music player stops playing music from sd after some time

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
Afte a lot of try & error I finally managed to read the s0 (lower? half of d0) register in assembler, and printed a message when they do not match (and forced reloading of the registers (even when the message is not printed)). I also added printks to the hotplug and cpu_pm notifiers (also merged the two commits I mentioned). Then I ran the FPBug detector and my message got printed far more often than I expected (it also got printed before a few times).

Here are all relevant files: the patches, the kmsg and also a translation from assembler to ("pseudo")-C of the vfp_support_entry procedure, that I manually made (read: may contain errors).

As soon as the screen is turned off, it seems that the FPU register values of CPU0 (!) always get clobbered between the invocation of the vfp_support_entries. Noob question: Do the cores have separate FPU registers?
to access the VFP reg set look into vfphw.S, there are vfp_get/put_float/double functions. you can get the asm from there.

the code certainly looks like each core has an fpu! and i always suspected smp implied that
 

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
to access the VFP reg set look into vfphw.S, there are vfp_get/put_float/double functions. you can get the asm from there.
Yes I initially got the code from there, but got stuck in a bootloop. The solution was then to use p11 instead of p10 (this specifies the coprocessor to use and is also used in VFPFSTMIA macro, that is used for reloading the state).

Meanwhile I have two more insights:
1. I compared the i9300_defconfig with the i9100 one and noticed that CONFIG_EXYNOS_PM_HOTPLUG is disabled there. The i9100 also does not exhibit the bug in this configuration.

2. I found a vfp_enable implementation and calls in mach-exynos/cpuidle-exynos4.c. Commenting them out did not change anything though.
Having become suspicious of the file, I added prinks at the beginning of the exynos4_enter_core0_aftr and the exynos4_enter_core0_lpa functions that called vfp_enable. Indeed, when I get my flood of "HW state (=%x) for %p on CPU0 not in sync" messages, I also get a flood of messages from "exynos4_enter_core0_aftr". Since this function is only called from exynos4_enter_lowpower, I disabled the CONFIG_EXYNOS4_LOWPWR_IDLE configuration option. Result: The FPU corruption seems to be gone (and my phone is still able to wake up this time :) )
 

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
to access the VFP reg set look into vfphw.S, there are vfp_get/put_float/double functions. you can get the asm from there.

the code certainly looks like each core has an fpu! and i always suspected smp implied that
zei, this weekend ill try to get ahold of an S3 for a few minutes and verify that it's really not affected. then ill post patches that can be merged in soc-shared trees. i got 200+ kernel downloads yesterday and no probs reported so i think its ready. i wanted to merge the two commits u found though, they sure look relevant, but i dont want to push anything untested. did u find any problems?

anyway we should look for the cause but i cant do much without a device :( in the meantime i think its better to merge asap and then reenable the restore state optimization when the cause is found.
 

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
zei, this weekend ill try to get ahold of an S3 for a few minutes and verify that it's really not affected. then ill post patches that can be merged in soc-shared trees. i got 200+ kernel downloads yesterday and no probs reported so i think its ready. i wanted to merge the two commits u found though, they sure look relevant, but i dont want to push anything untested. did u find any problems?.
Maybe. Android sometimes nearly hangs, it gets so slugish that it causes a soft reboot but even the boot animation is slugish and I have to restart. It seems to affect just userland though and I can get a kmsg (looks normal) or logcat (constant stream of messages, could be normal though) via adb. I'm not sure though if this is cause by these commits, or if I did something wrong with my debugging messages, but for now it would be better not to merge these two commits.
 
  • Like
Reactions: Lanchon

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
Maybe. Android sometimes nearly hangs, it gets so slugish that it causes a soft reboot but even the boot animation is slugish and I have to restart. It seems to affect just userland though and I can get a kmsg (looks normal) or logcat (constant stream of messages, could be normal though) via adb. I'm not sure though if this is cause by these commits, or if I did something wrong with my debugging messages, but for now it would be better not to merge these two commits.
thank you. ill follow your advice. plus, i didnt have time to look into those commits yet but at first glance it looks like more are needed anyway. its better if leave kernel updating to people that can test anyway.

speaking of which, i got an invite a month back and bought a oneplus one as a present for a friend who lives in barcelona. and soon he'll be sending me his old S2 with a friend that is coming to visit (i live in argentina), so ill be able to test :)
 

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
@zeitferne @Lanchon i've merged those two commits into my kernel. I am running it since yesterday but I did not realize those performance issues yet.
The performance cost seems to be really marginal. If you want to measure it, an ideal benchmark would look something like that:
1. In one thread, do FPU operations. The speed of this thread is the benchmark's result.
2. Setup other threads that do not use the FPU (make sure to compile with -msoft-float) but which do use the CPU, so that as many context switches as possible occur for the first thread.
Without this patch, Linux would notice that the other threads did not use the FPU and therefore will skip writing the data it saved when it switched away from that thread to the FPU registers again, since it is (or should be; that's the bug) already there. With the patch, it writes the data to the registers anyway (but on the other hand, the patch saves the checking if it has to write the data).

[...] I disabled the CONFIG_EXYNOS4_LOWPWR_IDLE configuration option. Result: The FPU corruption seems to be gone (and my phone is still able to wake up this time :) )
Now I have reverted to stock CM kernel and applied only this single configuration change for which there is one single #ifdef in mach-exynos/cpuidle-exynos4.c (and no references in any makefile) and the bug is still gone. If my thinking is right, this means, our bug can be summarized succinctly as:

FPU registers on CPU0 get corrupted in Low Power Idle State. EDIT3: More specifically in AFTR (sic!) mode. Googling for "Exynos 4210 AFTR" reveals interesting results. [/EDIT3]

There are quite a few branches on soc_is_exynos4210() in the related functions, which would be a good explanation why this happens only on these devices.
EDIT2: Since this Low Power Idle Mode can only be entered if only one CPU is online (see this line) also explains why disabling hotplug gets rid of the bug.

EDIT: A file that is properly unifdefed for i9100 is here (ca. 200 lines ~ 10% less; all remaining #if-branches are active).
 
Last edited:

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
@zeitferne @Lanchon i've merged those two commits into my kernel. I am running it since yesterday but I did not realize those performance issues yet.
top performance loss is one fpu context load per timeslice (only the load itself, not the whole trap). its probably under 1%.

you can try antutu before and after, but be sure to run it many times: variation will probably occlude the real impact.
 

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
Alternative fix, one level deeper

After my investigations in my post above, I just had to put the pieces together:
in our case somehow the tracking fails (across power management state changes, or during CPU migration of tasks, or who knows)

I based my kernel on @Lanchon's (this one), removed the last commit that always reloads the FPU and also merged three other commits related to CPU_PM. Here is the gist with the patches:
https://gist.github.com/9fb17c20e635bbffcb7f

I hope the corruption does not come back after a second restart ;) but so far the FPBug app displayed nothing.

I don't know if these low power states are supposed to corrupt the FPU registers, but if they are, this fix should actually adress the root cause.

EDIT: I have added a flashable ZIP with this kernel (MD5: 55b591535854b66ef0de3245183eb33e).
EDIT2: Updated the ZIP to include driver modules, so that WiFi now works. New MD5: 4d9cd2b9021a4652f7d14ea5a101291d.
EDIT3: Fixed a potential bug. Since the kernel was running fine for (almost?) everyone, users don't need to update, but kernel developers should take a look at the change: https://gist.github.com/Oberon00/9f...xynos-call-pm-notifiers-w-irqs-disabled-patch. New MD5: 4852de8e7c3b77878290a72519b8004d

EDIT4: Added a flashable ZIP for Note1/N7000. MD5: df4fb01a03f10ac88309616cddc787ce. WARNING: As I do not own this phone, I cannot test the kernel or the flash-procedure there. Better make a backup before you flash it!

For a full patchset (includes Lanchon's required changes), see here: http://forum.xda-developers.com/showpost.php?p=57654708&postcount=833.
 

Attachments

Last edited:

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
Pro: "FP-PM-fixed-kernel-CM11-i9100.zip" runs FPBug.app with no corruption.
Con: WiFi is not working anymore on my s2.
It was and is working on Lanchon's latest.
OK, sorry, now I know what this magic.patch was for ^^. The kernel refuses to load the WiFi driver which lies in a separately compiled module, because it has a different kernel version. Let's see if I can quickly fix this…

EDIT: I updated the ZIP, WiFi should work now.
 
Last edited:

snow

Senior Member
Oct 14, 2010
204
305
103
26
www.mozilla.ro
OK, sorry, now I know what this magic.patch was for ^^. The kernel refuses to load the WiFi driver which lies in a separately compiled module, because it has a different kernel version. Let's see if I can quickly fix this…

EDIT: I updated the ZIP, WiFi should work now.
Wi-Fi is working. Don't know if it's placebo, but in my case it seems phone is more responsive (or maybe a little bit). :D

Can you share the commits necesarry to apply to default kernel? Because you made a lot of posts and I don't fully understand what to do (I'm a newbie and maybe a clean post with all commits and/or patch would be easier to "digest" for me). :)

Thanks in advance! ;)
 

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
Can you share the commits necesarry to apply to default kernel? Because you made a lot of posts and I don't fully understand what to do (I'm a newbie and maybe a clean post with all commits and/or patch would be easier to "digest" for me). :)

Thanks in advance! ;)
Here you go! https://gist.github.com/88964a724d19f2b1e808 (apply with "git am *.patch")

It is based on the current CM11 Kernel.

Up to and including the 8th, these are the patches that Lanchon applied minus his last one that always reloads the FPU registers.
Next come commits related to CPU_PM (including actually enabling them in the kernel configuration) and finally the last one is the essential one: Calling cpu_pm_enter/exit when entering/leaving the low power states.

EDIT: Made a small update, see here: http://forum.xda-developers.com/showpost.php?p=57643086&postcount=829
 
Last edited:

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
How should we positively determine this? Do you have an idea?
A very good hint that this is supposed to be that way is that the current Linux kernel also calls cpu_pm_enter() when entering the AFTR state (didn't see the LPA (Low Power Audio?) though, maybe it is Android-specific); see here. For a real confirmation, one would probably have to look into the Exynos documentation, if there is any (publicly accessible).
 
Last edited:

pilgrim011

Senior Member
May 3, 2008
1,639
456
0
Belgrade
www.tvojsajt.com
A very good hint that this is supposed to be that way is that the current Linux also kernel calls cpu_pm_enter() when entering the AFTR state (didn't see the LPA (Low Power Audio?) though, maybe it is Android-specific); see here. For a real confirmation, one would probably have to look into the Exynos documentation, if there is any (publicly accessible).
Ran out of thanks, so - thank you very much. :)
 

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
Here you go! https://gist.github.com/88964a724d19f2b1e808 (apply with "git am *.patch")

It is based on the current CM11 Kernel.

Up to and including the 8th, these are the patches that Lanchon applied minus his last one that always reloads the FPU registers.
Next come commits related to CPU_PM (including actually enabling them in the kernel configuration) and finally the last one is the essential one: Calling cpu_pm_enter/exit when entering/leaving the low power states.
hey zei, THIS IS GREAT!!!!!

i wanted to work out why the restore optimization broke when i had the time but it looks like you've done it all!! didn't have the opportunity to look into the code you posted but i don't even feel the need to do it now lol! well unless a problem shows up, let's hope not :) this is why i love free software, only question is... when i finally get my i9100 what should i do with it now?? anybody needing a phone here :( ?? lol

i'll edit my project and point to your discoveries. if you can, make sure people test and when ready give a green light to kernel maintainers to merge. i won't be pushing my fix to CM after these news, my published kernels already work as stopgap so no rush. but when you green light your changes, if you don't wanna do it yourself i can push them to CM, just holler.

regarding your patchset, please consider doing this: merge my workaround commit then revert it in a commit of yours. i worry that if you don't do this, both fixes might work their way into some kernels and we don't want that: it'll work but it's not optimal, all your hard work fixing the state restore optimization will go to waste if my fix disables it.

and while you are at it, consider using git format-patch --stdout .. >xxx.patch to put out a single file and reduce chances of mistakes (remember to exclude magic.patch!)

once this is shown stable, maybe it's time for somebody to figure out why CONFIG_EXYNOS_PM_HOTPLUG is disabled in the 4412 kernels. maybe it's part of samsung's workaround for this kernel bug that they couldn't find and it might make sense to revert it.

---------- Post added at 03:08 PM ---------- Previous post was at 02:58 PM ----------

a couple more things for you to look at if you want. last week when i was looking at the code i remember seeing this:

1) vfp_enable (i think) was duplicated in two files (very yuck!) and one of the versions had a commit over it that fixed some bug, while the other didn't

2) does soc_is_exynos4210() even work? superficially at least it looked like there was a problem, can't remember which, looked like not implemented i think
 

snow

Senior Member
Oct 14, 2010
204
305
103
26
www.mozilla.ro
hey zei, THIS IS GREAT!!!!!

i wanted to work out why the restore optimization broke when i had the time but it looks like you've done it all!! didn't have the opportunity to look into the code you posted but i don't even feel the need to do it now lol! well unless a problem shows up, let's hope not :) this is why i love free software, only question is... when i finally get my i9100 what should i do with it now?? anybody needing a phone here :( ?? lol

i'll edit my project and point to your discoveries. if you can, make sure people test and when ready give a green light to kernel maintainers to merge. i won't be pushing my fix to CM after these news, my published kernels already work as stopgap so no rush. but when you green light your changes, if you don't wanna do it yourself i can push them to CM, just holler.

regarding your patchset, please consider doing this: merge my workaround commit then revert it in a commit of yours. i worry that if you don't do this, both fixes might work their way into some kernels and we don't want that: it'll work but it's not optimal, all your hard work fixing the state restore optimization will go to waste if my fix disables it.

and while you are at it, consider using git format-patch --stdout .. >xxx.patch to put out a single file and reduce chances of mistakes (remember to exclude magic.patch!)

once this is shown stable, maybe it's time for somebody to figure out why CONFIG_EXYNOS_PM_HOTPLUG is disabled in the 4412 kernels. maybe it's part of samsung's workaround for this kernel bug that they couldn't find and it might make sense to revert it.
You could be a great maintainer for i9100 because you have proved you can fix things without having a i9100 device. :D Imagine if you will receive the device what other great things you can do :D

"Fun" for i9100 (and Exynos4 platform) is not near to end, buddy. ;) @GeeckoDev which owns N7000 tries to implement hwc, gralloc and Mali driver (r3p2-01rel0 with dmabuf, not UMP) from Insignal source. HWC for our platform it seems to do nothing, so his work will make Project Butter to be a reality and our devices should be smoother than even before! :D

Look here:
http://forum.xda-developers.com/showpost.php?p=57565566&postcount=810