[Bug report] Music player stops playing music from sd after some time

notabenem

Senior Member
Jul 17, 2009
118
75
0
After reading http://lists.infradead.org/pipermail/linux-arm-kernel/2014-June/264993.html I was wondering, if it isn't indeed the mct.c that messes up the 64bit registers in the FPU. However, since clumsy is still seeing hangs with CONFIG_EXYNOS_MCT disabled, it's probably not this code.

... maybe we have enough info now to try the ARM Linux mailing list a second time?
I would encourage you to do so. And I'd also suggest to try to get in touch with the gentlemen from Samsung that submitted the patches into the samsung kernel. I mean the authors of this:
http://git.kernel.org/cgit/linux/ke.../?id=3a062281129229b50e06547af3110f8eccd2f4e4
 

Entropy512

Senior Recognized Developer
Aug 31, 2007
14,095
25,085
0
Owego, NY
I wish I remembered why I was looking at the MCT a few years ago. All of the patches I'm looking through look familiar. I do recall that there were various things that had me thinking backporting the upstream changes was nearly impossible.

Do note that even when attempting to disable it in the defconfig, it may be getting force-enabled by another option.

You can't, to my knowledge, explicitly set something to "no" in the defconfig - you'll notice that any disabled options are commented out, effectively "not there".

It is interesting that some of the upstream patches seem to be related to stopping the MCT and/or at least disabling the interrupt in certain situations, like https://git.kernel.org/cgit/linux/k...c?id=e248cd5d5f1c28c82781d4122dddd1b7c3d7b46f - it might be that one of those is what we need. Problem is that I think backporting that patch could be easier said than done.

zeitferne's printk() is the smoking gun here IMO...

Edit: https://git.kernel.org/cgit/linux/k...c?id=a7fadac10ffbfd16cc7ccf951eab1ecf85e1abdf

could probably be combined with
https://git.kernel.org/cgit/linux/k...t?id=e700e41d9abfbf9fee01e979a41b185695132c19
and
https://git.kernel.org/cgit/linux/k...c?id=e248cd5d5f1c28c82781d4122dddd1b7c3d7b46f
 
Last edited:

clumsy

Member
Sep 2, 2006
14
5
0
And do you mean the usual sdcard daemon hangs?
I guess so, i coulndnn't check. various applications randomly "don't respond" anymore and get killed

What happens with CONFIG_LOCAL_TIMERS unset?
tried that today, however that's even worse now. within half a day the device did not wake from sleep when power button was pressed and had to be resetted/powercycled.

I'll try again with MCT disabled and try to find out what exactly hangs when the apps freeze.

Do note that even when attempting to disable it in the defconfig, it may be getting force-enabled by another option.
You can't, to my knowledge, explicitly set something to "no" in the defconfig - you'll notice that any disabled options are commented out, effectively "not there".
I agree with that. However it really seems to make a difference. It runs smoother and quicker and does not completely freeze (until now), so I guess undefining MCT does have an impact.... IMHO...
 
Last edited:

notabenem

Senior Member
Jul 17, 2009
118
75
0
(almost off topic): I wish we could simply get the 3.18 kernel from Linaro (for Android) and make it work with SII. I guess all the drivers we have are completely incompatible, aren't they?
 

Entropy512

Senior Recognized Developer
Aug 31, 2007
14,095
25,085
0
Owego, NY
(almost off topic): I wish we could simply get the 3.18 kernel from Linaro (for Android) and make it work with SII. I guess all the drivers we have are completely incompatible, aren't they?
It would be a massive undertaking - you're kind of getting into the reason why most of the original Exynos4 guys moved to other platforms.

One of Insignal or Hardkernel's latest reference board BSPs would probably be the best bet, but no one has the time for those either. (geecko attempted to backport a bunch of insignal/hardkernel kernel stuff to allow their latest HALs to be used but it didn't work out too well...)

As an example for Qualcomm devices, a typical bringup effort is something like:
Grab OEM kernel source
Based on Android version and chip used, pull a list of Qualcomm CAF release tags to check
Run a script that compares that kernel source against the list of tags and finds the one that is closest. (Note, this step and the previous one are usually unnecessary with Sony and Oppo, since they actually TELL developers the answer to this upon request)
Check out that CAF tag, drop OEM source on top of it, make a commit out of it
Split this commit into smaller "digestible" chunks that can be independently committed to a new baseline and also analyzed - https://github.com/Entropy512/kernel_find7_reference/commits/oppo_kernel for example
Take those commits, determine which ones are actually needed (some aren't and just contain fugly hacks, like https://github.com/Entropy512/kernel_find7_reference/commit/82536a234f5a2578fbb7fd46e0f9e0e7b3ee34fb ), and apply them to a newer CAF reference baseline

The problem with Exynos4 is - No production device release can be traced back to any public reference release. The end result is that the deltas against the nearest possible references (usually mainline) are MASSIVE, and splitting them up to apply to a new baseline is extremely difficult and time consuming - more effort than anyone is willing to put into these old devices any more.

I still wish I could remember why it was I was looking at MCT the last time I was looking at that code... I KNOW I've had a problem that seemed to be located here in the past but I don't remember what!
 

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
(Probably) bad news: The arch/arm/kernel/irq.c (mainline history) in our kernel is at a4841e39f7. I merged all commits up to and including 78359cb86: ARM: CPU hotplug: ensure we migrate all IRQs off a downed CPU. This commit also makes sure that no per CPU IRQs are migrated. I then added the IRQF_PERCPU flag to the flags of mct_tick0_event_irq and mct_tick1_event_irq. Indeed my printk about the tick interrupt being called on the wrong CPU is not triggered then and "IRQ112 no longer affine" messages no longer show up in kmsg. So far so good, the problem is that the rsync test still hangs :(
In the worst case, CONFIG_LOCAL_TIMERS and/or CONFIG_EXYNOS_PM_HOTPLUG are not more directly related to the bug than FAT32 vs exFAT.
 
Last edited:
  • Like
Reactions: m1trand1r

Entropy512

Senior Recognized Developer
Aug 31, 2007
14,095
25,085
0
Owego, NY
(Probably) bad news: The arch/arm/kernel/irq.c (mainline history) in our kernel is at a4841e39f7. I merged all commits up to and including 78359cb86: ARM: CPU hotplug: ensure we migrate all IRQs off a downed CPU. This commit also makes sure that no per CPU IRQs are migrated. I then added the IRQF_PERCPU flag to the flags of mct_tick0_event_irq and mct_tick1_event_irq. Indeed my printk about the tick interrupt being called on the wrong CPU is not triggered then and "IRQ112 no longer affine" messages no longer show up in kmsg. So far so good, the problem is that the rsync test still hangs :(
In the worst case, CONFIG_LOCAL_TIMERS and/or CONFIG_EXYNOS_PM_HOTPLUG are not more directly related to the bug than FAT32 vs exFAT.
Did you try backporting the MCT changes I linked a few posts ago?
 

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
Did you try backporting the MCT changes I linked a few posts ago?
Sort of, but there were many dependencies, so I ended up with this git log (sorry for the bad order):

2011-07-22 82fefb4 arm: remove several unnecessary module.h include instances
2011-07-21 59ebeeb ARM: CPU hotplug: ensure we migrate all IRQs off a downed CPU
2011-07-11 34b6370 ARM: introduce handle_IRQ() not to dump exception stack
2011-07-21 14af1ec ARM: CPU hotplug: pass in proper affinity mask on IRQ migration
2011-11-16 3f8ea71 genirq: Don't allow per cpu interrupts to be suspended [from here up probably not strictly necessary]
2011-07-11 092a5f5 irq: Track the owner of irq descriptor [required for be0fede; link]
2011-10-26 de16503 irq: Fix comment typo ist->is [just for completeness]
2011-10-04 584ad45 genirq: Fix fatfinered fixup really [probably required]
2011-09-30 72243c7 genirq: percpu: allow interrupt type to be set at enable time [also required]
2014-12-11 6be37ad ARM: Exynos MCT: Correct IRQ name. [The #define for the exynos5 interrupt in mainline has a different name than in the smd4412 kernel]
2011-09-23 be0fede genirq: Add support for per-cpu dev_id interrupts [required for b929ac9, 54bbe0e, b929ac9]
2011-11-03 94ba3d8 ARM: EXYNOS4: convert MCT to percpu interrupt API [that's basically the commit you linked; see here]
2011-07-22 b929ac9 ARM: gic, local timers: use the request_percpu_irq() interface [probably required for 94ba3d8]
2014-12-11 54bbe0e ARM: gic: consolidate PPI handling [in preparation of b929ac9]
2011-07-21 d2ff782 ARM: CPU hotplug: fix abuse of irqdesc->node [probably not strictly necessary]
2011-07-21 2a08af5 ARM: GIC: avoid routing interrupts to offline CPUs [probably not strictly necessary]

See here for arch/arm/common/gic.c, here kernel/irq/manage.c

It even booted, but oopsed with a NULL pointer immediately followed by a panic as soon as I turned the screen off. Here's the last_kmsg. Either I have done something wrong while resolving conflicts, or I missed some important commits, or the changes that Samsung made to the GIC handling (whatever that is :D ) are semantically incompatible with the mainline changes. Maybe I will look at it again.

EDIT: Maybe just this commit is additionally needed?
EDIT2: A quick gist with the patches (generated by git format-patch): https://gist.github.com/anonymous/71cbf9acb5095797f79d.
 
Last edited:

Entropy512

Senior Recognized Developer
Aug 31, 2007
14,095
25,085
0
Owego, NY
Sort of, but there were many dependencies, so I ended up with this git log (sorry for the bad order):

2011-07-22 82fefb4 arm: remove several unnecessary module.h include instances
2011-07-21 59ebeeb ARM: CPU hotplug: ensure we migrate all IRQs off a downed CPU
2011-07-11 34b6370 ARM: introduce handle_IRQ() not to dump exception stack
2011-07-21 14af1ec ARM: CPU hotplug: pass in proper affinity mask on IRQ migration
2011-11-16 3f8ea71 genirq: Don't allow per cpu interrupts to be suspended [from here up probably not strictly necessary]
2011-07-11 092a5f5 irq: Track the owner of irq descriptor [required for be0fede; link]
2011-10-26 de16503 irq: Fix comment typo ist->is [just for completeness]
2011-10-04 584ad45 genirq: Fix fatfinered fixup really [probably required]
2011-09-30 72243c7 genirq: percpu: allow interrupt type to be set at enable time [also required]
2014-12-11 6be37ad ARM: Exynos MCT: Correct IRQ name. [The #define for the exynos5 interrupt in mainline has a different name than in the smd4412 kernel]
2011-09-23 be0fede genirq: Add support for per-cpu dev_id interrupts [required for b929ac9, 54bbe0e, b929ac9]
2011-11-03 94ba3d8 ARM: EXYNOS4: convert MCT to percpu interrupt API [that's basically the commit you linked; see here]
2011-07-22 b929ac9 ARM: gic, local timers: use the request_percpu_irq() interface [probably required for 94ba3d8]
2014-12-11 54bbe0e ARM: gic: consolidate PPI handling [in preparation of b929ac9]
2011-07-21 d2ff782 ARM: CPU hotplug: fix abuse of irqdesc->node [probably not strictly necessary]
2011-07-21 2a08af5 ARM: GIC: avoid routing interrupts to offline CPUs [probably not strictly necessary]

See here for arch/arm/common/gic.c, here kernel/irq/manage.c

It even booted, but oopsed with a NULL pointer immediately followed by a panic as soon as I turned the screen off. Here's the last_kmsg. Either I have done something wrong while resolving conflicts, or I missed some important commits, or the changes that Samsung made to the GIC handling (whatever that is :D ) are semantically incompatible with the mainline changes. Maybe I will look at it again.

EDIT: Maybe just this commit is additionally needed?
EDIT2: A quick gist with the patches (generated by git format-patch): https://gist.github.com/anonymous/71cbf9acb5095797f79d.
That "ARM: EXYNOS: Fix for stall in case of cpu hotplug or sleep" is the most interesting of the commits to me with respect to this whole thing. Of most interest - It's inside the local timers ifdef AND it affects 4210s but not 4412s
 

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
(Probably) bad news: The arch/arm/kernel/irq.c (mainline history) in our kernel is at a4841e39f7. I merged all commits up to and including 78359cb86: ARM: CPU hotplug: ensure we migrate all IRQs off a downed CPU. This commit also makes sure that no per CPU IRQs are migrated. I then added the IRQF_PERCPU flag to the flags of mct_tick0_event_irq and mct_tick1_event_irq. Indeed my printk about the tick interrupt being called on the wrong CPU is not triggered then and "IRQ112 no longer affine" messages no longer show up in kmsg. So far so good, the problem is that the rsync test still hangs :(
In the worst case, CONFIG_LOCAL_TIMERS and/or CONFIG_EXYNOS_PM_HOTPLUG are not more directly related to the bug than FAT32 vs exFAT.
been reading the news. the problem in my mind with interrupts and timers is, given that there is no code in the kernel that touches the FPU, i can't figure out a credible scenario for FP corruption as a result of a bug in that area.

unless the bug affects the trap that does the lazy FPU state restore. but i imagine that this code depends exclusively on arm IP, not exynos peripheral weirdness, so i discount that. (disclaimer: i dont know the linux kernel, the arm arch, and much less exynos.) i'm also assuming that the cores work well when they are not transitioning to/from a power saving state; i think i saw enough evidence for this.

basically i imagine 2 scenarios (feel free to add):
-fpu state crosstalk between userland processes
-plain old clobbering with random (external) data

to differentiate we need to view the corruption in action. now that people are working on this again, i'll mod my corruption detector over the weekend to sleep between checks, and hopefully it will work this time. the detector will clearly show if there's crosstalk and the extent, and also the values if it is just clobbering.

my hunch here is we'll find clobbering.

assuming it's clobbering, these are the possible scenarios i can think of (feel free to add):

#1) the state is not restored correctly
one scenario would be this: bug in silicon: the (FPU?) core is powered up and the kernel starts using it. but the voltages are not stable enough when the kernel restores the FPU state and very few times the state is corrupted. in this case we would likely see mostly 00s or FFs in the registers.
how to test: in the code that loads the fpu state from ram, we immediately test after loading it by comparing against the copy in ram.
for the record, this is very very unlikely to be the cause i think.

#2) the state is not saved correctly
one scenario, bug in silicon: the kernel saves the FPU state to ram. the save is a "big" block of ram, maybe fills a cache line or more, and thus the write is handled differently by the hardware than other writes. then the core is powered down before the core-local cache state is fully flushed to whatever comes next. it affects only the FPU because all other writes done by the kernel prior to core shutdown are scattered; this one is different.
how to test: enlarge the save structure and write the register bank twice; check consistency of the two copies in the restore code.

#3) the save/restore machinery fails
one scenario, bug in silicon: imagine arm arch requires that after core power up the FPU is disabled. the linux kernel rightly omits disabling the FPU after powering up a core. but this buggy soc sometimes wakes up with FPU enabled. linux dispatches a userland thread, the thread uses the FPU, and because its enabled, the core doesnt trap. linux cant restore state and the thread sees the power up state (mostly 00s or FFs i guess). alternatively: linux dispatches a userland thread and the thread doesnt use the FPU in its slice. during next context switch linux sees the FPU enabled and thus saves state, clobbering the correct saved state in ram. next time the thread uses the FPU it sees the FPU power up state that was saved.
how to test: check the FPU enable flag after powering up a core.

there might be many other scenarios like these.

what would be potential scenarios involving interrupts or timers that would corrupt FPU state?

i'll post a detector soon.
 
  • Like
Reactions: pirate11n11

XxPixX

Senior Member
Dec 19, 2012
678
526
0
25
Warsaw
Maybe this will help?
https://groups.google.com/forum/#!topic/xvisor-devel/aJXwarmjH3A

There is some issue with latest QEMU due to which we get MCT L0 timer
interrupts on CPU1 instead of CPU0 despite explicitly setting irq affinity.

May be this is due to missing GIC combiner code for Exynos SOCs but
for now minor/harmless fix added by this patch fixes the Xvisor crash
on Exynos4 emulated by QEMU.

EDIT:
If the bug is related to CPU hotplugging then there have been some issues with it lately on Exynos4210:
[PATCH 1/3] clocksource: exynos_mct: Fix stall after CPU hotplugging

clocksource: Exynos_mct: Register clock event after request_irq()
clocksource: Exynos_mct: Use irq_force_affinity() in cpu bringup
irqchip: Gic: Support forced affinity setting
genirq: Allow forcing cpu affinity of interrupts
 
Last edited:

zeitferne

Senior Member
Jul 15, 2014
153
684
0
Upper Austria
oberon00.github.io
That "ARM: EXYNOS: Fix for stall in case of cpu hotplug or sleep" is the most interesting of the commits to me with respect to this whole thing. Of most interest - It's inside the local timers ifdef AND it affects 4210s but not 4412s
OK, after adding this commit on top of the others, the system runs fine now, but the rsync test still fails with a hanging sdcard daemon.

---------- Post added at 01:35 PM ---------- Previous post was at 12:38 PM ----------

I added a "return IRQ_HANDLED;" after my printk, but the sdcard daemon still hung. The second commit you linked is the one Entropy512 just mentioned. All other commits, especially "hclocksource: Exynos_mct: Register clock event after request_irq()" indeed sound like they fix the problem that the IRQ is called on the wrong CPU, but there are huge changes in between, so it would be hard to apply these commits. Additionally I'm not even sure now if that is really the root cause of the FPU register corruption.
 

Entropy512

Senior Recognized Developer
Aug 31, 2007
14,095
25,085
0
Owego, NY
OK, after adding this commit on top of the others, the system runs fine now, but the rsync test still fails with a hanging sdcard daemon.

---------- Post added at 01:35 PM ---------- Previous post was at 12:38 PM ----------



I added a "return IRQ_HANDLED;" after my printk, but the sdcard daemon still hung. The second commit you linked is the one Entropy512 just mentioned. All other commits, especially "hclocksource: Exynos_mct: Register clock event after request_irq()" indeed sound like they fix the problem that the IRQ is called on the wrong CPU, but there are huge changes in between, so it would be hard to apply these commits. Additionally I'm not even sure now if that is really the root cause of the FPU register corruption.
Yeah, there seems to be more and more evidence that says the MCT isn't the root cause. It could be that disabling CONFIG_LOCAL_TIMERS just alters the timing of things enough to mask the issue, just like various other things (such as keeping the screen on) mask the issue. The various MCT funkiness could just be unrelated minor (potentially just cosmetic) issues...

However, it seems like hotplug is definitely connected to this, and it IS one thing that explains how the registers could get corrupted. If some task is running on core1, then core1 plugs out and the task migrates to core0 - that might explain the issue. (would be interesting to see if the sdcard process is migrating cores when the issue occurs?)

Which could potentially also explain why using exfat (or any other FUSE-based FS) for the external SD masks the issue. This results in an additional userspace process being awake when reading data, which could potentially be enough additional load to prevent core1 from getting unplugged during the read.
 

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
FPBug detector

this detector runs 4 parallel processes that randomly sleep and check for corruption in all FP regs. sleeps are 0 to 250ms in a uniform-squared distribution (favors short waits). initial reg values depend on reg number and process number to detect process crosstalk and random corruption. each process outputs changed regs and dies upon detecting corruption. this detector can be run in non-rooted phones.

there is no proper cleanup of processes by the app, so after EACH invocation you should reboot, because: a) your battery will die soon if you don't, and 2) there will be more processes running, crosstalk among them might go undetected, and debug output in the event of a detection might be confusing.

the app is not a service, so it MUST remain in foreground once started. it is most probably required to turn off the screen for the bug to be triggered, so run the app and turn off the screen.

if the bug doesn't show up, you are welcome to try concurrent MTP transfers, taring/untaring stuff via adb shell, or whatever you like as long as the app remains in foreground.

(source included)

EDIT: i've added 2-process and 1-process versions of the app. the original 4-process version failed to detect corruption in the one instance in which it was tested so far (see below). rationale: maybe the 4-process version doesn't let the system turn off a core and the others do. note that more than one process is required to detect crosstalk.

NOTE: if ALL processes die after a run that detects corruption, YOU DONT NEED TO REBOOT to run the test again (or continue using the phone). this is the only exception to the reboot rule.
 

Attachments

Last edited:

drdays

Senior Member
Aug 17, 2011
84
26
48
Accra
FPBug detector

this detector runs 4 parallel processes that randomly sleep and check for corruption in all FP regs. sleeps are 0 to 250ms in a uniform-squared distribution (favors short waits). initial reg values depend on reg number and process number to detect process crosstalk and random corruption. each process outputs changed regs and dies upon detecting corruption. this detector can be run in non-rooted phones.

there is no proper cleanup of processes by the app, so after EACH invocation you should reboot, because: a) your battery will die soon if you don't, and 2) there will be more processes running, crosstalk among them might go undetected, and debug output in the event of a detection might be confusing.

the app is not a service, so it MUST remain in foreground once started. it is most probably required to turn off the screen for the bug to be triggered, so run the app and turn off the screen.

if the bug doesn't show up, you are welcome to try concurrent MTP transfers, taring/untaring stuff via adb shell, or whatever you like as long as the app remains in foreground.

(source included)
Thanks. Can the FPBug detector be run when the sdcardfix has been flashed on the system?

Sent from my GT-I9100 using Tapatalk
 

drdays

Senior Member
Aug 17, 2011
84
26
48
Accra
My setup is 13th December nightly of CM11. Internal sd is vfat formatted and external sd is fat32 formatted.
I copied 200mb worth of pictures from internal sd to pc and 1gb of documents (pdf and chm) to external sd via mtp.
FPBug was running in the foreground with screen off. Everything seemed to go smoothly. Only thing is I can't access settings section of FPBug.
 

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
My setup is 13th December nightly of CM11. Internal sd is vfat formatted and external sd is fat32 formatted.
I copied 200mb worth of pictures from internal sd to pc and 1gb of documents (pdf and chm) to external sd via mtp.
FPBug was running in the foreground with screen off. Everything seemed to go smoothly. Only thing is I can't access settings section of FPBug.
lol there are no settings in the app.

so i guess your device gets the bug without the fix right? can you confirm that? that'd show that this doesnt work.

if so i should reduce the number of processes to 2 and see what happens.
 
  • Like
Reactions: sharp87

Lanchon

Senior Member
Jun 19, 2011
2,703
4,455
0
i've added 2-process and 1-process versions of the detector app (edit to original post). maybe the 4-process version doesn't let the system turn off a core and the others do. note that more than one process is required to detect crosstalk.