• Introducing XDA Computing: Discussion zones for Hardware, Software, and more!    Check it out!

Dev help needed debugging ramoops from bootlooping Nexus 6P

Search This thread

XCnathan32

Senior Member
May 30, 2013
445
1,011
Texas
So I managed to get my Nexus working by enabling only the little cores, however, I would like to try to get the big cores working.

Here's the console-ramoops I pulled from my device: https://pastebin.com/ddinyPzz

The first major error relating to the BIG cpu (that I noticed), was at lines 317 and 318 "_cpu_up: attempt to bring up CPU 4 failed"

However, the fatal error seems to occur at lines 439-451. Multiple errors about pll_clk_enable occur, here's some lines that I noticed.
Line 440: "variable_rate_pll_clk_enable: PLL a57_pll1 didn't lock after enabling for L value 0x50!"
And then at line 451: "Kernel panic - not syncing: failed to lock a57_pll1 PLL" From that point on, the kernel appears to go through the shutdown process.
For those who don't know, the cortex-a57 cores are the ones that make up the BIG cpu.

I tried to do some research on what PLL was (disclaimer, I am no expert whatsoever, so what I say may be wrong)
From what I could find, the PLL stands for phase-locked loop, and it's purpose is to control the frequencies of a CPU.
Intel has a post on possible causes for PLL losing lock: https://www.altera.com/support/support-resources/operation-and-testing/pll-and-clock-management/pll-loss-lock.html

I did some digging around in the kernel source code, and there are entries for "pll_clk_disable" in the PLL driver https://android.googlesource.com/kernel/msm/+/android-msm-angler-3.10-o-preview-3/drivers/clk/qcom/clock-pll.c

So maybe this means there's a way to somehow disable PLL?
So if any Devs, or anyone with experience on this, have an idea on how to possibly fix this, please give your thoughts. This problem is a relatively prominent one in this device, and it would be awesome if we could fix it.

Here's my questions to anyone who knows more about this,
Would it be possible to disable PLL?
Is PLL hardware based, or software?
And would it be possible to somehow build a kernel to fix this problem?
 

Nathan-K

Member
Nov 28, 2016
14
32
Bay Area, CA
plus.google.com
Caveat: I'm a mechanical, not electrical engineer. Take this with a huge grain of salt.

Phase locked loops can't be "disabled". They are a very important, low-level aspect of modern electronics. As you point out it is useful for signal synchronization such as a CPU bus. They also have many important electronic components, the failure of any single one will result a cascade of failures. So basically, I think your logs just further suggest it is a fundamental hardware issue.

As a theoretical example (this is completely made up by the way, my own hypothetical amateur understanding): let's say a solder pad is broken on the CPU or tin whiskers formed on one of the two signals for which the PLL comparator is operating. One signal will be a nice sine wave, the other will be random noise. The comparator output will be random noise. So in this case, the "PLL lock failure" will indicate a broken solder joint.

As another theoretical example (this is completely made up, etc.): let's say the Voltage Controlled Oscillator circuit develops a non-monotinic output, which is described in the video as sometimes occurring. (I'll just say "silicon wear".) This would result in system instability at certain frequencies. Ergo, another hardware problem. (In this case, maybe downclocking the CPU would be a test? Do the Big cores operate at different frequencies at boot?)

All in all I think it's great you got it this far, and is brilliant work. Way to go. I hope (*cough*) Google considers signing a "hotfix" firmware image that implements your workaround, so that people with locked/stock bootloaders may possibly have the opportunity to fix their phones. I'll pass this on to Techno Bill, our firmware volunteer guy. He's the one who successfully argued for publishing signed Full OTA "Rescue images" way back when. I'll ping whomever I can, and keep my fingers crossed.

Here's some helpful references:
https://www.electronics-notes.com/articles/radio/pll-phase-locked-loop/tutorial-primer-basics.php
https://www.youtube.com/watch?v=A9qt0JYdvFU

TL;DR: Likely still a hardware problem. I am out of the loop (pardon the pun) on this topic, and am only a volunteer so my views are my own, etc. However I suspect Google was being honest about it from the get go.

If you have a bootlooping phone, the official line from volunteers like me is to call in and ask about Warranty Service.
 
Last edited:

robcore

Senior Member
Jul 27, 2012
919
740
Samsung Galaxy Note 3
Google Pixel XL
So I managed to get my Nexus working by enabling only the little cores, however, I would like to try to get the big cores working.

Here's the console-ramoops I pulled from my device: https://pastebin.com/ddinyPzz

The first major error relating to the BIG cpu (that I noticed), was at lines 317 and 318 "_cpu_up: attempt to bring up CPU 4 failed"

However, the fatal error seems to occur at lines 439-451. Multiple errors about pll_clk_enable occur, here's some lines that I noticed.
Line 440: "variable_rate_pll_clk_enable: PLL a57_pll1 didn't lock after enabling for L value 0x50!"
And then at line 451: "Kernel panic - not syncing: failed to lock a57_pll1 PLL" From that point on, the kernel appears to go through the shutdown process.
For those who don't know, the cortex-a57 cores are the ones that make up the BIG cpu.

I tried to do some research on what PLL was (disclaimer, I am no expert whatsoever, so what I say may be wrong)
From what I could find, the PLL stands for phase-locked loop, and it's purpose is to control the frequencies of a CPU.
Intel has a post on possible causes for PLL losing lock: https://www.altera.com/support/supp...g/pll-and-clock-management/pll-loss-lock.html

I did some digging around in the kernel source code, and there are entries for "pll_clk_disable" in the PLL driver https://android.googlesource.com/ke...3.10-o-preview-3/drivers/clk/qcom/clock-pll.c

So maybe this means there's a way to somehow disable PLL?
So if any Devs, or anyone with experience on this, have an idea on how to possibly fix this, please give your thoughts. This problem is a relatively prominent one in this device, and it would be awesome if we could fix it.

Here's my questions to anyone who knows more about this,
Would it be possible to disable PLL?
Is PLL hardware based, or software?
And would it be possible to somehow build a kernel to fix this problem?

Hey man, do you have your current kernel tree online? I a) don't own the device and b) am no expert. However, I am a very resourceful kernel hacker, and working around the impossible is my specialty. Just looking at the clk driver from the same source you shared, I can see some glaring issues in online cpu refcounting. Also the logs lean closer to device tree/platform data errors than anything else. Though the following init sequences should definitely be checking for the population instead of just progressing based on trust. I would be happy to submit some patches if you have a build up :)

Sent from my Note 3 using XDA Labs
 
Last edited:

XCnathan32

Senior Member
May 30, 2013
445
1,011
Texas
Hey man, do you have your current kernel tree online? I a) don't own the device and b) am no expert. However, I am a very resourceful kernel hacker, and working around the impossible is my specialty. Just looking at the clk driver from the same source you shared, I can see some glaring issues in online cpu refcounting. Also the logs lean closer to device tree/platform data errors than anything else. Though the following init sequences should definitely be checking for the population instead of just progressing based on trust. I would be happy to submit some patches if you have a build up :)

Sent from my Note 3 using XDA Labs

The nougat kernel source for the 6P is here, not sure if that's the same thing as a kernel tree, correct me if i'm wrong. Thanks for working on this!
 

XCnathan32

Senior Member
May 30, 2013
445
1,011
Texas
Caveat: I'm a mechanical, not electrical engineer. Take this with a huge grain of salt.

Phase locked loops can't be "disabled". They are a very important, low-level aspect of modern electronics. As you point out it is useful for signal synchronization such as a CPU bus. They also have many important electronic components, the failure of any single one will result a cascade of failures. So basically, I think your logs just further suggest it is a fundamental hardware issue.

As a theoretical example (this is completely made up by the way, my own hypothetical amateur understanding): let's say a solder pad is broken on the CPU or tin whiskers formed on one of the two signals for which the PLL comparator is operating. One signal will be a nice sine wave, the other will be random noise. The comparator output will be random noise. So in this case, the "PLL lock failure" will indicate a broken solder joint.

As another theoretical example (this is completely made up, etc.): let's say the Voltage Controlled Oscillator circuit develops a non-monotinic output, which is described in the video as sometimes occurring. (I'll just say "silicon wear".) This would result in system instability at certain frequencies. Ergo, another hardware problem. (In this case, maybe downclocking the CPU would be a test? Do the Big cores operate at different frequencies at boot?)

All in all I think it's great you got it this far, and is brilliant work. Way to go. I hope (*cough*) Google considers signing a "hotfix" firmware image that implements your workaround, so that people with locked/stock bootloaders may possibly have the opportunity to fix their phones. I'll pass this on to Techno Bill, our firmware volunteer guy. He's the one who successfully argued for publishing signed Full OTA "Rescue images" way back when. I'll ping whomever I can, and keep my fingers crossed.

Here's some helpful references:
https://www.electronics-notes.com/articles/radio/pll-phase-locked-loop/tutorial-primer-basics.php
https://www.youtube.com/watch?v=A9qt0JYdvFU

TL;DR: Likely still a hardware problem. I am out of the loop (pardon the pun) on this topic, and am only a volunteer so my views are my own, etc. However I suspect Google was being honest about it from the get go.

If you have a bootlooping phone, the official line from volunteers like me is to call in and ask about Warranty Service.

Ok that makes more sense, thanks for clearing that up. I do know that some people have said simply underclocking their big cores have worked, personally it didn't work for me, but maybe I just need to clock it lower, I'm not sure. Regardless, it shocks me how blatantly erroneous the Nexus 5X/6P devices are, and somehow, Google/Huawei/Qualcomm/LG haven't done anything to address or fix the problem, other than Google and Huawei getting into a pissing contest over who's fault it is.
 
  • Like
Reactions: robcore

robcore

Senior Member
Jul 27, 2012
919
740
Samsung Galaxy Note 3
Google Pixel XL
I'm thinking of working on a kernel to optimize it for using only the little cores, for now though, I'm just trying to get this fix out to as many devices as I can. Apparently Qualcomm 808/810 SOCs have tons of problems.
If you do, please reach out to me. Judging from your logs, the big cores aren't being found during initialization, and the other errors (pll, clocks) are from assuming that the cores are already registered with the driver. Now, given that the big cores have been identified as the culprit for the loops, maybe we could work on a custom solution to delay their initialization to a later initialization stage.
While upstreaming a legacy device of mine, I had to rewrite the Qualcomm cpufreq driver in order to make it register with the cpufreq core, and it was a learning experience that would be a pity to keep to myself!
 
  • Like
Reactions: MrMarques01

XCnathan32

Senior Member
May 30, 2013
445
1,011
Texas
If you do, please reach out to me. Judging from your logs, the big cores aren't being found during initialization, and the other errors (pll, clocks) are from assuming that the cores are already registered with the driver. Now, given that the big cores have been identified as the culprit for the loops, maybe we could work on a custom solution to delay their initialization to a later initialization stage.
While upstreaming a legacy device of mine, I had to rewrite the Qualcomm cpufreq driver in order to make it register with the cpufreq core, and it was a learning experience that would be a pity to keep to myself!

This sounds promising, it's weird though how devices would randomly bootloop, not even right after an update.
 
  • Like
Reactions: robcore

kronflux

Senior Member
Jun 6, 2012
515
501
33
Edmonton
Samsung Galaxy S7
Google Pixel 2
My thought would be that, being that this is clearly a hardware issue, would there be a way to track it to a specific core? Then simply disable -that- core, rather than disabling all "big cores" as they're being referred to.

I have minimal experience related to this, but that's just what's on my mind at the moment.
 

trax7

Senior Member
May 15, 2012
952
344
My thought would be that, being that this is clearly a hardware issue, would there be a way to track it to a specific core? Then simply disable -that- core, rather than disabling all "big cores" as they're being referred to.

I have minimal experience related to this, but that's just what's on my mind at the moment.
The issue is hardly any specific core but rather kernel/driver related to the whole BIG cluster as specified above. I think that if there were any real production defects they would have noticed them by now.
I have a Sony Z5C, which uses the same SoC and has a similar issue but is nowhere near as severe. Our hotplugging is inefficient and the CPU readings and IRQs were botched. Until recently the only viable/stable CPU governor was interactive (even on custom kernels)...:rolleyes: Anything else gave erroneous data, locked frequencies, deteriorated performance and didn't survive a reboot (or 10 minutes of use) without defaulting back to interactive. We don't have reboots or lockups or anything but the issue's still quite similar once you think about it.
 
  • Like
Reactions: BlueFlame4

Pineapplelaw

Member
Oct 8, 2011
20
4
Toronto
Is there anyway to enable one core in big cluster? If only part of the big cluster is bad, we can isolate it and use the other cores.
Booting with one core at a time we can test which one is the problem.
Maybe we could disable the core or set freq to 0? Or voltage to 0.

Sent from my Nexus 6P using XDA-Developers Legacy app
 

trax7

Senior Member
May 15, 2012
952
344
Is there anyway to enable one core in big cluster? If only part of the big cluster is bad, we can isolate it and use the other cores.
Booting with one core at a time we can test which one is the problem.
Maybe we could disable the core or set freq to 0? Or voltage to 0.
Look at my last post.
 

robcore

Senior Member
Jul 27, 2012
919
740
Samsung Galaxy Note 3
Google Pixel XL
This sounds promising, it's weird though how devices would randomly bootloop, not even right after an update.
Strange? Yes. Workable? Perhaps!
Something tells me the hmp scheduling has been a largely failed experiment as well. As well, the Google/oem sources are far from perfect, and every device needs its own specific workarounds for these types of issues.
 

XCnathan32

Senior Member
May 30, 2013
445
1,011
Texas
Strange? Yes. Workable? Perhaps!
Something tells me the hmp scheduling has been a largely failed experiment as well. As well, the Google/oem sources are far from perfect, and every device needs its own specific workarounds for these types of issues.

I'm seeing this error "msm_thermal:Failed reading node=/soc/qcom,msm-thermal" and CPU-z doesn't report any temps for my device whatsoever, is this universal for 6p's? Or is it another problem with the SoC?

Edit: Also seeing this error? (not sure if it is an error) come up a lot "(name of module/driver)0 <--> 0 mV" Do you think it's possible that the BIG CPUs are somehow not getting power?
 
Last edited:
Jan 8, 2013
49
19
@XCnathan32 see . / drivers / clk / qcom / clock-cpu-8994.c
The pll lock bit is defined as BIT(31), which is the MSB of register pll->status_reg.
On the log you posted, pll->status_reg is 0xca000100 with MSB set, this suggests the pll has locked by the time that debug message has set.

Conclusion: the cpu is not giving the pll enough time to lock.

Suggested possible solution: change line 53 in clock-pll.c from #define ENABLE_WAIT_MAX_LOOPS 200 to #define ENABLE_WAIT_MAX_LOOPS 100000 to give the pll 100ms to attempt a lock.
 

XCnathan32

Senior Member
May 30, 2013
445
1,011
Texas
@XCnathan32 see . / drivers / clk / qcom / clock-cpu-8994.c
The pll lock bit is defined as BIT(31), which is the MSB of register pll->status_reg.
On the log you posted, pll->status_reg is 0xca000100 with MSB set, this suggests the pll has locked by the time that debug message has set.

Conclusion: the cpu is not giving the pll enough time to lock.

Suggested possible solution: change line 53 in clock-pll.c from #define ENABLE_WAIT_MAX_LOOPS 200 to #define ENABLE_WAIT_MAX_LOOPS 100000 to give the pll 100ms to attempt a lock.

Will try it.
 
  • Like
Reactions: mene82

XCnathan32

Senior Member
May 30, 2013
445
1,011
Texas
@XCnathan32 see . / drivers / clk / qcom / clock-cpu-8994.c
The pll lock bit is defined as BIT(31), which is the MSB of register pll->status_reg.
On the log you posted, pll->status_reg is 0xca000100 with MSB set, this suggests the pll has locked by the time that debug message has set.

Conclusion: the cpu is not giving the pll enough time to lock.

Suggested possible solution: change line 53 in clock-pll.c from #define ENABLE_WAIT_MAX_LOOPS 200 to #define ENABLE_WAIT_MAX_LOOPS 100000 to give the pll 100ms to attempt a lock.

Just realized this, would I need to uncomment "#define"? Because I would think that line would be ignored when commented.
 
  • Like
Reactions: mene82

Vency77

Senior Member
Apr 5, 2012
265
175
Plovdiv
Damn, I hope I'm not offending anyone but, I feel very sad that I don't have a broken 5x/6p (and my nexus 5 does not want to die) to work on this fix.
Will follow this thread and help with whatever I can (which is limited to some C programming and a few linux kernel builds...).
Good luck and congrats for this progress.
 

Top Liked Posts

  • There are no posts matching your filters.
  • 16
    So I managed to get my Nexus working by enabling only the little cores, however, I would like to try to get the big cores working.

    Here's the console-ramoops I pulled from my device: https://pastebin.com/ddinyPzz

    The first major error relating to the BIG cpu (that I noticed), was at lines 317 and 318 "_cpu_up: attempt to bring up CPU 4 failed"

    However, the fatal error seems to occur at lines 439-451. Multiple errors about pll_clk_enable occur, here's some lines that I noticed.
    Line 440: "variable_rate_pll_clk_enable: PLL a57_pll1 didn't lock after enabling for L value 0x50!"
    And then at line 451: "Kernel panic - not syncing: failed to lock a57_pll1 PLL" From that point on, the kernel appears to go through the shutdown process.
    For those who don't know, the cortex-a57 cores are the ones that make up the BIG cpu.

    I tried to do some research on what PLL was (disclaimer, I am no expert whatsoever, so what I say may be wrong)
    From what I could find, the PLL stands for phase-locked loop, and it's purpose is to control the frequencies of a CPU.
    Intel has a post on possible causes for PLL losing lock: https://www.altera.com/support/support-resources/operation-and-testing/pll-and-clock-management/pll-loss-lock.html

    I did some digging around in the kernel source code, and there are entries for "pll_clk_disable" in the PLL driver https://android.googlesource.com/kernel/msm/+/android-msm-angler-3.10-o-preview-3/drivers/clk/qcom/clock-pll.c

    So maybe this means there's a way to somehow disable PLL?
    So if any Devs, or anyone with experience on this, have an idea on how to possibly fix this, please give your thoughts. This problem is a relatively prominent one in this device, and it would be awesome if we could fix it.

    Here's my questions to anyone who knows more about this,
    Would it be possible to disable PLL?
    Is PLL hardware based, or software?
    And would it be possible to somehow build a kernel to fix this problem?
    5
    Caveat: I'm a mechanical, not electrical engineer. Take this with a huge grain of salt.

    Phase locked loops can't be "disabled". They are a very important, low-level aspect of modern electronics. As you point out it is useful for signal synchronization such as a CPU bus. They also have many important electronic components, the failure of any single one will result a cascade of failures. So basically, I think your logs just further suggest it is a fundamental hardware issue.

    As a theoretical example (this is completely made up by the way, my own hypothetical amateur understanding): let's say a solder pad is broken on the CPU or tin whiskers formed on one of the two signals for which the PLL comparator is operating. One signal will be a nice sine wave, the other will be random noise. The comparator output will be random noise. So in this case, the "PLL lock failure" will indicate a broken solder joint.

    As another theoretical example (this is completely made up, etc.): let's say the Voltage Controlled Oscillator circuit develops a non-monotinic output, which is described in the video as sometimes occurring. (I'll just say "silicon wear".) This would result in system instability at certain frequencies. Ergo, another hardware problem. (In this case, maybe downclocking the CPU would be a test? Do the Big cores operate at different frequencies at boot?)

    All in all I think it's great you got it this far, and is brilliant work. Way to go. I hope (*cough*) Google considers signing a "hotfix" firmware image that implements your workaround, so that people with locked/stock bootloaders may possibly have the opportunity to fix their phones. I'll pass this on to Techno Bill, our firmware volunteer guy. He's the one who successfully argued for publishing signed Full OTA "Rescue images" way back when. I'll ping whomever I can, and keep my fingers crossed.

    Here's some helpful references:
    https://www.electronics-notes.com/articles/radio/pll-phase-locked-loop/tutorial-primer-basics.php
    https://www.youtube.com/watch?v=A9qt0JYdvFU

    TL;DR: Likely still a hardware problem. I am out of the loop (pardon the pun) on this topic, and am only a volunteer so my views are my own, etc. However I suspect Google was being honest about it from the get go.

    If you have a bootlooping phone, the official line from volunteers like me is to call in and ask about Warranty Service.
    5
    @XCnathan32 see . / drivers / clk / qcom / clock-cpu-8994.c
    The pll lock bit is defined as BIT(31), which is the MSB of register pll->status_reg.
    On the log you posted, pll->status_reg is 0xca000100 with MSB set, this suggests the pll has locked by the time that debug message has set.

    Conclusion: the cpu is not giving the pll enough time to lock.

    Suggested possible solution: change line 53 in clock-pll.c from #define ENABLE_WAIT_MAX_LOOPS 200 to #define ENABLE_WAIT_MAX_LOOPS 100000 to give the pll 100ms to attempt a lock.
    4
    @XCnathan32 see . / drivers / clk / qcom / clock-cpu-8994.c
    The pll lock bit is defined as BIT(31), which is the MSB of register pll->status_reg.
    On the log you posted, pll->status_reg is 0xca000100 with MSB set, this suggests the pll has locked by the time that debug message has set.

    Conclusion: the cpu is not giving the pll enough time to lock.

    Suggested possible solution: change line 53 in clock-pll.c from #define ENABLE_WAIT_MAX_LOOPS 200 to #define ENABLE_WAIT_MAX_LOOPS 100000 to give the pll 100ms to attempt a lock.

    HOLY **** DUDE!! I think you fixed it! Your a goddamn genius. It didn't boot the first time flashing it, but after checking the log, the PLL clocks registered successfully. I just forgot to build the kernel without dm-verity, so a dm-verity error is why it didn't boot this time. I'm gonna rebuild the kernel with dm-verity disabled, and try it again, but I really think you might have fixed this!
    3
    Update: Sorry for kind of abandoning this project, with the core utilization fix, the device runs so well now, that I feel like getting a fix for all 8 cores is not necessary. I would rather focus my time and attention on improving and maintaining the 4 core fix. With all 8 cores enabled, the 6P gets rather hot, which means it's more uncomfortable to hold, as well as it can impact the longevity of other components. Increasing voltage would also make the thermal issue worse, and reduce battery life more, so I feel like the better option for longevity is to use the 4 core fix.

    However, many bright minds have given good suggestions in this thread, so if anyone else would like to pick up the torch on development, feel free to go ahead.