FORUMS

Dev help needed debugging ramoops from bootlooping Nexus 6P

445 posts
Thanks Meter: 1,009
 
By XCnathan32, Senior Member on 23rd July 2017, 01:55 AM
Post Reply Email Thread
So I managed to get my Nexus working by enabling only the little cores, however, I would like to try to get the big cores working.

Here's the console-ramoops I pulled from my device: https://pastebin.com/ddinyPzz

The first major error relating to the BIG cpu (that I noticed), was at lines 317 and 318 "_cpu_up: attempt to bring up CPU 4 failed"

However, the fatal error seems to occur at lines 439-451. Multiple errors about pll_clk_enable occur, here's some lines that I noticed.
Line 440: "variable_rate_pll_clk_enable: PLL a57_pll1 didn't lock after enabling for L value 0x50!"
And then at line 451: "Kernel panic - not syncing: failed to lock a57_pll1 PLL" From that point on, the kernel appears to go through the shutdown process.
For those who don't know, the cortex-a57 cores are the ones that make up the BIG cpu.

I tried to do some research on what PLL was (disclaimer, I am no expert whatsoever, so what I say may be wrong)
From what I could find, the PLL stands for phase-locked loop, and it's purpose is to control the frequencies of a CPU.
Intel has a post on possible causes for PLL losing lock: https://www.altera.com/support/suppo...loss-lock.html

I did some digging around in the kernel source code, and there are entries for "pll_clk_disable" in the PLL driver https://android.googlesource.com/ker...om/clock-pll.c

So maybe this means there's a way to somehow disable PLL?
So if any Devs, or anyone with experience on this, have an idea on how to possibly fix this, please give your thoughts. This problem is a relatively prominent one in this device, and it would be awesome if we could fix it.

Here's my questions to anyone who knows more about this,
Would it be possible to disable PLL?
Is PLL hardware based, or software?
And would it be possible to somehow build a kernel to fix this problem?
The Following 16 Users Say Thank You to XCnathan32 For This Useful Post: [ View ] Gift XCnathan32 Ad-Free
23rd July 2017, 09:11 PM |#2  
Junior Member
Flag Bay Area, CA
Thanks Meter: 31
 
More
Caveat: I'm a mechanical, not electrical engineer. Take this with a huge grain of salt.

Phase locked loops can't be "disabled". They are a very important, low-level aspect of modern electronics. As you point out it is useful for signal synchronization such as a CPU bus. They also have many important electronic components, the failure of any single one will result a cascade of failures. So basically, I think your logs just further suggest it is a fundamental hardware issue.

As a theoretical example (this is completely made up by the way, my own hypothetical amateur understanding): let's say a solder pad is broken on the CPU or tin whiskers formed on one of the two signals for which the PLL comparator is operating. One signal will be a nice sine wave, the other will be random noise. The comparator output will be random noise. So in this case, the "PLL lock failure" will indicate a broken solder joint.

As another theoretical example (this is completely made up, etc.): let's say the Voltage Controlled Oscillator circuit develops a non-monotinic output, which is described in the video as sometimes occurring. (I'll just say "silicon wear".) This would result in system instability at certain frequencies. Ergo, another hardware problem. (In this case, maybe downclocking the CPU would be a test? Do the Big cores operate at different frequencies at boot?)

All in all I think it's great you got it this far, and is brilliant work. Way to go. I hope (*cough*) Google considers signing a "hotfix" firmware image that implements your workaround, so that people with locked/stock bootloaders may possibly have the opportunity to fix their phones. I'll pass this on to Techno Bill, our firmware volunteer guy. He's the one who successfully argued for publishing signed Full OTA "Rescue images" way back when. I'll ping whomever I can, and keep my fingers crossed.

Here's some helpful references:
https://www.electronics-notes.com/ar...mer-basics.php
https://www.youtube.com/watch?v=A9qt0JYdvFU

TL;DR: Likely still a hardware problem. I am out of the loop (pardon the pun) on this topic, and am only a volunteer so my views are my own, etc. However I suspect Google was being honest about it from the get go.

If you have a bootlooping phone, the official line from volunteers like me is to call in and ask about Warranty Service.
The Following 5 Users Say Thank You to Nathan-K For This Useful Post: [ View ] Gift Nathan-K Ad-Free
23rd July 2017, 09:38 PM |#3  
robcore's Avatar
Senior Member
Thanks Meter: 740
 
More
Quote:
Originally Posted by XCnathan32

So I managed to get my Nexus working by enabling only the little cores, however, I would like to try to get the big cores working.

Here's the console-ramoops I pulled from my device: https://pastebin.com/ddinyPzz

The first major error relating to the BIG cpu (that I noticed), was at lines 317 and 318 "_cpu_up: attempt to bring up CPU 4 failed"

However, the fatal error seems to occur at lines 439-451. Multiple errors about pll_clk_enable occur, here's some lines that I noticed.
Line 440: "variable_rate_pll_clk_enable: PLL a57_pll1 didn't lock after enabling for L value 0x50!"
And then at line 451: "Kernel panic - not syncing: failed to lock a57_pll1 PLL" From that point on, the kernel appears to go through the shutdown process.
For those who don't know, the cortex-a57 cores are the ones that make up the BIG cpu.

I tried to do some research on what PLL was (disclaimer, I am no expert whatsoever, so what I say may be wrong)
From what I could find, the PLL stands for phase-locked loop, and it's purpose is to control the frequencies of a CPU.
Intel has a post on possible causes for PLL losing lock: https://www.altera.com/support/suppo...loss-lock.html

I did some digging around in the kernel source code, and there are entries for "pll_clk_disable" in the PLL driver https://android.googlesource.com/ker...om/clock-pll.c

So maybe this means there's a way to somehow disable PLL?
So if any Devs, or anyone with experience on this, have an idea on how to possibly fix this, please give your thoughts. This problem is a relatively prominent one in this device, and it would be awesome if we could fix it.

Here's my questions to anyone who knows more about this,
Would it be possible to disable PLL?
Is PLL hardware based, or software?
And would it be possible to somehow build a kernel to fix this problem?

Hey man, do you have your current kernel tree online? I a) don't own the device and b) am no expert. However, I am a very resourceful kernel hacker, and working around the impossible is my specialty. Just looking at the clk driver from the same source you shared, I can see some glaring issues in online cpu refcounting. Also the logs lean closer to device tree/platform data errors than anything else. Though the following init sequences should definitely be checking for the population instead of just progressing based on trust. I would be happy to submit some patches if you have a build up

Sent from my Note 3 using XDA Labs
24th July 2017, 12:11 AM |#4  
XCnathan32's Avatar
OP Senior Member
Flag Texas
Thanks Meter: 1,009
 
Donate to Me
More
Quote:
Originally Posted by robcore

Hey man, do you have your current kernel tree online? I a) don't own the device and b) am no expert. However, I am a very resourceful kernel hacker, and working around the impossible is my specialty. Just looking at the clk driver from the same source you shared, I can see some glaring issues in online cpu refcounting. Also the logs lean closer to device tree/platform data errors than anything else. Though the following init sequences should definitely be checking for the population instead of just progressing based on trust. I would be happy to submit some patches if you have a build up

Sent from my Note 3 using XDA Labs

The nougat kernel source for the 6P is here, not sure if that's the same thing as a kernel tree, correct me if i'm wrong. Thanks for working on this!
24th July 2017, 12:16 AM |#5  
XCnathan32's Avatar
OP Senior Member
Flag Texas
Thanks Meter: 1,009
 
Donate to Me
More
Quote:
Originally Posted by Nathan-K

Caveat: I'm a mechanical, not electrical engineer. Take this with a huge grain of salt.

Phase locked loops can't be "disabled". They are a very important, low-level aspect of modern electronics. As you point out it is useful for signal synchronization such as a CPU bus. They also have many important electronic components, the failure of any single one will result a cascade of failures. So basically, I think your logs just further suggest it is a fundamental hardware issue.

As a theoretical example (this is completely made up by the way, my own hypothetical amateur understanding): let's say a solder pad is broken on the CPU or tin whiskers formed on one of the two signals for which the PLL comparator is operating. One signal will be a nice sine wave, the other will be random noise. The comparator output will be random noise. So in this case, the "PLL lock failure" will indicate a broken solder joint.

As another theoretical example (this is completely made up, etc.): let's say the Voltage Controlled Oscillator circuit develops a non-monotinic output, which is described in the video as sometimes occurring. (I'll just say "silicon wear".) This would result in system instability at certain frequencies. Ergo, another hardware problem. (In this case, maybe downclocking the CPU would be a test? Do the Big cores operate at different frequencies at boot?)

All in all I think it's great you got it this far, and is brilliant work. Way to go. I hope (*cough*) Google considers signing a "hotfix" firmware image that implements your workaround, so that people with locked/stock bootloaders may possibly have the opportunity to fix their phones. I'll pass this on to Techno Bill, our firmware volunteer guy. He's the one who successfully argued for publishing signed Full OTA "Rescue images" way back when. I'll ping whomever I can, and keep my fingers crossed.

Here's some helpful references:
https://www.electronics-notes.com/ar...mer-basics.php
https://www.youtube.com/watch?v=A9qt0JYdvFU

TL;DR: Likely still a hardware problem. I am out of the loop (pardon the pun) on this topic, and am only a volunteer so my views are my own, etc. However I suspect Google was being honest about it from the get go.

If you have a bootlooping phone, the official line from volunteers like me is to call in and ask about Warranty Service.

Ok that makes more sense, thanks for clearing that up. I do know that some people have said simply underclocking their big cores have worked, personally it didn't work for me, but maybe I just need to clock it lower, I'm not sure. Regardless, it shocks me how blatantly erroneous the Nexus 5X/6P devices are, and somehow, Google/Huawei/Qualcomm/LG haven't done anything to address or fix the problem, other than Google and Huawei getting into a pissing contest over who's fault it is.
The Following User Says Thank You to XCnathan32 For This Useful Post: [ View ] Gift XCnathan32 Ad-Free
24th July 2017, 04:03 AM |#6  
robcore's Avatar
Senior Member
Thanks Meter: 740
 
More
Quote:
Originally Posted by XCnathan32

The nougat kernel source for the 6P is here, not sure if that's the same thing as a kernel tree, correct me if i'm wrong. Thanks for working on this!

For sure, I meant moreso if you were working on a kernel personally for it

Sent from my Note 3 using XDA Labs
24th July 2017, 04:07 AM |#7  
XCnathan32's Avatar
OP Senior Member
Flag Texas
Thanks Meter: 1,009
 
Donate to Me
More
Quote:
Originally Posted by robcore

For sure, I meant moreso if you were working on a kernel personally for it

I'm thinking of working on a kernel to optimize it for using only the little cores, for now though, I'm just trying to get this fix out to as many devices as I can. Apparently Qualcomm 808/810 SOCs have tons of problems.
The Following User Says Thank You to XCnathan32 For This Useful Post: [ View ] Gift XCnathan32 Ad-Free
24th July 2017, 04:21 AM |#8  
robcore's Avatar
Senior Member
Thanks Meter: 740
 
More
Quote:
Originally Posted by XCnathan32

I'm thinking of working on a kernel to optimize it for using only the little cores, for now though, I'm just trying to get this fix out to as many devices as I can. Apparently Qualcomm 808/810 SOCs have tons of problems.

If you do, please reach out to me. Judging from your logs, the big cores aren't being found during initialization, and the other errors (pll, clocks) are from assuming that the cores are already registered with the driver. Now, given that the big cores have been identified as the culprit for the loops, maybe we could work on a custom solution to delay their initialization to a later initialization stage.
While upstreaming a legacy device of mine, I had to rewrite the Qualcomm cpufreq driver in order to make it register with the cpufreq core, and it was a learning experience that would be a pity to keep to myself!
The Following User Says Thank You to robcore For This Useful Post: [ View ] Gift robcore Ad-Free
24th July 2017, 05:18 AM |#9  
XCnathan32's Avatar
OP Senior Member
Flag Texas
Thanks Meter: 1,009
 
Donate to Me
More
Quote:
Originally Posted by robcore

If you do, please reach out to me. Judging from your logs, the big cores aren't being found during initialization, and the other errors (pll, clocks) are from assuming that the cores are already registered with the driver. Now, given that the big cores have been identified as the culprit for the loops, maybe we could work on a custom solution to delay their initialization to a later initialization stage.
While upstreaming a legacy device of mine, I had to rewrite the Qualcomm cpufreq driver in order to make it register with the cpufreq core, and it was a learning experience that would be a pity to keep to myself!

This sounds promising, it's weird though how devices would randomly bootloop, not even right after an update.
The Following User Says Thank You to XCnathan32 For This Useful Post: [ View ] Gift XCnathan32 Ad-Free
24th July 2017, 06:21 AM |#10  
My thought would be that, being that this is clearly a hardware issue, would there be a way to track it to a specific core? Then simply disable -that- core, rather than disabling all "big cores" as they're being referred to.

I have minimal experience related to this, but that's just what's on my mind at the moment.
24th July 2017, 06:59 AM |#11  
trax7's Avatar
Senior Member
Thanks Meter: 329
 
More
Quote:
Originally Posted by kronflux

My thought would be that, being that this is clearly a hardware issue, would there be a way to track it to a specific core? Then simply disable -that- core, rather than disabling all "big cores" as they're being referred to.

I have minimal experience related to this, but that's just what's on my mind at the moment.

The issue is hardly any specific core but rather kernel/driver related to the whole BIG cluster as specified above. I think that if there were any real production defects they would have noticed them by now.
I have a Sony Z5C, which uses the same SoC and has a similar issue but is nowhere near as severe. Our hotplugging is inefficient and the CPU readings and IRQs were botched. Until recently the only viable/stable CPU governor was interactive (even on custom kernels)... Anything else gave erroneous data, locked frequencies, deteriorated performance and didn't survive a reboot (or 10 minutes of use) without defaulting back to interactive. We don't have reboots or lockups or anything but the issue's still quite similar once you think about it.
The Following User Says Thank You to trax7 For This Useful Post: [ View ] Gift trax7 Ad-Free
Post Reply Subscribe to Thread

Tags
bootloop, debug, nexus 6p, pll

Guest Quick Reply (no urls or BBcode)
Message:
Previous Thread Next Thread
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes