Slowness and kworker eating CPU with version #190400 of kernel 5.8.18-olimex

Started by mossroy, December 15, 2020, 05:43:07 PM

Previous topic - Next topic

mossroy

With latest "mainline" image provided by Olimex (http://images.olimex.com/release/a64/A64-OLinuXino-buster-minimal-20201207-193928.img.7z) comes the following kernel :
QuoteLinux debian10 5.8.18-olimex #190400 SMP Mon Dec 7 19:05:17 UTC 2020 aarch64 GNU/Linux

If I install this image on a A64-OLinuXino-2Ge8G-IND board, it's very sluggish.
I'm monitoring my boards with Prometheus Node exporter, and it takes 12 seconds to output ip:9100/metrics where it took 1 to 2 seconds. The result is that Prometheus fails to monitor it (timeout).
A simple "top" shows several processes kworker eating 20 to 50% of a CPU, even when the board is idle.


If I install (on the same board) the previous version of this image (A64-OLinuXino-buster-minimal-20201105-143953.img.7z), that comes with kernel :
QuoteLinux a64-olinuxino 5.8.18-olimex #140443 SMP Thu Nov 5 14:08:32 UTC 2020 aarch64 GNU/Linux
everything is fine : no kworker processes eating CPU, no slowness.
If I run "apt upgrade", it upgrades the kernel : after a reboot, the slowness and kworker processes appear.

Steps to reproduce :
  • install A64-OLinuXino-buster-minimal-20201207-193928.img on a A64-OLinuXino-2Ge8G-IND board (or upgrade the kernel on an older mainline image)
  • run "top"

LubOlimex

We are investigating. Meanwhile, try different scaling governor option, try:

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Test.

Then try with:

echo powersave > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Report back your findings. It seems others have experienced similar:

https://forum.armbian.com/topic/13117-pine64-lts-high-kworker-load-with-cpufreq-problems-after-upgrade-to-20021/
Technical support and documentation manager at Olimex

mossroy

With "performance" governor, the problem seems to disappear : no more kworker processes eating CPU, and the prometheus node exporter takes ~1.5 seconds to respond
With "powersave" governor, the problem also disappears. The node exporter takes ~2.2 seconds
With "ondemand" governor (default value), the problem reappears : kworker processes eat some CPU, and node exporter takes ~11 seconds

So switching to performance or powersave governor is a workaround

LubOlimex

Alright thanks for your time and tests! We will fix it back but this would come at a cost. Because this thing is more serious than it seems. The sluggish performance of the ondemand governor is caused by the wokaround for the arch timer of the а64 chip - the a64 chip has known design problems with timer and timer might jump forward in time. Please check this series of patches:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898487.html
Technical support and documentation manager at Olimex

mossroy

Quote from: LubOlimex on December 17, 2020, 09:06:22 AMhttps://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898487.html

This link returns a 404 Not Found error. But, yes, I'm aware of the timer issue. I even reported it again recently : https://www.olimex.com/forum/index.php?topic=7238.msg29730#msg29730

I see that your latest image http://images.olimex.com/release/a64/A64-OLinuXino-buster-minimal-20201217-194545.img.7z ships an older version of the kernel :
Quote from: undefinedLinux a64-olinuxino 5.8.18-olimex #122632 SMP Wed Dec 16 12:28:30 UTC 2020 aarch64 GNU/Linux

I suppose it's a workaround before finding a real fix?
I suppose this version #122632 still has this timer issue?

LubOlimex

This should be the link: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898487.html

>I suppose it's a workaround before finding a real fix?

There are no real fixes for bugs in the chip's design and silicon. I don't know if a better workaround would be proposed.

If the kernel has timer issue, our images also would have it. But the governor sluggishness should be gone.
Technical support and documentation manager at Olimex

mossroy

Quote from: LubOlimex on December 21, 2020, 08:25:45 AMThis should be the link: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898487.html

Your link is still not valid : the forum software adds "email" tags inside the url. The following link looks OK (at least in the preview. I created a "url" instead of letting it do it automatically) :
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898487.html

mossroy

I confirm that the timer issue is there with http://images.olimex.com/release/a64/A64-OLinuXino-buster-minimal-20201217-194545.img.7z

I just ran into it : one A64-OLinuXino-2Ge8G-IND board (installed yesterday) jumped to Thu 20 Feb 05:01:58 CET 2116, and (as usually) lost network access.

It's a blocker for me