Unexpected shutdowns after upgrade to kernel 5.10.105

Started by mossroy, March 26, 2022, 02:36:41 PM

Previous topic - Next topic

mossroy

Olimex has released a kernel version 5.10.105 that seems to solve the "Dirty Pipe" security issue. See https://github.com/OLIMEX/linux-olimex/pull/2

BUT it also seems to introduce a regression. I have many A64 boards that were unexpectedly shut down, with the following error message:

Quotethermal thermal_zone0: critical temperature reached (90 C), shutting down

In a few days after installing this kernel 5.10.105, it happened on 4 of my A64-OLinuXino-2Ge8G-IND boards (running both buster and bullseye). It did not happen on the boards that are idle (or almost idle), only on the ones that have some noticeable CPU usage.
These boards did not have such issue so far. Or at least I did not notice it (either because it was much less frequent and/or because it triggered a reboot instead of a shutdown).

All of them have a heatsink: https://www.olimex.com/Products/Components/Misc/ALUMINIUM-HEATSINK-20x20x6MM/. They are all in an apartment with a temperature around 20°C (It's spring here).

I reinstalled two of them with A64-OLinuXino-bullseye-minimal-20211130-145129.img (with its kernel 5.10.60) and the problem disappeared (even after installing the 5.10.60 version I compiled, with the "dirty pipe" patch), with the same workload.

That was to handle the emergency, and to check it comes from the kernel upgrade.
I'm not 100% sure yet, but I don't see any other explanation.

But I can't afford to reinstall all of them: is there a way to downgrade the kernel to 5.10.60? I did not dare to run a "sudo apt remove linux-image-5.10.105-olimex", because it would also remove package linux-image-olimex, for which the apt repository does not have the 5.10.60 version any more (see http://repository.olimex.com/pool/main/l/linux-5.10.60-olimex/ and http://repository.olimex.com/pool/main/l/linux-5.10.105-olimex/)

LubOlimex

#1
How is the governor set?

One of the main reasons for this release was an unexpected hang with a big number of boards. The voltage value tables were not correct and not inline with the CPU frequency used. The Linux didn't allow sufficient voltage to the cores of the main chip and when the chip required more voltage to operate it didn't get it and basically threw errors and restarted. This bug was introduced when we implemented the patch for the RTC leap, that also required changing the performance governor. The new governor values were not checked thoroughly. The hint for what was the problem and when that problem occurred was discovered when testing older images that never hanged with the boards.

Edit: I stress tested quite a few A64 boards and can't see shut down due to thermal overheating. Can you be more specific of the conditions your boards hang? How long did you stress them? Any information that might help me replicate the issue here is welcome.
Technical support and documentation manager at Olimex

LubOlimex

We just tested one A64-OLinuXino-2Ge8G-IND without any radiator and latest image from http://images.olimex.com/release/a64/ and it never went above 70 degrees C. Again what exactly are you doing?
Technical support and documentation manager at Olimex

mossroy

The 4 boards had been installed with an older image version from Olimex (the previous one, for most of them, running kernel 5.10.60), then upgraded through apt to kernel 5.10.105 (from olimex repository).
Maybe it could come from something you would have changed in latest image (other than the kernel), that would not be provided for update in your apt repository? It's just a possible reason, that you could try to reproduce.

I'd be able to give more detail on the workload on these boards if necessary (in the following days: I'm not close to the boards right now). But it would be complicated to reproduce.
2 of them are k3s nodes (a Kubernetes lightweight distribution), running various pods: wordpress, mariadb, prometheus, munin. The CPU usage is not very high on them.
The 2 other ones are running Jenkins, which consumes a lot of CPU on startup.

They all have the heatsink mentioned above, are inside the official case https://www.olimex.com/Products/OLinuXino/A64/BOX-A64-BLACK/ (but I keep them open), with no other device plugged on it.

LubOlimex

Maybe we missed something. Is it possible to test with the unmodified image on one of the overheating boards and compare the behavior? The base image from here:

http://images.olimex.com/release/a64/
Technical support and documentation manager at Olimex

mossroy

I'll test that in the following days (probably tomorrow).

On your side, you might test to install your previous image (with kernel 5.10.60), upgrade it with apt (to have kernel 5.10.105), and run your stress test on it?

LubOlimex

Technical support and documentation manager at Olimex

mossroy

I reinstalled one of my boards with latest A64-OLinuXino-bullseye-minimal-20220321-223544.img (kernel 5.10.105), as you suggested, on another SD-Card.

After installing and running the same workload on it, I see the CPU temperature raising above 80°C, and after a few minutes, it shuts down with following message:

Quote from: undefinedkernel:[ 1693.680831] thermal thermal_zone0: critical temperature reached (91 C), shutting down

After switching back to the original SD-Card (with kernel 5.10.60 + "dirty pipe" patch), and running the same workload again on it, the CPU temperature does not go above ~75°C (at some point, with 100% CPU on all cores), and the board does not shut down.

This is done in a reproducible way, as the SD-cards were installed with the same Ansible playbook (to install k3s on them), and the same pods were scheduled on the board.
What consumes the most CPU is a mix of following processes: unpigz (to extract container images), containerd, munin-*, coolwsd and coolforkit (parts of Collabora CODE). It lasts a few minutes (less than 5), then the CPU is less used.

But I suppose it does not depend on the processes/technology. What is the stress test tool you are using? I might try to run the same here

LubOlimex

I did some testing too, whether I use apt update apt upgrade from previous image to latest one or if I use directly the latest image - the tests go the same. I'd suggest method of acquiring the image is not the issue.

Now about running the test:

I use the following:

stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 1000s &
and then

watch -n1 cat /sys/devices/virtual/thermal/thermal_zone0/temp
I tested so far three A64 boards that I have around, revisions E, F, G. Two of them don't overheat, they don't go above 80 degrees. The third one however ends up with the same message as you get! It is the hardware revision F board!

What is the hardware revisions of your shutting down boards?
Technical support and documentation manager at Olimex

mossroy

All the A64-OLinuXino-2Ge8G-IND boards I have here are revision E (not F), according to what is written on them: Rev.E (c) 2017

mossroy

I tested your stress tool, and have the same behavior as with my workload:
  • with kernel 5.10.60, the temperature does not exceed 70 to 75°C
  • with kernel 5.10.105, it reaches 90°C in a few seconds, and shuts down

LubOlimex

We have to think of a solution. Previously the power for the CPU core was too low and some boards were suddenly restarting due to lack of power when needed, now that we increased it to suitable levels some boards overheat and shut down during prolonged operation.

As a temporarily fix you can lower the max clock to keep lower temperature, for example set it 20% lower to 816Mhz (instead of the default 1056Mhz) :

echo 816000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
Technical support and documentation manager at Olimex

jch

I'm seeing random stalls (see the next thread) after upgrading to .105, but I'm not seeing the symptoms described above.  When putting load on the board, the temperature rises up to 70 degrees, then cooling_device0 gets into state 1 and the temperature reliably drops back.

@mossroy, could you please wait until the board reaches 70 degrees or so, then run the following?
cat /sys/class/thermal/cooling_device0/cur_state /sys/class/thermal/thermal_zone*/temp

mossroy

I saw that a new kernel has been released by Olimex. A new version 5.10.105 where the voltage changes have been reverted (based on http://images.olimex.com/changelog.txt)

I've quickly tested it on my A64 boards: their behavior has improved significantly, but I still manage to reproduce the unexpected shutdown.

With my usual workload, the temperature is around 60-70°C, sometimes reaches around 80°C, but does not shutdown.

But if I run the stress tool for a few minutes, the board temperature is around 80°C, slowly increases, reaches 90°C, and shuts down with the usual error message:
kernel:[ 2358.596960] thermal thermal_zone0: critical temperature reached (90 C), shutting down
@jch : with this newer kernel, when running the stress tool, here is the output of your command:

$ cat /sys/class/thermal/cooling_device0/cur_state /sys/class/thermal/thermal_zone*/temp
2
81997
73690
72403

LubOlimex

1. This is with default settings and aluminum radiator over the CPU?

2. How much time did it take to reach 90?
Technical support and documentation manager at Olimex