Olimex has released a kernel version 5.10.105 that seems to solve the "Dirty Pipe" security issue. See https://github.com/OLIMEX/linux-olimex/pull/2
BUT it also seems to introduce a regression. I have many A64 boards that were unexpectedly shut down, with the following error message:
Quotethermal thermal_zone0: critical temperature reached (90 C), shutting down
In a few days after installing this kernel 5.10.105, it happened on 4 of my A64-OLinuXino-2Ge8G-IND boards (running both buster and bullseye). It did not happen on the boards that are idle (or almost idle), only on the ones that have some noticeable CPU usage.
These boards did not have such issue so far. Or at least I did not notice it (either because it was much less frequent and/or because it triggered a reboot instead of a shutdown).
All of them have a heatsink: https://www.olimex.com/Products/Components/Misc/ALUMINIUM-HEATSINK-20x20x6MM/. They are all in an apartment with a temperature around 20°C (It's spring here).
I reinstalled two of them with A64-OLinuXino-bullseye-minimal-20211130-145129.img (with its kernel 5.10.60) and the problem disappeared (even after installing the 5.10.60 version I compiled, with the "dirty pipe" patch), with the same workload.
That was to handle the emergency, and to check it comes from the kernel upgrade.
I'm not 100% sure yet, but I don't see any other explanation.
But I can't afford to reinstall all of them: is there a way to downgrade the kernel to 5.10.60? I did not dare to run a "sudo apt remove linux-image-5.10.105-olimex", because it would also remove package linux-image-olimex, for which the apt repository does not have the 5.10.60 version any more (see http://repository.olimex.com/pool/main/l/linux-5.10.60-olimex/ and http://repository.olimex.com/pool/main/l/linux-5.10.105-olimex/)
How is the governor set?
One of the main reasons for this release was an unexpected hang with a big number of boards. The voltage value tables were not correct and not inline with the CPU frequency used. The Linux didn't allow sufficient voltage to the cores of the main chip and when the chip required more voltage to operate it didn't get it and basically threw errors and restarted. This bug was introduced when we implemented the patch for the RTC leap, that also required changing the performance governor. The new governor values were not checked thoroughly. The hint for what was the problem and when that problem occurred was discovered when testing older images that never hanged with the boards.
Edit: I stress tested quite a few A64 boards and can't see shut down due to thermal overheating. Can you be more specific of the conditions your boards hang? How long did you stress them? Any information that might help me replicate the issue here is welcome.
We just tested one A64-OLinuXino-2Ge8G-IND without any radiator and latest image from http://images.olimex.com/release/a64/ and it never went above 70 degrees C. Again what exactly are you doing?
The 4 boards had been installed with an older image version from Olimex (the previous one, for most of them, running kernel 5.10.60), then upgraded through apt to kernel 5.10.105 (from olimex repository).
Maybe it could come from something you would have changed in latest image (other than the kernel), that would not be provided for update in your apt repository? It's just a possible reason, that you could try to reproduce.
I'd be able to give more detail on the workload on these boards if necessary (in the following days: I'm not close to the boards right now). But it would be complicated to reproduce.
2 of them are k3s nodes (a Kubernetes lightweight distribution), running various pods: wordpress, mariadb, prometheus, munin. The CPU usage is not very high on them.
The 2 other ones are running Jenkins, which consumes a lot of CPU on startup.
They all have the heatsink mentioned above, are inside the official case https://www.olimex.com/Products/OLinuXino/A64/BOX-A64-BLACK/ (but I keep them open), with no other device plugged on it.
Maybe we missed something. Is it possible to test with the unmodified image on one of the overheating boards and compare the behavior? The base image from here:
I'll test that in the following days (probably tomorrow).
On your side, you might test to install your previous image (with kernel 5.10.60), upgrade it with apt (to have kernel 5.10.105), and run your stress test on it?
I reinstalled one of my boards with latest A64-OLinuXino-bullseye-minimal-20220321-223544.img (kernel 5.10.105), as you suggested, on another SD-Card.
After installing and running the same workload on it, I see the CPU temperature raising above 80°C, and after a few minutes, it shuts down with following message:
Quote from: undefinedkernel:[ 1693.680831] thermal thermal_zone0: critical temperature reached (91 C), shutting down
After switching back to the original SD-Card (with kernel 5.10.60 + "dirty pipe" patch), and running the same workload again on it, the CPU temperature does not go above ~75°C (at some point, with 100% CPU on all cores), and the board does not shut down.
This is done in a reproducible way, as the SD-cards were installed with the same Ansible playbook (to install k3s on them), and the same pods were scheduled on the board.
What consumes the most CPU is a mix of following processes: unpigz (to extract container images), containerd, munin-*, coolwsd and coolforkit (parts of Collabora CODE). It lasts a few minutes (less than 5), then the CPU is less used.
But I suppose it does not depend on the processes/technology. What is the stress test tool you are using? I might try to run the same here
I did some testing too, whether I use apt update apt upgrade from previous image to latest one or if I use directly the latest image - the tests go the same. I'd suggest method of acquiring the image is not the issue.
Now about running the test:
I use the following:
stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 1000s &
watch -n1 cat /sys/devices/virtual/thermal/thermal_zone0/temp
I tested so far three A64 boards that I have around, revisions E, F, G. Two of them don't overheat, they don't go above 80 degrees. The third one however ends up with the same message as you get! It is the hardware revision F board!
What is the hardware revisions of your shutting down boards?
All the A64-OLinuXino-2Ge8G-IND boards I have here are revision E (not F), according to what is written on them: Rev.E (c) 2017
I tested your stress tool, and have the same behavior as with my workload:
- with kernel 5.10.60, the temperature does not exceed 70 to 75°C
- with kernel 5.10.105, it reaches 90°C in a few seconds, and shuts down
We have to think of a solution. Previously the power for the CPU core was too low and some boards were suddenly restarting due to lack of power when needed, now that we increased it to suitable levels some boards overheat and shut down during prolonged operation.
As a temporarily fix you can lower the max clock to keep lower temperature, for example set it 20% lower to 816Mhz (instead of the default 1056Mhz) :
echo 816000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
I'm seeing random stalls (see the next thread) after upgrading to .105, but I'm not seeing the symptoms described above. When putting load on the board, the temperature rises up to 70 degrees, then cooling_device0 gets into state 1 and the temperature reliably drops back.
@mossroy, could you please wait until the board reaches 70 degrees or so, then run the following?
cat /sys/class/thermal/cooling_device0/cur_state /sys/class/thermal/thermal_zone*/temp
I saw that a new kernel has been released by Olimex. A new version 5.10.105 where the voltage changes have been reverted (based on http://images.olimex.com/changelog.txt)
I've quickly tested it on my A64 boards: their behavior has improved significantly, but I still manage to reproduce the unexpected shutdown.
With my usual workload, the temperature is around 60-70°C, sometimes reaches around 80°C, but does not shutdown.
But if I run the stress tool for a few minutes, the board temperature is around 80°C, slowly increases, reaches 90°C, and shuts down with the usual error message:
kernel:[ 2358.596960] thermal thermal_zone0: critical temperature reached (90 C), shutting down
@jch : with this newer kernel, when running the stress tool, here is the output of your command:
$ cat /sys/class/thermal/cooling_device0/cur_state /sys/class/thermal/thermal_zone*/temp
1. This is with default settings and aluminum radiator over the CPU?
2. How much time did it take to reach 90?
1. It is with default settings from image A64-OLinuXino-bullseye-minimal-20211130-145129.img (kernel 5.10.60), upgraded to latest kernel through apt. With the aluminium heatsink https://www.olimex.com/Products/Components/Misc/ALUMINIUM-HEATSINK-20x20x6MM/ on the CPU, and inside the official case https://www.olimex.com/Products/OLinuXino/A64/BOX-A64-BLACK/ , in an apartment room at ~22°C
2. From a cold (unplugged) device, it took 10 minutes to reach 90°C (and shut down) with the stress tool
NB: if I let the case open (to avoid the heat to be trapped inside), it takes a bit more time but also reaches 90°C
We did a lot of tests with revision E boards in the same box as you with the same aluminum radiator, with this image http://images.olimex.com/release/a64/A64-OLinuXino-bullseye-minimal-20220413-094751.img.7z
It never reaches 90 degrees, we saw 85 degrees tops.
Can you test without the box? Maybe the box is the reason?
I just tested without the box.
With the stress test tool, it quickly reaches 80°C, then 85°C, then slowly increases, and finally reaches 90°C and shuts down. It took around 10 minutes.
Here is a photo of this board, with its heatsink: http://mossroy.free.fr/olimex/IMG_20220424_135434.jpg
I left the spacers below, so that there is some air under the board : http://mossroy.free.fr/olimex/IMG_20220424_141444.jpg
It's spring in France, and the apartment is not over-heated.
Even if I doubt it makes a difference, could you test with A64-OLinuXino-bullseye-minimal-20211130-145129.img + kernel upgrade?
We did, it is not behaving like this at all here. We did test for few days.
Mossroy, that's weird: I'm running with no heat sink, and the board doesn't overheat, it correctly slows down the CPU when it reaches 70°C. The test you did previously indicates that the thermal zone mechanism is working properly for you (/sys/class/thermal/cooling_device0/cur_state increases when the board heats).
What about the CPU speed? What happens if you run
sudo watch cat /sys/class/thermal/thermal_zone0/temp /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freqYou should see the CPU frequency decrease as the temperature crosses 70°C.
If the CPU speed does drop, then it indicates that for some reason slowing down the CPU is not enough to bring the temperature down in your case. Perhaps you're putting load on the GPU, or on some other component?
I have no HDMI cable plugged on my board. I do have my usual processes eating some CPU on it. They use the network (and the microSD card filesystem), but I doubt they use the GPU.
I've run the stress test again (with no case, as in the photos), while running the watch command suggested by @jch (I also added the cur_state value):
sudo watch cat /sys/class/thermal/thermal_zone0/temp /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq /sys/class/thermal/cooling_device0/cur_state
Before running the stress command-line (with my usual workload), the output is:
Then I run the stress tool. After less than 10 seconds, it goes above 70°C, which reduces the frequency:
After 1 minute and 30 seconds, it reaches 80°C, which reduces the frequency again:
After 5 minutes:
After 7 minutes:
After 9 minutes:
After 10 minutes:
Message from syslogd@srv11 at May 6 13:08:07 ...
kernel:[1034171.930899] thermal thermal_zone0: critical temperature reached (90 C), shutting down
If you believe it's necessary, I might reinstall the board again, do not put my usual workload on it, and run the stress test again. Should I even record a video?
My board behaves like yours, except that bringing down the frequency to 900MHz reliably stops the board from overheating. I'm running it without a box.
Perhaps some DT guru can indicate how to build an overlay to make the board slow down even more when it reaches 80°C?
The fact that the board shuts down is a big problem because the board does not provide any service any more, and requires a manual intervention to restart.
Whatever the workload or usage conditions, it should not shut down IMHO (provided that the external temperature is within the industrial range officially supported by the 2Ge8G-IND board : -40°C to 85°C based on https://www.olimex.com/Products/OLinuXino/A64/A64-OLinuXino/open-source-hardware, and the usage conditions are "reasonable" like using the official case)
My scenario puts a maximal workload, but in "normal" conditions (temperature of an apartment in spring).
As suggested by @jch, maybe the frequency should be reduced more when reaching 80°C. Or (if it's technically possible), an extra step could be added at 85°C to reduce the frequency more significantly?
Well, you can lower it further, as I suggested before
"As a temporarily fix you can lower the max clock to keep lower temperature, for example set it 20% lower to 816Mhz (instead of the default 1056Mhz) :
echo 816000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
Maybe try even more conservative values like 648000 or 480000.
@LubOlimex, this is not about the max frequency, which is fine as it is. It is about how much the board throttles when it overheats. The former is handled by cpufreq, the latter is handled by the thermal system, the two are pretty much orthogonal.
It sounds like Linux has not got the throttling right, in which case report it as a bug to the distro / etc.
@LubOlimex: as you say, it can only be a temporary workaround. I would indeed probably need to lower the frequency to 648000 or 480000, as the temperature is still raising at 816000 MHz. And things will probably get worse when the summer will be there.
I agree with @jch that the need is to tune the throttling.
Lowering the CPU max frequency like @LubOlimex suggests would (of course) kill its performance, and is not necessary when it's not overheating (i.e. when the CPU is not working a lot for a long time)
@JohnS I don't see how this could be a bug of the distro. We're talking about configuring the kernel CPU thermal throttling, as Olimex already did in the past.
The problem for the moment is this:
- if we enable proper governor in which the board would never hang, then the processor bug with the clock would manifest, aka it would jump forward in time
- if we apply fix for time jump, we can't use throttling governor only performance one and board might overheat
If you are not worried about time jump you can use the more conservative governors.
@LubOlimex, I understand.
But it might not be necessary to switch to a different governor.
Currently, I see that the CPU frequency is reduced when its temperature reaches certain thresholds: 70°C => 912 MHz and 80°C => 816 MHz
I suppose this might be tuned, without switching to another governor.
Ideally (if it's possible), I would add a new threshold, like:
70°C => 912 MHz, 80°C => 816 MHz and 85°C => 480 MHz (for example)
If it's not possible, changing the frequency for the second threshold should work, too, like:
70°C => 912 MHz and 80°C => 648 MHz (for example)
NB: having the time jump issue again is not an option, at least for me
@LubOlimex, when will Olimex release a kernel with a solution for this issue?
Summer is here, and my boards are sometimes shutting down because of this.
To me, it would probably work to "only" add a new temperature threshold in the current governor, or (at worst) lower the frequency for the last threshold. See my last post for more detail.
Always lowering the maximum CPU frequency could only be, as you said, a "temporary workaround".
This is not about the maximum frequency, which is fine as it is, @LubOlimex. When the board overheats, it throttles somewhat. (https://ectipakistan.com/) The thermal system manages the latter, while cpufreq handles the former; the two are essentially orthogonal.