Unexpected shutdowns after upgrade to kernel 5.10.105

Started by mossroy, March 26, 2022, 02:36:41 PM

Previous topic - Next topic

mossroy

1. It is with default settings from image A64-OLinuXino-bullseye-minimal-20211130-145129.img (kernel 5.10.60), upgraded to latest kernel through apt. With the aluminium heatsink https://www.olimex.com/Products/Components/Misc/ALUMINIUM-HEATSINK-20x20x6MM/ on the CPU, and inside the official case https://www.olimex.com/Products/OLinuXino/A64/BOX-A64-BLACK/ , in an apartment room at ~22°C

2. From a cold (unplugged) device, it took 10 minutes to reach 90°C (and shut down) with the stress tool

NB: if I let the case open (to avoid the heat to be trapped inside), it takes a bit more time but also reaches 90°C

LubOlimex

We did a lot of tests with revision E boards in the same box as you with the same aluminum radiator, with this image http://images.olimex.com/release/a64/A64-OLinuXino-bullseye-minimal-20220413-094751.img.7z

It never reaches 90 degrees, we saw 85 degrees tops.

Can you test without the box? Maybe the box is the reason?


Technical support and documentation manager at Olimex

mossroy

I just tested without the box.
With the stress test tool, it quickly reaches 80°C, then 85°C, then slowly increases, and finally reaches 90°C and shuts down. It took around 10 minutes.

Here is a photo of this board, with its heatsink: http://mossroy.free.fr/olimex/IMG_20220424_135434.jpg
I left the spacers below, so that there is some air under the board : http://mossroy.free.fr/olimex/IMG_20220424_141444.jpg

It's spring in France, and the apartment is not over-heated.

Even if I doubt it makes a difference, could you test with A64-OLinuXino-bullseye-minimal-20211130-145129.img + kernel upgrade?

LubOlimex

We did, it is not behaving like this at all here. We did test for few days.
Technical support and documentation manager at Olimex

jch

Mossroy, that's weird: I'm running with no heat sink, and the board doesn't overheat, it correctly slows down the CPU when it reaches 70°C.  The test you did previously indicates that the thermal zone mechanism is working properly for you (/sys/class/thermal/cooling_device0/cur_state increases when the board heats).

What about the CPU speed?  What happens if you run
sudo watch cat /sys/class/thermal/thermal_zone0/temp /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freqYou should see the CPU frequency decrease as the temperature crosses 70°C.

If the CPU speed does drop, then it indicates that for some reason slowing down the CPU is not enough to bring the temperature down in your case.  Perhaps you're putting load on the GPU, or on some other component?

mossroy

I have no HDMI cable plugged on my board. I do have my usual processes eating some CPU on it. They use the network (and the microSD card filesystem), but I doubt they use the GPU.

I've run the stress test again (with no case, as in the photos), while running the watch command suggested by @jch (I also added the cur_state value):
sudo watch cat /sys/class/thermal/thermal_zone0/temp /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq /sys/class/thermal/cooling_device0/cur_state
Before running the stress command-line (with my usual workload), the output is:
50758
960000
0

Then I run the stress tool. After less than 10 seconds, it goes above 70°C, which reduces the frequency:
71701
912000
1

After 1 minute and 30 seconds, it reaches 80°C, which reduces the frequency again:
80242
816000
2

After 5 minutes:
86326
816000
2

After 7 minutes:
87496
816000
2

After 9 minutes:
88432
816000
2

After 10 minutes:
89485
816000
2

and finally:
Message from syslogd@srv11 at May  6 13:08:07 ...
 kernel:[1034171.930899] thermal thermal_zone0: critical temperature reached (90 C), shutting down

If you believe it's necessary, I might reinstall the board again, do not put my usual workload on it, and run the stress test again. Should I even record a video?

jch

My board behaves like yours, except that bringing down the frequency to 900MHz reliably stops the board from overheating.  I'm running it without a box.

Perhaps some DT guru can indicate how to build an overlay to make the board slow down even more when it reaches 80°C?

mossroy

The fact that the board shuts down is a big problem because the board does not provide any service any more, and requires a manual intervention to restart.

Whatever the workload or usage conditions, it should not shut down IMHO (provided that the external temperature is within the industrial range officially supported by the 2Ge8G-IND board : -40°C to 85°C based on https://www.olimex.com/Products/OLinuXino/A64/A64-OLinuXino/open-source-hardware, and the usage conditions are "reasonable" like using the official case)
My scenario puts a maximal workload, but in "normal" conditions (temperature of an apartment in spring).

As suggested by @jch, maybe the frequency should be reduced more when reaching 80°C. Or (if it's technically possible), an extra step could be added at 85°C to reduce the frequency more significantly?

LubOlimex

Well, you can lower it further, as I suggested before

"As a temporarily fix you can lower the max clock to keep lower temperature, for example set it 20% lower to 816Mhz (instead of the default 1056Mhz) :

echo 816000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
"

Maybe try even more conservative values like 648000 or 480000.
Technical support and documentation manager at Olimex

jch

@LubOlimex, this is not about the max frequency, which is fine as it is.  It is about how much the board throttles when it overheats.  The former is handled by cpufreq, the latter is handled by the thermal system, the two are pretty much orthogonal.

JohnS

It sounds like Linux has not got the throttling right, in which case report it as a bug to the distro / etc.

John

mossroy

@LubOlimex: as you say, it can only be a temporary workaround. I would indeed probably need to lower the frequency to 648000 or 480000, as the temperature is still raising at 816000 MHz. And things will probably get worse when the summer will be there.

I agree with @jch that the need is to tune the throttling.
Lowering the CPU max frequency like @LubOlimex suggests would (of course) kill its performance, and is not necessary when it's not overheating (i.e. when the CPU is not working a lot for a long time)

@JohnS I don't see how this could be a bug of the distro. We're talking about configuring the kernel CPU thermal throttling, as Olimex already did in the past.

LubOlimex

The problem for the moment is this:

- if we enable proper governor in which the board would never hang, then the processor bug with the clock would manifest, aka it would jump forward in time

- if we apply fix for time jump, we can't use throttling governor only performance one and board might overheat

If you are not worried about time jump you can use the more conservative governors.
Technical support and documentation manager at Olimex

mossroy

@LubOlimex, I understand.

But it might not be necessary to switch to a different governor.

Currently, I see that the CPU frequency is reduced when its temperature reaches certain thresholds: 70°C => 912 MHz and 80°C => 816 MHz

I suppose this might be tuned, without switching to another governor.

Ideally (if it's possible), I would add a new threshold, like:
70°C => 912 MHz, 80°C => 816 MHz and 85°C => 480 MHz (for example)

If it's not possible, changing the frequency for the second threshold should work, too, like:
70°C => 912 MHz and 80°C => 648 MHz (for example)

NB: having the time jump issue again is not an option, at least for me

mossroy

@LubOlimex, when will Olimex release a kernel with a solution for this issue?

Summer is here, and my boards are sometimes shutting down because of this.

To me, it would probably work to "only" add a new temperature threshold in the current governor, or (at worst) lower the frequency for the last threshold. See my last post for more detail.

Always lowering the maximum CPU frequency could only be, as you said, a "temporary workaround".