Some temperature sensors are missing in recent images

Started by mossroy, March 23, 2021, 03:43:25 PM

Previous topic - Next topic

mossroy

In older images provided by Olimex, several temperature sensors were exposed. In particular some coming from the CPU.

Example with a board installed with A64-OLinuXino-buster-minimal-20200601-131837.img (running with kernel 5.8.18) :
$ sensors
axp813_adc-isa-0000
Adapter: ISA adapter
temp1:        +35.7°C 

gpu1_thermal-virtual-0
Adapter: Virtual device
temp1:        +48.3°C 

cpu0_thermal-virtual-0
Adapter: Virtual device
temp1:        +48.0°C  (crit = +90.0°C)

axp20x_battery-isa-0000
Adapter: ISA adapter
in0:          +0.00 V 
curr1:        +0.00 A 

axp813_ac-isa-0000
Adapter: ISA adapter
in0:              N/A  (min =  +4.00 V)

gpu0_thermal-virtual-0
Adapter: Virtual device
temp1:        +48.7°C

But most of these sensors have disappeared in more recent images.

For example, on a board installed with A64-OLinuXino-buster-minimal-20201207-193928.img (upgraded to kernel 5.10.23) :
$ sensors
axp813_adc-isa-0000
Adapter: ISA adapter
temp1:        +36.7°C

The CPU sensors are valuable to check it's not overheating. In particular now that the CPU governor is set to "performance" by default

mossroy

This is unfortunately not fixed with an upgrade to the latest kernel provided in the olimex repo (5.10.36).

Could you please fix this regression? (and/or tell us if there is a workaround)

jch

They're now managed by the thermal subsystem, which allows the kernel to throttle the CPU when it overheats.  You can get the values from sysfs:
$ for t in /sys/devices/virtual/thermal/thermal_zone*; do cat $t/type $t/temp; done
cpu0-thermal
43972
gpu0-thermal
44323
gpu1-thermal
44206

mossroy

Thanks for the info.

Unfortunately, this is not natively supported by the tools I was happily using so far :
  • sensors command-line (even after running sensors-detect)
  • prometheus
  • glances
  • and probably more

All were working fine with previous versions of the kernel.

And all were working fine with the (unstable) mainline debian bullseye I managed to install (see https://www.olimex.com/forum/index.php?msg=31474), based on kernel 5.10.x too.

It's good news that the CPU is throttled in case of overheat, but there has been a regression that seems specific to recent Olimex images


mossroy

About the CPU throttling when overheating, it seems to be not always working.

Today, I have put a very heavily load on one of my A64 boards, and it has shut down with the following syslog message :
Quotekernel:[ 5506.549851] thermal thermal_zone0: critical temperature reached (90 C), shutting down

This board had been very recently installed with latest image A64-OLinuXino-buster-minimal-20210513-112230.img , with kernel 5.10.36. It's in a room with average temperature

I would have much preferred a CPU throttling than a brutal shutdown that forces a manual restart.

jch

> About the CPU throttling when overheating, it seems to be not always working.

Strange, it works for me. At four threads, the board will reliably throttle at 70°C, and temperature remains stable.

> kernel:[ 5506.549851] thermal thermal_zone0: critical temperature reached (90 C), shutting down

My understanding is that the kernel should start throttling the CPU at 70°C, throttle it further at 80°C, and shut it down at 90°C.  If the CPU reached 90°C, then shutting down is the correct behaviour, but it shouldn't have happened in the first place.  Could you please check the value of
cat /sys/devices/virtual/thermal/thermal_zone0/trip_point_*_temp
It should say 70000 80000 90000, if that's not the case, there's something wrong with your device tree.

Can you please confirm that throttling is happening?  Run
watch cat /sys/devices/virtual/thermal/thermal_zone0/temp /sys/devices/virtual/thermal/cooling_device0/cur_state
You should see the second value switch to 1 when the temperature goes above 70000.

mossroy

cat /sys/devices/virtual/thermal/thermal_zone0/trip_point_*_tempgives your expected output

Regarding the throttling, I'll have to generate a heavy load again to check

mossroy

After generating some heavy load again, I confirm that your second value switches to 1 each time the temperature exceeds 70000, and switches back to 0 when it comes back below.

jch

Then it looks like everything is working like intended — I've got no explanation for what could have gone wrong before.  If you find a reliable way to reproduce the issue, I'll be glad to have a look.

mossroy

The heavy load was produced by a compilation, probably multi-threaded on the 4 cores.

My board was inside the official metal box https://www.olimex.com/Products/OLinuXino/A64/BOX-A64-BLACK/ , with a heatsink https://www.olimex.com/Products/Components/Misc/ALUMINIUM-HEATSINK-20x20x6MM/ , in an apartment room with average spring temperature.