Board restarts because of CPU overheat, when on high load

Started by mossroy, March 07, 2020, 10:54:14 pm

Previous topic - Next topic

mossroy

When the CPU load is very high (over the 4 cores), the CPU temperature raises over 100°C (based on "sensors" command-line).
When it reaches 110°C, the board automatically restarts.

I would expect the board to be stable, i.e to do what is necessary to keep a reasonable CPU temperature, and avoid the reboot. Including by reducing the CPU frequency if necessary (or any other solution).

I saw on https://linux-sunxi.org/Linux_mainlining_effort that "Thermal" and "DVFS" features where merged into mainline kernel 5.6. It also seems that something similar has been commited on latest version of upstream armbian (version 20.02) : see https://github.com/armbian/build/commit/ac8291987aeb92ecbcc018692d49a756fdf2a38e
Would it be a solution? If so, when will the official Olimex images have this feature?

I'm using a A64-OLinuXino-2Ge8G-IND board with latest Armbian_5.92.1_Olinuxino-a64_Debian_buster_next_5.2.5.7z image. It's in a room at around 22°C, without any fan or heatsink (and I would like to keep a fanless board).

LubOlimex

Technical support and documentation manager at Olimex

mossroy

I'll try, but it needs me to reinstall a board, which I can not easily do for now.

But I could compare the behavior with a recent upstream Armbian image.

I ran this simple program (with 4 threads, and in a 10 times bash loop) to eat all the CPU : https://gist.github.com/AstromechZA/38ab4d097aecf12bcc372575eb01613c

With upstream Armbian 20.02 (on kernel 5.4.20), it reaches 70°C.
With ftp://staging.olimex.com/Allwinner_Images/a64-olinuxino/linux/1.latest_images/buster/images/Armbian_5.92.1_Olinuxino-a64_Debian_buster_next_5.2.5.7z , it exceeds 70°C in a few seconds, and reaches 110°C.

I'm wondering if it could come from a wrong detection of temperature?

mossroy

@LubOlimex : I finally ran the test you suggested.

TL;DR: the detected overheat and reboot occur on both armbian 5.92 images distributed by Olimex. But it does not occur on latest armbian (20.02.7), nor on the old a64olinuxino_ubuntu_16.04.3_20180523_rel5.

I tested with the same protocol I had used in March : running a program that eats all the CPU of the 4 cores, for a few minutes. All tests were run on the same board, on the same day.

With ftp://staging.olimex.com/Allwinner_Images/a64-olinuxino/linux/1.latest_images/bionic/images/Armbian_5.92.1_Olinuxino-a64_Ubuntu_bionic_next_5.2.5_desktop.7z, the CPU temperature is displayed when connecting with SSH, and seems to correspond to /sys/class/thermal/thermal_zone0/temp. When idle, it's around 53°C. With all CPUs used, it quickly goes to 80, reaches 110°C and reboots

I tested again with ftp://staging.olimex.com/Allwinner_Images/a64-olinuxino/linux/1.latest_images/buster/images/Armbian_5.92.1_Olinuxino-a64_Debian_buster_next_5.2.5.7z, looking at /sys/class/thermal/thermal_zone0/temp : when idle, it's around 55°C. With all CPUs used, it quickly goes to 80°C, reaches 110°C and reboots. (same result as in March, with a freshly installed image)

When the board reboots, there is this kernel message :
 kernel:[  661.632645] thermal thermal_zone0: critical temperature reached (110 C), shutting down


With ftp://staging.olimex.com/Allwinner_Images/a64-olinuxino/linux/2.archived_images/ubuntu/a64olinuxino_ubuntu_16.04.3_20180523_rel5.zip , "cat /sys/class/thermal/thermal_zone0/temp" reports something that looks like a temperature. When idle, it reports 57. With all CPUs used, it quickly goes to 80, reaches 92 but does not go above.

With Armbian_20.02.7_Lime-a64_buster_current_5.4.28 (latest from https://dl.armbian.com/lime-a64/Buster_current), looking at /sys/class/thermal/thermal_zone0/temp : when idle, it's around 34°C. With all CPUs used, it quickly goes to 50°C, and reaches 70°C.

I don't know if the overheat is true, or if the temperature is not detected correctly. But, in any case, the reboot on high load is a big issue.
I think Olimex needs to provide a newer image with this issue fixed.

JohnS

It's clear from linux-sunxi ML that thermal management support needs considerable work.

Maybe figure out your own safe limits and use those or else contribute to linux-sunxi for everyone.

John

mossroy

It seems that the "safe limits" of latest armbian are fine.
And that its kernel includes significant improvements on thermal management.
In my tests, these improvements are enough to make the board stable on high load.

I can switch to this armbian image, but I would prefer to use an official image from olimex if possible, in order to have their full support.

JohnS

Maybe an email to their support pointing it out and hope they'll update.

They probably will, but as to when...

John

mossroy

Thanks JohnS.

From a previous experience, it seemed to me that using the support e-mail was not very efficient for this kind of request (see comments in https://olimex.wordpress.com/2019/03/08/a64-olinuxinogot-mainline-linux-kernel-5-0-images/).

I seem to have more answers from Olimex here, and it's also a way to share the discussion with other users/customers (like you, I suppose)

JohnS


LubOlimex

I am following this thread and we are investigating. Seems like some of the default values are high. Will try to improve it for future releases. Overall A64 gets hot even with older images, so if you expect heavy stress on all cores it might be a good idea to also consider some cooling. Even a simple piece of aluminium seems to lower the temperature around ~10C.
Technical support and documentation manager at Olimex

mossroy

Thanks for your answer.

I don't expect heavy stress for a long time. But, in my case, it simply happens on startup. I run https://jenkins.io/ on it, which uses a lot of CPU on startup (at least on first startup, or after an upgrade). So, on affected images (see previous post), the board ends up in an endless startup loop, if I don't kill the process soon enough.

Is https://www.olimex.com/Products/Components/Misc/ALUMINIUM-HEATSINK-20x20x6MM/ the kind of "piece of aluminium" you would suggest? Because the page only talks about A10 and A20, but A64 is not mentioned there. Can it be used on A64 boards?

LubOlimex

Technical support and documentation manager at Olimex