A64 stalls under network load with kernel .105

Started by jch, April 08, 2022, 11:44:08 pm

Previous topic - Next topic

jch

April 08, 2022, 11:44:08 pm Last Edit: April 09, 2022, 12:47:44 am by jch
Hi,

Ever since upgrading to .105, I'm seeing my A64 board fail under heavy network load (routing between Ethernet and Wifi).  The serial shell is still responsive, but the board is no longer routing until I reboot it.  The little console consoles itself with messages such as this:
[ 1841.000879] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-... 1-... 2-... } 37157 jiffies s: 105 root: 0x7/.                             
[ 1841.012496] rcu: blocking rcu_node structures:                               
[ 1890.618734] rcu: INFO: rcu_sched self-detected stall on CPU                 
[ 1890.624321] rcu:     1-....: (54120 ticks this GP) idle=27e/1/0x4000000000000002 softirq=6037/6037 fqs=26244                                                 
[ 1904.486155] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-... 1-... 2-... } 53029 jiffies s: 105 root: 0x7/.                             
[ 1904.497762] rcu: blocking rcu_node structures:                               
[ 1953.628024] rcu: INFO: rcu_sched self-detected stall on CPU                 
[ 1953.633612] rcu:     1-....: (69871 ticks this GP) idle=27e/1/0x4000000000000002 softirq=6037/6037 fqs=34118

LubOlimex

This looks like the problem I wrote about here, it was supposed to be fixed:

https://www.olimex.com/forum/index.php?topic=8643.msg33463#msg33463

1) What does this three commands return:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

2) Are you using the latest image from here: https://images.olimex.com/release/a64/
Technical support and documentation manager at Olimex

jch

It's using the performance governor, with the stock max frequency.

Just to be clear: the board is *not* overheating, it never reaches 70°C.  The WiFi interface just randomly hangs under network load (not CPU load).

LubOlimex

If the whole board stalls, then it is not just the WIFI.

Performance governor basically makes the board ignore temperature settings.

Try ondemand governor. Or maybe even powersave to test if it improves reliability:

echo powersave > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Technical support and documentation manager at Olimex

jch

April 30, 2022, 04:07:03 pm #4 Last Edit: April 30, 2022, 04:14:13 pm by jch
> If the whole board stalls, then it is not just the WIFI.

Please read my initial posting again.  "The serial shell is still responsive, but the board is no longer routing until I reboot it."

> Performance governor basically makes the board ignore temperature settings.

I'm pretty sure that's not correct.  I've just confirmed that the CPU frequency slows down when the board reaches 70°C even with the performance governor.

jch

June 02, 2022, 01:51:32 am #5 Last Edit: June 02, 2022, 01:56:37 am by jch
It looks like it's this bug: https://bugzilla.kernel.org/show_bug.cgi?id=215542, which is fixed in Linux 5.10.106.  Olimex, could we please have an update with 5.10.106 or later?

LubOlimex

Nice find. Sure we will update it, if you check the branches you can see we update it regularly:

https://github.com/OLIMEX/linux-olimex
Technical support and documentation manager at Olimex