A64 stalls under network load with kernel .105

Started by jch, April 08, 2022, 11:44:08 PM

Previous topic - Next topic

jch

Hi,

Ever since upgrading to .105, I'm seeing my A64 board fail under heavy network load (routing between Ethernet and Wifi).  The serial shell is still responsive, but the board is no longer routing until I reboot it.  The little console consoles itself with messages such as this:
[ 1841.000879] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-... 1-... 2-... } 37157 jiffies s: 105 root: 0x7/.                             
[ 1841.012496] rcu: blocking rcu_node structures:                               
[ 1890.618734] rcu: INFO: rcu_sched self-detected stall on CPU                 
[ 1890.624321] rcu:     1-....: (54120 ticks this GP) idle=27e/1/0x4000000000000002 softirq=6037/6037 fqs=26244                                                 
[ 1904.486155] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-... 1-... 2-... } 53029 jiffies s: 105 root: 0x7/.                             
[ 1904.497762] rcu: blocking rcu_node structures:                               
[ 1953.628024] rcu: INFO: rcu_sched self-detected stall on CPU                 
[ 1953.633612] rcu:     1-....: (69871 ticks this GP) idle=27e/1/0x4000000000000002 softirq=6037/6037 fqs=34118

LubOlimex

This looks like the problem I wrote about here, it was supposed to be fixed:

https://www.olimex.com/forum/index.php?topic=8643.msg33463#msg33463

1) What does this three commands return:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

2) Are you using the latest image from here: https://images.olimex.com/release/a64/
Technical support and documentation manager at Olimex

jch

It's using the performance governor, with the stock max frequency.

Just to be clear: the board is *not* overheating, it never reaches 70°C.  The WiFi interface just randomly hangs under network load (not CPU load).

LubOlimex

If the whole board stalls, then it is not just the WIFI.

Performance governor basically makes the board ignore temperature settings.

Try ondemand governor. Or maybe even powersave to test if it improves reliability:

echo powersave > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Technical support and documentation manager at Olimex

jch

#4
> If the whole board stalls, then it is not just the WIFI.

Please read my initial posting again.  "The serial shell is still responsive, but the board is no longer routing until I reboot it."

> Performance governor basically makes the board ignore temperature settings.

I'm pretty sure that's not correct.  I've just confirmed that the CPU frequency slows down when the board reaches 70°C even with the performance governor.

jch

#5
It looks like it's this bug: https://bugzilla.kernel.org/show_bug.cgi?id=215542, which is fixed in Linux 5.10.106.  Olimex, could we please have an update with 5.10.106 or later?

LubOlimex

Nice find. Sure we will update it, if you check the branches you can see we update it regularly:

https://github.com/OLIMEX/linux-olimex
Technical support and documentation manager at Olimex

jch

> Nice find. Sure we will update it, if you check the branches you can see we update it regularly:

I've just reflashed my board with A64-OLinuXino-bullseye-minimal-20230515-130040, and the kernel is still 5.10.105.

@Olimex, I've reported this issue almost a year ago... may I please ask that you provide an updated kernel?

jch


LubOlimex

@jch I sent you a personal message more than a month ago with experimental image with newer kernel, but you didn't respond back. Can you check your inbox?
Technical support and documentation manager at Olimex

jch

Indeed, I missed it.  I'll try to find time to test it this week-end, and report back.

jch

@LubOlimex, I've just reflashed a board with the experimental 5.10.180 image that you provided, and configured it as an AP, the exact same configuration that would freeze under .105.

As far as I can tell, it's rock solid: I've downloaded 200MB of data through it, and it's still up (I'm actually posting this message through it).

LubOlimex

Thanks for the feedback will forward the info to the developers. Let me know if you notice something strange with that image.
Technical support and documentation manager at Olimex