A20 hangs with rcu: INFO: rcu_sched detected stalls on CPUs/tasks errors

Started by honken, May 23, 2021, 07:44:51 PM

Previous topic - Next topic

honken

I have an A20 (A20-OLinuXino-LIME2 Rev.K) that unfortunately frequently hangs more or less completely. This happens somewhat randomly after a few days at most, and it becomes completely unreachable over the network at that point. What I've done so far:

  • Setup watchdog to reboot it if network becomes unreachable or load average is too high. Disconnecting the network results in a reboot, so watchdog works as it should.
  • Somewhat in desperation setup a cronjob to reboot it each night, in case the issue accumulates over time. This has not helped.
  • Connected a serial console to it, and started logging all output to the console to a logfile. Sometimes I do get log output (see below), sometimes it just completely hangs without any output
  • Searched this forum for similar issues without any luck. The closest I got was this issue for A64. I tried setting the governor to performance, but still get crashes.
  • Tried magic sysrq over serial. This has worked in some cases, at least allowing me to reset the board. In other cases not even sysrq has worked, and in general the only command that seems to work that way is doing a reset, not remounting ro or anything like that.

As mentioned I got some logging output at some of the crashes, but not all. The most recent one had the log filling up with this message:

[78766.050175] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[78766.056136] rcu:     1-...0: (2 ticks this GP) idle=566/1/0x40000000 softirq=5235681/5235682 fqs=2625
[78829.070164] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[78829.076127] rcu:     1-...0: (2 ticks this GP) idle=566/1/0x40000000 softirq=5235681/5235682 fqs=9251
[78839.098691] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 1-... } 6009 jiffies s: 545 root: 0x2/.
[78839.109235] rcu: blocking rcu_node structures:
[78892.090185] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[78892.096150] rcu:     1-...0: (2 ticks this GP) idle=566/1/0x40000000 softirq=5235681/5235682 fqs=15873

later on the message changed to this:

[294231.429218] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[294231.435228] rcu:    1-...0: (2 ticks this GP) idle=566/1/0x40000000 softirq=5235681/5235682 fqs=105777
[294241.445267] rcu: rcu_sched kthread starved for 53619940 jiffies! g8663913 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[294241.455948] rcu:    Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[294241.464981] rcu: RCU grace-period kthread stack dump:

This message kept reoccuring every minute or so. The console was completely unresponsive apart from this output, and then magic sysrq that in this instance at least worked partially, allowing me to reset the board. At an earlier crash I got a somewhat more informative message:

[160986.415593] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[160986.421647] rcu:    1-...0: (6 ticks this GP) idle=176/1/0x40000000 softirq=11046871/11046873 fqs=15875
[160986.431034]         (detected by 0, t=36762 jiffies, g=18625165, q=6060)
[160986.437216] Sending NMI from CPU 0 to CPUs 1:
[160996.443712] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 1-... } 6586 jiffies s: 973 root: 0x2/.
[160996.454359] rcu: blocking rcu_node structures:
[160996.458921] Task dump for CPU 1:
[160996.462256] task:zma             state:R  running task     stack:    0 pid:29639 ppid:  1133 flags:0x00000002
[160996.472275] Backtrace:
[160996.474829] [<c0e9c8f4>] (0xc0e9c8f4) from [<00000000>] (0x0)
[160996.480681] Backtrace aborted due to bad frame pointer <8a7e5869>
[161049.437840] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[161049.443889] rcu:    1-...0: (6 ticks this GP) idle=176/1/0x40000000 softirq=11046871/11046873 fqs=22495
[161049.453275]         (detected by 0, t=52517 jiffies, g=18625165, q=8896)
[161049.459456] Sending NMI from CPU 0 to CPUs 1:
[161061.370373] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 1-... } 22818 jiffies s: 973 root: 0x2/.
[161061.381130] rcu: blocking rcu_node structures:
[161061.385690] Task dump for CPU 1:
[161061.389049] task:zma             state:R  running task     stack:    0 pid:29639 ppid:  1133 flags:0x00000002
[161061.399071] Backtrace:
[161061.401841] [<c0e9c8f4>] (0xc0e9c8f4) from [<00000000>] (0x0)
[161061.408147] Backtrace aborted due to bad frame pointer <8a7e5869>

I use the board as a remote server, running Zoneminder and OpenHAB. It has a hard drive attached with swap, and also around 500MB free memory. CPU is fairly heavily loaded, but normally is 30-40% idling. Disk read/writes is fairly low. I run Debian 10 using the image provided by Olimex (A20-OLinuXino-buster-minimal-20210127-100834.img was the image I used), and it has been updated to run the latest kernel and and all other packages too.

I've tried searching for similar issues outside of this forum and found some pointers to kernel bugs that are several years old, so I'm assuming these should be resolved by now? Have anyone experienced similar issues, and would have any advise on what the issue might be?

honken

I'm unfortunately still having issues with this board, and in recent days they got quite extreme (more or less daily crashes, often more frequent than that even). What I found after my initial post was that if I disconnected the battery it actually made the board a whole lot more stable, although it still would crash every 1-2 weeks. Crashes sometimes came with kernel panics or similar on the serial console, other times it was completely silent. magic sysrq to reset the board worked a few times, but often the only thing that helps is pulling the power from the board completely.

Recently the board as mentioned became even more unstable though. Some crashes happened completely silently, others could have messages like the following one:

[ 3285.028048] 8<--- cut here ---
[ 3285.031147] Unable to handle kernel paging request at virtual address f00e3824
[ 3285.038385] pgd = 71ac8d41
[ 3285.041117] [f00e3824] *pgd=00000000
[ 3285.044739] Internal error: Oops: 5 [#1] SMP ARM

Another recent crash gave this:

[22715.712699] 8<--- cut here ---
[22715.715808] Unable to handle kernel NULL pointer dereference at virtual address 00000004
[22715.723920] pgd = b7011945
[22715.726645] [00000004] *pgd=00000000
[22715.730249] Internal error: Oops: 5 [#1] SMP ARM
[22715.734874] Modules linked in: xt_MASQUERADE nf_conntrack_netlink nft_counter nft_chain_nat xt_addrtype nft_compat nf_tables nfnetlink xt_conntrack nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge overlay rfkill evdev axp20x_adc axp20x_battery axp20x_ac_power industrialio lima at24 sun4i_can gpu_sched can_dev sun4i_c
odec sun4i_ss sunxi_cedrus(C) v4l2_mem2mem videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common libdes videodev mc sun4i_drm_hdmi uio_pdrv_genirq uio cpuf
req_dt sun4i_ts hwmon usb_f_acm u_serial g_serial libcomposite ip_tables x_tables pinctrl_axp209 spidev realtek
[22715.789415] CPU: 1 PID: 790 Comm: dockerd Tainted: G         C        5.10.60-olimex #123632
[22715.797853] Hardware name: Allwinner sun7i (A20) Family
[22715.803103] PC is at rb_insert_color+0x24/0x174
[22715.807649] LR is at timerqueue_add+0x7c/0xb8
[22715.812013] pc : [<c053f3a8>]    lr : [<c0545a24>]    psr: 60070093
[22715.818281] sp : d93a7dd4  ip : d93a7de8  fp : d93a7de4
[22715.823505] r10: 00000000  r9 : 00000001  r8 : 00000000
[22715.828726] r7 : ef69c894  r6 : d93a7f00  r5 : 000014a8  r4 : ea664a93
[22715.835251] r3 : c2bf7f00  r2 : 00000000  r1 : ef69c894  r0 : d93a7f00
[22715.841784] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[22715.849006] Control: 10c5387d  Table: 592c006a  DAC: 00000051
[22715.854768] Process dockerd (pid: 790, stack limit = 0x23e8231c)
[22715.860789] Stack: (0xd93a7dd4 to 0xd93a8000)
[22715.865162] 7dc0:                                              ea664a93 d93a7e0c d93a7de8
[22715.873352] 7de0: c0545a24 c053f390 ef69c880 d93a7f00 ef69c880 00000000 c0f04fc8 00000001
[22715.881534] 7e00: d93a7e24 d93a7e10 c01ae9b0 c05459b4 d93a7f00 ef69c840 d93a7e84 d93a7e28
[22715.889709] 7e20: c01af8cc c01ae974 c0f04fc8 00000000 00000000 00000000 00000000 ffffffff
[22715.897892] 7e40: 00000040 000014a8 d951b800 ef69c880 20070013 7cbbb4fa 00000080 d93a7f00
[22715.906078] 7e60: ffffe000 00011170 00000000 c10558ac c0f04fc8 00000001 d93a7eec d93a7e88
[22715.914262] 7e80: c0a7f32c c01af644 0000c350 00000000 00000001 c01aeb24 c0f04fc8 c01aee28
[22715.922440] 7ea0: 20070193 00000003 d93a7f00 00000001 00000001 00000000 c0f04fc8 7cbbb4fa
[22715.930613] 7ec0: d93a7eec 00004e20 00000000 00011170 00000000 c0f04fc8 0000c350 00000001
[22715.938788] 7ee0: d93a7f6c d93a7ef0 c01b0444 c0a7f288 d93a7f0c 00000000 0000c350 00000000
[22715.946966] 7f00: c2bf7f00 00000000 00000000 00000000 ea664a93 000014a8 ea658743 000014a8
[22715.955146] 7f20: c01aef78 ef69c880 00000001 00000000 d951b800 00000000 c0f04fc8 7cbbb4fa
[22715.963326] 7f40: c01c8aa0 c0f04fc8 00000000 00000000 00000000 c0100264 d93a6000 000000a2
[22715.971506] 7f60: d93a7fa4 d93a7f70 c01b0588 c01b038c 00000000 00000000 00004e20 7cbbb4fa
[22715.979687] 7f80: 00000000 7cbbb4fa 00000096 00000000 00000000 000000a2 00000000 d93a7fa8
[22715.987878] 7fa0: c0100060 c01b04d0 00000096 00000000 a51a3c78 00000000 00000001 00000014
[22715.996068] 7fc0: 00000096 00000000 00000000 000000a2 00000006 00000000 11000540 00002710
[22716.004255] 7fe0: 00000018 a51a3c74 0d4dca84 0d4dcaa4 80070010 a51a3c78 00000000 00000000
[22716.012435] Backtrace:
[22716.014916] [<c053f384>] (rb_insert_color) from [<c0545a24>] (timerqueue_add+0x7c/0xb8)
[22716.022921]  r4:ea664a93
[22716.025473] [<c05459a8>] (timerqueue_add) from [<c01ae9b0>] (enqueue_hrtimer+0x48/0xac)
[22716.033477]  r9:00000001 r8:c0f04fc8 r7:00000000 r6:ef69c880 r5:d93a7f00 r4:ef69c880
[22716.041229] [<c01ae968>] (enqueue_hrtimer) from [<c01af8cc>] (hrtimer_start_range_ns+0x294/0x3d8)
[22716.050101]  r5:ef69c840 r4:d93a7f00
[22716.053696] [<c01af638>] (hrtimer_start_range_ns) from [<c0a7f32c>] (do_nanosleep+0xb0/0x1e4)
[22716.062230]  r10:00000001 r9:c0f04fc8 r8:c10558ac r7:00000000 r6:00011170 r5:ffffe000
[22716.070066]  r4:d93a7f00
[22716.072619] [<c0a7f27c>] (do_nanosleep) from [<c01b0444>] (hrtimer_nanosleep+0xc4/0x144)
[22716.080725]  r10:00000001 r9:0000c350 r8:c0f04fc8 r7:00000000 r6:00011170 r5:00000000
[22716.088563]  r4:00004e20
[22716.091114] [<c01b0380>] (hrtimer_nanosleep) from [<c01b0588>] (sys_nanosleep_time32+0xc4/0x104)
[22716.099903]  r10:000000a2 r9:d93a6000 r8:c0100264 r7:00000000 r6:00000000 r5:00000000
[22716.107735]  r4:c0f04fc8
[22716.110287] [<c01b04c4>] (sys_nanosleep_time32) from [<c0100060>] (ret_fast_syscall+0x0/0x54)
[22716.118809] Exception stack(0xd93a7fa8 to 0xd93a7ff0)
[22716.123869] 7fa0:                   00000096 00000000 a51a3c78 00000000 00000001 00000014
[22716.132044] 7fc0: 00000096 00000000 00000000 000000a2 00000006 00000000 11000540 00002710
[22716.140221] 7fe0: 00000018 a51a3c74 0d4dca84 0d4dcaa4
[22716.145283]  r7:000000a2 r6:00000000 r5:00000000 r4:00000096
[22716.150952] Code: 0a00003b e5932000 e3120001 189da810 (e592c004)
[22716.157063] ---[ end trace 2059b098b968d6da ]---

I've tried monitoring the power usage in case the power supply somehow would be the culprit, logging the content of /sys/class/power_supply/axp20x-*/uevent every second. From what I can tell everything looks quite normal there, here's the last output just before a crash:

POWER_SUPPLY_TYPE=Mains
POWER_SUPPLY_HEALTH=Good
POWER_SUPPLY_PRESENT=1
POWER_SUPPLY_ONLINE=1
POWER_SUPPLY_VOLTAGE_NOW=4982000
POWER_SUPPLY_CURRENT_NOW=420000
POWER_SUPPLY_NAME=axp20x-battery
POWER_SUPPLY_TYPE=Battery
POWER_SUPPLY_PRESENT=1
POWER_SUPPLY_ONLINE=1
POWER_SUPPLY_STATUS=Not charging
POWER_SUPPLY_VOLTAGE_NOW=4133000
POWER_SUPPLY_CURRENT_NOW=0
POWER_SUPPLY_CONSTANT_CHARGE_CURRENT=1200000
POWER_SUPPLY_CONSTANT_CHARGE_CURRENT_MAX=1200000
POWER_SUPPLY_HEALTH=Good
POWER_SUPPLY_VOLTAGE_MAX_DESIGN=4200000
POWER_SUPPLY_VOLTAGE_MIN_DESIGN=2900000
POWER_SUPPLY_CAPACITY=97

As I use the board to remotely run monitoring its instability is getting to a point where I probably have to consider replacing it altogether with something else, given how unreliable it is. What I was going to try next even though it is discouraged in the docs is to power the board through a 2A USB charger over USB OTG port instead. Are there any other things I could try here?

honken

What might be worth adding as well is that the recent more extreme stability issues coincided with much colder weather, meaning drier air. Temperature where the board is located probably is quite stable around 12-16C, but humidity given the cold weather has dropped a lot. All other electronic devices are unaffected, so this might just be a coincidence.

soenke

Steps i would recommend:

1) do a memory test
2) disable all power saving functions
3) set cpu govenor to performance
4) build some case around it to increase ambient temperature (just to rule that out)
5) check cpu temperatures and install a cooler if too hot
6) check dram frequencies, maybe try lower clocks

honken

Many thanks for the great suggestions! Unfortunately the board seems to have just hanged again and I will not be able to get to it for a few days to test any of this :(

Quote1) do a memory test

On doing a memory test, any suggestions for the best way of doing that? I tried reading up on what's included in U-Boot, would that be reasonable options?

Quote2) disable all power saving functions
3) set cpu govenor to performance

I have not tried disabling power saving yet, but I earlier in the year tried changing the CPU governor to different settings (including performance) which unfortunately did not help.

Quote4) build some case around it to increase ambient temperature (just to rule that out)
5) check cpu temperatures and install a cooler if too hot

I've checked the temperatures earlier on the CPU and these seemed low and stable, and given that it is colder now the fact that the issue is more frequent seems a bit odd. Will try raising the temp of the board though to rule that out.

Quote6) check dram frequencies, maybe try lower clocks

Good idea, will try that once I've ruled out the memory error part above.

LubOlimex

Is this happening while using latest Olimage Linux image? Can you test with it? You can find it our FTP.
Technical support and documentation manager at Olimex

honken

Quote from: LubOlimex on November 30, 2021, 09:59:12 AMIs this happening while using latest Olimage Linux image? Can you test with it? You can find it our FTP.

The image I'm using currently is A20-OLinuXino-buster-minimal-20210127-100834.img. I've updated it continuously since though so kernels etc are all the most recent versions. I will try the latest image from the FTP though, thank you for the suggestion!

LubOlimex

Maybe try with the base image to limit down the possibilities. Also we are now retiring Ubuntu builds and only Debian will be supported starting 2022.
Technical support and documentation manager at Olimex