LIME2 Rev.L lock-up freeze instability

Started by kimfaint, June 13, 2023, 04:56:47 AM

Previous topic - Next topic

kimfaint

We have about 160 devices deployed remotely that use an A20-OLinuXino-LIME2 Rev.E board.  They have all been very reliable, running 24/7 for many years now.

Last year we ordered some new boards and received Rev.L.  We needed to therefore move from the Debian 7 (wheezy) OS we were using to the latest Debian 11 (bullseye) OS due to the change in Ethernet chipset.  The new OS now runs on both the Rev.E boards and the Rev.L boards, both running our app (graphical information displayed on a HDMI connected display).

The Rev.E boards are still very stable.  With uptimes measured in months, only restarted when there is a power outage.  However most of the Rev.L boards are only running for anywhere from a few hours up to 2 weeks and then randomly the board seems to freeze and lock up.

We have ruled out environmental factors such as temperature and power supply as the cause because we have monitored the temperature and have a 4G modem connected to the same power supply, which remains up when these Olimex lockups occur.

We also have setup a Rev.E board and a Rev.L board on the bench, both powered from the same supply, both running the same OS and software.  The Rev.E board currently has an uptime > 40 days, the Rev.L board locks up every few days.

When this occurs the symptoms are:
  • all LEDs on the board (Red, Green and ethernet port) are all still ON.
  • the HDMI display (which shows a clock) is frozen at the time the lock-up occured
  • not able to ping the device over the ethernet port
  • not able to interact over the serial port

We are calling this "Zombie mode".

The only way to recover is to travel to the remote site and cycle the power.  Pressing the RESET button on the Olimex board also works.

I have seen a few other posts on this forum with similar issues:

But no concrete solution.

We are not using the T2 -IND (industrial) version of the board as we do not have demanding temperature requirements and the Rev.E boards it did not require extended temperature range.

We are also considering if perhaps the LIME board might be worth trying in case it is more stable than the LIME2.  It seems to be form-factor and pinout compatible and we don't really need the additional RAM and Ethernet speed of the LIME2.  It has also had a few less hardware revisions.

LubOlimex

As a workaround did you try employing watchdog? When it hangs to get restarted automatically.

Hard to say which of the hardware changes affected your setup. There are lot of changes.

1. Can you do one other test - I know the Ethernet won't work with the older kernel image, but run Debian 7 (wheezy) image on the newer revision L boards and check if it hangs. The purpose of this test is to show whether the older u-boot would behave more solid. Let me know how it goes.

2. Did you use the Olimex image as basis? Did you perform tests with the latest image ("A20-OLinuXino-bullseye-base-20230515-130040.img.7z")? Can you confirm that at start of boot the board is properly listed as A20-OLinuXino-LIME2 and the revision listed is L (this can be seen over the serial after a fresh boot or after reboot)? If not listed as LIME2 revision L, how is it listed?
Technical support and documentation manager at Olimex

kimfaint

Thanks for your response. I did try to arm the watchdog device `echo 1 > /dev/watchdog`, and I got the following in syslog:

watchdog: watchdog0: watchdog did not stop!
But after a few more minutes nothing happens.  There does not seem to be any watchdogd process running. Are there any more details on how to enable the software watchdog?

I will try running the Debian 7 image on some of the Rev.L boards that we have removed from the field because they were suffering from the zombie state every few days.  I will also try running with the latest 20230515 image from the olimex download page.  Unfortunately these types of tests all take time to setup and then much longer to run.  To date we have been focused on workarounds.  But I will report back on these results.

As for the revision detection at boot.  That is interesting. We are building our own image  and also building our own u-boot. The reasons for this are discussed in this post. To build our image we:
  • Clone the https://github.com/OLIMEX/olimage repo
  • Patch config to include additional deb packages and also include an install script for our app
  • Build a modified A20 minimal image
  • Clone the https://github.com/OLIMEX/u-boot-olinuxino repo
  • Apply a patch to u-boot-olinuxino to change the default board type from LIME to LIME2. Specifically LIME2 Rev.E
  • Build the modified u-boot binary.
  • Write modified u-boot binary to the boot-sector of the modified A20 minimal image.

The image we are using we built 2022-10-04.

We had to rebuild the u-boot binary because of the errors reading from the EEPROM due to a conflicting device on our baseboard (a tilt switch) that shares the I2C pin used by the EEPROM. When it was unable to read the EEPROM it defaulted to LIME and the ethernet did not work. We chose a default of LIME2 Rev.E because that was the minimum revision we use and the ethernet on Rev.L still seems to work when the board is detected as Rev.E.

Modification of our baseboards to remove the conflicting device, to allow the EEPROM to be read is an option. It's a device we don't actually use. Althought we'd rather avoid doing this because modification/re-manufacture of more baseboards and fitting them in the field will be costly.

Is there any reason that detection of the board as LIME2 Rev.E would cause a LIME2 Rev.L board to be unstable?

LubOlimex

I think this will be the culprit of the problems. Wrong settings loaded for the boards. Different revisions have different hardware that require different software support. My advice for debugging the issue is to perform tests with the latest image as basis without connecting the conflicting I2C device and check if it still hangs. That would at least point in any meaningful direction.

Hard to say what are the changes probably compare the sources of older and newer images. Browse trough the commits at our GitHub.

I think enabling the watchdog requires enabling kernel modules or even rebuilding the kernel. We haven't done it ourselves, so I can't give you anything useful about the process. Yet, I've seen people in the forums and over the e-mails that managed to get it working. Maybe search the forums or online. It was just an idea what you can try.

Technical support and documentation manager at Olimex

LubOlimex

You notice that Ethernet doesn't work with some configurations and that is an easy empirical way to determine that using the wrong config might lead to troubles, but there are sometimes not so obvious changes. It is not so easy to determine issues when it comes to RAM memory timings for example, I know we've used different RAM memories between different revisions, each coming with own software configuration.
Technical support and documentation manager at Olimex

KriszK

I have similar problem with A20 Rev.L board. It is freezing every 1-2 days using official latest Olimage Linux as well with latest kernel (5.10.180-20230725-092646)

This Olimage Linux has been installed one week ago and immediately updated (apt update/apt upgrade) to latest SW level.

I am using it with High Availability mode which means same software level and same SW/HW configuration is used on two A20 board (only boards revision differ)

Board 1: T2-OLinuXino-LIME2-e8Gs16M-IND Rev.K2
Board 2: T2-OLinuXino-LIME2-e8Gs16M-IND Rev.L1

Board 1 is very stable. Board 2 is always freezing.

For these boards one SATA SSD disk is connected and one USB zigbee stick. I have tried to connect USB stick with and without USB HUB (with own power supply) but same result.

LI-PO battery (3.7V 1400mA) are connected to both boards.

Both power supply: SY1005E (5V/2A)


LubOlimex

@KriszK

1) Is there any message or error thrown in the log? Maybe some kernel panic or anything abnormal in the messages or it just stops?

2) Are you sure that the whole board hangs, and not only your debug medium (if you use SSH maybe only the Ethernet hanged, or if you use serial port maybe only your adapter hanged)?

3) Do you boot from the SD card or the eMMC? Is the SSD used only as storage?

4) Are you using the same USB port for the USB device? The two USB ports are set differently - one can provide double the current compared to the other.

5) How do you power the disk - is it from the board? If you are powering everything from the adapter, the supply might be insufficient. Can you recrate the hang with disc powered from another source or maybe without the disk or the USB attached.
Technical support and documentation manager at Olimex

KriszK

Hi!

Thanks for the quick response.

I have made some tests.

Test 1.:
I have deployed previous version of kernel (5.10.105-20230217-181328) with following command:
apt reinstall linux-image-5.10.105-olimex

Reboot the system and waiting...

Unfortunately the board totally freezing again after 1 day but other way. Firstly I noticed that I was not able to login using SSH. This time other functions like Homeassistant and Zigbee USB stick working and I was able to login homeassistant frontend using firefox but homeassistant application was not able to communicate weather application over internet (like missed default gateway...)
Some hours later the board totally freeze.

Logs:

Failed SSH logins:
Sep  2 17:41:53 18087-hass-ssd-2 sshd[529]: error: beginning MaxStartups throttling
Sep  2 17:41:53 18087-hass-ssd-2 sshd[529]: drop connection #10 from [192.168.xxx.xxx]:34306 on [192.168.xxx.xxx]:22 past MaxStartups

Just before freezing and rebooting:
Sep  3 03:01:00 18087-hass-ssd-2 bash[274]: 2023-09-03 03:01:00.317 WARNING (MainThread) [homeassistant.components.media_player] Updating spotify media_player took longer than the scheduled update interval 0:00:30
Sep  3 03:01:00 18087-hass-ssd-2 bash[274]: 2023-09-03 03:01:00.340 WARNING (MainThread) [homeassistant.components.switch] Updating wake_on_lan switch took longer than the scheduled update interval 0:00:30
Sep  3 03:01:01 18087-hass-ssd-2 kernel: [119192.495577] usb 1-1: USB disconnect, device number 2
Sep  3 03:01:01 18087-hass-ssd-2 kernel: [119193.529292] usb 1-1.4: ch341_read_int_callback - usb_submit_urb failed: -19
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Sep  3 02:17:07 18087-hass-ssd-2 systemd-modules-load[230]: Inserted module 'g_serial'
Sep  3 02:17:07 18087-hass-ssd-2 kernel: [    0.000000] Booting Linux on physical CPU 0x0
Sep  3 02:17:07 18087-hass-ssd-2 kernel: [    0.000000] Linux version 5.10.105-olimex (root@runner-cpbkaozn-project-1-concurrent-0) (arm-linux-gnueabihf-gcc (Debian 8.3.0-2) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #181328 SMP Fri Feb 17 18:14:43 UTC 2023

Test 2:
I have replaced only the board from Rev.L1 to Rev.K2. I didn't change anything in environment.

Procedure:
1.) Shut down the system.
2.) Replace board to Rev.K2
3.) Booting from SD card with official Olimage Linux
4.) Write u-boot to SPI flash:
        u-boot-install /dev/mtd0
5.) remove SD card and reboot
6.) System rebooted from SPI and SATA SSD disk
7.) Running olinuxino-reset (and reboot again)
8.) apt reinstall linux-image-5.10.180-olimex (to be the latest kernel) and reboot

It is up and running for 2 days without any problem.



1.) Unfortunately no error messages in logs. The log files entries finished when freezing happend. I connected console from the board starting to see what will happen (permanently connected by other PC) but I saw only black screen.

Example:
Sep  1 16:10:00 18087-hass-ssd-2 npm[560]: Zigbee2MQTT:info  2023-09-01 16:10:00: MQTT publish: topic 'zigbee2mqtt/0x0c4314fffe4488b2', payload '{"linkquality":123,"power_on_behavior":"off","state":"ON","update":{"installed_version":587765297,"latest_version":587765297,"state":"idle"},"update_available":null}'
Sep  1 16:10:00 18087-hass-ssd-2 npm[560]: Zigbee2MQTT:info  2023-09-01 16:10:00: MQTT publish: topic 'zigbee2mqtt/0x0c4314fffe4488b2', payload '{"linkquality":123,"power_on_behavior":"off","state":"ON","update":{"installed_version":587765297,"latest_version":587765297,"state":"idle"},"update_available":null}'
Sep  1 16:10:04 18087-hass-ssd-2 systemd[1]: Starting system activity accounting tool...
!!!! FREEZING !!!!
!!!! POWER OFF/POWER ON from now !!!!
Sep  1 15:17:06 18087-hass-ssd-2 systemd-modules-load[230]: Inserted module 'g_serial'
Sep  1 15:17:06 18087-hass-ssd-2 systemd-modules-load[230]: Inserted module 'sun4i_ts'
Sep  1 15:17:06 18087-hass-ssd-2 systemd-random-seed[237]: Kernel entropy pool is not initialized yet, waiting until it is.
Sep  1 15:17:06 18087-hass-ssd-2 fake-hwclock[211]: Fri 01 Sep 2023 01:17:01 PM UTC
Sep  1 15:17:06 18087-hass-ssd-2 systemd[1]: Starting Flush Journal to Persistent Storage...
Sep  1 15:17:06 18087-hass-ssd-2 systemd[1]: Finished Set the console keyboard layout.
Sep  1 15:17:06 18087-hass-ssd-2 systemd[1]: Reached target Local File Systems (Pre).
Sep  1 15:17:06 18087-hass-ssd-2 systemd[1]: Reached target Local File Systems.
Sep  1 15:17:06 18087-hass-ssd-2 systemd[1]: Starting Set console font and keymap...
Sep  1 15:17:06 18087-hass-ssd-2 systemd[1]: Finished Flush Journal to Persistent Storage.


2.)
Ethernet hangs and zigbee USB stick hangs... no response from Zigbee network commands as well, no SSH. I'm not sure all board hangs but it seems no response.

3.)
Booting from SPI. SSD is used as root filesystem. It has been created by olinuxino-sd-to-sata script.

4.)
Tried both USB port same result. I have tried using USB HUB with own power supply as well but same.

5.)
SSD disk powered from the board.

Maybe Rev.L1 board need more power than Rev.K2.? Now both Rev.K2 running stable in same configuration and same power source.

kimfaint

Following up on my issue. We have tried the following:

  • Removed the conflicting I2C device, to allow the EEPROM to be read and correct board to be detected during uBoot.
  • Switched to higher spec SanDisk "Endurance" SD card (we were using SanDisk "Extreme", we may switch to SanDisk "Industrial" SD cards once we find a supplier)
  • Reducing how often writes from our app hit the SD card
  • Switching /tmp dir to tmpfs
  • Enable systemd watchdog control

There has been some improvement in the stability of our Rev.L and Rev.L1 boards but we still get lockups.

The procedure to enable system watchdog control is:
mkdir /etc/systemd/system.conf.d
cat <<EOS > /etc/systemd/system.conf.d/watchdog.conf
[Manager]
RuntimeWatchdogSec=10
EOS
systemctl daemon-reexec

For the tmpfs /tmp dir:
cp -vs /usr/share/systemd/tmp.mount /etc/systemd/system/
systemctl --now enable tmp.mount

LubOlimex

There are some different components between K2 and L1, notable ones are:

- eMMC is different (klmag2gend-b031 was replaced by klmag1jetd-b041003)
- RAM memory is different (mt41k256m16ha-125 was replaced with k4b4g1646e)
- voltage stability and reset IC is different (MCP120 was replaced with VDA4510CTA)
Technical support and documentation manager at Olimex

votroto

#10
Same problem. New board Rev. L with eMMC. Locked up three times so far.
Both while using an SD card as well as with the internal eMMC storage.

Latest kernel and everything 5.10.180-olimex #092646
Listed as A20-OLinuXino-LIME2-e16Gs16M Rev.L1
SN: 0002AF30


LubOlimex

@votroto Can you reproduce the issue reliably? How often does it hang, do you do something when it hangs? Any log available?
Technical support and documentation manager at Olimex

votroto

Yep, if I wait long enough, it freezes. Five times so far.
Of course there is nothing in the logs, I have to shut it down by holding the power button, no logs near the freeze have a chance to get saved.

It's just the board with the 7"LCD and a simple python script which pulls data over SNMP. It's running the standard minimal image, updated and upgraded using apt. The only difference is /tmp is tmpfs in fstab.

LubOlimex

@votroto Is the power supply solid? Is it possible that it is insufficient to power the setup?

Does the same setup work fine with older revision LIME2 boards (as in the case of user "kimfaint")? Or you have just one board?

If the problem remains you might contact me over support@olimex.com with link to this thread, I might recommend recommend returning the whole setup so we can hang it here in similar conditions and analyze.
Technical support and documentation manager at Olimex

dunkdomez

I have similar problem with A20 Rev.L board. It is freezing every 1-2 days using official latest Olimage Linux as well with latest kernel (5.10.180-20230725-092646)