SD card errors on A64-OLinuXino-2Ge8G-IND

Started by ilario, September 13, 2023, 05:00:11 PM

Previous topic - Next topic

ilario

Dear all,
I have two A64-OLinuXino-2Ge8G-IND each one with a Kingston SDCIT2/32GB SD card and running A64-OLinuXino-bullseye-base-20230515-130040 image with kernel 5.10.180-olimex.

I tested one of the cards with badblocks -n after connecting it with the USB SD card reader available on Olimex website and found no errors.

On both A64-OLinuXino-2Ge8G-IND units, after some random time using them with disk activity, I get errors in dmesg, like these ones:

mmc_erase: group start error -110, status 0x0
sunxi-mmc 1c0f000.mmc: data error, sending stop command
sunxi-mmc 1c0f000.mmc: send stop command failed
blk_update_request: I/O error, dev mmcblk0, sector 333704 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
EXT4-fs warning (device mmcblk0p1): ext4_end_bio:351: I/O error 10 writing to inode 272917 starting block 41714)
Buffer I/O error on device mmcblk0p1, logical block 39153
blk_update_request: I/O error, dev mmcblk0, sector 47925289 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0

The internal eMMC memory never showed any of such messages.

I tried to reproduce these very nice instructions but cannot find any of the files mentioned (and I am not brave enough to create a dtb overlay by myself).

I did kind of a fancy setup but I believe that the errors I am observing are due to the SD card driver/speed or something like that... My setup is: the 32 GB SD card is partitioned with a 24 GB and a 8 GB partition. The second partition has exactly the same size in sectors as the internal eMMC memory partition, and these two partitions are joined in a RAID1 (mirror) with the original idea of minimizing the risk of data losses. This md0 device is used for storing the database of a Prometheus instance running on the A64-OLinuXino-2Ge8G-IND.

Thanks!
Ilario

LubOlimex

Do you perform hardware power offs? Maybe just the fs on the EEPROM gets corrupted?

The instructions you found in the forum no longer apply. You are using the latest image that already has the fixes applied. Please ignore the instructions from the forum.

My first recommendation is:

1) Check the cards with F3 or H2testw software beforehand, this will show any errors with the cards.

2) Write the cards with BalenaEtcher or USBImager both program are free and allow for verification after write.

Let me know if the issue persists after that.
Technical support and documentation manager at Olimex

ilario

Quote from: LubOlimex on September 14, 2023, 08:53:18 AMDo you perform hardware power offs?

A few times I forgot to issue the poweroff command before unplugging, could this cause the issue?

Quote from: LubOlimex on September 14, 2023, 08:53:18 AMMaybe just the fs on the EEPROM gets corrupted?

EEPROM? How do I check that?

Quote from: LubOlimex on September 14, 2023, 08:53:18 AMMy first recommendation is:

1) Check the cards with F3 or H2testw software beforehand, this will show any errors with the cards.

Ok, I will. Up to now I just checked one of the 2 SD cards using badblocks -n (non-destructive write test). I will test both with F3.

Quote from: LubOlimex on September 14, 2023, 08:53:18 AM2) Write the cards with BalenaEtcher or USBImager both program are free and allow for verification after write.

I made one of the two SD cards with BalenaEtcher on Windows 11 and the second one with dd on Arch Linux.

Just one more piece of information, that I forgot to add in my initial post:

From dmesg the speed of the SD card and eMMC memory are:
mmc0: new high speed SDHC card at address 5048
mmcblk0: mmc0:5048 SDCIT 29.9 GiB
 mmcblk0: p1 p2
mmc1: new high speed MMC card at address 0001
mmcblk1: mmc1:0001 Q2J55L 7.09 GiB
mmcblk1boot0: mmc1:0001 Q2J55L partition 1 16.0 MiB
mmcblk1boot1: mmc1:0001 Q2J55L partition 2 16.0 MiB
 mmcblk1: p1

(interestingly, I cannot mount mmcblk1boot0 nor mmcblk1boot1, but this is another topic)

I will update you as soon as I manage to complete the F3 tests on both cards.

LubOlimex

> A few times I forgot to issue the poweroff command before unplugging, could this cause the issue?

Yes. To avoid that avoid full power offs. If that is hard to achieve consider using a Li-Po battery as back up.
Technical support and documentation manager at Olimex

ilario

Updates:

Using the USB SD card reader, on both SD cards, I did:
  • test the OS partition with f3write and f3read, no errors for either of the cards
  • test the RAID partition with badblocks -n (it was easier than looking for a way to mount the partition), no errors for either of the cards

Then I put the SD cards back into the Olinuxino units and tested in the same way from there:
  • the OS partition with f3write and f3read (write speed 18 MB/s, read speed 23 MB/s), no errors
  • the RAID partition (without the RAID running, as it degraded due to the previous errors) tested with badblocks -n, no errors

It is quite surprising that during normal operation I see errors and during these tests I do not see them.
After adding back the SD card partition to the RAID I ran F3 on the RAID (17 MB/s write, 30 MB/s read. Not bad) and observed no error on any of the A64 units.

I will let them run over the weekend, to see if the problem appears again.

Another (very likely unrelated) weird behaviour I observed is that mmcblk0 and mmcblk1 randomly swap at each boot (sometimes mmcblk0 is the SD card and sometimes is the internal eMMC). Is this expected?

ilario

Checking journalctl logs, I saw that the errors started after this line:

systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
This message comes from fstrim.timer systemd unit that launches fstrim.service that issues this command:

/sbin/fstrim --listed-in /etc/fstab

Running manually the commands

fstrim -v /
or

fstrim -v /mnt/raid1
triggered the same errors visible in dmesg.

So, for now I disabled the fstrim.timer systemd unit with:
systemctl disable fstrim.timer
systemctl stop fstrim.timer

I will let you know if the problem appears again.

Does this problem happen to everyone?
What is the right solution? Is this a drivers' bug?
I tried adding a "nodiscard" option in /etc/fstab but did not work at all (seems that nodiscard is an option accepted by mount but not recognized in fstab).