NAND corrupted after power failure

Started by martinayotte, September 04, 2013, 04:23:40 AM

Previous topic - Next topic

martinayotte

Hi everyone,

Yesterday, I got my NAND rootfs corrupted after a power failure during a storm, my second corruption in last 2 months. Unfortunately, like at the fist time, fsck.ext3 didn't help recover the partition, I have to erase everything and restore backups, but this time, my backup was 3 weeks old ... Anyway, I'm getting back on track since my project sources are in subversion, but still, all the apt-get I've done during those weeks need to be redone ... :-(

But this situation started scaring me : what will happen with my final product since it won't be on UPS ?
Sure, I can start thinking about making rootfs as ReadOnly, but I will still need a writeable partition which can still be corrupted ... Why an A10s board almost idle can corrupted his NAND so easily ?

Any thoughts ?


olimex

Hi
this is not PIC nor AVR, power failure on Linux computer is always bad, specially if you are in middle of write operation.
you can try to harden your Linux making everything Read Only if this is possible

martinayotte

#2
Hi Olimex,

I understand your answer, since I'm engineer since 25 years, and understand that Linux partition can get corrupted if not halted properly.

But, unfortunately, after more investigation/analysis, I'm still scared using the NAND in my final product.
After analysis/reflections, here is what I'm concluding :

Since this was my second NAND corruption, I've backed up the corrupted partition before restoring previous backup : I've discovered that tons of files were corrupted were located in X11 libs, such as XFCE and Gtk, none of those files were accessed in write mode, they all been accessed in Read mode since my A10s board was running X11 and the power failure occurred while I was preparing souper, so the board was Idle and X11 in ScreenSaver mode since an hour.

So, this investigation bring be reflections :
From my engineering background, I presume that Voltage Level on the NAND is a bit treaky and maybe get corrupted because the AllWinner A10s is still trying to do transaction with the NAND, even Read cycle, which looks to the NAND as Write cycle, due to the level of WriteEnable voltage became too low before CPU been halted by current AXP209 supervisor ...
If it is the case, to avoid that, and your can consult your engineering team, a dedicated Voltage Supervisor should be added to disable any NAND access under low voltage condition, maybe cheap one with hard-coded voltage level available in 3 pins form-factor. This should maybe planned for next PCB revision.

To prove that, I will try to do a ReadOnly Partition along a Writable Partition in the same NAND. If it still get corrupted, this will prove that the problem is Chip level write, since both partitions will be on the same chip. But, for now, since I've lost time on my current development, I won't do those tests in the following days, that will be added to my taks list ...

olimex

please let us know about your tests, we are always open for suggestions about improving the hardware

simple brute force solutions would be to add LiPo and when main power supply voltage is missing to force shutdown

more clever solutions would be comparator which monitor the 5V input power and when drop to 4V sets WP signal on the NAND flash, but this will not save the SD card if present too

martinayotte

Although I didn't have chance/time to do ReadOnly partition tests to prove that my suspicions about corruptions, I've look about the NAND connection on the A10-A20 schametic in the mean time that A10s NAND schematic becomes available :

I see that WP is tied to a R-C circuit, that's means that WP isn't protecting NAND from corruptions during Power Down, since R-C voltage is gradually/slowly decay before the whole board shutdown ...
I suggest that, in future PCB revisions, this WP signal on NAND should be either connected to RESET (managed by AXP209) or controlled by dedicated Voltage Supervisor such as LTC2915 or MC3416x


haydn87

Hello,
me and my team are experiencing problems with corrupted NAND after power failure as well.

In order to verify that NAND corruption is caused by WP signal as suggested by martinayotte, we have come up with the idea to block any writing on NAND by short circuiting the capacitor in the R-C circuit, setting the Write-Protected signal to ground. Is this method going to work?

If the problem is the WP signal and we choose to use a Voltage Supervisor to disable any NAND writing under low voltage condition, is it sufficient that the VS connects the pin PC16/NVP directly to ground or does it need to use a resistor?

Moreover martinayotte suggests the problem is the slowly decaying voltage on R-C circuit. Is it possible to substitute the capacitor of the RC circuit with a resistor?

Thank you.

MBR

I think that modifing hardware to ensure read-only is a little overkill, especialy when the Linux has sundry read-only filesystems (eg. the SquashFS), where the corruption of filesystem (by writing into it) is utterly imposible. And the only non-modifiable partition - the first FAT32 one - dont'th have to be mounted at all.