Board occasionally will not boot or reset

Started by thom_nic, May 26, 2023, 02:47:17 PM

Previous topic - Next topic

thom_nic

We are using the STMP1-SOM and occasionally the board will not boot.  We have LEDs that enable in u-boot and the kernel which never turn on.  Worse still the reset pin doesn't appear to have any effect.  We have had to full power cycle for the unit to come back. 

We are booting from SD card.  After the three u-boot partitions there is a /boot partition with boot.scr, kernel, dtb and initramfs which is kept readonly and the fifth partition is the rootfs.

I've occasionally also seen a device which would not boot on power cycle. Eventually I removed the SD card, inserted it in a SD reader and connected to a linux machine.  I did not mount or fsck anything on the host machine. dmesg showed the partitions enumerated correctly in the dev machine.  Then I put the SD card back into the SOM and it booted on the next power up.

I would understand if the device could not boot because it could not read the boot partition but I don't believe that is the case.

LubOlimex

When this happens here it is usually one of these two:

1. Worn out power jack connector. Since we test empirically each Olimex-made board we've noticed that power plugs that go into the power jacks are not eternal and wear out. Just the other I had a colleague complain that the board he is testing on start in 30% of the cases, the solution was to change his power jack and then he confirmed it works 50 out of 50 times.

2. Power down followed up by power up in quick succession. Doing so leaves some capacitors charged and improper voltage levels. There should be at least 5-10 seconds between power down and power up, else you are bound to have hiccups.
Technical support and documentation manager at Olimex

thom_nic

Thanks Lub.  This is a STMP1 SOM and the product has an internal AC-DC power converter so the issue is not an external power connector.

Is there any way to deal with capacitance so the unit can recover on its own?  This is an industrial product which may occasionally encounter power interruptions.  It is often inconvenient (and service interrupting) for the customer to dispatch a service call to power cycle the unit to restore normal operation.

LubOlimex

The simplest solution is to have a small Li-Po battery all the times. This would exclude sudden power loss and fix many problems related to the main power supply (glitches in supply, occasional drops, noise). Probably battery monitoring can be implemented, so when battery goes low (if main power supply is missing for long time) the board can perform a software shutdown.

Aside from that there are hardware solutions, the simplest would be placing a resistor to help discharge the capacitor faster. The main drawback from adding such resistor is that the current consumption would increase by let's say 20mA/30mA/50mA (depending on the resistor used). Of course you'd probably want to test different resistors and test empirically.

Before committing to one or other solution, it is good idea to replicate the issue at your lab. Aka confirm it is related to power down followed quickly by a power up. Just remove power and apply it almost immediately.
Technical support and documentation manager at Olimex

thom_nic

Which capacitor or power rail needs to be discharged? Is it the 3v3? 

I've been able to replicate this if I kill power for ~2s.  I don't know what the lower limit is since I've been doing it by hand.  If power is lost for >2s (roughly) then it boots every time. 

Is there a way to program the AXP209 to delay turning on?  Assuming the AXP is not also affected by this.

It seems like if nothing else, it could be possible for us to to add a switch that delays enabling power to the SOM.

Adding a LiPo seems like a non-starter.  It would affect our environmental and failure ratings and require mechanical redesign. Plus whenever we decide to shut down completely (say, no input power for > 30s) there would still be that 2s window after we shut down, if power were restored, that would "brick" it.

LubOlimex

#5
Glad that it was identified and confirmed, this is usually the hardest part of fixing an issue.

If you are not going to use battery (if your design doesn't allow for it), you can test by adding 330 Ohm resistor (at least 100 milliwatt, any PTH size works, if using SMT then at least 0805 size no smaller) on the IPS line. If it was me the easiest place I can attach it is on C37 location, above the reset button, which by default is not placed, this leaves us nice free pads to attach such resistor.

Edit: I actually tested this now and it feels for me that it improves the start up behavior, and haven't noticed a major downside so far.

We would also be doing our own investigation and evaluation of this behavior, but not sure if behavior can be completely fixed without sacrifices. Overall, it is impossible to be able to completely avoid it, any device that gets powered down in a instant and then powered up a second later is bound to malufcution after a number of such cycles. There are a lot of things going on, like start-up currents and so on.
Technical support and documentation manager at Olimex

thom_nic

Please let me know what you conclude in your investigation.  Anything we can do to minimize the time window of this occurring will be beneficial. 

We're buying these in quantity from you already and if there's an SMT mod on the SOM we may see if we can order these from you with the mod in place.  I'm going to talk with the HW/EE lead today (I'm the SW guy) and will follow up.

LubOlimex

We tested a few different hardware values but they don't fix this sort of behavior completely. I still managed to hang it every time albeit harder.

My advice is to test with the 330 Ohm 0805 resistor on the C37 location and see if that works better for you. However, again notice I managed to hang the board with it too.

You might need to also think of an alternative solution.

Technical support and documentation manager at Olimex

thom_nic

Thanks for the update.

We will do some testing with a C37 resistor at some point when we have a free tech to run the setup and test, to quantify the before and after behavior. 

Any hardware mitigation on our side will take months to get into production due to where we are in the production lifecycle.  Any update the the SOM if it reduces the time window (and probability) of the hangup occurring would be welcome. The change would need to come from Olimex as we won't do rework on a third-party OEM component.  But we would welcome the option to purchase an updated rev from Olimex.

If it helps I can provide our company name so you can look up our purchase history.  We're not huge quantity but we've been regular purchasers of your SOMs for years.

LubOlimex

QuoteAny hardware mitigation on our side will take months to get into production due to where we are in the production lifecycle.  Any update the the SOM if it reduces the time window (and probability) of the hangup occurring would be welcome. The change would need to come from Olimex as we won't do rework on a third-party OEM component.  But we would welcome the option to purchase an updated rev from Olimex.

As long as you have tested and confirmed it works for you we can probably do it, but I am not the person that can promise it. It has to reach the boss, so please drop an e-mail at support@olimex.com describing the issue and linking this thread. Keep me updated how testing with the resistor goes.
Technical support and documentation manager at Olimex