Olimex STMP1 SOM - Error starting up

Started by SteffenFuchs, January 29, 2024, 11:53:57 PM

Previous topic - Next topic

LubOlimex

We are investigating thoroughly thanks to the four boards with this issue that we received back from customers. In such boards it seems that the AXP209 hangs sometimes and the u-boot can't find it or communicate with it so the boot process stops. Our tests and hardware changes so far hadn't provided the definitive answers we are seeking. Currently, we are prototyping a number of test boards around the AXP209 to exhaust all hardware scenarios. I will update this thread when I have some more news but I welcome any feedback or experience or description with the boot of STMP boards - any specific details are helpful.

One workaround is to exclude any AXP209 initialization from u-boot and kernel, the AXP209 still hangs but the board would boot all the time (you'd lose access to any features of the PMU tho, like power management, battery status, etc). It is not clear if it is safe to exclude the AXP209 entirely since it controls a lot of things - like voltage on chip, voltage on HDMI, etc. It might be unsafe to disable the AXP209. Disabling the AXP209 from u-boot or kernel might lead to bigger issues and problems in the long run. More empirical tests are needed to confirm if it is safe.

We will post here some more details on how to exclude the AXP209 if you wish to test that.

Technical support and documentation manager at Olimex

thom_nic

Quote from: Titomax on March 27, 2024, 06:17:44 PMWe initially thought we broke the modules in some way but they started working again after about 20-24 hours after removing them from the motherboard and let them disconnected.

This is exactly the same behavior we've observed as well.  After attempting many times to boot, sometimes leaving the device powered off overnight - I pull off the SOM and replace with a different one (same SD card) and it boots. Must be a bad SOM, right? 

Then a few days later, I take another look at the "bad" SOM on the EVB or I throw it back in the product and now it boots!

LubOlimex

Here, this image has AXP209 and PMU removed from u-boot and kernel, give it a try:

https://ftp.olimex.com/TEMP/SOM-NO-AXP209/STM32MP1-OLinuXino-SOM-bullseye-minimal-20240328-133932.img.7z

Again:

QuoteOne workaround is to exclude any AXP209 initialization from u-boot and kernel, the AXP209 still hangs but the board would boot all the time (you'd lose access to any features of the PMU tho, like power management, battery status, etc). It is not clear if it is safe to exclude the AXP209 entirely since it controls a lot of things - like voltage on chip, voltage on HDMI, etc. It might be unsafe to disable the AXP209. Disabling the AXP209 from u-boot or kernel might lead to bigger issues and problems in the long run. More empirical tests are needed to confirm if it is safe.
Technical support and documentation manager at Olimex

thom_nic

Quote from: LubOlimex on March 29, 2024, 01:41:42 PMHere, this image has AXP209 and PMU removed from u-boot and kernel, give it a try:

Thank you Lub. Could you push your branches to https://github.com/OLIMEX/u-boot-olinuxino and https://github.com/OLIMEX/linux-olimex so we can see the changeset?

LubOlimex

#19
I can give you the diff files with the changes, find them here:

https://ftp.olimex.com/TEMP/SOM-NO-AXP209/DIFF-files/

Not sure if we will make a fork or branch in the public repo without AXP209 since it is a workaround that might suit people with startup issues but would hinder people that need AXP209 tools.

Removing the AXP209 for the boards with HDMI connector is especially risky (like the STMP157-OLinuXino-LIME2 and STMP157-BASE-SOM-EVB).
Technical support and documentation manager at Olimex

thom_nic

Thank you.  Is there a plan, or ongoing investigation to solve the root cause and get the AXP209 reliably working again?

LubOlimex

Yes, we are working on it. As I wrote previously: "Currently, we are prototyping a number of test boards around the AXP209 to exhaust all hardware scenarios." These PCBs were manufactured and tests are currently carried away. This can't be done very quick since there is a lot of hardware tinkering and soldering and empirical testing to reach to the bottom of the issue.
Technical support and documentation manager at Olimex

LubOlimex

#22
An update on this issue so far.

Summary:

This appears to be somehow rare issue since we couldn't find any such boards in our warehouse. Meaning not all boards are affected, we guess that maybe different AXP209 chips have different tolerances - and those combined with tolerances of all other related components lead to only small percentage of boards affected by start up issues. Futermore, these start up issues differ in intensity - some boards might fail to start in 10% of power ups, other in 50% of power ups and some rarely can start up at all.

We got 4 boards back from two customers posting in this thread (each sent us two boards). This helped a lot, thanks again. All four boards had more or less start up issues upon power up. Two of the boards were incredibly bad cases only very rarely starting up - we managed to get both these boards boot via hardware changes, one board had the AXP209 changed and started working fine, the other started booting after changing R64 from 10k to 100k.

The nature of the issue is that the I2C communication between the AXP209 and the main chip dies. Since the official Olimage images have some AXP209 support, when the software can't find the AXP209 over the I2C - the Linux image won't start. It is not clear what causes that hang in the I2C communication and why only some boards are affacted. We are still working on some hardware solutions for future hardware revisions.

The workardounds we found so far:

1. Remove all drivers for AXP209 from Linux, this will make the board boot but all AXP209 control over I2C would be gone. Some functions would be missing (like reading battery charge, setting the board in low power modes, etc). We already created such a Linux image without any initialization of the AXP209, it is avialalbe here:

https://ftp.olimex.com/TEMP/SOM-NO-AXP209/STM32MP1-OLinuXino-SOM-bullseye-minimal-20240328-133932.img.7z

Differences can be seen here:

https://ftp.olimex.com/TEMP/SOM-NO-AXP209/DIFF-files/

2. Hardware workaround - with some hardware delay by increasing the value of resistor R62 from 10k to 100k, the boards don't hang on power up (as long as there was at least two seconds power down between power ups). Keep in mind R62 is a bit hard to replace (size 0402 and location under plastic connector), we bend the plastic connector before repalcing it then bend it back. So far this seems like accepatble workaround and all boads manfuactured since start of April will leave the factory with R64 = 100k.

If the software workaround doesn't work for your goals, or if you are unable to perofrm the hardware worarkound, please contact us over support@olimex.com - we will arrange return and replacement of R64 with 100k.

We are still running hardware tests and different solutions hope to fix this issue completly in next hardware revision. I will update you when that happens.
Technical support and documentation manager at Olimex

thom_nic

Thanks very much for the update Lub.  Couple questions:

- Did you rev the hardware with the resistor change? We will not make hardware mods to a third-party component including the Olimex. It must come with the fix.  Can you please confirm the YYWW date code and lot code when this change was made?  We will likely inform our CMs to stop using older SOMs and replace them ASAP.

- Can you please provide a full list of lost capability due to disabling the AXP209 in software?  You said battery charging, low power states, and I think you said elsewhere that some video output capability would be lost.  Is that everything?  I just want to ensure I fully understand the possible effects of disabling the PMIC.

That aside, I am still struggling to understand the root cause of the issue.  Virtually every STMP1 SOM we have on hand in engineering, we can cause this in about 10 minutes or less of power cycling (1 minute off between each power up.)  After it gets to this state, the unit may be left off for hours (overnight) and still will not boot.  Swapping the SOM always resolves it.  It seems like infrequently a "bad" SOM will boot again after it has been pulled off the mainboard for a while.  But usually it will hang again after the next  power cycle or two.

Have you already or do you plan to also add this improvement to the next board revision?

thom_nic

#24
Lub I'm looking at the diff files you posted above.  Both diffs appear to be from the same file in the linux kernel source.  Should one of them be for the u-boot source?  I am guessing modules/stmp1-boot/u-boot-olimex/board/olimex/stm32mp1_olinuxino/spl.c

LubOlimex

Good catch, my bad! Proper diff files are now uploaded at the same location:

https://ftp.olimex.com/TEMP/SOM-NO-AXP209/DIFF-files/

- Can you please provide a full list of lost capability due to disabling the AXP209 in software?  You said battery charging, low power states, and I think you said elsewhere that some video output capability would be lost.  Is that everything?  I just want to ensure I fully understand the possible effects of disabling the PMIC.

It is mainly the battery-related software functions. The HDMI should be fine for the SOM board (I mainly mentioned it if somebody performs similar removal of AXP support for the LIME or BASE SOM images). Not much should be lost from lack of it. We are not fully aware ourselves of all implications of removing AXP209 driver from the Linux, but on first and second look and tests so far it seems fine.

Drop me an e-mail at support@olimex.com to discuss the details about the affected boards and how to ensure future purchases have the hardware fix applied (and also to discuss returning the two boards that you sent back).
Technical support and documentation manager at Olimex

LubOlimex

The boards with 100k resistor would be revision D1, indicated only on the box (the PCB marking would still says D). All boards purchased directly from our web-shop will be revision D1. If you purchase boards from our-web shop they will have the 100k resistor.

I will update the schematic later today.
Technical support and documentation manager at Olimex

thom_nic

#27
Edit: disregard, I was looking at the wrong SOM product page.

Hi Lub - I do not see the updated PDF schematic on the product page and I also cannot find any repo on https://github.com/OLIMEX/ for the STMP1 SOM hardware.  In addition to just schematic it would be helpful to know where to find the hardware changelog for this SOM.  For example many of the SOMs are here but not the STMP1.  Thanks.