End of last year we purchased approx. 30 of the STM32MP1 SOM to use in a pilot series of our measuring equipment.
We encountered the following issue that we wanted to bring to your attention and ask,
- If you experience the same behavior, and
- If you can advise us how to avoid the issue
Our observation has been:
- When powering up the SOM the system sometimes does not boot
- Connecting to the provided UART connector and observing the output, it stops after only a few lines, the last one printed being Init AXP209 PMIC
This happens:
- with at least one of the boards purchased all the time,
- some more boards when they have been disconnected from power supply for less than a minute, and
- nearly all boards when disconnected from power for only a few (e.g. less than 3) seconds, but not all the boards all the time
The behavior was the same with different setups connecting the SOM to:
- Our application board
- Your Olimex STMP1(A13)-EVB
- A simple board just supplying GND and 5 V to the LCD-TR1 connector
- The behavior has been the same with out-of-the-box boards that have never been connected to our application board, to rule out any possible defect caused by us
The software revisions used are your images STM32MP1-OLinuXino-SOM-bullseye-minimal both from
- July 25th 2023
- December 22nd 2023
In order to investigate further I added a few printf statements to the U-Boot code from December 22nd to see what is happening, and found this (all mentioned files relative to UBoot root directory):
- board/st/stm32mp1/spl.c, function board_early_init_f: calls axp_init
- arch/arm/mach-stm32mp/axp209.c, function axp_init: calls axp_bus_init
- arch/arm/mach-stm32mp/axp209.c, function axp_bus_init:
first calls uclass_det_device_by_seq, which is OK,
then calls dm_i2c_probe - drivers/i2c/i2c-uclass.c, function dm_i2c_probe: calls i2c_probe_chip
- drivers/i2c/i2c-uclass.c, function i2c_probe_chip:
eventually calls function pointer ops->xfer, with one message of zero length to probe is chip answers to its address on the I²C-bus
ops-xfer happens to point to stm32_i2c_xfer, as defined in drivers/i2c/stm32f7_i2c.c, struct dm_i2c_ops stm32_i2c_ops - drivers/i2c/stm32f7_i2c.c, function stm32_i2c_xfer:
first calls stm32_i2c_check_device_busy, which is OK
then calls stm32_i2c_message_xfer - drivers/i2c/stm32f7_i2c.c, function stm32_i2c_message_xfer:
first calls stm32_i2c_message_start, which puts the address on the I²C-bus and returns OK
since the message has zero length, the while loop with all calls to readb and writeb is skipped
finally stm32_i2c_check_end_of_message is called - drivers/i2c/stm32f7_i2c.c, function stm32_i2c_check_end_of_message:
calls stm32_i2c_wait_flags, which waits for one of these flags to be set: STM32_I2C_ISR_ERRORS | STM32_I2C_ISR_NACKF | STM32_I2C_ISR_STOPF
then it checks for these flags and additionally for STM32_I2C_ISR_ARLO
If the error occurs, it finds flag STM32_I2C_ISR_NACKF set
the function then returns -EIO, which is -5
The returned value is propagated all the way back to board_early_init_f, where it is ignored.
In board_early_init_f, in case of error, one of the subsequent calls to axp_set_aldo3 and axp_set_aldo2 halts the board, which I did not further investigate as the previous function already showed issues not dealt with board_early_init_f never terminates, startup never continues
Reading up on I²C-bus I found that a NACK directly after address is sent occurs under these circumstances:
- the I²C data line is driven open-collector by all participants, with a pull-up holding the line at 1 if no participant drives it low
- normally after a chip has been addressed the master does not drive data low
- but the slave that has been addressed should, indicating that it is present and ready to receive data
- in case of the error the AXP209 however does not, which can be due to these issues:
- The address sent is wrong (at least as far as software goes the value is the same whether the error occurs or not, but I did not oscillograph the actual bus as it is difficult to access)
- The chip addressed is not present
- The chip does not listen to the bus
- The chip is in a different state expecting other data
This is how far we have investigated the issue so far. If you like, we can send you the board that exhibits the error during each start, as well as one of the many boards that show the error only when the power is interrupted very briefly.
Best regards,
Steffen
Hello,
I have experienced similar issue with STMP157-OLinuXino-LIME2 SBC.
The board booted successfully a couple of times when it arrived with different images being burned on the SD card. However, after two days of development it suddenly got stuck when trying to boot up. The only output that i get is:
U-Boot SPL 2021.04+olimex-1-20231222.140935 (Dec 22 2023 - 14:12:07 +0000)
Model: STM32MP1 OLinuXino-LIME
Init AXP209 PMIC
VDD Core set to: 1350 mv
This seems to be similar what SteffenFuchs wrote above. Whatever I change, the SD card, the image (I tried the supplied image as well), nothing seem to be fixing the issue and the board remains unusable stuck in frozen boot. I would appreciate any form of help or advice here.
Kind Regards,
Jacob
You need to use the base image from here:
https://images.olimex.com/release/stm32-lime/
You need a solid 10W 5V DC power supply attached to the power jack.
I'm using a rev C STMP1-SOM with a rev B EVB. The 5V wall wart states 5VDC 2500mA. I'm seeing the same behavior:
U-Boot SPL 2021.04+olimex-1-20231222.140935 (Dec 22 2023 - 14:12:07 +0000)
Model: STM32MP1XX OLinuXino-SOM
Init AXP209 PMIC
U-Boot SPL 2021.04+olimex-1-20231222.140935 (Dec 22 2023 - 14:12:07 +0000)
Model: STM32MP1XX OLinuXino-SOM
Init AXP209 PMIC
It usually prints twice after I hit the RESET button. I'm using the stock image downloaded from this link (http://images.olimex.com/release/stm32-som/).
Thank you all for the replies! jper and thom_nic, that is exactly What I have encountered with several boards, though not all of them all the time, so it seems to be an issue that comes up more frequently.
In addition to all my tests above I have I can confirm that I used exactly the images as linked above by you LubOlimex - but I have also tried the previous images from July with no difference.
Further the manual for the board found here (https://www.olimex.com/Products/SOM/STMP1/STMP157-SOM-512/resources/STMP15x-SOM.pdf) states on the top of page 7 that a 5 V 1 A power supply is needed - which is only 5 W instead of the 10 W mentioned above.
However I have encountered the same behavior with a 5 V 3 A 15 W power supply as with a smaller one.
So can you at Olimex please start investigating the issue? I think it comes down to the I²C communication between the STM32MP1 and the AXP209 you use on your board, which fails according to my research and testing as I have written above, but I have no access to any means to go further myself.
I've been pulling my hair out because we're randomly seeing products not boot. Sometimes it comes back after fiddling a bit. I've been thinking it was some sort of SD card issue (even though no writing occurs to the boot partitions!) now I'm learning maybe it's board/power related.
For our part, we need to get to the bottom of this quickly, we have hundreds of devices out in the field and they are very costly to RMA (not to mention a poor customer experience) when one refuses to boot.
The problem for us is that we can't replicate this issue. The many boards we have tested always start up. Maybe only some boards experience this problem, maybe in specific conditions. Until we can replicate the issue here we can't solve anything. If you have boards that doesn't start please contact us at support@olimex.com so we can arrange returns to experiment with them. Notice that some reasonable fault rate on the start up hang is expected. Like if it is one board doesn't start once if it is unpowered and repowered 1000 times in 10 days - this is not a thing that we can replicate here.
Dear LubOlimex,
thank you for your reply.
I did send you that email on January 8th already, also offering to send back some boards we experienced problems with, but have not reveived any reply from support@olimex.com since, so had to go through the forum.
Maybe it got lost somewhere around here, re-send it - we are interested in the returns of boards that experience the issue regularly so we can see the hang ourselves.
Just did.
At the moment I have one that consistently hangs. I'll get it returned soon.
Make sure to include a full description of the behavior and steps we should take to replicate it inside so we can pay closer attention. Also request RMA over support@olimex.com (maybe in the e-mail include link to this thread so we can link the request with the behavior described in the forum).
The two boards I returned should have arrived March 8th. Hope you can reproduce the behavior and find the cause.
Yeah thanks for the effort, they reached me. So far I can replicate the behavior you described 1:1 with the first board - we are not yet done with analyzing what could cause it but we are on it. Thanks for good packing and description of the behavior.
Hi all,
I can confirm the problem described at the beginning by SteffenFuchs (Steffen thank you for your in deep analysis). We are using many STMP157-SOM-512-EXT on our mainboard and sometimes after normal shutdown the modules don't boot anymore. On the debug UART the boot procedure hangs after this messages:
U-Boot SPL 2021.04+olimex-1-20240312.134658 (Mar 12 2024 - 13:48:01 +0000)
Model: STM32MP1XX OLinuXino-SOM
Init AXP209 PMIC
We initially thought we broke the modules in some way but they started working again after about 20-24 hours after removing them from the motherboard and let them disconnected.
Now only 2 modules on the initial 11 refuse to boot, all the others boot again.
It's alarming that we don't have any guarantee that inside our final product they don't stop to boot again after a system restart.
Please investigate the issue as soos as possible as we need to stop production if an hardware failure is identified.
Thank you in advance
We are investigating thoroughly thanks to the four boards with this issue that we received back from customers. In such boards it seems that the AXP209 hangs sometimes and the u-boot can't find it or communicate with it so the boot process stops. Our tests and hardware changes so far hadn't provided the definitive answers we are seeking. Currently, we are prototyping a number of test boards around the AXP209 to exhaust all hardware scenarios. I will update this thread when I have some more news but I welcome any feedback or experience or description with the boot of STMP boards - any specific details are helpful.
One workaround is to exclude any AXP209 initialization from u-boot and kernel, the AXP209 still hangs but the board would boot all the time (you'd lose access to any features of the PMU tho, like power management, battery status, etc). It is not clear if it is safe to exclude the AXP209 entirely since it controls a lot of things - like voltage on chip, voltage on HDMI, etc. It might be unsafe to disable the AXP209. Disabling the AXP209 from u-boot or kernel might lead to bigger issues and problems in the long run. More empirical tests are needed to confirm if it is safe.
Edit and update: As of end of June 2024 we haven't found problems nor received any reports of problems when using the Linux image without AXP209 support. It should be considered safe to use the software workaround.
We will post here some more details on how to exclude the AXP209 if you wish to test that.
Quote from: Titomax on March 27, 2024, 06:17:44 PMWe initially thought we broke the modules in some way but they started working again after about 20-24 hours after removing them from the motherboard and let them disconnected.
This is exactly the same behavior we've observed as well. After attempting many times to boot, sometimes leaving the device powered off overnight - I pull off the SOM and replace with a different one (same SD card) and it boots. Must be a bad SOM, right?
Then a few days later, I take another look at the "bad" SOM on the EVB or I throw it back in the product and now it boots!
Here, this image has AXP209 and PMU removed from u-boot and kernel, give it a try:
https://ftp.olimex.com/TEMP/SOM-NO-AXP209/STM32MP1-OLinuXino-SOM-bullseye-minimal-20240328-133932.img.7z
Again:
QuoteOne workaround is to exclude any AXP209 initialization from u-boot and kernel, the AXP209 still hangs but the board would boot all the time (you'd lose access to any features of the PMU tho, like power management, battery status, etc). It is not clear if it is safe to exclude the AXP209 entirely since it controls a lot of things - like voltage on chip, voltage on HDMI, etc. It might be unsafe to disable the AXP209. Disabling the AXP209 from u-boot or kernel might lead to bigger issues and problems in the long run. More empirical tests are needed to confirm if it is safe.
Quote from: LubOlimex on March 29, 2024, 01:41:42 PMHere, this image has AXP209 and PMU removed from u-boot and kernel, give it a try:
Thank you Lub. Could you push your branches to https://github.com/OLIMEX/u-boot-olinuxino and https://github.com/OLIMEX/linux-olimex so we can see the changeset?
I can give you the diff files with the changes, find them here:
https://ftp.olimex.com/TEMP/SOM-NO-AXP209/DIFF-files/
Not sure if we will make a fork or branch in the public repo without AXP209 since it is a workaround that might suit people with startup issues but would hinder people that need AXP209 tools.
Removing the AXP209 for the boards with HDMI connector is especially risky (like the STMP157-OLinuXino-LIME2 and STMP157-BASE-SOM-EVB).
Thank you. Is there a plan, or ongoing investigation to solve the root cause and get the AXP209 reliably working again?
Yes, we are working on it. As I wrote previously: "Currently, we are prototyping a number of test boards around the AXP209 to exhaust all hardware scenarios." These PCBs were manufactured and tests are currently carried away. This can't be done very quick since there is a lot of hardware tinkering and soldering and empirical testing to reach to the bottom of the issue.
An update on this issue so far.
Summary:
This appears to be somehow rare issue since we couldn't find any such boards in our warehouse. Meaning not all boards are affected, we guess that maybe different AXP209 chips have different tolerances - and those combined with tolerances of all other related components lead to only small percentage of boards affected by start up issues. Furthermore, these start up issues differ in intensity - some boards might fail to start in 10% of power ups, other in 50% of power ups and some rarely can start up at all.
We got 4 boards back from two customers posting in this thread (each sent us two boards). This helped a lot, thanks again. All four boards had more or less start up issues upon power up. Two of the boards were incredibly bad cases only very rarely starting up - we managed to get both these boards boot via hardware changes, one board had the AXP209 changed and started working fine, the other started booting after changing R62 from 10k to 100k.
The nature of the issue is that the I2C communication between the AXP209 and the main chip dies. Since the official Olimage images have some AXP209 support, when the software can't find the AXP209 over the I2C - the Linux image won't start. It is not clear what causes that hang in the I2C communication and why only some boards are affacted. We are still working on some hardware solutions for future hardware revisions.
The workardounds we found so far:
1. Remove all drivers for AXP209 from Linux, this will make the board boot but all AXP209 control over I2C would be gone. Some functions would be missing (like reading battery charge, setting the board in low power modes, etc). We already created such a Linux image without any initialization of the AXP209, it is available here:
https://ftp.olimex.com/TEMP/SOM-NO-AXP209/STM32MP1-OLinuXino-SOM-bullseye-minimal-20240328-133932.img.7z
Differences can be seen here:
https://ftp.olimex.com/TEMP/SOM-NO-AXP209/DIFF-files/
2. Hardware workaround - with some hardware delay by increasing the value of resistor R62 from 10k to 100k, the boards don't hang on power up (as long as there was at least two seconds power down between power ups). Keep in mind R62 is a bit hard to replace (size 0402 and location under plastic connector), we bend the plastic connector before replacing it then bend it back. So far this seems like acceptable workaround and all boards manufactured since start of April will leave the factory with R64 = 100k R62 = 100k.
If the software workaround doesn't work for your goals, or if you are unable to perform the hardware workaround, please contact us over support@olimex.com - we will arrange return and replacement of R62 with 100k.
We are still running hardware tests and different solutions hope to fix this issue completely in next hardware revision. I will update you when that happens.
Thanks very much for the update Lub. Couple questions:
- Did you rev the hardware with the resistor change? We will not make hardware mods to a third-party component including the Olimex. It must come with the fix. Can you please confirm the YYWW date code and lot code when this change was made? We will likely inform our CMs to stop using older SOMs and replace them ASAP.
- Can you please provide a full list of lost capability due to disabling the AXP209 in software? You said battery charging, low power states, and I think you said elsewhere that some video output capability would be lost. Is that everything? I just want to ensure I fully understand the possible effects of disabling the PMIC.
That aside, I am still struggling to understand the root cause of the issue. Virtually every STMP1 SOM we have on hand in engineering, we can cause this in about 10 minutes or less of power cycling (1 minute off between each power up.) After it gets to this state, the unit may be left off for hours (overnight) and still will not boot. Swapping the SOM always resolves it. It seems like infrequently a "bad" SOM will boot again after it has been pulled off the mainboard for a while. But usually it will hang again after the next power cycle or two.
Have you already or do you plan to also add this improvement (https://www.olimex.com/forum/index.php?msg=37863) to the next board revision?
Lub I'm looking at the diff files you posted above (https://ftp.olimex.com/TEMP/SOM-NO-AXP209/DIFF-files/). Both diffs appear to be from the same file in the linux kernel source. Should one of them be for the u-boot source? I am guessing modules/stmp1-boot/u-boot-olimex/board/olimex/stm32mp1_olinuxino/spl.c
Good catch, my bad! Proper diff files are now uploaded at the same location:
https://ftp.olimex.com/TEMP/SOM-NO-AXP209/DIFF-files/
- Can you please provide a full list of lost capability due to disabling the AXP209 in software? You said battery charging, low power states, and I think you said elsewhere that some video output capability would be lost. Is that everything? I just want to ensure I fully understand the possible effects of disabling the PMIC.
It is mainly the battery-related software functions. The HDMI should be fine for the SOM board (I mainly mentioned it if somebody performs similar removal of AXP support for the LIME or BASE SOM images). Not much should be lost from lack of it. We are not fully aware ourselves of all implications of removing AXP209 driver from the Linux, but on first and second look and tests so far it seems fine.
Drop me an e-mail at support@olimex.com to discuss the details about the affected boards and how to ensure future purchases have the hardware fix applied (and also to discuss returning the two boards that you sent back).
The boards with 100k resistor would be revision D1, indicated only on the box (the PCB marking would still says D). All boards purchased directly from our web-shop will be revision D1. If you purchase boards from our-web shop they will have the 100k resistor.
I will update the schematic later today.
Edit: disregard, I was looking at the wrong SOM product page.
Hi Lub - I do not see the updated PDF schematic on the product page (https://www.olimex.com/Products/SOM/STMP1/STMP157-BASE-SOM-EXT/) and I also cannot find any repo on https://github.com/OLIMEX/ for the STMP1 SOM hardware. In addition to just schematic it would be helpful to know where to find the hardware changelog for this SOM. For example many of the SOMs are here (https://github.com/OLIMEX/SOM/tree/master) but not the STMP1. Thanks.
Lub at the bottom of your above message you wrote:
Quoteall boads manfuactured since start of April will leave the factory with R64 = 100k.
Did you mean R62?
Yes, R62. I have edited previous post.
Hi!
I don't know if it's the same problem or if I have to create a new topic.
I have random boot error at starting up with SMTP1 LIME.
On thunderstorm power cuts, some boards don't restart.
On 20 STMP157-OLinuXino-LIME2H-EXT (Rev_B) boards in production approximately 5 boards remain frozen (its are distributed in several places so its do not suffer exactly the same power cuts).
I never have this problem when I reboot a board with a "shutdown -r now" (or "shutdown -r 00:00").
I did some tests and after some power cuts, the board stay frozen on :
U-Boot SPL 2021.04+olimex-1-20240312.134658 (Mar 12 2024 - 13:48:01 +0000)
Model: STM32MP1 OLinuXino-LIME
Init AXP209 PMIC
VDD Core set to: 1350 mv
When the board is frozen, the reset button don't help, the error returns.
A quick power cut doesn't help either.
For the error to disappear, the board need to be unplugged several seconds.
Quick power cuts seems to cause the problem every time.
NB : I tried with 1 A and 2 A power supply.
It sounds like the same issue. You can easily test if it is - try the image without AXP initialization just to confirm. If these five boards that never boot, boot with it, then it is the same issue. Image is here:
https://ftp.olimex.com/TEMP/SOM-NO-AXP209/STM32MP1-OLinuXino-SOM-bullseye-minimal-20240328-133932.img.7z
Contact us at support@olimex.com about the issue with the boards that won't start when power supply is applied.
> Quick power cuts seems to cause the problem every time.
This is expected by design. All boards should fail to start if there was not enough time for the capacitors to discharge. There should be at least 2 seconds between power ups.
Hi,
> This is expected by design. All boards should fail to start if there was not enough time for the capacitors to discharge. There should be at least 2 seconds between power ups.
I didn't know this limitation. I can't use LiPo battery to protect boards against micro power cuts because LiPo are not safe enough in the hot environments where boards are used (wood boiler).
After some tests, I can't reproduce the problem with this image even with quick power cuts : good news !
But I have a lot of errors during the first 100 seconds after boot :
[ 83.123763] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 83.846734] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 84.570037] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 85.293739] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 86.017742] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 86.741530] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 87.465436] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 88.188899] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 88.911906] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 89.700894] es8328 0-0010: ASoC: error at snd_soc_component_update_bits on es8328.0-0010: -11
[ 90.422235] es8328 0-0010: ASoC: error at snd_soc_component_update_bits on es8328.0-0010: -11
[ 91.143840] es8328 0-0010: ASoC: error at snd_soc_component_update_bits on es8328.0-0010: -11
[ 91.864692] es8328 0-0010: ASoC: error at snd_soc_component_update_bits on es8328.0-0010: -11
[ 92.585549] es8328 0-0010: ASoC: error at snd_soc_component_update_bits on es8328.0-0010: -11
[ 93.307060] es8328 0-0010: ASoC: error at snd_soc_component_update_bits on es8328.0-0010: -11
[ 94.030435] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 94.752109] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 95.474055] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 96.196230] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 96.918255] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 97.640112] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 98.361737] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 99.082735] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 99.804734] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 100.526133] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 101.247231] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 101.968004] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 102.688578] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
[ 103.409249] es8328 0-0010: ASoC: error at soc_component_read_no_lock on es8328.0-0010: -11
And I have no network :
olimex@stm32mp1-olinuxino-som:~$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
olimex@stm32mp1-olinuxino-som:~$
Edit : I understand the error messages, the kernel has been made for SOM not for LIME.
Wait did you wrote that you have STMP LIME boards? The image and this forum thread is for STMP SOM boards, sorry my bad. It is normal to have errors since the image is suitable for only SOM boards.
We haven't noticed similar problems with LIME boards. Yours can be different problem.
So, should I create a new thread ?
Yes, better create a new thread.