ESP32 PoE ISO - Does not receive IP address via DHCP over ETH

Started by recursinging, July 15, 2022, 11:01:06 AM

Previous topic - Next topic

recursinging

I've been evaluating the ESP32-PoE-ISO for some remote sensing work for about 6 months now. I have 4 units that are working as expected, but the 5th hangs while waiting for an IP address from the DHCP server. I see this in not the first mention of the problem here on this forum.

I can say with certainty that it is hardware related, even though it doesn't make much sense considering the nature of the problem.  I have two units on my desk in front of me, both wired identically, running identical firmware, connected to the same switch. Only one of the two is hanging (SYSTEM_EVENT_ETH_GOT_IP never gets fired).

As described in this thread: https://www.olimex.com/forum/index.php?topic=7099.15 I'm also seeing the DHCP discover packets, but the returned offer does not seem to be handled by the ESP32-PoE-ISO.

What physical aspects on the ETH RX side in the hardware design might be able to influence the LAN8710A in such a way that it fails at such a high level? 

I'm not familiar with the initialization routine, nor with what actually must occur to fire the SYSTEM_EVENT_ETH_CONNECTED event.  Intuitively I'd expect RX to be verifiably working by that point. I also expect the 10/100 auto-negotiation to have occurred by then as well. 

Perhaps someone here with a little more insight has an idea for potential mitigations/workarounds?

recursinging

So after ordering another 10 units, I've determined 4 have this defect, making them useless for my needs. Incredibly frustrating. Does anyone else here have such a high failure rate for these things?

LubOlimex

What was the hardware revision of the boards, as printed on them?
Technical support and documentation manager at Olimex

recursinging

All 10 are "Rev.I" (or "Rev.1"?).

Is there a newer hardware revision out? I need another 10 working boards right now.

I've been sorting the bad ones out by flashing the "ETH_LAN8720" example.


LubOlimex

Revision I is good enough regarding the Ethernet, anything after revision E has protection TVS1 on the Ethernet. Probably the TVS1 got damaged by your setup somehow. Probably you can fix hte boards by changing the TVS1, first measure if it is damaged.

Taken from the revision changes:

https://github.com/OLIMEX/ESP32-POE-ISO/blob/master/HARDWARE/Hardware-changes-log.txt

"Hardware revision E (internal, unreleased):

1. Added TVS1, ESDS314DBVR(SOT-23-5) to protect the Ethernet's PHY from ESD and other transient voltages;
...
"
Technical support and documentation manager at Olimex

recursinging

So I received the next batch of 10 boards from Mouser.  This time all Revision "K".  Here's my test procedure.


I repeated this procedure for each of the ten boards I just received. Five of them work, and five do not.  Just to be sure I repeated the procedure a second time for all ten boards and the result was the same -  50% failure, 100% repeatability.

No, I didn't check the TVS diodes. I did this in our electronics lab on a ESD mat with a ground strap.  These units went directly from their box, into this test procedure. The probability that five of these ten units experienced a critical ESD event is incredibly small.

I find it unlikely that this problem is systematic to this board per se - I'd imagine there would be many more complaints if 50% of all units shipped with defective Ethernet.

I feel there may be a problematic component in the design where the tolerances of the manufacturer don't match up with the requirements, and it seems that our Ethernet infrastructure creates ideal conditions for this mismatch to occur.  This is only a theory though.

I don't have the time or resources to find the root cause here. It's a pity because I've invested a lot of in a remote sensing plan with hundreds of nodes, that centers squarely around this device.  I'm afraid I'll have to go back to the drawing board.



LubOlimex

We test each board individually after manufacturing. We also perform empirical Ethernet test. 5 faulty boards out of 10 (50% fault rate) reaching a customer is unheard of. We manufacture quite a lot of these boards and we don't have similar feedback from any other customer or Mouser. There is something generally wrong in your setup or usage of the board. I believe that the issue lies somewhere in your network equipment or its settings (like firewall settings or QoS setup or any other filtering or maximum clients). However, it can be some issue in the Arduino demo that you used in combination with something leftover or newer from the libraries. My advice is:

1) Get one or more of the boards that don't work and test at another network or location (at home or somewhere else).

2) Try installing a fresh Arduino IDE and ESP32 for Arudino package (or wipe them completely and start afresh). I have had problems when some packages didn't update properly and left over data corrupted the whole Arduino IDE + ESP32 package installation. When done test with the default Ethernet code instead from the ESP32 here:

https://github.com/espressif/arduino-esp32/blob/master/libraries/Ethernet/examples/ETH_LAN8720/ETH_LAN8720.ino

Make sure to select Olimex ESP32-PoE-ISO from the board selector.

Meanwhile I will try to perform same procedure as you here and see if maybe the problem is in the Arduino code.
Technical support and documentation manager at Olimex

recursinging

In my opinion, the nature of my tests, and the results (especially the repeatability), exclude software a root cause. 

As I mentioned above, I don't believe the boards are faulty - the failure rate is too high for that - but I do believe that the boards contain some hardware variation that renders them faulty in my company network.

This could be as simple as a MAC address filter that I'm unaware of, or as complicated as a PHY level incompatibility with our HP switches.   

In any case, it's a tricky problem.  I will test in another network when I get a chance and report back.

recursinging

I finally got around to looking deeper into this with our IT department.  In the end it was due to some strange DHCP server MAC filtering rules they had enabled.  The MAC address of the device was the determining factor in whether or not it would receive an IP address. I didn't dig into what exactly the problematic rule was.

All units now function as expected.  Thanks to LubOlimex for taking the time to help find root cause here.

LubOlimex

Thanks for the update. I didn't reply in a while since I was deep testing tens of ESP32-PoE-ISO units with different PoE equipment in the same conditions as you trying to get similar error. At this moment all my tests showed no problem, and I was about to wrap it up this week and report here, but your update now makes it redundant. Glad you found the root of the issue.
Technical support and documentation manager at Olimex