ESP32-POE/ISO unstable after 10 days of power up

Started by Symonn, August 06, 2020, 03:29:52 PM

Previous topic - Next topic

Symonn

Hi, i have one ESP32-POE and ESP32-POE and i'm using the Olimex RS485 module to read a power meter that is placed near the ESP32 (5 meters of CAT5 cable - 9600 8N1).

My device has this workflow (it's powered via PoE using a Cisco 2960x switch):

  • Boot and acquire a DHCP address
  • Connect to the server and check if there is a new ota package to download
  • Read via modbus (Using ModbusMaster library) the power meter
  • Prepare the data in JSON format using ArduinoJson library and send it to the broker using PubSubClient
  • Wait for five minutes using Millis and start over with the loop

This is the main logic that "blocks" the loop and wait for some minutes:


void loop() {
  wdt_reset();
  MQTT_connect();

  client_subscriber_loop(); // loop subscriber mqtt

  unsigned long currentMillis = millis();
  if (currentMillis - previousMillisTaskDataPublisher > MQTT_PUBLISH_EVERY) {
    if (DEBUG == 1)
      Serial.printf("Free internal heap on Loop Enter %u\n", ESP.getFreeHeap());
    //  OTA Check
    trueverit_ota_check_update();
    wdt_reset();

    publish_data();
    if (DEBUG == 1)
      Serial.printf("Free internal heap on Loop Enter %u\n", ESP.getFreeHeap());

    previousMillisTaskDataPublisher = currentMillis;
  }
}



As you can see, i've also have Watchdog timer enabled and it is feeded in the loop or function that have to feed it (Modbus read, etc...)

Now, here comes the issues, this works exactly for 10 days and after that the MCU goes in loop, the LAN green led will flash indefinitely and nothing work, leaving the ESP32 in a undefined state.

This is the debug message that i see on arduino console:



There is no way to make it work a part of power off and power on the device.

This is extremely uncomfortable...in my case i have 25 of those devices placed far away from my office (600km) and i can't go every time to reboot them every ten days.

Any toughts?

JohnS

If it is repeatable and consistent as you say, it's 99.99% likely to be software so go hunting around your code & the Arduino code.

John

kyrk.5

First of all I am not an ESP expert :) However I usually get called when something does not work with an embedded device :)

For me it looks like from the screenshoot as the ESP would try to update itself. Since it ends up with an error it does a reset and try it again. Endless. If you would disable this updating feature the endless loop would be broken. Or you have to make sure that this update always happens.

The second question is why the ESP is entering this loop at all? I guess the watchdog does not gets triggered. It could be also an other kind of reset cause, like short power failure. To check if this is a watchdog reset, deactivate the watchdog, wait 10 day and check. If there is no reset then yes it is a watchdog reset.

Let us assume this was a watchdog reset. The third question is why does not the watchdog get not triggered. Here we have a problem. Either you need Chuck Norris who starr down the code as long the code does not confess every bug or you have to find them self. Since the ESP is flashed over bootloader and people seem to forget what a debugger is, it is not so easy. If you would have a debugger, I would suggest to deactivate the watchdog, wait 10 days and press stop on the debugger and check the status of your software and registers. But I guess you do not have a debugger. I think flashing is not posibble over JTAG but debugging is possible. I am not familiar with the arduino so I guess it is not possible to debug it so easy.

So now the question is how to find the root cause why the watchdog does not gets triggered. Try to search forums. Maybe this is a know limitation of the ESP firmware so that it does get hangs every 10 days and people just accept this and live with this. Check you software if you have somewhere a timer or counter that can count up to max 10 day and then it does overflows. Is it 10 days exactly? Maybe 12.8 day? Or something power of 2. This would also give a hint where to look.

My opinion is: I think the ESP is good for play and for learning and experiment. But building a product on it is quite risky. Since there is limited debug option, it can become a disaster when you have already items in the field and bug happens. And then you have nothing in your hand to find the problem and then fix it. The firmware might also become a nightmare. I guess it is only a binary blob, so if there is a problem, there is no way to look inside and analyse it. Maybe fix it on your own risk, or ask the vendor to fix the problem.



Symonn

I agree with kyrk.5, the ESP32POE/POE-ISO definively not fit very well with production projects.

I'm investigating on that, anyway, i think i'll switch on another board (wESP32) that costs two times more but has support for JTAG debug interface and full compliant IEEE 802.3at Type 1 Class 0 PoE with 12W of available power at 12V.


olimex

looks like software memory leak issue
I would not trust Arduino IDE project for reliable project, Espressif SDK is also with lots of hidden mines, you have to go through your code very carefully

I know software guys always first blame the hardware and I'm interested to see what the result with the other board will be

kyrk.5

Maybee we should take a look at the complete source code to find the problem