Watchdog Triggered and Project Lost
Hi Support,
We have a program running on a AXCF3152, running 2022.0.8. We are writing C++ libraries for the project, which could (obviously) have program bugs. The issue we are facing is that operation seems to be normal, but on several occasions, the PLC has spontaneously stopped working. When this occurs, all the lights on the PLC turn off, and it is unresponsive (i.e. SSH connection drops, not pingable). After approx. 4 minutes, the PLC came back online. There was no error message written to Output.log when the PLC died. Viewing the logs, the only noteworthy information was:
26.09.23 16:30:49.214 Arp.Services.Spm.SecurityProfileManager INFO - Security Profile Manager
26.09.23 16:30:49.274 Arp.System.Watchdog.Internal.SystemWatchdogRebootCounter INFO - last reboot caused by hardware watchdog
26.09.23 16:30:50.783 Arp.System.Lm.Internal.LicenseManager INFO - License runtime version: 2.50.1806
26.09.23 16:30:50.793 Arp.Io.ProfinetStack.Controller.Internal.Controller INFO - Controller instance added successfully
26.09.23 16:30:50.797 Arp.Plc.Domain.Internal.SystemWatchdogConditionProvider INFO - Watchdog occurred
26.09.23 16:30:50.798 Arp.Plc.Domain.Internal.PlcConditionManager INFO - Plc is not loadable because it is prevented by component 'Arp.System.Watchdog'
This points to something tripping the watchdog, but it appears to cause a full system reboot when this occurs. When the PLC starts back up again, FAIL and RUN flash at 2Hz (watchdog triggered), and the project does not load/appears to have been lost (we need to write the project from PLCnext Engineer again, and our configuration retain data is lost).
A few questions:
- Is there another log file/object which might have captured more information on why this happened?
- Is there something we can watch/look at if this happens again?
- What typically causes this behaviour on the PLC? As mentioned we can have bugs in our C++ programs (e.g. SEGFAULT), but the kernel is normally ok when this occurs (i.e. SSH does not drop like in this case).
- Is there any extra logging that we could add to our program (C++) to catch this?
Thanks for your support.
Kind Regards,
Joshua
Comments
Hi Joshua.
Those are all very good questions.
Unfortunately issues like this are notoriously difficult to diagnose.
There was a discussion last year that might be relevant:
On the questions about how to diagnose this problem - if the problem can be reproduced reliably, then that would be helpful, but in cases like this the problem usually happens seemingly at random. I guess that's also the case here.
The only diagnostics tool that we currently have is LTTng, which can be set up to generate a trace whenever a System Watchdog occurs:
Knowledge Base Link: How to activate the PLCnext Trace Controller (LTTtng)
(this should not be enabled permanently in a production application).
If a system watchdog occurs when LTTng is enabled, and if a trace is generated, then you can send the trace (and the corresponding Output.log file) to Declan in your local Phoenix Contact office and someone here will see if there are any clues to explain the behaviour.
the project does not load/appears to have been lost (we need to write the project from PLCnext Engineer again ...).
The project should not be lost - it should still be on the device. As the log message suggests, by default the PLCnext Engineer project is not loaded after the device is rebooted by the System Watchdog. If the device is rebooted for any other reason - e.g. a power cycle - then the PLCnext Engineer project should run. If that does not happen, then please let us know.
Hi Martin,
Thank you for such a prompt and helpful response. I will install LTTng and hopefully this will catch the problem next time so we have more information to work with.
Also based on your advice in the linked post, I will investigate whether we have any of the cases which violate the Execute method (i.e. dynamically allocating memory/calling methods from libraries).
I will reach out again when I know more.
Thanks again,
Joshua
Hi Martin,
We've had the problem occur about 3 times over the last day or so. Unfortunately, LTTng did not appear to capture any data when the watchdog was triggered (there is nothing in the trace folder). I have followed the instructions in the link provided and I believe it is running; do you have any ideas why nothing was captured?
Is it at all possible to get a log of what realtime program was running in the ESM over time? If so, this would allow us to dig deeper into a specific program which might be the culprit.
Also, do you believe it would be helpful to increase the debug level of the PLC?
Kind Regards,
Joshua
Hi Martin,
With LTTng now running successfully with your support, we've managed to capture multiple traces of the PLC failing. I've used Trace Compass to view the log files and have found that one of the PLC's task watchdog is triggering. This task is running at 100Hz (10ms), and has approx. 6 realtime programs, some of which spawn background threads. This is using the Arp::Thread class and Arp::Mutex.
Do you have any thoughts on why the PLC is not failing gracefully in this situation? Normally when a task exceeds its defined watchdog timeframe, we get an error which can be seen in the logs, it enters an obvious error state and it doesn't crash the PLC.
Kind Regards,
Joshua
I've used Trace Compass to view the log files and have found that one of the PLC's task watchdog is triggering.
How did you find that a PLC's task watchdog is triggering, just from the LTTng trace? Is there any other indication that this has happened, e.g. an entry in the Output.log file?
Normally when a task exceeds its defined watchdog timeframe, we get an error which can be seen in the logs, it enters an obvious error state and it doesn't crash the PLC.
Yes, if a task thread executes for longer than the Task Watchdog time, this would normally lead to a controlled stop of the PLC component and clear error messages. But if a task thread is blocked completely, then that's a different case.
How are the background threads created from the real-time C++ programs? Are those threads created with the same real-time priority as the Task thread, or are they executed with lower priority? Do real time threads always use the non-blocking mutex method "TryLock", instead of the blocking "Lock"? Is there anything else that could block a real-time thread, like a Task thread, like a call to a third-party library or dynamic memory allocation?