OPC UA Server - cannot recompute OPC UA authentication flags

Hi everyone,
We are periodically getting the following error on an AXCF3152 (Fw ver 2022.0.8) :
27.09.23 18:59:18.030 Arp.Services.OpcUAServer.Internal.Security.ClientAuthenticationCriteria CRITICAL - cannot recompute OPC UA authentication flags - unexpected Arp::Exception:
Exception of type ‘Arp::System::Commons::InvalidOperationException’ was thrown
Creation of the directory enumerator failed.
at Arp::System::Commons::Io::Directory::GetEnumerator(Arp::BasicString<char, std::allocator > const&, bool, bool, bool)
at Arp::Services::OpcUAServer::Internal::Security::ClientAuthenticationCriteria::IsEmpty(Arp::BasicString<char, std::allocator > const&)
at Arp::Services::OpcUAServer::Internal::Security::ClientAuthenticationCriteria::RecomputeFlags()
at Arp::Services::OpcUAServer::Internal::Security::SessionManager::IsClientAuthenticationChanged(Arp::Services::OpcUAServer::Internal::Security::ClientAuthenticationFlags&)
at Arp::Services::OpcUAServer::Internal::Security::SessionManager::Run(void*)
at Arp::System::Commons::Threading:
:RunThread(Arp::System::Commons::Threading::ThreadBinaryCompatibilityExtensions*)
at Arp::System::Commons::Threading:
:RunInternal(void*)
at Arp::System::Ve::Internal::Linux::ThreadService::RunInternal(void*)
at /lib/libpthread.so.0(+0x3000c08ea4) [0x7f6f4a7feea4]
at /lib/libc.so.6(clone+0x3f) [0x7f6f4a4cadcf]

Any ideas on what might be causing this problem ?
Any suggestions on how we can resolve or debug further ?

Cheers,

Lindsay

Any ideas on what might be causing this problem ?The developers are aware of this issue and have given the following explanation:
Part of the OPC UA Server component continuously checks if there are new certificates available in the trust store. This check is important because when the contents of the trust store change at runtime, the changes should be applied immediately (or as soon as possible).
The error message indicates that something is blocking access to the trust store, e.g. a parallel process that is accessing certificates in the trust store or writing to the directory. Something like the security page on the Web-Based Management site, or maybe a third-party component or process.
Any suggestions on how we can resolve or debug further ?Is there any user action(s) that correspond to this message appearing, e.g. activity on the Web-Based Management pages?
Are there any third-party or custom-written components or processes running on the device, that access the Trust Store on the device and that might block the OPC UA component from accessing the Trust Store?

Hi Martin,
Thanks again for the prompt and detailed response. We have not noticed any clear relationships between user actions that might correspond with the issue. That being said, we have multiple things talking over OPC UA to the PLC. Namely

                              * Multiple users with UAExpert connected
                              * An external python script running on another host using asyncua to connect on a periodic basis to read the OPC UA node tree.

Martin,
A few other things that we are trying.

                              * We have tried temporarily disabling the use trust store settings on the controller in the OPC UA section. This seemed to give a different error however, so we have re-enabled.
                              * We will try migrating to 2023. Would you expect this to change anything ? (i.e have there been any OPC UA fixes related to this in the 2023 LTS release)
                              * Looking at the WBM for the PLC, we notice that currently our trust stores are empty. Is this expected ?

Cheers,
Lindsay

One other thing we notice that is different is on the PLC which has issues we have no visible trust stores via WBM
opc_image.png (as shown), but on another PLC without issues there seem to be the default “OPC UA-configurable” and “Empty” trust stores. Is this likely where we might be having problems then ?
Apologies for the spam.
Cheers,
Lindsay

I have discussed this with the developers and here are some answers:
We will try migrating to 2023. Would you expect this to change anything ? (i.e have there been any OPC UA fixes related to this in the 2023 LTS release)No, on its own this would not be expected to fix this problem.
Looking at the WBM for the PLC, we notice that currently our trust stores are empty. Is this expected ?No. The two default Trust Stores that you see in the other device should also be present on the device with the problem.
Is this likely where we might be having problems then ?Yes, this is likely to be causing the problem. According to the developers, the OPC UA Component expects the default trust store to be present on the device. I have put in a feature request for a future version of the OPC UA Component, asking that it gives a more user-friendly log message when an expected file or folder is missing.
The quickest way to restore the default firmware files is to do a Type 1 reset on the device. That will delete the overlay file system, including all project files and any changes to system files. That should restore the default Trust Store configuration. After the reset, you will need to reload your application files to the device.

Martin,
Thanks for the clear responses. We will try the type 1 reset as mentioned.
Cheers,
Lindsay

Ok, so a few more updates here.

                              * We have tried a Type 1 reset on the device, keeping firmware at 2022. Still the same issue. We notice that initialling in WBM before the issue arises trust stores are as expected, but following the error they are empty like above.
                              * Also upgraded to 2023. As expected, no improvement.
                              * Have minimised the number of sources connected over OPC UA to the plc to a single instance of UAExpert. 
                              * May try swapping out the PLC tomorrow in case there is something there. 

Re-attaching the logs :
05.10.23 15:03:43.596 Arp.Hardware.Modules.ResourceMonitor.Linux.Internal.CpuLoad ERROR - CalculateCpuLoadsPercent: Reading from file failed.
05.10.23 15:03:44.347 root WARN - Enumerator: Too many open files
05.10.23 15:03:44.352 Arp.Services.OpcUAServer.Internal.Security.ClientAuthenticationCriteria CRITICAL - cannot recompute OPC UA authentication flags - unexpected Arp::Exception :
Exception of type ‘Arp::System::Commons::InvalidOperationException’ was thrown
Creation of the directory enumerator failed.
at Arp::System::Commons::Io::Directory::GetEnumerator(Arp::BasicString <char, std::allocator > const&, bool, bool, bool)
at Arp::Services::OpcUAServer::Internal::Security::ClientAuthenticationCriteria::IsEmpty(Arp::BasicString <char, std::allocator > const&)
at Arp::Services::OpcUAServer::Internal::Security::ClientAuthenticationCriteria::RecomputeFlags()
at Arp::Services::OpcUAServer::Internal::Security::SessionManager::IsClientAuthenticationChanged(Arp::Services::OpcUAServer::Internal::Security::ClientAuthenticationFlags &)
at Arp::Services::OpcUAServer::Internal::Security::SessionManager::Run(void*)
at Arp::System::Commons::Threading::thread::RunThread(Arp::System::Commons::Threading::ThreadBinaryCompatibilityExtensions*)
at Arp::System::Commons::Threading::thread::RunInternal(void*)
at Arp::System::Ve::Internal::Linux::ThreadService::RunInternal(void*)
at /lib/libc.so.6(+0x8aa42) [0x7f5eda660a42]
at /lib/libc.so.6(+0x10c6f0) [0x7f5eda6e26f0]
05.10.23 15:03:44.352 Arp.Services.OpcUAServer.Internal.Security.SessionManager INFO - Applying configuration for client authentication to 1 endpoints: trustAll=false, checkCRLs=true, checkIssuerCRLs=true, ignoreValidity=false
05.10.23 15:03:45.130 root WARN - Enumerator: Too many open files

Cheers,

Lindsay

Hmm, it’s interesting and unexpected (for me) that the default Trust Store would disappear after this error occurs. I will give that information to the developers and see what they say.
One thing that is not clear:
It sounds like this behaviour only appears on one device. When exactly the same application is run on a different device, the problem does not appear. Is that correct?
For the device where the problem does occur - does the problem appear with the simplest possible application, i.e. a PLCnext Engineer project that is just generated from the template, sent straight to the PLC without any changes, and then connected to from UaExpert? If the problem does not appear in that case, then that suggests it’s something in the application that is causing the issue, and it would be interesting to identify that part of the application. If the problem does appear, then we would need to come up with another theory.

It has been pointed out to me that the WBM is probably having the same problem as the OPC UA server - it cannot access the Trust Store, and so cannot display this on the WBM page. That makes sense. Also, the error messages that appear around “Critical” OPC UA Server message in the log file indicate a deeper problem with the file system, not only with the OPC UA Server component.
If you can try to reproduce the problem on the same device, but with a minimum project, the result would be interesting.

Ok, so we are reasonably confident that we have narrowed the problem down to our application (as mentioned above as a likely cause). Specifically a real-time program, written in C++, that deals with an external library which is used to control a manipulator over a socket. Still to debug further where the problem occurs here, but culprits are thought to be the library itself (most likely cause, which has some openmp dependancy) or potentially the retain data handling (less likely, not ruled out yet). Here is what we tried and the results that lead to this conclusion :

                              * Can confirm the comment above about likely being a deeper file-system problem. Namely, when the issue occurs, OPC-UA is not the only thing that fails (WBM is a bit funky, and I believe we also loose comms to PLCNext Engineer)
                              * Swapped PLC (in case something else) , same project and setup as before ( i.e custom configuration written once OPC UA to Input|Retain variables) --> Error still occurs reliably.
                              * PLC isolated from machine, no OPC UA client connections, same project, custom configuration written once over OPC UA to Input|Retain variables --> Error still occurs reliably.
                              * PLC isolated from machine, no OPC UA client connections, same project, custom configuration not written --> No Error
                              * PLC isolated from machine, no OPC UA client connections, same project, part of custom configuration written once over OPC UA to Input|Retain variables --> No Error
                              * PLC isolated from machine, no OPC UA client connections, same project, C++ real-time program mentioned above removed --> No Error
                              * From here, we will debug further the real-time program or come up with an alternate implementation strategy that mitigates the need. Main remaining questions :   

Is there anything that might be good for us to try to nail down what in our real-time program might be causing issues ?
* File system and OPC-UA errors are a bit cryptic in this situation w.r.t root cause. Can’t really think how it could be improved however, and very hard to programmatically trace back to the real-time program.
Cheers,
Lindsay

Is there anything that might be good for us to try to nail down what in our real-time program might be causing issues ?The warning messages around the Critical message from the OPC UA indicate “too many open files”, so that’s likely to be the problem. Since you’re using sockets, and sockets are treated as files, there’s the chance that there are too many open sockets.
Just in case it’s relevant here, please remember that - as discussed with Joshua recently - the Execute method in a C++ program should never call methods from open-source libraries, because it is 100% guaranteed that those libraries are not designed to run in a deterministic real-time environment like the ESM in PLCnext Control devices.
File system and OPC-UA errors are a bit cryptic in this situation w.r.t root cause. Can’t really think how it could be improved however, and very hard to programmatically trace back to the real-time program.The OPC UA server developers have agreed to improve the error message in this case. Components like that can only report the problem they’re having - e.g. “unable to open a file”. Unfortunately It’s impossible for those Components to diagnose what has caused that problem.