How do you not know the root cause? Do you guys not have any observability within your architecture?
Our telemetry/observability costs are eye watering. An issue is too little signal to noise which we are working on.
We are a little closer to a root cause. A default value in the MySQL update for locking behavior changed and triggered the issue. We're rolling that fix out now