Hey folks - here's some answers to the main questions in this thread, as well as some insights to our process and what we're working on to help us better debug the issue(s). Thanks for the questions and please keep on providing feedback!
What things do you look at when you get reports about in-game performance issues?
Performance issues and their related investigations are complex, and there are loads of things we look at. Just about anything could be a factor when it comes to performance issues. Generally, the big ones we look at are:
- The server
(s) where the issue is occurring
- What time
(s) the issue occurred
- Machine specs for impacted players
- What affected players were doing at the time the issue occurred
- What addons affected players have installed. All these things combined help us get a full picture of most issues. And from there, we expand our list of what we look at if we need more information, as we have for this particular issue. This includes requesting additional information from affected players and also implementing new logging when necessary.
What do the in-game performance charts look like now?
The ones that I personally look at the most (crashes, long load times, and server frames) are all pretty solid overall. We have a few crashes we’re addressing, but they are rarely occurring. The long load times reports, especially on console are very small and server frames – how bogged down the server is also very stable, at least outside of prime time. During prime time, there are a few spikes that we’d like to figure out and are implementing more tooling (logging) in the next few incremental patches to try to get more info for our engineers.
What information is most helpful to get when investigating in-game performance issues?
Sharing with us the date and time the issue occurred (ideally to the minute) as well as what you were doing at the time is the most helpful. It’s also a huge bonus if you have a video/clip of it so we can see exactly what is being described. We have been watching the video clips you’ve been sharing with us for this issue – thank you!
What are you able to change when in-game performance issues occur?
It really depends on what the issue is. We are able to hotfix some things (hotfix = no server downtime) but most require an outage. Crashes, broken items/spawns and exploits are generally “easier” to find a fix and get something out quickly. “Input lag”, “high ping” or “low frame rate” type issues are much harder to pinpoint and often take a great deal of time to find and fix. However, in the case of input lag or high ping, if it’s a result of an attack on the servers, we are able to turn adjust mitigation whenever we need.
Why do investigations sometimes take a long time?
There’s lots of different reasons, some bugs are easy to repro, others are not. Load related bugs (in general the input lag or high ping type issues) are the most difficult and time consuming to track down. Remember ESO is a HUGE game – millions of lines of code (over 25mil) to sift through. Then you have countless different machine types, drivers, internet types/locations in the world, addons... etc. to make bug hunting more complex.
On top of that, massive servers with hundreds of thousands of players on them at any one time is really, really hard to replicate internally. It’s tough!
How many different groups have been involved in investigating this issue?
Quite a few people are involved in investigations for ongoing game performance issues, including people from our BI (Business Intelligence) group, Community, Customer Support, Game Design, Engineering, QA and Live Services teams.
Has anyone at ZOS been able to replicate this issue?
We have not been able to consistently replicate this issue on the live servers, or at all on our internal servers due to some live server factors such as load being involved. While some of our developers have experienced it on the live servers alongside our players, there’s been no clear and consistent “smoking gun” root cause we’ve been able to identify so far.
Can someone confirm the role of the AI chat monitoring here?
We’ve extensively investigated if this could have an impact on game performance. The way the tool works, it sifts through chat logs that are exported from the game. So, it isn’t integrated into ESO, and never actually touches the game or the chat servers.
Regarding the tool itself, it helps us identify at risk chat and areas of concern faster. But our agents make the determination on whether an account is suspended or banned. We do have some auto kick rules for chat spam, but anything automated is temporary. Our agents review every automatic action against account history before taking any permanent account action.
We have also been using the tool for a couple years and nothing has changed in how we log chat, so the timing doesn’t line up. The situation a couple months ago with increased actioning related to chat logs was more a case of changes to our processes and training up new personnel than with the tool itself. Ultimately, though, yes we did look at if the tool was having any noticeable impact to game performance and it’s not.
What is your process like when an in-game performance issue is reported?
The first step is to gather as much info as we can. Is this just one person running into it, or is it multiple? Is it region/location specific? Platform specific? Does this only happen during prime time? Once we have all the info available, we then start to work on internal repro. If we can find a way to repro it, we’re generally able to get a fix together quickly.
What work has been completed or is in progress so far?
Outside of the ongoing investigations based on the information all of you have provided, we’ve been working on lots of logging additions for the most part. None of these are going to solve any problems on their own, but they should provide us with better mechanisms for helping to better diagnose the problem.
- Server logging - specifically collecting more metrics about a player's experience from within an instance/region.
(number of messages per frame, avg and max bytes sent & received...etc.)
- Client logging of alternate quit methods. Clients closed via Alt+F4 or the window 'X' button terminate instantly and the server wouldn't know it was intentional and would just be considered a client timeout.
(This is to aid in false-positive detection.)
- HeartbeatDisconnector
(basically a message that checks to see if your client is still connected) should accept any message as traffic, not just heartbeats. If you get a ton of messages, it's rare but possible your heartbeat doesn't get processed in time and you get disconnected.
- Present a more accurate ping in the game client. Our latency meter is not reflective of what the actual ping is. It's currently tied to framerate and load screens and what not.
- Investigate the Bandits UI addon.
(In the top 15 most used add-ons currently) There are lots of forum posts pointing to this add-on for potential latency issues. The suspicion is that the addon is using map pings as somewhat of a backdoor to send arbitrary data to clients, regardless of whether players have the add-on installed. The pings would get broadcast to everyone and cause game performance problems (latency & frame rate).