Relying on Reliability – How Polystream is Future Proofing Failure in the Technology Behind the Metaverse
Here at Polystream, we work with a range of tools to gain insights into our operations and give us a heads up when things might not be going according to plan.
My name is Theresa, and I’m a Site Reliability Engineer – which means it’s my job to oversee how we control and maneuver in any case of failure across the various unique parts of our technology to ensure we deliver a powerful and seamless experience in the Metaverse. In this blog, I want to introduce you to our observability framework and showcase how we managed to work around some of the limitations that arise from trying to enable different tools to work together.
What SRE means at Polystream
Pioneered by Google, Site Reliability Engineering (SRE) collectively describes the concepts, tools and practices for running software systems in production reliably at scale. The expectations on the availability of production systems is usually a direct descendant of the user experience. The user experience is what drives the requirements as to how much downtime can be tolerated from our systems.
As we often work with game studios and tech companies on a pre-release state of a product, we don’t always have access to adequate datasets that allow us to make informed decisions about things like service levels and error budgets which describe a specific parameter set to help evaluate service performance. You may therefore ask, why introduce SRE at a relatively early stage?
To me it boils down to two very important aspects:
- Avoidance of Technical Debt
- Improved product design from concept to delivery
Technical debt builds up over time. Or in the words of Ward Cunningham who coined the term: “A little debt speeds development so long as it is paid promptly”. While we’re developing the Metaverse, we need to have visibility into the future impact any rapid changes or improvements to technology might have so we can make the right decisions now – essentially removing what would later result in technical debt.
Theresa von Laffert
Improved product design is a byproduct of reduced technical debt. By keeping track of weaknesses and trying to resolve them in a timely manner, we can avoid compounding debts whilst building more stability into processes and systems. In addition, building out the necessary tools to monitor systems helps us to understand them on the smallest component level. This is significant as we grow in complexity, and will help us better serve customers all over the world.
One important consideration of improved product design is to have in place as early on as possible the tools that allow us to monitor existing systems and to analyse what is happening below the surface. It helps us to draw conclusions on the health and capabilities of our systems, and therefore the impact it will have on others that build on top of our platform. The quality of our response to failure is impacted by the suitability of monitoring tools to capture our system state.
“We work with a range of tools to gain insights into our operations and give us a heads up when things might not be going according to plan.”
Our observability framework
In the world of SRE, observability is an important concept that mainly focuses on three main parts: event logs, metrics, and traces. While each of them alone does not necessarily allow us to draw the right conclusions, when combined they help us to develop a pretty good understanding of our systems:
- We use an Elastic Stack based logging platform to collect and manage event logs from our platform. Our primary usage is to be informed about important events that occur during a session. (A session describes a stream between the Polystream client that you can run locally on your machine and one of our streaming servers running in the cloud.)
- We use infrastructure monitoring tools for metrics collection and tracing. Data is being collected via lightweight agents that are directly deployed onto the streaming servers. An intelligent processing mechanism captures metrics and forwards them to a centralised monitoring hub which allows for further investigation, tracing and alert definition.
- On top of these tools we use xMatters and Slack for managing event notifications.
Bringing it all together
When it comes to bringing all our data and metrics collection tools together, one part of the puzzle is to collect data whilst another part needs to ensure that we identify alertable conditions in our data. To establish accurate reactivity, we need to summarize relevant event information quickly and send this to the right person, or automate reactions to these events. This is where xMatters comes into play.
xMatters is an automation platform with a highly customizable on-call rota management system.
The bread and butter of how xMatters works are the so-called workflows which represent a combination of steps each referring to a specific action. Those actions can be certain events that take place like an HTTP call or certain status change of a service. Those actions can produce an output that can be used in subsequent steps.
Most workflows in their entirety are characterized by a certain trigger which is essentially the first step in the workflow. They often have an end goal in mind in the form of a step that perhaps raises an event or sends a notification to a certain group of people that is defined through an on-call schedule.
At Polystream, we use xMatters as the underlying tissue for event management and alerting. We utilise xMatters capabilities of integrating information from various external sources, collecting them, and processing them into a single event that is brought to our attention based on certain criteria. This allows us on one hand to reduce alert noise but also to define very specific situations that need to be met for an alert to be sent.
One example of this would be if a critical condition is triggered on our Infrastructure Monitoring platform. In this case, we are notified that a process on the streaming server has been down for more than a tolerable amount of time. The alert is enabled with a webhook to xMatters which initiates the workflow by sending a json payload entailing important alert data. The basics are received via the xMatters event trigger, however, before the alert is forwarded to our dedicated Slack channel, we ask xMatters to do some extra work to find out in which of our environments the issue was reported which can be conveniently implemented with a “Switch” step that evaluates the value of the environment parameter received at the Trigger step.
From here, xMatters initiates the rest of the workflow based on the corresponding value received for the environment. We implement this separation because we have two separate Slack channels for both environments (we don’t want any alerts for our development environment ending up in our production Slack channel).
In the next step xMatters sends out a further request to get any data lacking from the original alert information.
Lastly, we convert the value for the timestamp into human readable format before finally sending it to our Slack channels for our attention. Automating some of our troubleshooting steps in advance and streamlining alerts into the right notification channels saves us a lot of time when dealing with failures on our platform.
The challenge of combining different monitoring approaches
One of the difficulties I experienced in moderating the flow of information between the tools used in our observability framework is rooted in the concept of time series monitoring. Event data in Elastic Stack is our source of truth when it comes to event logs and similarly to the we utilise a webhook that integrates alert information from Elastic Stack into xMatters similar to the workflow with our Infrastructure monitoring platform.
Alerting in Elastic Stack is based on the definition of rules which run periodically on the log platform to check for specific conditions. If the condition is fulfilled within a specified schedule, the alert is triggered and initiates the workflow in xMatters.
Currently, our configured time interval for measuring the alert condition is five minutes which means our reactivity is 12 times per hour, and a LOT can happen in five minutes. If we were to shorten the time period, a fleet of thousands of servers could mean that there are thousands of alertable conditions met in one second which isn’t practical either!
There are good intentions behind the aggregation of events over time, especially when considering what viable implementations might be needed when there are clear service level objectives. For real-time alerting, we have been experimenting with automating responses to single events.
For example, one of the conditions we alert on is when a streaming server closes a session unexpectedly. If more than one of these logs is collected within a period of five minutes, they get aggregated into an array of json objects.
We then let xMatters gather additional information about the affected servers. In order to filter out the values of the event fields, we need to loop through every payload received through the trigger step in xMatters to read out each corresponding IPaddress. Arrays as input value in xMatters do not suffice to automate server-specific recovery steps.
For loops have been my best friend for parsing field values of multiple events, null checks omitted for brevity
While parsing individual events in xMatters combined with a reduced time interval for event aggregation in Elastic Stack can help our goal for real-time alerting, there are still some difficulties with that approach.
- On the one hand, the maintenance of code required to parse individual events is prone to break if there are changes introduced to the alert structure. In addition, each step in the workflow will need to reflect those changes to pass on the correct output.
- On the other hand, we need to tackle the problem of alert fatigue, for example, by logical grouping of events or by utilising xMatters capabilities to disable workflows after they have been triggered a certain number of times while still keeping a record of incoming events.
Our observability framework and how we use workflows to gather information and alert us on critical conditions in our systems, even with some of the difficulties involved when processing aggregated time series based data from Elastic Stack in xMatters, helps us to plan how to react to events in real time which will continue to drive our future decisions around monitoring tools.
As I mentioned earlier, one of the goals of early entry SRE is to enable people to build robust products before they come to market. Our vision is to build into our technology a monitoring system with capabilities to automatically react to specific events in real time and to notify our team about actionable alerts that are based on service level objectives. The more time we invest into the careful selection and configuration of our tools and infrastructure, the more efficiently we can deal with incidents in our live environment and implement preventative measures to recover from failures more quickly, which will be crucial to empowering Metaverse builders of the future.