Evolving Polystream’s Build – Our Journey So Far
No matter the size and ambition of a technology project, there are some basic principles that ring true: the need to maintain quality, and the need to do things quickly and efficiently. In other words, to do things well, that scale. Over the last six years at Polystream, we’ve gone on a journey with our build and continuous integration process; consistently analysing it, assessing its limitations, and making changes – chasing higher efficiency that doesn’t compromise our ability to provide quality.
For this blog, our software engineer Olivier Carrière shares insights into how our team continues to evolve Polystream’s build.
From the very beginning, we knew we had automate our build & continuous integration process.
As a technology company, we take the quality of our releases very seriously, which means keeping an eye on regression. Manually implementing a complex release process would be an inefficient solution, bogging down iteration and lengthening our delivery times.
As the majority of applications ripe for command streaming, games and creativity are dominantly consumed on Windows-running computers, Polystream adopted Visual Studio’s source repository and build systems (originally known as ‘Visual Studio Team Services’ and later called ‘Azure DevOps Server’).
Given the familiarity the team had with Microsoft products at the time, it was an easy and low-cost decision to make. Very rapidly, an automated build process was rolled out, using the visual pipeline editor to define the various CI processes we needed.
In the early stages, the central streaming technology would run a single CI pipeline which would build test and release, whereas the platform side of the business would run a collection of pipelines, each associated with individual services and components. The two sides would communicate through a release process involving dropping files into Azure blob storage.
Although getting started with Azure pipelines is done easily without having to invest a lot of working hours, we eventually reached a stage where its operations started to become a burden:
Some pipelines started becoming very long, resulting in lengthy build processes. We have dozens of identical tests to run on various data sets. We need a build system that should be able to handle running them concurrently in the most efficient manner. Ideally, we would like the possibility to model a graph-like dependency topology.
Some tests, requiring a GPU, were being run on machines under people’s desks (!)
The number of pipelines to be maintained was starting to grow, meaning a lot of clicking around when updating the CI process across pipelines.
Having a pipeline definition that’s not attached to the sources made evolving the build process in-line with the source a hacky and error-prone business.
Looking at improvements
With these technical limitations strong in our minds, we knew we had to reassess. Our first port of call was to examine the market, and find what was able to accommodate our build system.
Our aim was to solve these problems, but also retain some key features:
Builds and tests should be able to run on Windows and Linux as different components have different targets.
Some of our tests require a GPU to run for us to cover end-to-end scenarios
We need a decent UI to navigate reports and drive our CI manually.
We explored the CI landscape and identified a selection of candidates that looked like they might fit the requirements. Off the bat, a very large proportion of those were rejected as there was no possibility to run Windows agents on them.
This left us with a shorter list to explore:
Highly flexible pipeline configuration
Easily extensible. Extremely flexible in how it’s extended (as they say, it’s a “thing doer”, which means it can be adapted to hook in pretty much anything and do any process)
YAML config bundled with code
Everything is docker containers
Funky dashboard, probably too minimal to meet the team’s needs as it’s visualisation only.
Everything is docker containers, which means GPU support on windows agents is going to be problematic.
Lots of supported tools and plugins.
Only linear pipelines supported
No particular USP to stand out from Azure DevOps.
The concept of stages+jobs+tasks would allow the implementation of the build scenarios we need to implement.
Build configuration can be bundled with code.
Although the build configuration can be bundled with code, there are 2 specs, one YAML and one Java one. The YAML spec isn’t fully featured.
When we attempted to deploy it, it blew up at startup unless we were using a license restricted Oracle JVM to run it.
Simple and clear config files
Some nice features like ad hoc supporting services
Simple setup (start a docker image)
Documentation is minimal
Pipelines are linear
Tons and tons of plugins
Custom pipelines are supported with explicitly parallel stages
Very difficult to maintain
Poor documentation in places.
The team is already familiar with the tool
The YAML pipeline definition unlocks a remarkable set of functionality:
– Complex job hierarchies can be expressed with dependencies.
– Parallel and matrix job scaling strategies are quite powerful to implement large testing matrixes.
– They can be composed using templates.
– The pipelines definitions are entirely bundled with the code on Github, meaning radically different pipelines can be used in branches.
– Jobs in pipelines are executed in order of dependencies meaning running agent usage is maximised. (Most other CI frameworks use stages to enable parallelisation, which is more restrictive)
The basic license for concurrent agents is affordable enough for us.
Docker hosted agents are not natively supported. Although some of our jobs can’t run in them, it’s always good to have the possibility for those jobs. (With some effort, however, one can create docker agents, but it’s a fairly manual process)
The quality of the documentation can be variable.
Implementing a better solution
After careful consideration and some experimentation, we came to an unexpected conclusion: we were using the right tool already, just not in the right way.
As it turned out, Azure DevOps supported all the features we needed, but they were more or less “hidden” into a YAML form.
Streaming technology pipeline
By using this pipeline definition format, we were now able to express a more complex job dependency graph for our core streaming tech.
In this approximated view, we could see that some jobs can depend on multiple sources. Azure DevOps gradually consumes those jobs in dependency on order.
Job parallelisation is implemented using the matrix concept. It allows us to spread our tests and to execute them simultaneously over all the build agents we have available.
We can also address another issue, with pipeline definition duplication we experience in our platform-side builds. In that case, we can use the template construct and compose those similar pipelines using the same central definition. This enables us to propagate changes more rapidly throughout the pipelines when needed and generally helps with maintenance.
Relying on having computers sitting under people’s desks to run CI tasks isn’t ideal from a reliability point of view. We also wanted to increase our parallel running capacity. So our ideal solution would consist of several CPUs, but also several GPUs.
We considered a rack-mounted host solution with integrated GPUs, but they are expensive and the space was too limited in our server room to justify the added cost.
That’s why we opted for a collection of low power HTPC-style PCs with AMD 3200G CPUs; a solution that combines a reasonably low space footprint with an adequate amount of CPU and GPU power. Power usage also is kept in relative check at around 40W per unit.
Using VM resources
Before we built our build farm, we used a single rack VM host to run our single monolithic and linear build task. Due to the nature of our process, we used it more or less as a single really beefy VM agent with a truly enormous RAM disk in an attempt to reduce the overall execution time.
Once decommissioned for this purpose, we found ourselves with a reasonably decent amount of CPU power at our disposal, which was repurposed as a Linux build agent host. We then set up a Kubernetes cluster on this host that would host Azure Agents. We use an internally developed docker image as Microsoft ceased supporting this. It can be found at: https://hub.docker.com/r/polystream/azure-devops-agent.
Therefore, after the transition, we had at our disposal a pool of Windows agents with GPUs and a pool of Kubernetes hosted Linux agents.
Results after evolution
Once the transition to the new system was completed, our streaming technology build and test process saw its execution time reduced from a dozen minutes to half that value. Quite importantly, it paved the way for adding even more parallel build steps without having a noticeable impact to the total execution time.
Similarly, platform-side builds have had their execution times slashed from a couple of minutes to a few seconds. Certainly, taking control of the execution agents improved the responsivity significantly.
Generally, after the expected teething problems of the transition, usability, reliability and maintenance of the CI system has noticeably improved.
The current solution ticks a lot of our CI requirement boxes. Although in some areas of the business we have moved on to using another solution, it’s the sole CI infrastructure we use for streaming technology and all our web and front end development. It appears to be coping relatively well with the ever increasing demands we place on it and should be able to scale up further as we grow our business.
Of course, the fact that it serves our needs currently doesn’t mean that it will be adequate forever. There are a lot of other solutions we had an early look at that we had to discard because they did not support the features we needed at the time. As we grow our business, there will be a time where we will have to evaluate the market once again, and I look forward to writing the next update on that!
- We use the hosted version of Azure DevOps. There is an option to self-host the service, there are some benefits we get from the hosted version:
- We have a distributed workforce it is more practical to be able to access that service outside of our private network.
- We also use the public hosted version of GitHub. This means we can implement validation hooks for instance, which is a great quality-of-life improvement when doing Pull Requests
- However powerful the YAML format can be for our uses, it can grow to be quite messy. The reason is that the templating engine has some quirks that prevent composition in some cases. A solution to make it more concise is to move as much of the CI process definition outside of the bulk of the YAML pipeline file inside scripts. We commonly use PowerShell on windows or bash on Linux. An added benefit is that those can be called by developers outside of the CI to perform local checks. A drawback to this approach however is that the information displayed by the Azure DevOps UI is less granular as it only is aware of running “a script”.
- Run time eventually crept back up. But so much more is being run now. Throwing more hardware at it would partly solve the problem, but it’s worth noting that some individual steps have also increased in execution time. There is some scope for optimisation and streamlining.
- Azure DevOps releases is a handy tool to manage your release process. It provides a convenient set of UI elements for your release managers and it’s fairly flexible.
- The backend side of the business has eventually moved on to using code fresh. Given the nature of the integration and regression testing, as well as the nature of the release process it made sense to change the nature of the CI infrastructure.