We talk to Director of Technical Operations Jeff Berube about powering Red 5's heavyweight MMO with cloud computing.
Over the last few years, cloud computing has become a marvel of the business world. By being able to order computing power on demand, IT departments have been able to eliminate the costs associated with commissioning servers and kitting out data centers. It also provides flexibility, supporting firms like Netflix, Spotify and Pinterest as their user base grows.
Can cloud computing support a full-fledged MMO while still providing a stable, low latency experience to gamers? The developer behind sci-fi shooter MMO FireFall believed that it could, going on to convincingly prove it during an evolving, long-term beta that started earlier in 2012. You can even judge the fruits of their endeavour yourself this weekend, as Red 5 Studios are holding an open beta from February 22nd to 25th.
After hearing that FireFall was powered by Amazon’s AWS cloud platform, I was eager to discover why developer Red 5 Studios took this bold step. In an interview, Director of Technical Operations Jeff Berube explained that all the pieces seemed to fall into place, with cloud services like AWS maturing to the point where the concept became viable. He went on to describe how it’s made growing FireFall much easier, either when adding services or expanding into Europe.
Working with a small team that has backgrounds in systems, databases and security, Berube is responsible for infrastructure engineering and management at Red 5. Prior to joining the studio in January 2011, he was part of a team at Origin Systems growing Ultima Online. He also spent time at Blizzard Entertainment, designing the infrastructure behind World of Warcraft and later used as a platform for Starcraft II.
ZAM: What led to your decision to use Amazon as a hosting platform for all of FireFall, and not just your website?
As far back as my time at Origin, I talked about finding a way to manage infrastructure based upon just two metrics: available CPU and memory in the data center. That was well before technology had caught up with what I was hoping to do. As services such as Amazon’s AWS matured, it looked like the pieces needed were finally becoming available but even I thought it would be difficult, if not impossible, to operate something as complex as an online game in a virtualized environment.
When I started at Red 5 Studios, I explained to Mark that I had this crazy idea to run everything in the cloud but I wasn’t ready to commit to it as a solution until I could prove it would work. As we built up the technology we would use to manage everything, we were surprised at just how well everything came together in the cloud. There are some really great tools available now, tools that had always needed to be created in-house, that made things both cleaner and faster to build.
We started using AWS (Amazon Web Services) for “production” the first month I was here with the launch of some of our web infrastructure. I was able to get a single instance running our full game stack up and running in AWS in late March and, working with the development team, create the first working cluster in May, if I remember correctly.
It wasn’t until we had a couple more months of experience running the game on EC2 (Elastic Compute Cloud) that I was comfortable letting Mark know that we would be able to launch in the cloud.
ZAM: What benefits does cloud hosting have, both for you as a developer and for us as gamers?
As a developer, there are a number of benefits. Probably the one we experience most frequently is that when we need to add new servers, or upgrade the ones we have, we have the ability to do so without a lengthy procurement process. This lets us prototype new architecture, test new software easily in parallel to existing systems, roll out a new feature as soon as it has passed QA, or add additional capacity to an existing service in response to player behavior. We are also able to scale services, both up and down, as the player population in a region changes.
I remember talking with our lead backend web developer right before we launched the store for the first time to the public. There was about 8 minutes before the feature went live and one of us mentioned it would probably be a good idea to have more than just a small cluster of servers backing that service. We tripled the size of the application server pool powering that feature by the time the first customers had the opportunity to access it.
An extremely important issue as it relates to the gamer’s experience and the developer’s ability to manage player expectations on launch day is adequate server capacity. With the traditional approach to building online game infrastructure, you trust that player forecasts are accurate and then make a decision concerning how much it is worth to the company to provide that level of capacity. It is very easy to either have inaccurate forecasting or purchase too much or too little hardware. With too little hardware, the initial experience is ruined for players and they may never get to see the real value of the experience you are trying to provide. In the case of too much hardware, players are spread too thin and a lot of money, money that could have gone into further development, is tied up in hardware and data center colocation costs.
Cloud hosting, like AWS, provides for near limitless and rapid scalability, within reason, of course. We pay for exactly what we need and only when we need it. If we find that we have under or overestimated player interest, we can change the size of the server cluster that we use. Working with the development team, we have built a number of features that help us take advantage of the fluid nature of our “data center”.
ZAM: How do you make sure that FireFall performs as well as we expect, being able to connect and play lag-free?
Being able to provide a gameplay experience at least as good as we could provide with dedicated hardware was our top priority while first building and testing the environment in EC2. We definitely had questions early in testing about whether “lag” that was seen was due to the shared nature of the virtualized servers in the cloud or something that we had introduced in game code. (It turned out to be something we were doing.)
There are also certain things we don’t have any visibility into, like network performance outside the actual server, that need to be thought about since we don’t have direct access to those parts of the service. We spent a long time building metrics collection and visualization into both our server infrastructure and the application itself. This is a process that will continue for as long as we are developing the app or infrastructure, of course.