Brewing Your Own Game Analytics Service
In this post, I describe how to implement a game analytics system to collect and store game data. At the end of the post you’ll find links to the source code for my sample implementation.
As a game developer, you must gather data about your game’s users, mine that data, and respond to it. The cloud being what it is today, there are multiple costs associated with collecting and using game data, including costs for transactions and storage.
There are of course tons and tons of services out there that do pretty much the same thing in terms of data collection. This is especially true in the mobile space, where it seems that a new games analytics VC-funded company pops up every day. However, these services often come with lots of questions regarding data ownership, costs, reporting structures, and so forth. At bottom, these services may or may not fit your needs.
As such, if you’re simply looking to understand more about a game analytics system, need more functionality, or just want to roll your own, let’s take a look at how to build your very own low-cost game analytics service from scratch.
To begin, we make the following assumptions:
- The user will play your client, and you’ll submit events to a server for cataloging.
- You have some client code that can make HTTP requests (we’ll use HTML5 in this article).
- You have some semi-resident server-side compute resource with direct access to a data store. We’ll use Google App Engine (GAE) in this article.
In a naive implementation, we assume that the client does the bulk of the work, and pushes data up to the cloud in a regular fashion. For example, we can push an event every time a rat is killed. This setup results in a simple dataflow between our components:
Let’s hypothetically say that our game has around 15,000 players a day, and each player kills about 2,000 rats – that’s a shocking 30 million events that we’ll be tracking. At this point, we need to take a hard look at what our cost structure is for storing and computing all that data. For instance, the latest pricing on Google App Engine charges about $0.10 per 100k writes to the Datastore, meaning you’d pay about $30 a day for 30 million writes. That’s a lot of money to throw around just to store in-game events for data-mining. I mean, if you have a dedicated data miner to track that information then the cost might be justified; otherwise, you’re going to need to find a more cost-effective solution.
In addition, your client may be sending huge amounts of data to the server (for instance if you’re tracking each mouse click in an RTS game). That’s lots of traffic and compute time that you’re churning just to collect some floating point numbers.
In the naive implementation above, the dominating cost factor is the sheer number of writes into the Datastore. To lower the cost, we have to reduce the number of writes.
The first thing we should do is determine the importance of the data that we’re collecting and how often we need to use that data. For instance, mouse clicks in an RTS game may be semi-important, but not so important that we need that data instantly. As such, we could batch those mouse clicks and submit them at the end of the game. This deferred batch-submit strategy is great for reducing the amount of transfers from client to server (since we only get the data at the end of the game), but doesn’t really help us reduce the number of writes into the Datastore (assuming that we’d write each entry in the batch to the Datastore after receipt).
To reduce the number of writes, we turn to another offering in App Engine. The Blobstore API allows our application to create data objects (called blobs) that are much larger than objects allowed in the Datastore service. The original Blobstore API only allowed clients to submit blobs via HTTP request, but the new experimental Blobstore API allows writing directly to the storage system from server-side code.
With this setup, we only submit a batch of events at the end of the game and then store them into the Blobstore, which drops our overall cost per day significantly.
There are, however, two issues with this setup. The first issue is client connectivity. Say a user plays 20 minutes and is then disconnected, dropping needed data that we may really want. This issue is even more apparent on mobile platforms, where users are on unreliable network connections and where you should expect random data loss during transmission.
Luckily clients can take advantage of persistent storage, which allows them to store batched data and attempt to resubmit the data at a later time. Having reliable network connections means submissions are more likely to succeed the first time, but any client-side batching system needs to have, at its core, the concept of cache-resubmit for any gathered data.
The second issue is that storing data in Blobstore limits our ability to do analysis on the data directly. Before we can mess with the data, we must read it from the Blobstore into a computational structure for usage. In other words, the query “give me all the users who’ve killed a rat today” requires us to read out the data from Blobstore into Datastore (or a similar container) before doing processing.
In an ideal world we’d allow clients to message-spam our server as much as they want, so that we wouldn’t have to worry about clients dropping out randomly and taking their precious data with them. We can accommodate such spammy submissions by batching events at the server.
For those of you that are new to cloud computing systems like Google App Engine, it’s worth emphasizing that GAE modules aren’t always running – rather, they are instances that are spun up depending on request volume. More importantly, they can get spun down as well, depending on query volume and infrastructure service scheduling. That means there’s really no way to keep a resident in-memory copy of data.
This is where App Engine Backends come in. Backends are pseudo persistent, heavier-weight process that can hang around for longer durations. With backends, we can allow clients/GAE instances to communicate as normal, and cache/batch the requests into a backend before submitting to the Blobstore
The cost for this setup would be about $0.08/hour for the GAE backend, in addition to the size of the data that’s being stored in the Blobstore, as well as any additional front-end / back-end compute times.
With this setup, clients can be very spammy and intermittent, and the GAE instance effectively acts as a pass-through, simply handing data off to the backend.
You can also make the backend public-facing, allowing clients to submit data to it directly rather than going through the GAE instance. But be warned that this may create a vulnerability point, as spammy/rogue clients may have the ability to engage in Denial Of Service attacks against the backend. Combining the scalability of the GAE instance frontend with the longer-running backend is a good way to eliminate this vulnerability and still allow per-client scaling/throttling.
As the number of writes in our system increases, we’ll eventually hit an upper limit on the number of requests a backend can process. More specifically, depending on the size of our event structure, our 30 million events may exceed the capacity of the back end's available RAM.
One way to address this issue is to add an upper storage limit and start flushing data. For example, once the backend caches say 10MB of data, it can flush that data to the Blobstore for storage. This technique is particularly helpful given that GAE backends are not truly persistent – they are processes that can be rescheduled on different physical machines, which can die or become unavailable for various reasons. As such, we run into similar connectivity issues as we do with clients, although at much lower frequency. Adding regular flushes can ensure regular storage pulses to safeguard our system from losing data. A downside to these regular flushes is that they can take a chunk of time, and generally, we’ll need to flush during an event submission (if the cache gets full, we’ll need to flush before adding the new event). As such, the GAE instance can easily timeout waiting for the backend to flush its data.
A more scalable method of dealing with the limited number of requests a backend can process is to simply increase the number of backends to a desired capacity, and use a hashing function in the GAE instance to evenly distribute events to the backends. Or rather, once our traffic increases to the point where it exceeds what one backend can handle, we can add additional backends (which obviously increases the cost).
Creating a balanced approach
One important point to note is that there are actually different types of data that you will want to collect for your game. Every game is different – there is no one best set of data for all games. Your game will have a range of statistics that you should collect and store with some variability. For instance:
- There may be some statistics that are fine to gather locally and push up at periodic intervals; others, you’ll want to store immediately because they are so critical.
- You may not need guaranteed delivery of every single event from every single game for every single player. You may just need “most” data or a representative amount of data. For example, you can log only data from a statistically relevant percentage of clients and then extrapolate results.
- Not all events can/should come from clients. For MMO games, most of the event calculation and game state reckoning takes place on a game server instance, and as such, that instance should have access to submit events as well.
You should also adjust your data collection based on input loads. Tune your storage/flushing options based on where the bulk of your data is coming from. For any flushing point (client to server, or server to Blobstore) you should adjust when to flush based on duration and how much data is stored. Always remember that you're trading RAM for IO operations – keep data in RAM longer and you’ll need more RAM; flush data more often and you'll do more IO. Tweak the numbers constantly to find a good balance.
One final issue to consider is the cost associated with the size of your data in Blobstore. Since you’re charged per byte, it might be worth reducing the size of the stored data. Thankfully App Engine also has a solution for that through its ZIP API.
Our final implementation, shown in the figure below, includes data segmentation, multiple levels of batching, and data compression.
True cost gut-check
There are of course many additional costs involved with cloud processing. For example, there’s the computational cost (= instance hours) of processing/handling every individual HTTP request from clients, as well as the bandwidth cost associated with the per-HTTP-request overhead. To estimate your true cost, the best thing to do is to build a mock system and run some valid traffic through it to ballpark your numbers.
Reducing long-term data costs
When you implement your game analytics system, you should plan for success and consider what you want to do in the long haul. For example, imagine that on a good day, you’ll be storing some 50 million in-game events. That adds up to a significant amount of data to keep around long-term. After a year of production, the likelihood that a single day’s worth of data will be useful drops significantly, so keeping it in the cloud is going to cost you money for data that’s not used. In that case, you should consider moving the data into a form that reduces your cost over time.
One solution that may make sense is to move data regularly from the cloud to a local box, where you can access the data for a longer period of time at a lower cost. Before archiving, you should cache important data elements so that future analysis can reference the results of the data without having to pull it all back out from deep freeze.
You can find source code for each of the 3 tracking methods we’ve discussed here on my github page. With the App Engine SDK for Python, you can quickly upload and run the instances. Use the given HTML pages to test the system and see how things work.