These researchers are using Grand Theft Auto V to teach self-driving cars

Researchers at Intel Labs and the Technische Universität Darmstadt have figured out a method to rapidly speed up a critical step in how self-driving cars 'see.'

Driverless cars are still mainly a speculative technology, but the science behind how they work is undeniably fascinating. Chris Urmson -- who until recently headed up the self-driving car project at Google -- shared a great illustration of this in a 2015 TED talk, with side-by-side footage of how the human eye perceives objects on the road versus how a car's light and motion sensors interpret the information.

It's complex stuff, without a doubt. But perhaps the biggest hurdle self-driving cars must overcome is the human element. Specifically, how we teach computers what it is they're looking at.

One method involves hand-indexing large databases of images, like the Cambridge-driving Labeled Video Database, or CamVid. In these databases, researchers go through image by image and color-code every object: the road, other vehicles, pedestrians, sidewalks, fixtures, buildings, trees, and so on. It's a painstaking process, because it needs to be done at pixel-perfect high resolution, and every image of the CamVid dataset was checked over by a second person for accuracy. In all, it took about an hour to color code each image, and there are over 700 of them.

A street-level shot of GTAV's Los Santos. GTAV's Los Santos at street level.

However, a team of five researchers collaborating between Intel Labs and the Technische Universität Darmstadt in Darmstadt, Germany, are presenting a paper at next month's Euopean Conference on Computer Vision which could dramatically simplify this process, using high-end videogames like Grand Theft Auto V and Hitman. The team has released its complete paper in advance of the conference, in which they describe using off-the-shelf capturing software, RenderDoc, to break down what a videogame "sees" when it sends the information to your graphics processor.

It's like this: RenderDoc (which devs sometimes use to debug their builds) captures all the rendering information processed by the graphics card, such as which objects are loaded, their color and textures, the amount of light coming off them, shadows, and so on. Researchers sifted through this captured data to identify which of it referred to which objects on screen -- which bit of code generates that bit of road, and so on -- and labeled them.

A street in GTAV's Los Santos (left) and its reconstruction through capture data (right). A street in GTAV (left) and its reconstruction through capture data (right).

Why did the researchers do this? Because that information doesn't change, no matter how many times it's loaded or unloaded from memory. Every time the game brings up a certain car model or lamppost, it's loading the same file, so the same string of code will appear in the capturing software's readout -- meaning that researchers could identify and tag it.

"Since RenderDoc is scriptable and its source code is available as well, we modified it to automatically transform recorded data into a format that is suitable for annotation," the team explains in their paper. "By intercepting all communication with the graphics hardware, we are able to monitor the creation, modification, and deletion of resources used to specify the scene and synthesize an image" comparable to what a player would see.

With the RenderDoc software, the team captured every 40th frame from their sessions of Grand Theft Auto V, effectively reconstructing the entire 3D rendered environment outside the game. If the same object was loaded and visible between one frame and the next, the team didn't need to re-identify it in every successive image, it was already tagged. That means that instead of spending a great deal of time hand-painting high-resolution images down to the precise pixel, the computer did it for them. Instead of spending 60 minutes on a frame, they spent a handful of seconds correcting a few futsy bits the computer had missed. As a result, the team was able to produce 25,000 color-coded images, many times the size of the CamVid dataset. It appears to be pretty accurate in practice, too!

You can see all this happening in the video above. Could this entire process have been made simpler with Grand Theft Auto V's source code? Sure. But commercial game studios aren't usually forthcoming with that kind of data, and the researchers found that open-source games didn't have the photorealistic visuals necessary for what they wanted to accomplish.

This isn't to say Grand Theft Auto V is a perfect replacement for real-world footage, mind you. But this isn't about completely supplanting what a driverless car sees with a computer simulation; it's about speeding up the process by which a car's computer learns to identify what it sees. So it's pretty neat that high-end games like GTAV can provide the level of detail necessary to augment that process!

If you're up for some extended reading, I recommend checking out the team's complete research findings. It's dense, like most academic papers, but I actually found it far more readable than I was expecting (and I have seen enough graduate students' papers to last me a lifetime).

(h/t Gamasutra.)