Andrej Karpathy, Tesla’s Senior Director of AI, presented Tesla’s recent work at CVPR 2021. CVPR is one of the foremost conferences for academic research into computer vision. Karpathy always does a great job explaining cutting-edge work in an intelligible format (he is an AI researcher with over 350,000 Twitter followers!).
Karpathy’s presentation is about 40 minutes, but it comes at the end of an 8.5 hour session recording. Hence the timestamps start at 7:51:26.
[7:52:46] As a way of emphasizing the importance of automated driving, Karpathy describes human drivers as “meat computers.” I saw some people take offense to this on the Twitter. I think the shortcomings of human drivers are widely acknowledged and this statements wasn’t necessary, but neither was I extremely offended. Human drivers kill a lot of people.
[7:55:14] Karpathy describes Autopilot’s Pedal Misapplication Mitigation (PMM) feature. I’d not heard of this, but I like it. Malcolm Gladwell released a podcast a few years ago hypothesizing that the Toyota recalls of the aughts and early 2010s were largely due to confused drivers flooring the accelerator pedal when they meant to (and thought they were) flooring the brake pedal. Although Consumer Reports disagrees.
[7:57:40] Karpathy notes that Waymo’s approach to self-driving relies on HD maps and lidar, whereas Tesla’s approach relies only on cameras. He claims this makes Tesla’s approach much more scalable, because of the effort required in building and maintaining the HD map. I’m not sure I agree with him about this – a lot of effort goes into automating the mapping process to make it scalable. And even if mapping does prove to be unscalable, lidar has a lot of uses besides localizing to an HD map.
[8:01:20] One reason that Tesla has removed radar from its sensor suite, according to Karpathy, is to liberate engineers to focus on vision. “We prefer to focus all of our infrastructure on this [cameras] and we’re not wasting people working on the radar stack and the sensor fusion stack.” I had not consider the organizational impact of removing the radar sensor.
[8:02:30] Radar signals are really accurate most of the time, but occasionally the radar signal goes haywire, because the radar wave bounces off a bridge or some other irrelevant object. Sorting the signal from the noise is a challenge.
[8:03:25] A good neural network training pipeline has data that is large, clean, and diverse. With that, “Success is guaranteed.”
[8:04:35] Karpathy explains that Tesla generates such a large dataset by using automated techniques that wouldn’t work for a realtime self-driving system. Because the system is labeling data, rather than processing the data in order to drive, the system can run much slower and use extra sensors, in order to get the labeling correct. Humans even help clean the data.
[8:07:10] Karpathy shares a sample of the 221 “triggers” Tesla uses to source interesting data scenarios from the customer fleet. “radar vision mismatch”, “bounding box jitter”, “detection flicker”, “driver enters/exits tunnel”, “objects on the roof (e.g. canoes)”, “brake lights are detected as on but acceleration is positive”, etc.
[8:08:40] Karpathy outlines the process of training a network, deploying it to customers in “shadow mode”, measuring how accurately the model predicts depth, identifying failure cases, and re-training. He says they’ve done 7 rounds of shadow mode. I’m a little surprised the process is that discrete. I would’ve guessed Tesla had a nearly continuous cycle of re-training and re-deploying models.
[8:10:00] Karpathy shows a very high-level schematic of the neural network architecture. There’s a RESnet-style “backbone” that identifies features and then fuses data across all the sensors on the vehicle and then across time. Then the network branches into heads, then “trunks”, then “terminals.” The combined network shares features but also allows engineers interested in specific features (e.g. velocity for vehicles in front of the car) to tune their branches in isolation.
[8:11:30] “You have a team of, I would say, 20 people who are tuning networks full-time, but they’re all cooperating. So, what is the architecture by which you do is an interesting question and I would say continues to be a challenge over time.” In a few different cases now, Karpathy has discussed organizational dynamics within the engineering team and a significant factor in development.
[8:11:50] Karpathy flashes an image and specs of Tesla’s new massive computer. That Karpathy knows enough about computer architecture to even describe what’s going on here is impressive. He also plugs recruiting for their super-computing team.
[8:14:20] In the vein of integration, Karpathy shares that the team gets to design everything from the super-computer, to the in-vehicle FSD chip, to the neural networks. Vertical integration!
[8:16:00] Karpathy shows an example of radar tracking a vehicle and reporting a lot of noise. He explains that maybe they could work on the radar to fix this, but kind of shrugs and says it’s not worth it, since radar isn’t that useful anyway.
[8:19:40] Karpathy references both the validation and simulation processes, but at such a high level I can’t really tell what they’re doing. He mentions unit tests, simulations, track tests, QA drives, and shadow modes.
[8:20:20] Tesla reports FSD has run about 1.7M Autopilot miles with no crashes. Karpathy warns that crashes are inevitable, at Tesla’s scale. He reports that the legacy stack has a crash “every 5M miles or so.” For context, in the US, human drivers experience fatal crashes about every 65M miles. (Do note the distinction between “fatal crashes”, which is the available data for human drivers, and “all crashes” which is the reference Karpathy provides. We would expect “all crashes” to occur much more frequently than “fatal crashes.”)
[8:22:40] Karpathy speculates that training for vision alone basically requires a fleet (and a super-computer), in order to gather sufficient data. He seems like such a nice guy that I wouldn’t even consider this a dig at lidar-reliant autonomous vehicle companies, but rather I chalk this up to a defense to all the criticism that Tesla’s vision-only approach has received.