A while back I went to a data science meetup, and the presentation was on how an analyst solved his problem with a PID controller. As an engineer I’m well familiar with PID controllers. I spent a lot of time studying them in school. The problem in this scenario was optimizing the amount and cost of ad impressions in a given day. My first thought when I saw it was that this isn’t data science. You’d think that with the amount of data available that a regression or decision tree might work better. I’ve thought about this quite a bit since then, and my view has changed. In fact, I now his approach is brilliant in its simplicity.
For those not familiar, a Proportional Integral Derivative controller is a control-feedback loop that automates the control of systems. That’s a fancy way of saying that they control an input and monitor the output. If there’s a difference between the output and the desired output, the input is adjusted to account for this error. The first designs were for steering a ship, automating the task of maintaining a heading when there are wind or currents to account for. These aren’t new, the first designs were completed in the 1920’s, and today they’re commonplace. In fact, you’ve used them.
Your home’s thermostat is a PID controller. The input is whether or not to turn on your air conditioner or heater, the desired output is the temperature you want, and it measures the inside temperature. Most cars these days have cruise control, that’s another example of a PID controller that we use on a daily basis. An aircraft autopilot is a more complex example, as it maintains heading, altitude and airspeed. Your computer’s hard drive uses a PID controller to position its head on the platter. And they’re used extensively in manufacturing and industrial applications.
Some of the more cool or exotic uses of PID controllers are in a car’s suspension system, to make the ride ultra smooth. NASA uses them on their launch platforms to aid in keeping rockets balanced upright. Recently SpaceX and Blue Origin have autonomously landed rockets through the use of advanced control systems. And Boston Dynamics has created a number of awesome walking robots that maintain balance across slippery or rugged terrain. Some of these applications are quite a bit more advanced than the simple PID controller, but they definitely contain PID controllers.
By now hopefully you can see why I’d question whether PID controllers would be considered a data science tool. Certainly they’re useful, and if you find an application where they’ll work then its a great choice because they’re well documented, relatively easy to implement, and once they’re tuned they require little maintenance. These are all great attributes, and yet I wouldn’t expect them to be covered in a statistics course. Or a computer science course. The math behind PID controllers is calculus. In my own education, I studied a lot of calculus and these were only covered in my engineering coursework.
In fact, if we look at common data science tools, PID controllers are absent. One of the great reasons for choosing R over another language is that there are packages for everything. A Johns Hopkins professor even created a popular video proclaiming There’s an R Package for That, playing on the popular cell phone map ads from a few years ago. And yet, if you do a search there is no PID controller package for R. There is a blog post showing how to create one, but no package. And what about Python’s data science library scikit-learn? Nope. Data science upstart Julia? No. Not even Apache’s Hadoop or Spark frameworks have pre-built algorithms for a PID controller.
These data science packages and libraries are all designed to make sense of thousands (or more) rows of hundreds of variables. A PID controller reads one thing and controls another. PID controllers don’t work with big data. And yet, they’re incredibly useful in applications where we work with data.
So now lets pay closer attention to the data science side. A popular post over the past few years has been Drew Conway’s data science Venn diagram. One of the 3 primary components is hacking skills. Conway partly explains this as discovery and building knowledge via hypotheses and experimentation. That means figuring out how to get things done. You aren’t limited to what libraries have to offer. If you have domain experience and can simplify your problem to the point that a PID controller accomplishes what you need, that’s valuable no matter what you call it. I’d say that fits within data science.
It has also occurred to me that there are a few other data science applications not covered universally by data science tools. Collaborative filtering and A/B testing are both definitely in the realm of data science, and yet if you choose one of these methods for your project you’ll likely be coding up your own solution. Natural language processing is another example of a field within data science that calls on skills outside of a statistical model. There are definitely packages and libraries for NLP in the major languages, but this field has its own methods outside of categorization, classification or regression. And maybe that’s my point. This is data science. You do what you have to do.
Notice that I’m careful to call this data science. Another closely related moniker is machine learning. A PID controller is definitely not machine learning, nor knowledge discovery nor operations research nor data mining. This is not a method for classification or clustering. There is no model being created. This isn’t Frequentist or Bayesian. PID controllers can definitely be useful for solving data science problems, but they cannot predict anything.
What are some potential use cases where a data scientist might choose a PID controller? As a generic answer I’d say anywhere that you believe there’s a direct relationship between a variable that you control and an output that you monitor. The data stream should be continuous, or at least assumed continuous. The example that lead to this thought is perfect, controlling the number of ad impressions by adjusting the bid price in an auction. In this example, your ads are up and running 24 hours a day, 7 days a week, and you know almost instantly how many impressions you earned.
I actually think PID controllers could work well for a lot of problems with the online realm. Buying online ads, budget pacing and load balancing come to mind. There are probably areas within SEO where it would help.
This would not work as well for optimizing football game attendance by adjusting ticket prices. In this scenario the data is not continuous, there are days or months between events. There are also too many other variables at play. If temperatures are below freezing then ticket price probably isn’t a factor in game attendance, people won’t be as willing to sit in the cold. If the game is post-season then temperature probably isn’t a factor, people will attend regardless of temperature because they’re excited to see their team in the playoffs. These are guesses, but hopefully you see my point.
My own conclusion in this is that, yes, PID controllers are fair game within the realm of data science. The realm of controllers is stable, mature, and is still an area of active research. If you find a scenario where a PID controller gives you the results you’re looking for, then its absolutely a good choice.