Tag Archives: statistics

Are There PID Controllers in Data Science?

A while back I went to a data science meetup, and the presentation was on how an analyst solved his problem with a PID controller. As an engineer I’m well familiar with PID controllers. I spent a lot of time studying them in school. The problem in this scenario was optimizing the amount and cost of ad impressions in a given day. My first thought when I saw it was that this isn’t data science. You’d think that with the amount of data available that a regression or decision tree might work better. I’ve thought about this quite a bit since then, and my view has changed. In fact, I now his approach is brilliant in its simplicity.

PID Controller

By TravTigerEE of Wikepedia

For those not familiar, a Proportional Integral Derivative controller is a control-feedback loop that automates the control of systems. That’s a fancy way of saying that they control an input and monitor the output. If there’s a difference between the output and the desired output, the input is adjusted to account for this error. The first designs were for steering a ship, automating the task of maintaining a heading when there are wind or currents to account for. These aren’t new, the first designs were completed in the 1920’s, and today they’re commonplace. In fact, you’ve used them.

Your home’s thermostat is a PID controller. The input is whether or not to turn on your air conditioner or heater, the desired output is the temperature you want, and it measures the inside temperature. Most cars these days have cruise control, that’s another example of a PID controller that we use on a daily basis. An aircraft autopilot is a more complex example, as it maintains heading, altitude and airspeed. Your computer’s hard drive uses a PID controller to position its head on the platter. And they’re used extensively in manufacturing and industrial applications.

Some of the more cool or exotic uses of PID controllers are in a car’s suspension system, to make the ride ultra smooth. NASA uses them on their launch platforms to aid in keeping rockets balanced upright. Recently SpaceX and Blue Origin have autonomously landed rockets through the use of advanced control systems. And Boston Dynamics has created a number of awesome walking robots that maintain balance across slippery or rugged terrain. Some of these applications are quite a bit more advanced than the simple PID controller, but they definitely contain PID controllers.

By now hopefully you can see why I’d question whether PID controllers would be considered a data science tool. Certainly they’re useful, and if you find an application where they’ll work then its a great choice because they’re well documented, relatively easy to implement, and once they’re tuned they require little maintenance. These are all great attributes, and yet I wouldn’t expect them to be covered in a statistics course. Or a computer science course. The math behind PID controllers is calculus. In my own education, I studied a lot of calculus and these were only covered in my engineering coursework.

In fact, if we look at common data science tools, PID controllers are absent. One of the great reasons for choosing R over another language is that there are packages for everything. A Johns Hopkins professor even created a popular video proclaiming There’s an R Package for That, playing on the popular cell phone map ads from a few years ago. And yet, if you do a search there is no PID controller package for R. There is a blog post showing how to create one, but no package. And what about Python’s data science library scikit-learn? Nope. Data science upstart Julia? No. Not even Apache’s Hadoop or Spark frameworks have pre-built algorithms for a PID controller.

These data science packages and libraries are all designed to make sense of thousands (or more) rows of hundreds of variables. A PID controller reads one thing and controls another. PID controllers don’t work with big data. And yet, they’re incredibly useful in applications where we work with data.

So now lets pay closer attention to the data science side. A popular post over the past few years has been Drew Conway’s data science Venn diagram. One of the 3 primary components is hacking skills. Conway partly explains this as discovery and building knowledge via hypotheses and experimentation. That means figuring out how to get things done. You aren’t limited to what libraries have to offer. If you have domain experience and can simplify your problem to the point that a PID controller accomplishes what you need, that’s valuable no matter what you call it. I’d say that fits within data science.

It has also occurred to me that there are a few other data science applications not covered universally by data science tools. Collaborative filtering and A/B testing are both definitely in the realm of data science, and yet if you choose one of these methods for your project you’ll likely be coding up your own solution. Natural language processing is another example of a field within data science that calls on skills outside of a statistical model. There are definitely packages and libraries for NLP in the major languages, but this field has its own methods outside of categorization, classification or regression. And maybe that’s my point. This is data science. You do what you have to do.

Notice that I’m careful to call this data science. Another closely related moniker is machine learning. A PID controller is definitely not machine learning, nor knowledge discovery nor operations research nor data mining. This is not a method for classification or clustering. There is no model being created. This isn’t Frequentist or Bayesian. PID controllers can definitely be useful for solving data science problems, but they cannot predict anything.

What are some potential use cases where a data scientist might choose a PID controller? As a generic answer I’d say anywhere that you believe there’s a direct relationship between a variable that you control and an output that you monitor. The data stream should be continuous, or at least assumed continuous. The example that lead to this thought is perfect, controlling the number of ad impressions by adjusting the bid price in an auction. In this example, your ads are up and running 24 hours a day, 7 days a week, and you know almost instantly how many impressions you earned.

I actually think PID controllers could work well for a lot of problems with the online realm. Buying online ads, budget pacing and load balancing come to mind. There are probably areas within SEO where it would help.

This would not work as well for optimizing football game attendance by adjusting ticket prices. In this scenario the data is not continuous, there are days or months between events. There are also too many other variables at play. If temperatures are below freezing then ticket price probably isn’t a factor in game attendance, people won’t be as willing to sit in the cold. If the game is post-season then temperature probably isn’t a factor, people will attend regardless of temperature because they’re excited to see their team in the playoffs. These are guesses, but hopefully you see my point.

My own conclusion in this is that, yes, PID controllers are fair game within the realm of data science. The realm of controllers is stable, mature, and is still an area of active research. If you find a scenario where a PID controller gives you the results you’re looking for, then its absolutely a good choice.

Aim Bigger

Gree-Red-AppleOver the past few years I’ve done quite a bit of research on A/B testing. Initially I wanted to understand how to apply them within my own work. I quickly became interested in designing tests and interpreting results in various scenarios. Then I wanted to understand the math behind them. Then I really geeked out on them and started to reverse engineer some of the online calculators that are available.

The math of these tests is fascinating to me because there’s a bit of complexity involved. How do you decide when to end a test? How much confidence can you have that the result is correct? However complex though, ultimately there is an intuition.

The correct way do an A/B test (using frequentist inference methods, Bayesian is another topic) is to start with a hypothesis then calculate the required sample size. You start with a measure of your current metric, design a new version and form a hypothesis about how much your newer version will improve.  From there you can determine how many times your split test needs to be repeated to reach a given significance level.

For example, suppose I have a web page with a 5% success rate (clicks to some other page). I want to improve that so I try a different message. If my hypothesis is that I’ll see a 5% increase in performance, that would improve my success rate to 5.3% and require 240,000 visitors to determine whether or not I’ve achieved my goal with 95% confidence. That’s going to take a long time to complete, for any website. Long enough that it wouldn’t make sense to spend the time on it for such little gain. Plus, in this scenario I know that I’ll be showing the inferior version to at least 120k visitors.

If I raise my hypothesis to a 50% improvement, that places my target success rate to 7.5%. I only need 3,000 visitors to reach 95% confidence in this second test. That’s a more manageable number. Even on a site with meager traffic I can complete this test in a reasonable amount of time.

This doesn’t mean that I should just start hypothesizing that all of my tests will achieve a 50% improvement. In order to generate those types of results I need to start testing things that are very different from each other. In the online world, changing the color of a button isn’t going to make a 50% difference. You need to get creative and try different layouts, offers and imagery. Test small cartoons against large high resolution images. Test selling directly online versus having people go through a sales team. Test offering great sales support versus having a thorough knowledge base.

I’m also not suggesting that you shouldn’t test button colors. I do recommend testing everything that influences the behaviors you want to encourage. I wouldn’t start with this test, since this is a small difference I’d hold this one until you’ve tested other more significant factors. I also wouldn’t settle for a 5% increase.

Notice that in the above example I can complete 80 versions of the second test within the same time frame as the first. If I set my goal to a 100% improvement, I can reach significance for more than 260 individual tests in the same time frame as my original test aiming for a 5% improvement. Now we’re getting somewhere. You absolutely should test 260 different ways to improve the results you’re seeing.

Practically speaking, what does it mean if I aim for a 50% increase in performance and only achieve a 5% bump? Not much, really. Use your own judgment on whether or not to keep the new version. This isn’t the big difference that you’re looking for so make a choice (via whatever method) and move on to your next test.

As you iterate through test after test to improve results an interesting thing will happen. You’ll end up with a product, or website, or marketing campaign that you didn’t envision at the start. This is because you haven’t been the designer, your customers have. You move forward in stages without a defined path. And you’re still making progress.

These principles aren’t new and are certainly being applied in marketing and business. I’d argue that it also applies outside of work, in less tangible areas of our lives. If you want to start saving more money, test not eating out versus giving up cable TV. See which one works the best for you. As you experiment within your personal life, you don’t need an Excel spreadsheet, but I do recommend quantifying the results somehow.

A/B testing is one of the coolest aspects of online marketing, in my estimation. You start with a few ideas and work your way forward to success through iterations, improving and gaining confidence as you go. It’s amazing to see your work morph into something that people engage with. I say test all your big ideas, and look for big successes.