Making Big Data Small Data. In an interview with Jonathan Lerner of Wharton’s Data & Analytics Club, WCAI’s Eric Bradlow discusses data compression, how it makes for a better (and more creative) approach to analysis, and how it may be the future given privacy concerns.

JL: The amount of data available and the ability to house data has been exploding. I think McKinsey did a study that said that Data Science was going to be the sexiest job of the next decade. So, for current students, how do you see things evolving over the next few years?

EB: It’s funny, I gave a plenary speech at an industry conference for about 500 people in Erasmus University in Rotterdam. And that was the topic, Big Data. And I started out my speech in a way I don’t think they expected—and there were a lot of big companies in the room, Heineken and other big local brands—I said, the first thing when somebody sends me a big data set, the first thing I say, and, excuse my French, is “oh shit.” What the hell am I gonna do now? I have this massive data set, I can’t run heavy-duty mathematical models on it because the data set is too large, or it’s even hard to read in, or it’s hard to clean. So now, I need to think about making big data small data, but I don’t want to get rid of too much information. So what I gave, in this 60 minute talk, is how do you make big data small data, and not lose too much information?

So, in the statistics world, and I’m trained as a statistician, we talk about data sufficiency. Things like sufficient statistics. So can you take a large number of columns and collapse them down into a small number of columns. That’s kind of what I call, horizontal compression. Can we take a large data set, and do a sample, maybe not a random sample, but a sampling in a smart way. I call that vertical compression. So what most firms do is, they think—erroneously—that they need to analyze data sets of 100 million rows, by 1,000 columns. You don’t, statisticians have been dealing with problems on sampling and data compression for a long time. So I think over the next three to five years what you’re going to see in technology-enabled data is great. I care about new data sources, not bigger data, because end of the day I’m going to take that big data and compress it anyway, but what are smart ways to compress data? And to get the most information out without having to deal with 1,000 columns or 100 million rows. That’s going to be what I’m working on over the next three to five years.

JL: Interesting. So that’s the cutting edge of what you think is the academic side of things?

EB: I think the cutting edge will be that. I think privacy is going to force us to do that, because, even though this is potentially collectible, you’re not getting your hands on it. So you have to be able to do what I call customer analytics inference with limited information. I don’t mean because you can’t collect it. I mean because you won’t get that data. In fact, we may be seeing the tipping point now where the data that is available to us is going to become less, not more. And that’s fine, as a matter of fact it’s fine for me because, number one, I think privacy concerns are legitimate. And secondly, it keeps me in my job! You know, if you had any data you want, you don’t need a statistician, you can just get someone to run some regressions and you know the answer. It’s when the data gets limited, or you have to compress it, or you have to sample it in a clever way, that you get the medium-size-dollar people like me!

Read the full interview at the Wharton Journal site.