Behind the Data: Max Shron of OkCupid
OkCupid is a popular, free online dating site that "uses math to get you dates." And if you are as much of a geek about numbers as we are, you are probably also obsessed with the fascinating and witty research on OkCupid's data blog, OkTrends. In celebration of Valentine's Day, we sit down with Max Shron, a consultant and data scientist. As OkCupid's data scientist, or as we like to think of him, one of the cupids behind OkCupid, he provided big-data assistance for OkTrends. Now he helps companies figure out how to best understand and make use of their data.
Visualizing: At OkTrends, you combed through huge amounts of data to come up with findings that are remarkably simple (but surprising). How do you decide what to look for? What tools and techniques do you use to explore the data?
Max Shron: The trick, in so far as I've observed one, is to pick questions first before you let the data be your guide. On OkTrends, I had the privilege of working with Christian Rudder, who started and remains the driving force behind the blog. He's got excellent ideas for places to look, things he'd like quantified, quantities he wants computed. Most importantly he asks interesting questions. It really helps that he's coming at it from a perspective of, "how would I quantify this thing I'm interested in?" instead of, "how do I summarize the data I've got?"
For example, last year Christian floated the idea of figuring out what words or phrases best summarized the profile text of different groups on the site. He came to me and asked how I would calculate this, figuring out which words are representative. We both brainstormed and came back with something simple: for every phrase on the site, divide the frequency of that phrase within a group's profiles by frequency of that phrase on the site as a whole, and importantly have a minimum threshold for frequency in the group. That threshold is a slider that controls how spurious or iron-clad the results are. It has some flaws, it's not terrible, but what's nice is that it's simple enough to reason about how it should behave.
At that point it was a matter of writing code to transform the profile text in to stemmed phrases, to count the results, and to calculate the relevant values. There are lots of ways to do it, and I went through a handful of iterations of an architecture. It's not hard, but it's not trivial either. OkCupid has enough profiles that you can't fit them all in memory.
You want it to be flexible enough to work for new groupings of people-- we reused the same technique for race, sexual orientations, and the occasional request from journalists-- and fast enough that it feels like a tool you can quickly throw at a problem. A hash table in memory, with a key for every phrase is probably a bad idea; with two-word phrases, it grows on the order of the number of words in all profiles squared. It's obviously less than that in practice, but it's still big.
If OkCupid was a larger organization, or the data science team was bigger, maybe we would have solved this with Hadoop. I ended up using a lightweight command line map-reduce pattern. Unix is a lovely thing, and a little awk, sort and Python can go a long way. I find one powerful machine to be easier to maintain than a cluster anyway.
Once the data is collapsed back down to something human-sized, you should poke at it in a spreadsheet or a text editor. You have to expect to generate five or ten results for every one that is worth pursuing, so developer speed is absolutely crucial.
V: What is a unique challenge to working with data from a dating site?
MS: Obviously user privacy is of paramount importance, but that's true of any data project where you have person-level information. I'm not sure there are unique challenges so much as unique advantages-- as a decently-functioning human being, you develop intuitions about how people will react socially, and so it's easier to gut-check data from a dating site than, say, the stock market.
V: Many of your analyses rely on messages sent or received by a user to measure something about them. How is social data like this transforming what we know about people?
MS: Transforming what we know about people is a tall order, and I don't think we're quite there yet. At OkCupid the most robust source of data are the questions that people answer, and for those it's easier to find interesting relationships.
Having said that, ten or fifteen years ago you could have imagined inferring social relationships from data, as a kind of shadow of the real reality. Now so much of our social life (with friends, lovers, enemies) is conducted online anyway that you're getting a much richer sample of people's minds. It almost doesn't make sense to talk about doing things "in real life" versus online any more, since online is now part of real life.
V:Your other work deals in data that is less personal (transportation, public health); tell us about one of your personal projects.
MS: I'm very proud of the DonorsChoose.org "hacking education" contest. DonorsChoose.org matches teachers looking for funds for classroom projects with people who want to give them money. They do great stuff. Mike Dewar and I wrote a piece of software that combs through the DonorsChoose.org site looking for recently funded proposals in rural areas, and alerts local journalists via a well-formatted email. There were all kinds of fun challenges-- how do you determine the coverage area of a newspaper? How do you make it easy for your friends to collect contact info for journalists (we couldn't find a good email database!)? How do you know when you're being too spammy? It won the Python category and he and I were both thrilled.
Another project is a modeling and visualization project from a few years back, when the Chicago Transit Authority had just cut service. Luke Joyner, Juan-Pablo Velez and I took publicly available schedules from the city and analyzed waiting times before and after at every bus and train stop in Chicago. We had originally been trying to scrape Google Maps to look at transit times, but it was too unwieldy and brittle of an approach. We ended up writing a piece for the Chicago News Cooperative that made it in to the Chicago edition of The New York Times, about the wait-time differences in different neighborhoods as a function of race and income. It turned out that the CTA seemed, for better or worse, to not have taken any of that in to consideration. Making the maps and doing the regressions for that was a lot of fun.
V: For anyone without a date tonight: what's a data set that you'd love to see visualized?
Great question. How about a map or some charts exploring how disability status interacts with population density, income and income inequality? I imagine that poverty and disability status are related, but is it worse to be in a poor census tract in a rich city or poor census tracts in a poor city? Are people more likely to be unable to work due to disability in the city or the countryside? The Census Bureau has all of the data you need for this, and any sub-slice would be interesting.