[infosthetics@strataconf 2011 by guest blogger Collin Sullivan]
While walking around between sessions at the 2011 Strata Conference, I managed to talk to a few of the speakers and presenters who seemed to be making a real impact on audiences. I thought a good way to round out the coverage of the conference would be to publish their responses to a few open-ended questions regarding so-called "Big Data" and our collective future, so readers can get a sense of the varied perspectives that the conference offered. I will offer a word of warning that this post is light on beautiful images and very heavy on text, but the substance is very much worthwhile. The post will cover Orwellian government, direct democracy, heatmaps of roach infestations and much more.
I spoke with Kim Rees of Periscopic, Naomi Robbins of NBR Graphs, Mark Madsen of Third Nature, Alistair Croll of Bitcurrent and Drew Conway (website) of New York University's Department of Political Science. All were very kind to speak with me and gave me thoughtful responses to these admittedly vague questions, and I appreciate their help and cooperation. Their words are below.
[I should note that both Mark Madsen and Alistair Croll expressed dissatisfaction with the blanket phrase "Big Data," and for similar reasons. "'Big Data's' a bad name because the bigness isn't the problem for most people," Madsen told me. Still, both saw the phrase's utility. "We need a handle to call this something," Croll said. Neither was hung up on the phrase, but their objections seemed noteworthy.]
And now, the questions...
Q1: What is the future of Big Data?
There is so much data being generated. I mean, we've seen this sort of backlog for years, of so much being generated, there's constantly things that are updating, sensors, data feeds, you know, Twitter, Facebook and whatnot. And the "making sense of it" part of it is really the part that's lagging behind. I think that there aren't really good methods for either visualizing that, or even just the algorithms to make sense of it, to really deal with that slog of information. So I think that that's really the future, the sense-making aspect of it. Finding ways to sort of glean some knowledge from it.
I'm focused on visualization, and my position is that before one can visualize big data, they have to visualize simple, small data. I see so many graphical mistakes in small data that get carried over to big data, so my specialty is in the smaller data. I'm a one-person company that doesn't have the computer resources or whatever to work with big data. There are many techniques that I'm aware of but I don't consider myself to be a specialist in big data.
I think it's more a future of semi-structured analytics or something like that, where it's not the old school "here's a database with some fixed tables of information," it's chaotic, or semi-chaotic collections of data that are schema-less, that get put out there and stored in interesting ways for reconstruction and recontextualization. So, if you have this data and this data and this data, and you're using them as a commercial organization in way A, and somebody comes along with some new piece of data, most organizations have a real ingest problem bringing in the new data, mashing it up with their other data and then producing something out of it or even just exploring it visually. This whole new startup market and all the things that are happening here give us that. They give us the ability to have data storage and retrieval where we don't have to worry about those structure problems anymore, and model them in advance. And they give us fluidity and flexibility. So that's what I see as the real future.
I think it's a combination of massive joined open data sets that are accessible by everyone, meaning ubiquitous computing. Everyone can get to a device that it's displayable on or accessible on, and those devices are two ways, so they're collectors of data but also displayers of data. And finally this idea of new interfaces. Creve Maples did a great presentation just before this about how the bottleneck is actually the link between the data and the human brain, so the more we can make immersive environments that allow us to use motion and position and vibration and sound as well as vision to understand things, the better we will be at working the data. And so to me society really changes when you have big data, new interfaces and ubiquitous computing all kind of tied together.
I think the future of big data, for me, is going to be, where you can take data sets that are large, but are also meaningful, together, and find ways to put them together to find [new] meaning. As you're going through an analytical process, often times you start with a dataset that's obvious. You're interested in studying crime, so you go and try to find crime data. But you want to be able to explain something about crime, not just count it. So the challenge is then finding the next dataset that's meaningful within that context and marrying them up and then doing some analysis on that, whether it's a statistical model where you're doing prediction, or a visualization which explains how maybe the weather and rainfall affect where crime happens. Those kinds of things I see as the most-- sort of the hardest part for an analyst to get at, where there is a large potential for services to come in and do something useful.
Q2: How will data affect what it means to be a citizen, if at all?
I think data visualization really touches on the citizen aspect because it really allows that conduit--it's the conduit to take that data and give it to the user in a way that's manageable and understandable. So finding ways of disseminating that, of consolidating that, I think is huge, so people don't have to run around and find a bunch of different data sets. No citizens are going to go to that level to get at that data. So, packaging it in smaller bites so that people can make sense of it, delivering it in ways that make sense like mobile devices and things like that... Data visualization I think is huge there.
But along with that, people are already skeptical of data because of the privacy concerns. So I think that that will continue to be-- people will start to be on high alert as far as, you know, what exactly is this data? Where is it coming from? Is it about me, specifically? Does it-- Is it tracking me? That sort of thing. So I think privacy is a sort of concern of that.
Data affects every aspect of our lives. Our medical treatment will improve with more use of data. It's already happening. We know what to eat, how drugs affect us and everything else based on data. Data is important in every department of the government: education, justice, health, security. And the access to data and ability to handle it well is of utmost importance.
I think there's a lot of effects. I think you actually could just go back and replay Scott Yara's presentation (video) from this morning and it would capture my opinions exactly, which is that, it's like that Sterling [correction: William Gibson] quote of, "The future is here, it's just not evenly distributed," because it's in all sorts of places, in not highly visible ways.
Of course there's the secret Orwellian government piece of it which we know is going on but we have no visibility into. At the same time we've got open data and transparency and, you know the Guardian guys talking about getting government data and making it available to people who otherwise wouldn't really see or have access to it, and pull things out of that that would not normally be pulled out of it, making government more transparent at the same time.
And then there are the invariable privacy concerns and other things. And the tools right now are sort of asymmetric in that government has an enormous amount of data about us, in many different venues, and they're gathering and bringing together more. And yet we, until fairly recently, have not had the same tools and cost-effective means of exploring that same data or turning those tables around onto the government. And I think you see it with the new communications, the ability to connect people, to communicate information, analyze it to tell stories about it to explain things, has some very destabilizing effects.
So I see it in all these different places and it's in small-scale. Traffic monitoring turns into capturing video of you running a red light which turns into somebody being able, like his [Scott Yara's] wife, to email you saying "Good job running that light this morning." Transparency where maybe we don't want it.
And it's hard to ignore the commercial effects of it. I've worked with data syndicators for a long time, and even as early as the mid-90s when I was working with them, they already knew things like what magazines I subscribed to, where I lived, what my income was, what kind of car I owned, how long I had owned it, what the names of my pets were, what their genders were, they knew all of this stuff back then. You can imagine in 15 years of evolution what we've got now. It's just in some ways scary because the commercial world knows it, marketing companies know it, Pepsi knows it. They buy this data, and they use it for various purposes, good or ill. And we have no control over it.
So I think we're setting ourselves up for a privacy backlash in one realm or another--data breaches, privacy identity theft and things like that. So it's sort of all over the map on this stuff. But good and bad effects, distributed effects, who's being affected? Most people don't care at the end of the day. So it's up to the watchdogs to watch out for the rest of us.
I think if you want to talk specifically about government, it changes things a lot. And one of the reasons is, we live in at best a representative democracy. If you come from a town that wants to vote for Orange, and 70% of your townspeople want to vote for Orange, and you go to Washington to represent your townspeople, and the Purple lobby comes along and says "Well we have 30% that like purple, we're kind of happy about that," and you vote for Purple, you have literally misrepresented.
So representative government is a hack. It's the idea that we can't afford to send everyone to Washington so we send representatives. Someone was telling me the other day that the number of the people in the population [relative] to representatives has more than gone up tenfold. So it's-- we're less and less representative. But the lobbyists are actually, their job is misrepresentation. It's to convince you to vote a way that you wouldn't normally vote. And so, a direct democracy, where the citizens have access to the information and can make votes--Reddit style, voting up voting down, whatever else, surfacing the topics they care about--is a big change. Direct democracy eliminates representative democracy and therefore misrepresentation. Lobbyists won't like that.
It has other consequences. For example, if I'm constantly having to ask people what they want, it's hard to make a long-term change because I'm thrashing with opinion polls. People don't like to not have their cake and eat it too. And so I think that's something that's going to take 20 years for us to digest as a society, but being a citizen will be radically different.
I also think that, you look at altruism, right, Wikipedia was created in our spare time by altruistic people and it's an incredibly useful resource. So, while in the old days you used to have national service where people signed up for the government or for the Army or whatever, civil servants. What it means to be a citizen is going to be what it means to be a good digital citizen as well, and that may mean not infecting your friends, acting appropriately, helping curate something online, contributing something like that. And so I think being a citizen of the internet rather than just a citizen of a country or a city is also going to change.
Well, I think it actually will allow for a much greater transparency into both the government, which we sort of see as being obvious, but also just day-to-day stuff. I'm from New York City and Mayor Bloomberg has had a really good policy of opening up city data, and people have done awesome things with it. A friend of mine, John Myles White who is a grad student at Princeton, just as a joke sort of for fun, got online and got infestation data and created a website called Roachmap.com where you can actually go and see a heatmap of where in New York City they're having the worst roach infestation.
But it's little things like that where citizens are now able to get data and throw it up online. The sort of canonical example is that we have bus data, GPS for buses and taxis, and now I can stand on the corner with my phone and say, "I know where the taxi's gonna be, I know when the bus is coming." I think the very local microlevel is where big data's gonna have the biggest impact for citizens, because it will just allow them to have much more knowledge about how their day will begin and end in terms of those public services and stuff that they're able to get.
Q3: What have you enjoyed most about the conference, or what are you most looking forward to?
I am looking forward to all of the data journalism stuff, I think that is really compelling to me right now. And then just meeting the other people who are doing great data visualization work. There are people that I've known on Twitter for a long time that I'm finally meeting, so it's nice to finally actually talk to them in more than 140 characters.
Everything to do with visualization.
The things that I really am enjoying a lot or that I've gotten a lot out of, is the diversity of the audience that's here. Coming from the data industry, it tends to be sort of insular and corporatized, and focused on a fairly narrow problem set, whereas the people here are not about getting the data in and managing the data. They're about using and consuming the information. And there's a million more ways to use and consume information than there are to produce and store it. And so, hearing about people doing things with health care reform and social things, people doing environmental monitoring, people doing basic corporate stuff, people just providing the infrastructure like the Twitters and Facebooks of the world to manage and enable simple communication between people.
There's such a diverse set of uses, it's-- you can almost equate it to the natural ecosystem where plants form the backbone of the entire system of life. But there aren't really that many species of plant relative to say, insects. There's 100 times more insects, species of insects-- well there's actually a lot more than that, thousands, than there are plants. And so it's just like data, and the ways you can use the data. There's a thousand times more ways to use it than most people even envision when they have a single data set. And I think these data markets are showing some of that, too, I find that intriguing.
Selfishly, I enjoyed seeing this many people in one place, because I was one of the people helping to organize it.
I think what I enjoyed most was that the conversation didn't devolve into any one thing. Usually when you get a bunch of data people together, they wind up talking about math, or they wind up talking about ethics, or they wind up talking about computing or whatever. This was pretty even-keeled, and we intentionally structured the content that way. But it didn't devolve into any one thing, you kind of went down a path for a session or two and then came back up, and OK let's consider this other thing. So I think it was, what I enjoyed the most is it was really well-rounded.
There were a couple of sessions that really stood out. Loved the presentation by LinkedIn this morning (video ), and actually by Scott Yara of Greenplum, really good. Some of the other presenters and keynotes were great. I loved Creve Maples' thing about immersion and visualization, that was fantastic. I was torn that I couldn't be in more than one place. The guys doing World of Warcraft analysis next door, while we're talking about predicting the future, I mean, this was a science fiction fan's wet dream in one building, and I'm just ecstatic that it happened.
Honestly, I think this conference for me has been the digital-to-analog converter. A lot of people that I only have ever interacted with online or on Twitter, through email, are all here, so it's like a meeting of the minds. And I got to meet people for the first time and actually hang out, have a beer with them and just talk face-to-face. And even though we're all here and we're technologists and we love to talk about big data, man, the face-to-face is really great. So for me it's been awesome just to meet people who I really respect and admire their work and see them in person and get to shake their hand and do that sort of stuff.
*Banner image courtesy of Alistair Croll, used with permission.
This post was written by Collin Sullivan. He is a research analyst for The Sentinel Project for Genocide Prevention, where data collection, analysis and visualization are being used to design an Early Warning System (EWS) to detect and prevent genocide. Collin lives in San Francisco. You can reach him at collin [at] thesentinelproject [dot] org and follow him on Twitter at @inciteinsight.
Note from the editor: I would like t thank Collin Sullivan for the enormous effort in writing down his impressions of the Strata conference! The aim was to provide all infosthetics readers who were not able to attend the conference with a truthful impression on the topics relating to data visualization. Collin really delivered and went over and beyond this expectation. Thnkx Collin!
. Strata 2011 [Day 1]: Making People Fall in Love with your Data
. Strata 2011 [Day 1]: Communicating Data Clearly
. Strata 2011 [Day 2]: Telling the Story with Data
. Strata 2011 [Day 2]: Data as Democracy
. Strata 2011 [Day 3]: Beauty, Journalism and the Human Mind