The myth of the aimless data explorer

aimlessThere is a sentence I have heard or read multiple times in my journey into (academic) visualization: visualization is a tool people use when they don’t know what question to ask to their data.

I have always taken this sentence as a given and accepted it as it is. Good, I thought, we have a tool to help people come up with questions when they have no idea what to do with their data. Isn’t that great? It sounded right or at least cool.

But as soon as I started working on more applied projects, with real people, real problems, real data they care about, I discovered this all excitement for data exploration is just not there. People working with data are not excited about “playing” with data, they are excited about solving problems. Real problems. And real problems have questions attached, not just curiosity. There’s simply nothing like undirected data exploration in the real world.

Digging a little deeper into the issue, I realize that after all this is natural and somewhat obvious: why should people explore data for the sake of it? Sure some people like us (yes the hopeless data geeks) do take pleasure in looking into a bunch of data, but we are a minority and I am not sure we should take us as the model of reference for what we do.

The reason why I decided to write about this thing is that I think this myth is somewhat pervasive and it’s not limited to visualization. While I am not a Data Mining or Machine Learning expert I know some people in the area and I know some of then too promote “knowledge discovery” as the science of finding good questions.

But wait a moment you might say … when we use knowledge discovery tools (yes, vis is a knowledge discovery tool) sometimes we do stumble into unanticipated questions and these questions may in fact be the real value of the whole process! I agree. And I have experienced this effect multiple times myself. Yet, I think this does not contradict my point: what I am arguing is not that we should not help people coming up with new questions as a collateral effect of data analysis or that coming up with new question is not valuable. What I am arguing here is that we should be very careful in selling visualization as a tool for people who don’t know what question to ask. This is simply not true. Everyone has a question and actually I even believe everyone should start with a question.

There are a couple of words I like more when talking visualization: hypothesis and explanation. These are great words! They describe much better what visualization is good for. You might actually have a good question to start with but not a good hypothesis or explanation for what is going on there (some patients develop unexpected complications after receiving a particular treatment and you don’t know why). And visualization can for sure help you out with coming up with one. Visualization is an “hypothesis booster”. It’s actually so effective that it could even be dangerous in this respect (it may bias you toward some explanation)!

So next time you talk about visualization restrain yourself to selling it for a tool to help people aimlessly explore some data. And when you hear someone saying that please send him or her to this post. I’d be happy to defend my position :)

Am I missing something here? Am I totally wrong in some sense? I know there are some people out there who would strongly disagree with me, feel free to let me hear your voice!