The myth of the aimless data explorer

by Enrico on January 7, 2014

in Thoughts

aimlessThere is a sentence I have heard or read multiple times in my journey into (academic) visualization: visualization is a tool people use when they don’t know what question to ask to their data.

I have always taken this sentence as a given and accepted it as it is. Good, I thought, we have a tool to help people come up with questions when they have no idea what to do with their data. Isn’t that great? It sounded right or at least cool.

But as soon as I started working on more applied projects, with real people, real problems, real data they care about, I discovered this all excitement for data exploration is just not there. People working with data are not excited about “playing” with data, they are excited about solving problems. Real problems. And real problems have questions attached, not just curiosity. There’s simply nothing like undirected data exploration in the real world.

Digging a little deeper into the issue, I realize that after all this is natural and somewhat obvious: why should people explore data for the sake of it? Sure some people like us (yes the hopeless data geeks) do take pleasure in looking into a bunch of data, but we are a minority and I am not sure we should take us as the model of reference for what we do.

The reason why I decided to write about this thing is that I think this myth is somewhat pervasive and it’s not limited to visualization. While I am not a Data Mining or Machine Learning expert I know some people in the area and I know some of then too promote “knowledge discovery” as the science of finding good questions.

But wait a moment you might say … when we use knowledge discovery tools (yes, vis is a knowledge discovery tool) sometimes we do stumble into unanticipated questions and these questions may in fact be the real value of the whole process! I agree. And I have experienced this effect multiple times myself. Yet, I think this does not contradict my point: what I am arguing is not that we should not help people coming up with new questions as a collateral effect of data analysis or that coming up with new question is not valuable. What I am arguing here is that we should be very careful in selling visualization as a tool for people who don’t know what question to ask. This is simply not true. Everyone has a question and actually I even believe everyone should start with a question.

There are a couple of words I like more when talking visualization: hypothesis and explanation. These are great words! They describe much better what visualization is good for. You might actually have a good question to start with but not a good hypothesis or explanation for what is going on there (some patients develop unexpected complications after receiving a particular treatment and you don’t know why). And visualization can for sure help you out with coming up with one. Visualization is an “hypothesis booster”. It’s actually so effective that it could even be dangerous in this respect (it may bias you toward some explanation)!

So next time you talk about visualization restrain yourself to selling it for a tool to help people aimlessly explore some data. And when you hear someone saying that please send him or her to this post. I’d be happy to defend my position :)

Am I missing something here? Am I totally wrong in some sense? I know there are some people out there who would strongly disagree with me, feel free to let me hear your voice!

 

 

  • http://chezvoila.com/ Voilà Info Design

    It matches my experience with businesses. Managers are interested in making the right decisions and they have years of experience asking questions to their data. Clients wondering what’s hidden in their data were not looking for a visualization as much as plain analysis to bring it up. This is more an observation of the current situation than what visualization can truly achieve though.

  • Santiago Ortiz

    I think there’s space for exploration without previous questions or identified problems: it’s possible just to play. This is somehow related to the data science family of unsupervised analysis methods http://en.wikipedia.org/wiki/Unsupervised_learning

    Reasons to just play:

    - familiarize yourself with structural aspects of the data you have (eventually discover holes, problems)

    - to discover outliers (later you can try to understand them… they might be extremely significant… or just sampling errors)

    - gives you a perspective of data without judgment (is a little bit like taking LSD and staring at a tree; you see a fractal, not a tree)

    - gives you a palette of visualization methods that can be used just taking into account data structures contained in the whole data (later is up to you to identify which ones make sense)

    - it always make sense to ask which variables are correlated even without knowing what they mean… it could be fun to discover surprises. In general you can contrast what you discovered in an abstract with your intuitions (and later use both unreliable-by-themselves insights to define directions of real exploration)

    - it’s fun

    That said, sooner or later, and probably the sooner the better, it’s important to identify questions and problems, and only by playing you won’t get there.

  • http://www.excelcharts.com/blog/ Jorge Camoes

    It depends; if you are a journalist, a graphic designer or a datavis expert/teacher probably you want to play with the data. If you are a professional user, you want answers, because you already know the questions.

    That said, if I were a journalist, a graphic designer or a datavis expert I would try to stay focused and get a deeper understanding of a knowledge field. If I were a professional user, I would try to spend some time in unstructured and what-if exploration.

  • Marian Dörk

    I see where you’re coming from with this argument, Enrico, and I find it interesting to challenge the widespread aspiration to support open-ended data exploration. But I disagree. I don’t think this a cliché with no application in the real world.

    First, ‘aimless’ is not ‘open-ended’. The term ‘aimless’ seems a bit problematic to me as it connotes a purposeless and therefore useless activitiy. However, approaching data without a specific problem or question can indeed have a purpose or aim – namely to learn about the quality and structure of a dataset. Surely, there are many scenarios where you might want to start with a problem and question, but there are just as many where you don’t. Santiago lists several good reasons for just playing with data.

    Second, I don’t think that we ever do anything without an intent. We might want to visualize a data set because we have a vague inkling or interest, but this to me is still open-ended. If we didn’t have any intention behind our analyses or explorations, then why would we use visualization or anything at all?

    Third, it’s not about not knowing a question, but it’s about purposefully avoiding a specific question. That is not always possible, but I think it can be beneficial to withhold judgement. It’s funny how you chose a photo with a blindfolded wanderer: one could argue that starting with a specific question or problem makes us blind for all the other aspects of a data set.

    In my work i’m less interested in open-ended analysis of data, and more in the value of open-ended exploration of rich collections. Especially for these kind of data sets (photos, documents, books, etc), it makes a lot of sense to wander around. In fact, especially for collections we lack the tools that allow us to approach data with a question, i.e., without a search query. For this visualization has great potential and I think it’s only mythical because we still lack the proper interfaces to support our inkling to stroll.

    • FILWD

      Please correct me if I am wrong: would it be correct to say visualization in these cases would be a tool to support people in search of an idea? And who are these people? I can think of various sets of artists and geeks. Who else?

      • Marian Dörk

        Sure, i think it has to do with this perpetual search for ideas and inspiration. I don’t think though that this search is a niche scenario. To me “artists and geeks” are not a small bunch of outliers. It’s a steadily growing crowd that some might call the creative economy…

  • FILWD

    Thank you all for your comments! I think we have to distinguish between what is possible to do and what people need to do. Sure it is possible to explore data aimlessly but this is not what most people need.

    Is there space for pure exploration? Sure. Should we explore this side of vis more? Sure! But I am just surprised to see how often we use this kind of terminology when in fact it does not seem to match with reality.

  • FILWD

    … just forgot to add Nick Diakopoulos’ comment on twitter is a really good one. Level of domain expertise plays a major role here too.

Previous post:

Next post: