What do we talk about when we talk about “Data Exploration”?

There is an old “adage” in the InfoVis / Visual Analytics community I have heard a zillion times: “visualization is needed/useful when people don’t have a specific question in mind“. For many years I have taken this as “the verb”. Then, over time, as I have grown more experienced, I have started questioning the whole concept: why would someone look at a given data set if he or she has no specific goal or question in mind? It does not make sense.

This is an aspect of visualization that has puzzled me for a long time. An interesting conundrum I believe it is still largely unsolved. One of those things many say, but nobody really seems to have grasped in full depth.

Here is my humble attempt at putting some order into this matter. Let’s start with definitions:

The definition introduces a couple of very important features: familiarity (“through an unfamiliar area“) and learning (“in order to learn about it“). If we take this definition as main guidance, we can say that data visualization is particularly helpful when we use it to look into some unfamiliar data to learn more about something.

I suspect there are (at least) three main situations in which this can happen.

  1. Need to familiarize with a new data set (“how does it look like?“). Anyone who dabbles with data a bit goes through this: you receive or find a new data set and the first thing you need to do is to figure out what information it contains. How many fields? What type of fields? What is their meaning? Are there any missing values? Is there anything I don’t actually understand? How are the values distributed? Is there any temporal or geographical information? Are we actually in presence of some kind of network or relation structure? Etc. One crucial, and often overlooked, aspect of this activity is “data semantics”. I personally find that understanding the meaning of the various fields and the values they contain is a such a crucial and hard activity at the beginning. An activity that often requires many many back-and-forth discussions and clarifications with domain experts and data collectors.
  2. Hunting for “something” interesting (“is there anything interesting here?“). I suspect this is what people mostly really mean when they talk about “data exploration“: the feeling that something interesting may be hidden there and that some exploratory work is needed to figure it out. But … When does this actually happen? What kind of real-world activities are characterized by this desire of finding “something”? I am not sure I have an all-encompassing answer to that, but I am familiar with at least two examples: data journalism and quantified self. In data journalism it is very common to first get your hands on some potentially “juicy” data set and then try to figure out what interesting stories may hide there (Panama Papers, Clinton’s Emails, Etc.). I have observed this in our collaboration with ProPublica when hunting for stories about how people review doctors in Yelp. In quantified self you often want to look at your data to see if you can detect anything unexpected. I have experienced the same when looking at personal data I have collected about my deep work habits (or lack thereof). Sometimes we know there must be something interesting in a given data set, and visualization guides us in the formulation of unexpressed questions. The interesting aspect of this activity is that the outcome is often more (and better) questions, not answers.
  3. Going off on a tangent (“oh … this may be interesting too!“). There is one last, subtler, kind of data exploration. You start with a specific question in mind but, as you go about it, you find something interesting that triggers an additional question you had not anticipated. This is the power of visual data analysis, it forces you to notice something new and you have to follow the path. This happens to me all the time (and I hope it’s just not a sign of my ADD). Some of these are useless diversions. Some of them actually lead to some pretty unique gems!

These three modalities can of course overlap a lot. I am also sure there are other situations we can describe as data exploration which I am not covering here (in case you have some suggestions please let me know!).

I want to conclude by saying that this is an incredibly under-explored area of data visualization. More advances are needed at least in two directions.

  • First, we need to much better understand data exploration as a process and, if possible, create models able to describe it in useful abstract terms. In visualization research we often refer to Card and Pirolli’s “Sensemaking Loop” to describe this kind of open-ended and incremental activity but for some reason every time I try to use it, it does not seem to describe what I actually observe in practice (this deserve its own post).
  • Second, we need to develop more methods, techniques and tools to support interactive data exploration. I bet there are lots of “latent needs” waiting to be discovered out there. This is another area where I believe we, visualization researchers have surprisingly made little progress. We have built a lot of narrow solutions that work for 3-5 people but very few general purpose methods and techniques. We need more of that (this also deserves its own post)!
  • Third, we need to find ways to teach exploratory data analysis systematically to others in ways that make the process as effective as possible. I am appalled at how little guidance and material there is out there on teaching people how to do the actual analysis work. Statisticians are fixated with confirmatory analysis and regard exploration as a second-class citizen. Visualization researchers are too busy building stuff and have done too little to teach others how to do the actual ground work. This is a problem we need to solve. It’s for this reason that next semester I will be teaching a new course with this specific purpose. Stay tuned.

That’s all I had to say about Data Exploration.

And you? What is your take? What is data exploration for you? And how can we improve it?

Take care.