How do you visualize too much data?

million treemapWe live in the data deluge era. You can hear it everywhere: massive databases, thousands of organizations taking decisions based on their data, millions of transactions executed every second. Fast and large. Massive and relentless. Do you think it’s hard to find examples of databases with a million items? No, it’s not. They are everywhere.

But wait a moment … how do you visualize a million items? … And billions? … And trillions?

Visualization is being developed fast and I love the way this whole community is pushing forward to create more and more clever designs. But how many visualizations out there stand the challenges posed by very high data volumes? And how prepared are people in visualizing these monsters? My feeling guys, frankly, is that the large majority is not prepared. Plus, you might not be aware of how much research has been done on this particular topic in the past. I will try to explain where the problems are and to offer a little toolkit to start dealing with it.

When is data too much?

It’s not easy to define when data is too much. Does it even make sense to state that data is too much? In statistical data analysis there are somewhat clear ways to state when data is too few, but I am not sure whether there are ways to say when it is too much. Intuitively the more you get, the better, right? But in visualization this is a whole different story. The more we have the harder the task.

We can intuitively say that data is too much simply when it doesn’t fit the screen. And researchers have been studying this issue for ages. The physical limit of visualization is the number of pixels a screen has and there’s no way to use more than that. The best you can hope to obtain is to use one single pixel for every item (also known as pixel-oriented visualization), then you have reached the limit.

But if the limit is the number of pixels then, what if we just increase the number of pixels according to the size of the data we want to visualize? Sure, of course we can always buy a higher resolution monitor … or we can build a wall by aligning one next to another several lower resolution monitors … or yet you can buy a powerwall like the one we have at the University of Konstanz.

But then it turns out you reach a new limit. While in principle you can always add screens up and pixels here and there (provided you can afford it), our visual field of view is limited, as well as the resolution of our eyes. That is, we could also think of packing more pixels in the same area, but how tiny a pixel can be before we cannot distinguish one to another? Visual acuity has a limit that we just cannot surpass.

So, when is data too much in visualization? Simply when we don’t have enough resolution or space to make it visible.

What are the (visualization) problems with too much data?

There are many potential issues with big data, here we focus only on those related to visualization:

  • Clutter. Ever tried to visualize a million data points in a scatter plot? I did it several times and it’s not fun. I tried with parallel coordinates too. It’s not fun. What you normally get out of it is a big black block of pixels as dense as plutonium. The information you can extract out of it is zero. In visualization you can distinguish two main kinds of techniques : visualizations where the objects can overlap (e.g., scatter plots) and space-filling techniques (e.g., treemaps). In the first case the more you plot the messier it gets. In the second case the more you plot the smaller the objects get (up to the point they are too small to be plotted).
  • Performance. If you are building a single static pretty picture, you might be willing to wait for ages before something pops out from you screen. But if you want to have some minimal interaction, performance is an issue. And even if you final purpose is to create one single static vis, you will need to iterate over and over before you get what you want. And each iteration will drive you mad. Give a read to the process followed by Paul Butler to create the famous facebook friendship visualization. Waiting several “minutes” every time is not fun.
  • Information loss. Of course we can always decide to make things simpler and take a small sample of the data or apply other data reduction methods, but then information loss creeps in and unless you are a skilled statistician or have similar knowledge you might be overwhelmed with the doubt that something interesting is lost on the way. As long as you use “only” visualization to visualize millions of items you might get into trouble with this problem. Not always, not necessarily, bit it’s good to know that the problem exists.
  • Limited cognition. As I said already above, one possibility would be to visualize data on huge screens. But not only it doesn’t scale on a economic level (how many monitors will you have to add before reaching bankruptcy?) but also on a human level, we just cannot work effectively beyond a certain size. Data aggregation is an option, but then you need more navigation and navigation loads up our memory.

What can you do?

Ok, enough with the problems. Now let’s try to offer some solutions. But before that, let me be honest with you: I do think visualization is not always the best tool for data analysis. And the sooner you learn to understand when this is the case the better. This is the basic assumption you need, the working mindset. The question you have to ask first is: do I really need visualization here?

Here are three basic strategies to deal with too much data:

  1. Admit visualization is not the right tool
  2. Sweat your blood to create a visualization out of it
  3. Use the visual analytics mantra

I said enough about the first point, so I will skip it.

Sweat Your Blood

Don’t get me wrong, with this heading I am by no means saying this is a bad option. I just want to make sure you know it’s hard. Damn hard. And at times it is frustrating because you spend whole afternoons tweaking things here and there and you don’t get what you want. But beautiful results can come out of it!

I am sure you remember the Facebook Friendship Map, published few weeks ago by Paul Butler. Beautiful right? But did you read the process he followed? A struggle between you and you computer. You and your code.

As a sanity check, I plotted points at some of the latitude and longitude coordinates. To my relief, what I saw was roughly an outline of the world. Next I erased the dots and plotted lines between the points. After a few minutes of rendering, a big white blob appeared in the center of the map. Some of the outer edges of the blob vaguely resembled the continents, but it was clear that I had too much data to get interesting results just by drawing lines. I thought that making the lines semi-transparent would do the trick, but I quickly realized that my graphing environment couldn’t handle enough shades of color for it to work the way I wanted.

If you have any experience with it you know that it is exactly like that, all the time. A struggle. And you’d better be prepared if you want to win.

So, what are the available techniques? I will list some few ones without pretending to be complete. I’d love to hear from you if you have more. Please do it.

  • Sampling – People in visualization tend to have a natural aversion against sampling, probably due to the psychological effect of fearing to lose something. This is not too rational though. Every data set is anyway a sample of some real phenomena and sampling is already there, embedded in your data. Do you sampling and don’t fear it.
  • Aggregation – I see aggregation as the alter ego of sampling. Ironically enough, people tend to see aggregation as lossless when in fact there might be a lot to lose. With aggregation your eyes simply don’t see many of the details data contain. Plus, every time you aggregate you need some form of navigation/interaction to ask for the details. Anyway, aggregation is another great tool to deal with big data. You can use aggregate queries or run some kind of clustering algorithm and the result can be great. Just be prepared to use big machines and wait for a looooong time.
  • Tuning/Tweaking – This is where you sweat most of your blood. With tweaking and tuning I mean all the little tricks you can adopt to make the picture you have in mind stand from the screen. It gets things like changing transparency, size, colors, positions, bending curves, and all the rest, as exemplified by the facebook friendship example. Good luck.
  • Segmentation – This is something I learned from the Occam’s Razor blog a long time ago. Data analysis becomes interesting when you segment it according to some parameter you have in your database. It turns out that much knowledge can be extracted when you compare organic segments of your data. Can you segment your data in layers like male/female, income levels, geographic areas, speed, taste, whatever? Do it and you will have less data for each segment. And analyzing one segment at a time will give you more options and probably a hook to see things from a new perspective.

These are the techniques that always come into my mind when I think about large data visualization. Maybe you have some more? In case, let me know.

Use the visual analytics mantra

I must admit it: this is my favorite part. What is the visual analytics mantra? In order to understand it you first have to get acquainted with the information visualization mantra invented by Ben Shneiderman. Then you will be in the position to understand the visual analytics mantra that Daniel Keim coined as an extension of the original one.

Visual Information Seeking Mantra

Overview First, Zoom and Filter, Details-on-Demand

Overview First, Zoom and Filter, Details-on-Demand

Overview First, Zoom and Filter, Details-on-Demand

This is how Shneiderman described the typical interaction you have with an information visualization system. I already commented on it in my post about the 7 foundational infovis papers you have to read.

Based on the famous mantra Daniel Keim invented the visual analytics mantra that concisely explains the core message behind visual analytics.

Visual Analytics Mantra

Analyze First, Show the Important, Zoom, Filter and Analyze Further, Details-on-Demand

Analyze First, Show the Important, Zoom, Filter and Analyze Further, Details-on-Demand

Analyze First, Show the Important, Zoom, Filter and Analyze Further, Details-on-Demand

That is guys, sometimes you need to use the power of the machine first and let it discover if there is anything interesting to show before visualization is used. The whole field of knowledge discovery in databases and statistic more in general has so much to offer that you cannot avoid to know at least part of it if you really want to deal with large data. Large data is a wild beast and you’d better treat it with the right tools. Visualization is a great tool to convey what automatic data analysis algorithms discover. And often it is a very challenging tasks! What the algorithms spit is exciting new complex data that requires creativity and knowledge as well. So, keep it in mind: if you want to sweat your blood fine, I am with you. But be sure to keep in mind there are other, maybe better, options around the corner. Automatic computation before visualization is a great one.

Some Papers

Some of the things I’ve written in this post stem from years of reading research papers on the topic. This is in fact quite close to the stuff I do during the day in my own research work. It’s fascinating … at least for me. There are a number of research papers that came into my mind during the conceptualization of this post. I will list them below without any particular order or comment. Just give them a look if you think you are interested. It’s not complete and it’s not a whole review, but I can testify for the quality of its content.

Take care and have fun. Comments are always more than welcome.

If you like this post please remember to talk about it in twitter, just push the green button (above or below) or send a message by yourself. I’m looking forward to hearing your opinion and questions. Thanks!

TreeJuxtaposer: Scalable Tree Comparison using Focus+Context with Guaranteed Visibility