How do you visualize too much data?

by Enrico on January 31, 2011

in Guides

million treemapWe live in the data deluge era. You can hear it everywhere: massive databases, thousands of organizations taking decisions based on their data, millions of transactions executed every second. Fast and large. Massive and relentless. Do you think it’s hard to find examples of databases with a million items? No, it’s not. They are everywhere.

But wait a moment … how do you visualize a million items? … And billions? … And trillions?

Visualization is being developed fast and I love the way this whole community is pushing forward to create more and more clever designs. But how many visualizations out there stand the challenges posed by very high data volumes? And how prepared are people in visualizing these monsters? My feeling guys, frankly, is that the large majority is not prepared. Plus, you might not be aware of how much research has been done on this particular topic in the past. I will try to explain where the problems are and to offer a little toolkit to start dealing with it.

When is data too much?

It’s not easy to define when data is too much. Does it even make sense to state that data is too much? In statistical data analysis there are somewhat clear ways to state when data is too few, but I am not sure whether there are ways to say when it is too much. Intuitively the more you get, the better, right? But in visualization this is a whole different story. The more we have the harder the task.

We can intuitively say that data is too much simply when it doesn’t fit the screen. And researchers have been studying this issue for ages. The physical limit of visualization is the number of pixels a screen has and there’s no way to use more than that. The best you can hope to obtain is to use one single pixel for every item (also known as pixel-oriented visualization), then you have reached the limit.

But if the limit is the number of pixels then, what if we just increase the number of pixels according to the size of the data we want to visualize? Sure, of course we can always buy a higher resolution monitor … or we can build a wall by aligning one next to another several lower resolution monitors … or yet you can buy a powerwall like the one we have at the University of Konstanz.

But then it turns out you reach a new limit. While in principle you can always add screens up and pixels here and there (provided you can afford it), our visual field of view is limited, as well as the resolution of our eyes. That is, we could also think of packing more pixels in the same area, but how tiny a pixel can be before we cannot distinguish one to another? Visual acuity has a limit that we just cannot surpass.

So, when is data too much in visualization? Simply when we don’t have enough resolution or space to make it visible.

What are the (visualization) problems with too much data?

There are many potential issues with big data, here we focus only on those related to visualization:

  • Clutter. Ever tried to visualize a million data points in a scatter plot? I did it several times and it’s not fun. I tried with parallel coordinates too. It’s not fun. What you normally get out of it is a big black block of pixels as dense as plutonium. The information you can extract out of it is zero. In visualization you can distinguish two main kinds of techniques : visualizations where the objects can overlap (e.g., scatter plots) and space-filling techniques (e.g., treemaps). In the first case the more you plot the messier it gets. In the second case the more you plot the smaller the objects get (up to the point they are too small to be plotted).
  • Performance. If you are building a single static pretty picture, you might be willing to wait for ages before something pops out from you screen. But if you want to have some minimal interaction, performance is an issue. And even if you final purpose is to create one single static vis, you will need to iterate over and over before you get what you want. And each iteration will drive you mad. Give a read to the process followed by Paul Butler to create the famous facebook friendship visualization. Waiting several “minutes” every time is not fun.
  • Information loss. Of course we can always decide to make things simpler and take a small sample of the data or apply other data reduction methods, but then information loss creeps in and unless you are a skilled statistician or have similar knowledge you might be overwhelmed with the doubt that something interesting is lost on the way. As long as you use “only” visualization to visualize millions of items you might get into trouble with this problem. Not always, not necessarily, bit it’s good to know that the problem exists.
  • Limited cognition. As I said already above, one possibility would be to visualize data on huge screens. But not only it doesn’t scale on a economic level (how many monitors will you have to add before reaching bankruptcy?) but also on a human level, we just cannot work effectively beyond a certain size. Data aggregation is an option, but then you need more navigation and navigation loads up our memory.

What can you do?

Ok, enough with the problems. Now let’s try to offer some solutions. But before that, let me be honest with you: I do think visualization is not always the best tool for data analysis. And the sooner you learn to understand when this is the case the better. This is the basic assumption you need, the working mindset. The question you have to ask first is: do I really need visualization here?

Here are three basic strategies to deal with too much data:

  1. Admit visualization is not the right tool
  2. Sweat your blood to create a visualization out of it
  3. Use the visual analytics mantra

I said enough about the first point, so I will skip it.

Sweat Your Blood

Don’t get me wrong, with this heading I am by no means saying this is a bad option. I just want to make sure you know it’s hard. Damn hard. And at times it is frustrating because you spend whole afternoons tweaking things here and there and you don’t get what you want. But beautiful results can come out of it!

I am sure you remember the Facebook Friendship Map, published few weeks ago by Paul Butler. Beautiful right? But did you read the process he followed? A struggle between you and you computer. You and your code.

As a sanity check, I plotted points at some of the latitude and longitude coordinates. To my relief, what I saw was roughly an outline of the world. Next I erased the dots and plotted lines between the points. After a few minutes of rendering, a big white blob appeared in the center of the map. Some of the outer edges of the blob vaguely resembled the continents, but it was clear that I had too much data to get interesting results just by drawing lines. I thought that making the lines semi-transparent would do the trick, but I quickly realized that my graphing environment couldn’t handle enough shades of color for it to work the way I wanted.

If you have any experience with it you know that it is exactly like that, all the time. A struggle. And you’d better be prepared if you want to win.

So, what are the available techniques? I will list some few ones without pretending to be complete. I’d love to hear from you if you have more. Please do it.

  • Sampling – People in visualization tend to have a natural aversion against sampling, probably due to the psychological effect of fearing to lose something. This is not too rational though. Every data set is anyway a sample of some real phenomena and sampling is already there, embedded in your data. Do you sampling and don’t fear it.
  • Aggregation – I see aggregation as the alter ego of sampling. Ironically enough, people tend to see aggregation as lossless when in fact there might be a lot to lose. With aggregation your eyes simply don’t see many of the details data contain. Plus, every time you aggregate you need some form of navigation/interaction to ask for the details. Anyway, aggregation is another great tool to deal with big data. You can use aggregate queries or run some kind of clustering algorithm and the result can be great. Just be prepared to use big machines and wait for a looooong time.
  • Tuning/Tweaking – This is where you sweat most of your blood. With tweaking and tuning I mean all the little tricks you can adopt to make the picture you have in mind stand from the screen. It gets things like changing transparency, size, colors, positions, bending curves, and all the rest, as exemplified by the facebook friendship example. Good luck.
  • Segmentation – This is something I learned from the Occam’s Razor blog a long time ago. Data analysis becomes interesting when you segment it according to some parameter you have in your database. It turns out that much knowledge can be extracted when you compare organic segments of your data. Can you segment your data in layers like male/female, income levels, geographic areas, speed, taste, whatever? Do it and you will have less data for each segment. And analyzing one segment at a time will give you more options and probably a hook to see things from a new perspective.

These are the techniques that always come into my mind when I think about large data visualization. Maybe you have some more? In case, let me know.

Use the visual analytics mantra

I must admit it: this is my favorite part. What is the visual analytics mantra? In order to understand it you first have to get acquainted with the information visualization mantra invented by Ben Shneiderman. Then you will be in the position to understand the visual analytics mantra that Daniel Keim coined as an extension of the original one.

Visual Information Seeking Mantra

Overview First, Zoom and Filter, Details-on-Demand

Overview First, Zoom and Filter, Details-on-Demand

Overview First, Zoom and Filter, Details-on-Demand

This is how Shneiderman described the typical interaction you have with an information visualization system. I already commented on it in my post about the 7 foundational infovis papers you have to read.

Based on the famous mantra Daniel Keim invented the visual analytics mantra that concisely explains the core message behind visual analytics.

Visual Analytics Mantra

Analyze First, Show the Important, Zoom, Filter and Analyze Further, Details-on-Demand

Analyze First, Show the Important, Zoom, Filter and Analyze Further, Details-on-Demand

Analyze First, Show the Important, Zoom, Filter and Analyze Further, Details-on-Demand

That is guys, sometimes you need to use the power of the machine first and let it discover if there is anything interesting to show before visualization is used. The whole field of knowledge discovery in databases and statistic more in general has so much to offer that you cannot avoid to know at least part of it if you really want to deal with large data. Large data is a wild beast and you’d better treat it with the right tools. Visualization is a great tool to convey what automatic data analysis algorithms discover. And often it is a very challenging tasks! What the algorithms spit is exciting new complex data that requires creativity and knowledge as well. So, keep it in mind: if you want to sweat your blood fine, I am with you. But be sure to keep in mind there are other, maybe better, options around the corner. Automatic computation before visualization is a great one.

Some Papers

Some of the things I’ve written in this post stem from years of reading research papers on the topic. This is in fact quite close to the stuff I do during the day in my own research work. It’s fascinating … at least for me. There are a number of research papers that came into my mind during the conceptualization of this post. I will list them below without any particular order or comment. Just give them a look if you think you are interested. It’s not complete and it’s not a whole review, but I can testify for the quality of its content.

Take care and have fun. Comments are always more than welcome.

If you like this post please remember to talk about it in twitter, just push the green button (above or below) or send a message by yourself. I’m looking forward to hearing your opinion and questions. Thanks!

TreeJuxtaposer: Scalable Tree Comparison using Focus+Context with Guaranteed Visibility

  • http://cscheid.net Carlos Scheidegger

    Hey Enrico, nice post!

    have you seen Daniel Weiskopf’s work with his students on the continuous versions of scatterplots and parallel coordinates? Those are a really nice way to reduce the pixel clutter of these plots by defining a continuous measure and rendering that instead. As it’s described, you need a continuous field to sample from (think temperature and pressure like scivis).

    But if you can model your data as a continuous distribution, (which you probably should anyway in the case of massive data) those alternatives make for incredibly good-looking, clutter-free plots.

    • Enrico

      Hey Carlos, nice to see you here! No, I haven’t seen it. Do you have any pointer that I can integrate? That would be fantastic. Anyway, you reminded me that calculating some kind of density field out of discrete data points is always possible (with density estimation) and it’s quite scalable if data is processed first. I would count this as a kind of visual analytics approach though. What do you think?

      • http://cscheid.net Carlos Scheidegger

        Sven Bachthaler, Daniel Weiskopf. “Continuous Scatterplots”, Vis 2008
        Julian Heinrich, Daniel Weiskopf. “Continuous Parallel Coordinates”, Vis 2009.

        About your visual analytics comment.. You’re going to force me into a public statement about visualization vs visual analytics? Fiiiiine :) confession time! I don’t think I really get the difference between the terms. So let me try to understand your idea of the difference.

        Is it fair to characterize (even if extremely roughly) your idea of the difference between visualization and visual analytics as follows: “visualization is traditionally concerned with basic issues such as how to render larger data faster and more effectively, while visual analytics tries to bring other techniques from data analysis into the picture”?

        If that’s sufficiently fair, then my answer is the following. I always thought of good visualization practice in the way I think you talk about Visual Analytics: algorithms for decent data analysis, together with algorithms for turning the result into a good figure. So, to me, that is still visualization, even if there’s kernel density estimation in the middle.

        Incidentally, you don’t need to process the data first: there are, for example, algorithms for doing density estimation on streaming data. There has been an enormous amount of work in data mining and learning on doing interesting computations in the streaming model. It’s something the vis community is sorely lacking behind, imo. But that’s a different story!

  • http://www.excelcharts.com/blog Jorge Camoes

    A great post as usual. I’d like to add my $0.02. If you are planning to print small pictures you don’t need to buy an expensive 12MB digital camera. The same happens with data visualization. We must know what level of detail we need, we must fight loss aversion.

    The more data we have, the less flat it becomes. We must add dimensions, prioritize it, know the difference between focus and context.We must not be afraid of our judgments. Open your eyes and you have a point of view. Turn around and you have a different point of view. We can do it in our physical world. We must be able to do it in the abstract world of data. That’s the future of information visualization: constantly changing our point of view to keep answering all those new questions that we are not even aware of.

    If we have too much data we are not asking the right question.

    • http://cscheid.net Carlos Scheidegger

      If we have too much data we are not asking the right question.

      But if we throw away some data, how do we know what is the right question to ask? :)

      • http://www.excelcharts.com/blog Jorge Camoes

        No, this is not a chicken and egg situation… I’m sure you can come up with very interesting questions without a single data point. Then find that single data point and rewrite your questions.

      • http://peltiertech.com/WordPress/ Jon Peltier

        And how do we know which is the right data to throw away, and the right data to keep?

  • http://www.excelcharts.com/blog/ Jorge Camoes

    Elementary, my dear Watson. You don’t throw away the data, you just shift focus. Just like your eyes do. You focus on data that answers your question, carries more information, explains more variation.

    Easier said than done, I know.We all have to learn how to do it.

  • Pingback: How do you visualize too much data? | VizWorld.com

  • http://bentrem.sycks.net Ben Tremblay

    We’re cognitive misers. Perceiving a million or billion pixels isn’t adaptive, so we see lines and curves and masses … shapes. The machinery of our cognitive scheme is heh rather fixed.

    We’re wired not only to detect patterns, but to detect dissonance, differences of a certain scale. (google “pop-oup effect”) But we’re likewise to smooth over absurdity. (google “McGurk effect)

    So @Jorge is on the right track: applying /this/ formula we get what looks like noise; applying /that/ formula we get to enjoy yet.another glorious Julia set. Now the thing is, the information needs to be chaotic … information rich … rather than random. Problematic: how do we know which is which /before/ we visualize it? My solution: don’t go fishing. Reduce the question. Sort of like hypothesis testing.

    Sorry, I’m distracted and rushing … wanted to write something rather than nothing. Glad to have found your blog!

    • http://bentrem.sycks.net Ben Tremblay

      erratum: s / “the information needs to be” / “the data needs to be”

  • Pingback: Unfiltered Orange | Weekly eDiscovery News Update - February 9, 2011 | Orange Legal Technologies

  • Pingback: How to choose the right chart (corrected) | viewtific

Previous post:

Next post: