It’s somewhat surprising to me to notice how little we discuss about the more technical side of data visualization. I use to say that visualization is something that “happens in your head” to emphasize the role of perception and cognition and to explain why it is so hard to evaluate visualization. Yet, visualization happens a lot in the computer also, and what happens there can be extremely fascinating too.
So, today I want to talk about algorithms in visualization. What’s the use of algorithms in visualization? When do we need them? Why do we need them? What are they for? Surprisingly, even in academic circles I noticed we tended to either avoid the question completely or to take the answer for granted. Heck, even the few books we have out there: how many of them teach the algorithmic side of visualization? None.
I have grouped algorithms in four broad classes. For each one I am going to give a brief description and a few examples.
Spatial Layout. The most important perceptual component of every visualization is how data objects are positioned on the screen, that is, the logic and mechanism by which a data element is uniquely positioned on the spatial substrate. Most visualization techniques have closed formulations, based on coordinate systems, that permits to uniquely, and somewhat trivially, find a place for each data object. Think scatter plots, bar charts, line charts, parallel coordinates, etc. Some other visualization techniques, however, have more complex logics which require algorithms to find the “right” position for each data element. A notable example is treemaps, which starting from the somewhat simple initial formulation called “slice-and-dice” evolved into more complex formulations like squarified treemaps and voronoi treempas. A treemap is not based on coordinates, it requires a logic. One of my favorite papers ever is “Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies” where alternative treemap algorithms are proposed and rigorously evaluated. Another example is the super wide class of graph layout algorithms called force-directed layouts, where nodes and edges take place according to some iterative optimization procedures. This class is so wide that some specific conferences and books exist only to study new graph layouts. Many other examples exist: multidimensional scaling, self-organizing maps, flowmap layouts, etc. A lot has been done in the past but a lot needs to be done yet too, especially in better understanding how scale them up to much higher number of elements.
(Interactive) Data Abstraction. There are many cases where data need to be processed and abstracted before they can be visualized. Above all the need to deal with very large data sets (see “Extreme visualization: squeezing a billion records into a million pixels” for a glimpse of the problem). It does not matter how big your screen is, at some point you are going to hit a limit. One class of data abstraction is binning and data cubes (Tableau is mostly based on that for instance), which summarize and reduce the data by grouping them into intervals. Every visualization based on density has some sort of binning or smoothing behind the lines and the mechanism can turn out to be complex enough to require some sort of clever algorithm. More interesting is the case of data visualizations that have to adapt to user interaction. Even the most trivial abstraction mechanism may require some complex algorithm to make sure the visualization is updated in less than one second when the user needs to navigate from one abstraction level to another. A recent great example of this kind of work is “imMens: Real-time Visual Querying of Big Data“. Of course binning is not the only data abstraction mechanism needed in visualization. For instance, all sorts of clustering algorithms have been used in visualization to reduce data size. Notably, graph clustering algorithms can (sometime) turn some huge “spaghetti mess” into some more intelligible picture. For an overview of aggregation techniques in visualization you can read “Hierarchical Aggregation for Information Visualization: Overview, Techniques, and Design Guidelines” a very useful survey on the topic.
Smart Encoding. Every single visualization can be seen as an encoding procedure where one has to decide how to map data features to visual features. To build a bubble chart you have to decide which variable you want to map to the x-axis, y-axis, color, size, etc … You get the picture. This kind of process may become tedious or too costly when the number of data dimensions increases. Also, some users may not have a clue on how to “correctly” visualize some data. Encoding algorithms can do some of the encoding for you or at least guide you into the process. This kind of approach never became too popular in reality but visualization researchers have spent quite some time developing smart encoding techniques. Notably, Jock Mackinlay‘s seminal work: “Automating the design of graphical presentations of relational information” and the later implementation of the “Show Me” function in Tableau (Show Me: Automatic Presentation for Visual Analysis). Other examples exist but they tend to be more on the academic speculation side. One thing I have never seen though is the use of smart encoding as an artistic tool. Why not let the computer explore a million different encodings and see what you get? That would be a fun project.
Quality Measures. Even if this may seem a bit silly at first, algorithms can be used to supplement or substitute humans in judging the quality of a visualization. If you go back to all the previous classes I have described above, you can realize that in everyone there might be some little mechanism of quality judgment. Layout algorithms (especially the nondeterministic ones) may need to routinely check the quality of the current layout. Same thing for sorting algorithms like those needed to fin meaningful orderings in matrices and heatmaps. Data abstraction algorithms may need to automatically find the right parameters for the abstraction. And smart encoding algorithms may need to separate the wheat from the chaff by suggesting only encodings with quality above a given threshold. A couple of years back I have written a paper on quality metrics titled “Quality metrics in high-dimensional data visualization: An overview and systematization” to create a systematic description of how they are used in visualization. The topic is arguably a little academic but I can assure you it’s a fascinating one with lots of potential for innovation.
These are the four classes of algorithms I have currently identified in visualization. Are there more out there? I am sure there are and that’s partly the reason why I have written this post. If there are other uses for algorithms which I did not list here please comment on this post and feel free to suggest more. That would help me build a better picture. There is much more to say on this topic.