We all know by now that visualization, thanks to its amazing communication powers, can be used to communicate effectively and persuasively massages that stick into people’s mind. This same power, however, can also be used to mislead and misinform people very effectively! When techniques like non-zero baselines, scaling by area (quadratic change to represent linear changes), bad color maps, etc., are used, it is very easy to communicate the wrong message to your readers (being that done on purpose or for lack of better knowledge). But, how easy is it?
Had a fantastic visit at ProPublica yesterday (thanks Alberto for inviting me and Scott for having me, you have an awesome team!) and we discussed about lots of interesting things at the intersection of data visualization, literacy, statistics, journalism, etc. But there is one thing that really caught my attention. Lena very patiently (thanks Lena!) showed me some of the nice visualizations she created and then asked:
How do you evaluate visualization?
How do you know if you have done things right?
Heck! This is the kind of question I should be able to answer. I did have some suggestion for her, yet I realize there are no established methodologies. This comes as a bit of a surprise to me as I have been organizing the BELIV Workshop on Visualization Evaluation for a long time and I have been running user studies myself for quite some time now. Continue reading
It’s somewhat surprising to me to notice how little we discuss about the more technical side of data visualization. I use to say that visualization is something that “happens in your head” to emphasize the role of perception and cognition and to explain why it is so hard to evaluate visualization. Yet, visualization happens a lot in the computer also, and what happens there can be extremely fascinating too.
So, today I want to talk about algorithms in visualization. What’s the use of algorithms in visualization? When do we need them? Why do we need them? What are they for? Surprisingly, even in academic circles I noticed we tended to either avoid the question completely or to take the answer for granted. Heck, even the few books we have out there: how many of them teach the algorithmic side of visualization? None.
I have grouped algorithms in four broad classes. For each one I am going to give a brief description and a few examples.
Spatial Layout. The most important perceptual component of every visualization is how data objects are positioned on the screen, that is, the logic and mechanism by which a data element is uniquely positioned on the spatial substrate. Most visualization techniques have closed formulations, based on coordinate systems, that permits to uniquely, and somewhat trivially, find a place for each data object. Think scatter plots, bar charts, line charts, parallel coordinates, etc. Some other visualization techniques, however, have more complex logics which require algorithms to find the “right” position for each data element. A notable example is treemaps, which starting from the somewhat simple initial formulation called “slice-and-dice” evolved into more complex formulations like squarified treemaps and voronoi treempas. A treemap is not based on coordinates, it requires a logic. One of my favorite papers ever is “Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies” where alternative treemap algorithms are proposed and rigorously evaluated. Another example is the super wide class of graph layout algorithms called force-directed layouts, where nodes and edges take place according to some iterative optimization procedures. This class is so wide that some specific conferences and books exist only to study new graph layouts. Many other examples exist: multidimensional scaling, self-organizing maps, flowmap layouts, etc. A lot has been done in the past but a lot needs to be done yet too, especially in better understanding how scale them up to much higher number of elements.
(Interactive) Data Abstraction. There are many cases where data need to be processed and abstracted before they can be visualized. Above all the need to deal with very large data sets (see “Extreme visualization: squeezing a billion records into a million pixels” for a glimpse of the problem). It does not matter how big your screen is, at some point you are going to hit a limit. One class of data abstraction is binning and data cubes (Tableau is mostly based on that for instance), which summarize and reduce the data by grouping them into intervals. Every visualization based on density has some sort of binning or smoothing behind the lines and the mechanism can turn out to be complex enough to require some sort of clever algorithm. More interesting is the case of data visualizations that have to adapt to user interaction. Even the most trivial abstraction mechanism may require some complex algorithm to make sure the visualization is updated in less than one second when the user needs to navigate from one abstraction level to another. A recent great example of this kind of work is “imMens: Real-time Visual Querying of Big Data“. Of course binning is not the only data abstraction mechanism needed in visualization. For instance, all sorts of clustering algorithms have been used in visualization to reduce data size. Notably, graph clustering algorithms can (sometime) turn some huge “spaghetti mess” into some more intelligible picture. For an overview of aggregation techniques in visualization you can read “Hierarchical Aggregation for Information Visualization: Overview, Techniques, and Design Guidelines” a very useful survey on the topic.
Smart Encoding. Every single visualization can be seen as an encoding procedure where one has to decide how to map data features to visual features. To build a bubble chart you have to decide which variable you want to map to the x-axis, y-axis, color, size, etc … You get the picture. This kind of process may become tedious or too costly when the number of data dimensions increases. Also, some users may not have a clue on how to “correctly” visualize some data. Encoding algorithms can do some of the encoding for you or at least guide you into the process. This kind of approach never became too popular in reality but visualization researchers have spent quite some time developing smart encoding techniques. Notably, Jock Mackinlay‘s seminal work: “Automating the design of graphical presentations of relational information” and the later implementation of the “Show Me” function in Tableau (Show Me: Automatic Presentation for Visual Analysis). Other examples exist but they tend to be more on the academic speculation side. One thing I have never seen though is the use of smart encoding as an artistic tool. Why not let the computer explore a million different encodings and see what you get? That would be a fun project.
Quality Measures. Even if this may seem a bit silly at first, algorithms can be used to supplement or substitute humans in judging the quality of a visualization. If you go back to all the previous classes I have described above, you can realize that in everyone there might be some little mechanism of quality judgment. Layout algorithms (especially the nondeterministic ones) may need to routinely check the quality of the current layout. Same thing for sorting algorithms like those needed to fin meaningful orderings in matrices and heatmaps. Data abstraction algorithms may need to automatically find the right parameters for the abstraction. And smart encoding algorithms may need to separate the wheat from the chaff by suggesting only encodings with quality above a given threshold. A couple of years back I have written a paper on quality metrics titled “Quality metrics in high-dimensional data visualization: An overview and systematization” to create a systematic description of how they are used in visualization. The topic is arguably a little academic but I can assure you it’s a fascinating one with lots of potential for innovation.
These are the four classes of algorithms I have currently identified in visualization. Are there more out there? I am sure there are and that’s partly the reason why I have written this post. If there are other uses for algorithms which I did not list here please comment on this post and feel free to suggest more. That would help me build a better picture. There is much more to say on this topic.
There are three research papers which have drawn my attention lately. They all deal with automatic annotation of data visualizations, that is, adding labels to the visualization automatically.
It seems to me that annotations, as an integral part of a visualization design, have received somewhat little attention in comparison to other components of a visual representation (shapes, layouts, colors, etc.). A quick check in the books I have in my bookshelf kind of support my hypothesis. The only exception I found is Colin Ware’s Information Visualization book, which has a whole section on “Linking Text with Graphical Elements“. This is weird because, think about it, text is the most powerful means we have to bridge the semantic gap between the visual representation and its interpretation. With text we can clarify, explain, give meaning, etc.
Smart annotations is an interesting area of research because, not only it can reduce the burden of manually annotating a visualization but it can also reveal interesting patterns and trends we might not know about or, worse, may get unnoticed. Here are the three papers (click on the images to see a higher resolution version).
Paper#1: “Just-in-time annotation of clusters, outliers, and trends in point-based data visualizations.“ Kandogan, Eser. Visual Analytics Science and Technology (VAST), 2012 IEEE Conference on. IEEE, 2012.
This annotation works on point based visualizations. The system detects trends automatically by analyzing the visual information displayed on the screen (that is, patterns are detected in the visual space, not the data space) and tries to find a description for the observed trends. Once a description is found, the system overlays labels that convey this information. So, for instance, in the image above the algorithm finds visual clusters (groupings) and annotates them with the data values that most explain the trend (data dimensions and values that have a unique distribution in the cluster). The paper does not focus only on clusters, it provides techniques to annotate trends and outliers as well and it describes the whole framework in a way that it is easy to imagine how this can be extended to other domains and visualizations.
Paper #2: “Contextifier: Automatic Generation of Annotated Stock Visualizations.“ Hullman, Jessica, Nicholas Diakopoulos, and Eytan Adar. ACM Conference on Human Factors in Computing Systems (CHI). May, 2013.
Contextifier automatically annotates stock market timelines (like the one shown above) by discovering automatically salient trends in the charts (peaks and valleys) and corresponding news that might be relevant to explain the trend. The system is based on an input article and a news corpus. The input article is used as a query to find relevant news in the corpus and to match them against salient features in the graph. Articles and trends are matched to decide which time points should be annotated. These points are subsequently annotated with the most relevant news in the corresponding time frame. The paper also contains a very interesting analysis of how visualization designers annotate their visualization. The outcome of this analysis is used to inform the design of the annotation engine.
Paper #3: “Graphical Overlays: Using Layered Elements to Aid Chart Reading.“ Kong, Nicholas, and Maneesh Agrawala. Visualization and Computer Graphics, IEEE Transactions on 18.12 (2012): 2631-2638. [Sorry no free access to this one.]
Graphical overlays actually does much more than annotating a chart with text, it’s a whole system to add information on top of existing charts to aid their reading. So, for instance, other than adding notes to a chart to identify potentially interesting trends it also adds grids, highlights elements of a specific type (e.g., one set of bars in a bar chart), adds summary statistics (like an average line in a time chart). The system works entirely on image data, which means it does not require direct access to the original data used to create the chart. In the authors’ words: ” Our approach is based on the insight that generating most of these graphical overlays only requires knowing the properties of the visual marks and axes that encode the data, but does not require access to the underlying data values. Thus, our system analyzes the chart bitmap to extract only the properties necessary to generate the desired overlay.”
These three papers present very clever mechanisms to annotate visualizations in different contexts and with different purposes. I suggest you to give a look to the papers because they provide numerous interesting technical details. Beyond the technical aspects though I believe it is interesting that a some researchers are independently focusing on visualization annotation. Annotation is extremely important and I think we did not spend enough energy in exploring its potential and challenges. I also think there is an educational gap we should cover, that is, how do we teach our students when, how and why a visualization should be annotated?
I am curious to hear from you what you think. What do you think about the papers I presented? And what do you think about annotation in general? How do you deal with annotations yourself?
(In my last post I introduced the idea of regularly posting research material in this blog as a way to bridge the gap between researchers and practitioners. Some people kindly replied to my call for feedback and the general feeling seems to be like: “cool go on! rock it! we need it!”. Ok, thanks guys your encouragement is very much needed. I love you all. So, here is a “researchy” post. It is not the same style I’ve used in my research posts in infosthetics but I think you will find it useful anyway.)
Even if I am definitely not a veteran of infovis research (far from it) I started reading my first papers around the year 2000 and since then I’ve never stopped. One thing I noticed is that some papers recur over and over and they really are (at least in part) the foundation of information visualization. Here is a list of those that:
- come from the very early days of infovis
- are foundational
- are cited over and over
- I like a lot
Of course this doesn’t mean these are the only ones you should read if you want to dig into this matter. Some other papers are foundational as well. For sure a side effect of the maturation of this field is that some newer papers are more solid and deep and I had to refrain myself to not include them in the list. But this is a collection of classics. A list of papers you just cannot avoid to know unless you want to risk a bad impression at VisWeek (ok ok it’s a joke … but there’s a pinch of truth in it). A retrospective. Definitely a must read. Call me nostalgic.
Advice: in order to really appreciate them you have to think they have all been written during the ’90s (some even in the ’80s!).
Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. William S. Cleveland; Robert McGill (PDF)
Please don’t tell me you don’t know this one! This is the most classic of the classics. Cleveland is one of the fathers of statistical graphics and he wrote two groundbreaking books, The Elements of Graphing Data and Visualizing Data, based on the research carried out in this paper.
- What’s in it? The paper describes a series of experimental user studies to understand how basic visual primitives like length, size, color, etc., compare in terms of visually carrying out quantitative information.
- Why is it important? Cleveland with this paper introduced the idea and concept (quite vigorously) of visualization based on rigorous experimentation. People like Bertin many years before started ranking visual features but never before this ranking was validated with a scientific method.
- What can you learn? The basics of data visualization. That visual encoding is hard stuff and you shouldn’t take it too lightly. And that visual primitives do have a ranking that you have to take into account if you want to design effective data visualizations.
The Structure of the Information Visualization Design Space. Stuart K. Card and Jock Mackinlay (PDF)
I suspect this is somewhat little known compared to the previous one. Card and Mackinlay are among the founders of information visualization and the content of this paper is repeated and reworked (maybe in a better shape) in the book Readings in Information Visualization.
- What’s in it? The paper describes what are the basic components that build up a visualization and how to put them together to build a new design.
- Why is it important? Because it is one of the first attempt to describe the visualization space in a systematic way.
- What can you learn? You learn that in order to design innovative visualizations you have to know what the building blocks are and how to connect the. In my experience this is one of the most important, and often neglected, skills.
Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays. Christopher Ahlberg and Ben Shneiderman (PDF)
Ok I am ready to accept you don’t know this paper yet, but please don’t tell me you’ve never heard of Ben Shneiderman and Christopher Ahlberg. If you didn’t, you’ve been leaving under a rock. Let me guess … you’ve heard of Ben Shneiderman but not of Christopher Ahlberg. Well, Christopher is the founder of Spotfire, probably the first commercial success of information visualization ever. And Spotfire was based on the research described in this paper.
- What’s in it? It describes one of the first attempts to make visualization dynamics and controlled by the user through interactive queries.
- Why is it important? The whole idea of dynamic filtering had a huge impact on the way data is visualized interactively. We can see the effect of this idea everywhere. Before that, there where queries in a database. After that, it was clear how powerful interactive visualization could be.
- What can you learn? You learn how powerful data visualization can be when interactive capabilities are added to static representations. After more than 10 years we are still learning this lesson.
High-Speed Visual Estimation Using Preattentive Processing. C. G. Healey, K. S. Booth and J. T. Enns (PDF)
I fell in love with Chris Healey‘s work very early in my journey into visualization. It always struck me how innovative and intriguing his research was. His specialty is what he calls “Perceptual Visualization”: the study of visualization based on core human vision principles. His page about perception in visualization is a real classic.
- What’s in it? It describes how the concept of preattentive processing can help in guiding the design of visualization and user interfaces. It contains several experimental studies.
- Why is it important? Nobody before the work of Healey (maybe Colin Ware?) pushed the limits of perception applied to visualization so far. I bet many of the results of his studies have yet to be exploited.
- What can you learn? You learn what preattentive processing is and how to apply it to the design of information visualizations. (As a byproduct you might also learn how tough this stuff is!)
Automating the Design of Graphical Presentations of Relational Information. Jock Mackinlay (PDF)
I mentioned Jock already in one of the papers above. Jock is not only behind some of the fundamental research in visualization and human-computer interaction but he is also one of the minds behind Tableau Software. This paper can be considered a very early draft of what became through several other steps Tableau today. In some sense it can still be considered visionary today since the dream of a tool that automatically adapts to data is very far to come (if it will ever come).
- What’s in it? Jock presents a system called APT (A Presentation Tool) whose purpose is to automatically design effective visualizations automatically by matching data features with visual features through the use of logic rules.
- Why is it important? It is not only important because it contains some visionary perspective in visualization but also because part of the work was focussed on the definition of visual primitives (starting from the work of Bertin) and on the way data features should match visual features.
- What can you learn? Knowing how to match data features to visual features is one of the most important skills of knowledgeable data visualization experts.
How NOT to Lie with Visualization. Bernice E. Rogowitz, Lloyd A. Treinish (PDF).
How can you not love a paper with a title like this? I’d give it a prize for best marketing in the research papers design. This work is fully focussed on color use and perception but its implications extend beyond the scope of color mapping. This work was part of the development of one of the earlier data visualization systems called OpenDX developed by IBM which included a module called PRAVDA for assisted color mapping.
- What’s in it? A detailed explanation of how the visual eye can be mislead if the wrong (color) mapping is used. Plus a thorough discussion of how to build effective color scales that take into account data distribution.
- Why is it important? I still see a lot of people using color badly. By reading this I hope this number will get smaller and smaller. It is also an early example of how automatic computation and interaction can go happily together.
- What can you learn? This paper will give you solid arguments about why mapping color badly is bad. Plus you will learn how to build effective color scales. On a side note, the same is true for every other visual feature you want to use.
The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. Ben Shneiderman (PDF).
This is a funny paper. Ben has the ability to write a paper with absolute nonchalance and make it massively popular. It’s not too technical, neither too elaborated. He straightforwardly proposed a classification and the famous infovis mantra and it became one of the biggest classics of infovis. If I remember well I’ve heard him talking about this paper once and explaining how unexpected this success was.
- What’s in it? A classification of information visualization techniques according to data type. More importantly the explanation of the visual information seeking mantra: “overview first, zoom and filter, details on demand”.
- Why is it important? The visual information seeking mantra has been the reference model for interactive visualization for 15 years. Hundreds of systems have been developed under this paradigm.
- What can you learn? The classification by data type will help you mentally organize visual designs into classes (even if I must admit I am not a big fan of this classification). The visual information seeking mantra will guide you in designing and evaluating interactive visualizations: do you have an overview? zoom and filter capabilities? details on demand? tools to relate things? history facilities?
That’s all guys. Pufff … it’s been a marathon to write such a long and detailed list. That’s the best I could think of and I really really hope this will be tremendously helpful to you. Go on read them and feel free to expand the list. Please remember:
- Let me know if something is not clear. I’d really love to help you.
- Let me know if you don’t agree on something. I’d be happy to hear and learn from you.
- If you have other papers to suggest, please do it!
Thanks a lot guys. Have fun with it.