Category Archives: Research

What do we talk about when we talk about “Data Exploration”?

There is an old “adage” in the InfoVis / Visual Analytics community I have heard a zillion times: “visualization is needed/useful when people don’t have a specific question in mind“. For many years I have taken this as “the verb”. Then, over time, as I have grown more experienced, I have started questioning the whole concept: why would someone look at a given data set if he or she has no specific goal or question in mind? It does not make sense.

This is an aspect of visualization that has puzzled me for a long time. An interesting conundrum I believe it is still largely unsolved. One of those things many say, but nobody really seems to have grasped in full depth.

Here is my humble attempt at putting some order into this matter. Let’s start with definitions:

The definition introduces a couple of very important features: familiarity (“through an unfamiliar area“) and learning (“in order to learn about it“). If we take this definition as main guidance, we can say that data visualization is particularly helpful when we use it to look into some unfamiliar data to learn more about something.

I suspect there are (at least) three main situations in which this can happens.

  1. Need to familiarize with a new data set (“how does it look like?“). Anyone who dabbles with data a bit goes through this: you receive or find a new data set and the first thing you need to do is to figure out what information it contains. How many fields? What type of fields? What is their meaning? Are there any missing values? Is there anything I don’t actually understand? How are the values distributed? Is there any temporal or geographical information? Are we actually in presence of some kind of network or relation structure? Etc. One crucial, and often overlooked, aspect of this activity is “data semantics”. I personally find that understanding the meaning of the various fields and the values they contain is a such a crucial and hard activity at the beginning. An activity that often requires many many back-and-forth discussions and clarifications with domain experts and data collectors.
  2. Hunting for “something” interesting (“is there anything interesting here?“). I suspect this is what people mostly really mean when they talk about “data exploration“: the feeling that something interesting may be hidden there and that some exploratory work is needed to figure it out. But … When does this actually happen? What kind of real-world activities are characterized by this desire of finding “something”? I am not sure I have an all-encompassing answer to that, but I am familiar with at least two examples: data journalism and quantified self. In data journalism it is very common to first get your hands on some potentially “juicy” data set and then try to figure out what interesting stories may hide there (Panama Papers, Clinton’s Emails, Etc.). I have observed this in our collaboration with ProPublica when hunting for stories about how people review doctors in Yelp. In quantified self you often want to look at your data to see if you can detect anything unexpected. I have experienced the same when looking at personal data I have collected about my deep work habits (or lack thereof). Sometimes we know there must be something interesting in a given data set, and visualization guides us in the formulation of unexpressed questions. The interesting aspect of this activity is that the outcome is often more (and better) questions, not answers.
  3. Going off on a tangent (“oh … this may be interesting too!“). There is one last, subtler, kind of data exploration. You start with a specific question in mind but, as you go about it, you find something interesting that triggers an additional question you had not anticipated. This is the power of visual data analysis, it forces you to notice something new and you have to follow the path. This happens to me all the time (and I hope it’s just not a sign of my ADD). Some of these are useless diversions. Some of them actually lead to some pretty unique gems!

These three modalities can of course overlap a lot. I am also sure there are other situations we can describe as data exploration which I am not covering here (in case you have some suggestions please let me know!).

I want to conclude by saying that this is an incredibly under-explored area of data visualization. More advances are needed at least in two directions.

  • First, we need to much better understand data exploration as a process and, if possible, create models able to describe it in useful abstract terms. In visualization research we often refer to Card and Pirolli’s “Sensemaking Loop” to describe this kind of open-ended and incremental activity but for some reason every time I try to use it, it does not seem to describe what I actually observe in practice (this deserve its own post).
  • Second, we need to develop more methods, techniques and tools to support interactive data exploration. I bet there are lots of “latent needs” waiting to be discovered out there. This is another area where I believe we, visualization researchers have surprisingly made little progress. We have built a lot of narrow solutions that work for 3-5 people but very few general purpose methods and techniques. We need more of that (this also deserves its own post)!
  • Third, we need to find ways to teach exploratory data analysis systematically to others in ways that make the process as effective as possible. I am appalled at how little guidance and material there is out there on teaching people how to do the actual analysis work. Statisticians are fixated with confirmatory analysis and regard exploration as a second-class citizen. Visualization researchers are too busy building stuff and have done too little to teach others how to do the actual ground work. This is a problem we need to solve. It’s for this reason that next semester I will be teaching a new course with this specific purpose. Stay tuned.

That’s all I had to say about Data Exploration.

And you? What is your take? What is data exploration for you? And how can we improve it?

Take care.

11 (Papers + Talks) Highlights from IEEE VIS’16

vis-webbanner2016

Hey, it took me a while to create this list! But better later than never. Here is my personal list of 11 highlight from the IEEE VIS’16 Conference.

If you did not have a chance to attend the conference you can start from here and then look into the following links:

Papers

Surprise! Bayesian Weighting for De-Biasing Thematic Maps.
Michael Correll, Jeffrey Heer.
https://github.com/uwdata/bayesian-surprise

Did you ever stumble into one of those choropleth maps in which the distribution of a given quantity is shown, (say, number of cars from a given manufacturer) but the only signal you can see is actually population density? This is the kind of problem Surprise! addresses. It deals with situations in which the quantity one wants to depict is confounded by another variable. To solve this problem Surprise! uses an underlying Bayesian model of how the quantity should be distributed and visualizes deviations from the model rather than quantity (hence the name Surprise!).

I think this is a brilliant idea which addresses a super common problem. I have seen people stumble into this problem countless times and I am glad we finally have a paper that explains the phenomenon and proposes a solution. The only issue is that visualizing surprise is not as natural as visualizing the actually quantity; which is normally what people would expect. One open challenge then is how to communicate both values at the same time.

Vega-Lite: A Grammar of Interactive Graphics.
Arvind Satyanarayan, Dominik Moritz, Kanit “Ham” Wongsuphasawat, Jeffrey Heer.
https://github.com/vega/vega-lite

The IDL team has done over the years and astounding job at developing an ecosystem of frameworks and tools to make the development of advanced visualizations easier and faster. Vega-Lite builds on top of Vega, which they presented last year, and proposes a much simpler language and extremely powerful functions to generate interactive graphics (with linked views, selections, filters, etc.). Arvind and Dominik gave a live demo and I have to say I am really impressed. While most existing frameworks focus on the representation part of visualization, this one focuses on interaction and as such it covers a really big gap. I am curious to see what people will manage to build using Vega-Lite. If you built some interactive visualizations in the past you certainly know that the interaction part is by far the hardest and messiest one. Vega-Lite seems to make it much simpler and straightforward than it used to be. I am looking forward to trying it out!

PROACT: Iterative Design of a Patient-Centered Visualization for Effective Prostate Cancer Health Risk Communication.
Anzu Hakone, Lane Harrison, Alvitta Ottley, Nathan Winters, Caitlin Guthiel, Paul KJ Han, Remco Chang.
http://web.cs.wpi.edu/~ltharrison/files/hakone2016proact.pdf

PROACT is a simple visualization dashboard that helps patients with prostate cancer understand their disease and make informed decisions about choosing between a conservative solution or surgery. The paper does a great job at describing the context and the challenges associated with such a delicate kind of situation and how visualization systems can be used by doctors and patients to enhance communication.

I consider this paper super relevant. While if you look into the images you won’t be impressed by fancy colorful views and interactions, the system has been demonstrated to be really effective in a very important and critical setting. It also raises awareness about issues we rarely discuss in visualization; especially how to deal with emotions and how to design systems that inform while being careful with the impact such knowledge may have on the viewers.

TextTile: An Interactive Visualization Tool for Seamless Exploratory Analysis of Structured Data and Unstructured Text.
Cristian Felix, Anshul Pandey, Enrico Bertini.
http://texttile.io

This is the latest product coming out of my lab. I plan to write a separate blog post on it later on. TextTile stems from multiple interactions we had with journalists and data analysts who need to look into data sets containing textual data together with tabular data (e.g., product reviews and surveys). In TextTile we propose a model that describes systematically how one can interactively query data starting from text and reflecting the results on the data table and vice-versa. The tool realizes this model in an interactive visual user interface with a mechanism similar to what is found in Tableau: the user creates queries and plots by dragging data fields to a predefined set of operations. I suggest you to try it on your own! You can find a demo here: http://texttile.io/.

Evaluating the Impact of Binning 2D Scalar Fields.
Lace Padilla, P. Samuel Quinan, Miriah Meyer, and Sarah H. Creem-Regehr.
https://www.cs.utah.edu/~miriah/publications/binning-study.pdf

I chose to include this paper because I found its message extremely inspiring. In visualization research we often cite a principle (proposed by Jock MacKinlay) called the “expressiveness principle“. The principle  states that “a visual encoding should express all of the relationships in the data, and only the relationships in the data“. This paper shows that this principle may actually not always hold. The paper describes experiments in which performance improves when a continuous value is presented with discrete color steps rather than continuous; a solution that breaks the expressiveness principle.  This may seem a minor detail but I believe it demonstrates a much bigger idea: there is lots of conventional wisdom ready to be debunked and it is up to us to hunt for this kind of research. Every single scientific endeavor is a loop of construction and destruction of past theories and idea. This paper is a great example of the destruction part of the cycle. We need more papers like this one!

VizItCards: A Card-Based Toolkit for Infovis Design Education.
Shiqing He and Eytan Adar.
http://www.cond.org/vizitcards.pdf

What a lovely lovely project! If you have ever tried to teach visualization you know how hard it is. Students just don’t get it if you give lectures and lots of theory. Visualization needs to be learned by doing. But organizing a course on doing in a systematic way is hard. Damn hard! Shiqing and Eytan have done an amazing job at making this process systematic and easy to adopt. They developed a toolkit and a set of cards instructors can use to guide students during a series of design workshops. One aspect I like a lot, other than the cards idea, is that many exercises have been ideated starting from an existing data visualization project and “retrofitted” to their original “amorphous” status of having a bunch of data and a vague goal. This is what the students are shown at the beginning and at the end of the process they can compare their results with the results developed in the original project. You can find the toolkit here: http://vizitcards.cond.org/supp/index.html. I am planning to adopt some of it myself next time I’ll teach my course (too late for this semester).

Colorgorical: Creating discriminable and preferable color palettes for information visualization.
Connor C. Gramazio, David H. Laidlaw, Karen B. Schloss.
http://vrl.cs.brown.edu/color

Creating categorical color palettes is a hard task and if you want to do it manually it’s even harder. Colorgorical is a new color selection tool that enables you to build new categorical color palettes using a lot of useful and interesting parameters, including: perceptual difference, name difference, pair preference, and name uniqueness. An internal algorithm tries to optimize all the desired parameters and generates a new color palette for you. You can also add starting colors to make sure some colors you want to have are actually present in the final color palette. I strongly suggest you to play with it! They have a nice web site explaining all the parameters and a simple interface to generate new palettes.

Talks

An Empire Built On Sand: Reexamining What We Think We Know About Visualization.
Robert Kosara.
https://eagereyes.org/papers/an-empire-built-on-sand

Robert’s talk was more of a performance than a talk. I really really enjoyed it. His talk at BELIV was all focused on the idea that we in vis regard some ideas as truth and keep repeating them even if evidence for them is actually weak or nonexistent. Robert kept repeating, in a wonderfully coordinated sequence, “how do we know that?” … “how do we know that?” … “how do we know that?“. I loved it. Too bad the talk was not recorded. But you can find the accompanying paper here. Kudos to Robert for assuming the role of contrarian at vis. We really need people like him who do not hold back, speak with candor, and are ready to yell the “the emperor has no clothes”.

We Should Never Stop BELIVing: Reflections on 10 Years of Workshops on the Esoteric Art of Evaluating Information Visualization.
Enrico Bertini.
http://bit.ly/beliv-keynote

Here is another one from yours truly. I started the BELIV workshop on evaluation in vis in 2006 with Giuseppe Santucci (my PhD advisor) and Catherine Plaisant and the organizers kindly asked me to give a keynote for the 10 years anniversary. If you click on the URL above you can watch the entire talk. I tried to be funny and also to give a sense of how much progress we have made and what may come next. Evaluation in visualization is a continuously evolving endeavor and there is much to learn and perfect. The vis community has been receptive to new ideas on how to conduct empirical research and I predict we will see a lot of innovation in coming years. Let me know what you think if you watch the video!

Capstone Talk: The three laws of communication.
Jean-luc Doumont.
http://www.principiae.be/

Wow! I had absolutely no idea who Jean-Luc was before I entered the room and started listening to his talk. This is by far one of the best capstone talks I have ever attended at VIS, if not the best. Jean-Luc gave a talk on how to convey messages effectively and organized it around a number of principles he developed through the years of his activity training people on effective communication. This guy know what he is talking about. His body language, the way he expresses his thoughts, the quality and density of information in what he says, the style of his slides, etc., everything is great. His work can inform any professional who needs to communicate information better, being it visual or verbal. He has a fantastic book which looks very much like Tufte’s but more on general communication. If you have never heard of him take a look at his work, he is amazing … and super fun!

Communicating Methods, Results, and Intentions in Empirical Research.
Jessica Hullman.
http://steveharoz.com/publications/vis2016-panel/improve-empirical-research.html

Jessica is doing some of the most interesting type of work in visualization. Her blend of core statistical concept and visualization is very much needed and one of the most interesting recent trend in vis: how to use vis to communicate statistics better and, at the same time, how to use statistics to do better vis research. In her short talk Jessica raised a number of important points on how we communicate research, not only to others but also to ourselves, and how we can introduce practices that may reduce the chances we are fooling ourselves. The world of experimental research and statistics is changing very fast and we are witnessing a wave of great self-criticism and reform. While this is true for science in general, the world of visualization research is also very receptive to what is happening and Jessica is one of the few vis people who is helping us make sense of it.

That’s all folks! I hope you’ll find these projects inspiring!

Paper: How Deceptive Are Deceptive Visualizations?

Screen Shot 2015-02-25 at 2.33.35 PM

We all know by now that visualization, thanks to its amazing communication powers, can be used to communicate effectively and persuasively massages that stick into people’s mind. This same power, however, can also be used to mislead and misinform people very effectively! When techniques like non-zero baselines, scaling by area (quadratic change to represent linear changes), bad color maps, etc., are used, it is very easy to communicate the wrong message to your readers (being that done on purpose or for lack of better knowledge). But, how easy is it?

Continue reading

How do you evaluate communication-oriented visualization?

Had a fantastic visit at ProPublica yesterday (thanks Alberto for inviting me and Scott for having me, you have an awesome team!) and we discussed about lots of interesting things at the intersection of data visualization, literacy, statistics, journalism, etc. But there is one thing that really caught my attention. Lena very patiently (thanks Lena!) showed me some of the nice visualizations she created and then asked:

How do you evaluate visualization?

How do you know if you have done things right?

Heck! This is the kind of question I should be able to answer. I did have some suggestion for her, yet I realize there are no established methodologies. This comes as a bit of a surprise to me as I have been organizing the BELIV Workshop on Visualization Evaluation for a long time and I have been running user studies myself for quite some time now. Continue reading

The Role of Algorithms in Data Visualization

It’s somewhat surprising to me to notice how little we discuss about the more technical side of data visualization. I use to say that visualization is something that “happens in your head” to emphasize the role of perception and cognition and to explain why it is so hard to evaluate visualization. Yet, visualization happens a lot in the computer also, and what happens there can be extremely fascinating too.

So, today I want to talk about algorithms in visualization. What’s the use of algorithms in visualization? When do we need them? Why do we need them? What are they for? Surprisingly, even in academic circles I noticed we tended to either avoid the question completely or to take the answer for granted. Heck, even the few books we have out there: how many of them teach the algorithmic side of visualization? None.

I have grouped algorithms in four broad classes. For each one I am going to give a brief description and a few examples.

Spatial Layout. The most important perceptual component of every visualization is how data objects are positioned on the screen, that is, the logic and mechanism by which a data element is uniquely positioned on the spatial substrate. Most visualization techniques have closed formulations, based on coordinate systems, that permits to uniquely, and somewhat trivially, find a place for each data object. Think scatter plots, bar charts, line charts, parallel coordinates, etc. Some other visualization techniques, however, have more complex logics which require algorithms to find the “right” position for each data element. A notable example is treemaps, which starting from the somewhat simple initial formulation called “slice-and-dice” evolved into more complex formulations like squarified treemaps and voronoi treempas. A treemap is not based on coordinates, it requires a logic. One of my favorite papers ever is “Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies” where alternative treemap algorithms are proposed and rigorously evaluated. Another example is the super wide class of graph layout algorithms called force-directed layouts, where nodes and edges take place according to some iterative optimization procedures. This class is so wide that some specific conferences and books exist only to study new graph layouts. Many other examples exist: multidimensional scaling, self-organizing maps, flowmap layouts, etc. A lot has been done in the past but a lot needs to be done yet too, especially in better understanding how scale them up to much higher number of elements.

(Interactive) Data Abstraction. There are many cases where data need to be processed and abstracted before they can be visualized. Above all the need to deal with very large data sets (see “Extreme visualization: squeezing a billion records into a million pixels” for a glimpse of the problem). It does not matter how big your screen is, at some point you are going to hit a limit. One class of data abstraction is binning and data cubes (Tableau is mostly based on that for instance), which summarize and reduce the data by grouping them into intervals. Every visualization based on density has some sort of binning or smoothing behind the lines and the mechanism can turn out to be complex enough to require some sort of clever algorithm. More interesting is the case of data visualizations that have to adapt to user interaction. Even the most trivial abstraction mechanism may require some complex algorithm to make sure the visualization is updated in less than one second when the user needs to navigate from one abstraction level to another. A recent great example of this kind of work is “imMens: Real-time Visual Querying of Big Data“. Of course binning is not the only data abstraction mechanism needed in visualization. For instance, all sorts of clustering algorithms have been used in visualization to reduce data size. Notably, graph clustering algorithms can (sometime) turn some huge “spaghetti mess” into some more intelligible picture. For an overview of aggregation techniques in visualization you can read “Hierarchical Aggregation for Information Visualization: Overview, Techniques, and Design Guidelines” a very useful survey on the topic.

Smart Encoding. Every single visualization can be seen as an encoding procedure where one has to decide how to map data features to visual features. To build a bubble chart you have to decide which variable you want to map to the x-axis, y-axis, color, size, etc … You get the picture. This kind of process may become tedious or too costly when the number of data dimensions increases. Also, some users may not have a clue on how to “correctly” visualize some data. Encoding algorithms can do some of the encoding for you or at least guide you into the process. This kind of approach never became too popular in reality but visualization researchers have spent quite some time developing smart encoding techniques. Notably, Jock Mackinlay‘s seminal work: “Automating the design of graphical presentations of relational information” and the later implementation of the “Show Me” function in Tableau (Show Me: Automatic Presentation for Visual Analysis). Other examples exist but they tend to be more on the academic speculation side. One thing I have never seen though is the use of smart encoding as an artistic tool. Why not let the computer explore a million different encodings and see what you get? That would be a fun project.

Quality Measures. Even if this may seem a bit silly at first, algorithms can be used to supplement or substitute humans in judging the quality of a visualization. If you go back to all the previous classes I have described above, you can realize that in everyone there might be some little mechanism of quality judgment. Layout algorithms (especially the nondeterministic ones) may need to routinely check the quality of the current layout. Same thing for sorting algorithms like those needed to fin meaningful orderings in matrices and heatmaps. Data abstraction algorithms may need to automatically find the right parameters for the abstraction. And smart encoding algorithms may need to separate the wheat from the chaff by suggesting only encodings with quality above a given threshold. A couple of years back I have written a paper on quality metrics titled “Quality metrics in high-dimensional data visualization: An overview and systematization” to create a systematic description of how they are used in visualization. The topic is arguably a little academic but I can assure you it’s a fascinating one with lots of potential for innovation.

These are the four classes of algorithms I have currently identified in visualization. Are there more out there? I am sure there are and that’s partly the reason why I have written this post. If there are other uses for algorithms which I did not list here please comment on this post and feel free to suggest more. That would help me build a better picture. There is much more to say on this topic.

Take care.

Smart Visualization Annotation

There are three research papers which have drawn my attention lately. They all deal with automatic annotation of data visualizations, that is, adding labels to the visualization automatically.

It seems to me that annotations, as an integral part of a visualization design, have received somewhat little attention in comparison to other components of a visual representation (shapes, layouts, colors, etc.). A quick check in the books I have in my bookshelf kind of support my hypothesis. The only exception I found is Colin Ware’s Information Visualization book, which has a whole section on “Linking Text with Graphical Elements“. This is weird because, think about it, text is the most powerful means we have to bridge the semantic gap between the visual representation and its interpretation. With text we can clarify, explain, give meaning, etc.

Smart annotations is an interesting area of research because, not only it can reduce the burden of manually annotating a visualization but it can also reveal interesting patterns and trends we might not know about or, worse, may get unnoticed. Here are the three papers (click on the images to see a higher resolution version).

Paper#1: “Just-in-time annotation of clusters, outliers, and trends in point-based data visualizations. Kandogan, Eser. Visual Analytics Science and Technology (VAST), 2012 IEEE Conference on. IEEE, 2012.

 

This annotation works on point based visualizations. The system detects trends automatically by analyzing the visual information displayed on the screen (that is, patterns are detected in the visual space, not the data space) and tries to find a description for the observed trends. Once a description is found, the system overlays labels that convey this information. So, for instance, in the image above the algorithm finds visual clusters (groupings) and annotates them with the data values that most explain the trend (data dimensions and values that have a unique distribution in the cluster). The paper does not focus only on clusters, it provides techniques to annotate trends and outliers as well and it describes the whole framework in a way that it is easy to imagine how this can be extended to other domains and visualizations.

Paper #2: “Contextifier: Automatic Generation of Annotated Stock Visualizations. Hullman, Jessica, Nicholas Diakopoulos, and Eytan Adar. ACM Conference on Human Factors in Computing Systems (CHI). May, 2013.

Contextifier automatically annotates stock market timelines (like the one shown above) by discovering automatically salient trends in the charts (peaks and valleys) and corresponding news that might be relevant to explain the trend. The system is based on an input article and a news corpus. The input article is used as a query to find relevant news in the corpus and to match them against salient features in the graph. Articles and trends are matched to decide which time points should be annotated. These points are subsequently annotated with the most relevant news in the corresponding time frame. The paper also contains a very interesting analysis of how visualization designers annotate their visualization. The outcome of this analysis is used to inform the design of the annotation engine.

Paper #3: “Graphical Overlays: Using Layered Elements to Aid Chart Reading. Kong, Nicholas, and Maneesh Agrawala. Visualization and Computer Graphics, IEEE Transactions on 18.12 (2012): 2631-2638. [Sorry no free access to this one.]

Graphical overlays actually does much more than annotating a chart with text, it’s a whole system to add information on top of existing charts to aid their reading. So, for instance, other than adding notes to a chart to identify potentially interesting trends it also adds grids, highlights elements of a specific type (e.g., one set of bars in a bar chart), adds summary statistics (like an average line in a time chart). The system works entirely on image data, which means it does not require direct access to the original data used to create the chart. In the authors’ words: ” Our approach is based on the insight that generating most of these graphical overlays only requires knowing the properties of the visual marks and axes that encode the data, but does not require access to the underlying data values. Thus, our system analyzes the chart bitmap to extract only the properties necessary to generate the desired overlay.

These three papers present very clever mechanisms to annotate visualizations in different contexts and with different purposes. I suggest you to give a look to the papers because they provide numerous interesting technical details. Beyond the technical aspects though I believe it is interesting that a some researchers are independently focusing on visualization annotation. Annotation is extremely important and I think we did not spend enough energy in exploring its potential and challenges. I also think there is an educational gap we should cover, that is, how do we teach our students when, how and why a visualization should be annotated?

I am curious to hear from you what you think. What do you think about the papers I presented? And what do you think about annotation in general? How do you deal with annotations yourself?

Take care.

7 Classic Foundational Vis Papers You Might not Want to Publicly Confess you Don’t Know

old papers(In my last post I introduced the idea of regularly posting research material in this blog as a way to bridge the gap between researchers and practitioners. Some people kindly replied to my call for feedback and the general feeling seems to be like: “cool go on! rock it! we need it!”. Ok, thanks guys your encouragement is very much needed. I love you all. So, here is a “researchy” post. It is not the same style I’ve used in my research posts in infosthetics but I think you will find it useful anyway.)

Even if I am definitely not a veteran of infovis research (far from it) I started reading my first papers around the year 2000 and since then I’ve never stopped. One thing I noticed is that some papers recur over and over and they really are (at least in part) the foundation of information visualization. Here is a list of those that:

  1. come from the very early days of infovis
  2. are foundational
  3. are cited over and over
  4. I like a lot

Of course this doesn’t mean these are the only ones you should read if you want to dig into this matter. Some other papers are foundational as well. For sure a side effect of the maturation of this field is that some newer papers are more solid and deep and I had to refrain myself to not include them in the list. But this is a collection of classics. A list of papers you just cannot avoid to know unless you want to risk a bad impression at VisWeek (ok ok it’s a joke … but there’s a pinch of truth in it). A retrospective. Definitely a must read. Call me nostalgic.

Advice: in order to really appreciate them you have to think they have all been written during the ’90s (some even in the ’80s!).

Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. William S. Cleveland; Robert McGill (PDF)

Please don’t tell me you don’t know this one! This is the most classic of the classics. Cleveland is one of the fathers of statistical graphics and he wrote two groundbreaking books, The Elements of Graphing Data and Visualizing Data, based on the research carried out in this paper.

  • What’s in it? The paper describes a series of experimental user studies to understand how basic visual primitives like length, size, color, etc., compare in terms of visually carrying out quantitative information.
  • Why is it important? Cleveland with this paper introduced the idea and concept (quite vigorously) of visualization based on rigorous experimentation. People like Bertin many years before started ranking visual features but never before this ranking was validated with a scientific method.
  • What can you learn? The basics of data visualization. That visual encoding is hard stuff and you shouldn’t take it too lightly. And that visual primitives do have a ranking that you have to take into account if you want to design effective data visualizations.

The Structure of the Information Visualization Design Space. Stuart K. Card and Jock Mackinlay (PDF)

I suspect this is somewhat little known compared to the previous one. Card and Mackinlay are among the founders of information visualization and the content of this paper is repeated and reworked (maybe in a better shape) in the book Readings in Information Visualization.

  • What’s in it? The paper describes what are the basic components that build up a visualization and how to put them together to build a new design.
  • Why is it important? Because it is one of the first attempt to describe the visualization space in a systematic way.
  • What can you learn? You learn that in order to design innovative visualizations you have to know what the building blocks are and how to connect the. In my experience this is one of the most important, and often neglected, skills.

Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays. Christopher Ahlberg and Ben Shneiderman (PDF)

Ok I am ready to accept you don’t know this paper yet, but please don’t tell me you’ve never heard of Ben Shneiderman and Christopher Ahlberg. If you didn’t, you’ve been leaving under a rock. Let me guess … you’ve heard of Ben Shneiderman but not of Christopher Ahlberg. Well, Christopher is the founder of Spotfire, probably the first commercial success of information visualization ever. And Spotfire was based on the research described in this paper.

  • What’s in it? It describes one of the first attempts to make visualization dynamics and controlled by the user through interactive queries.
  • Why is it important? The whole idea of dynamic filtering had a huge impact on the way data is visualized interactively. We can see the effect of this idea everywhere. Before that, there where queries in a database. After that, it was clear how powerful interactive visualization could be.
  • What can you learn? You learn how powerful data visualization can be when interactive capabilities are added to static representations. After more than 10 years we are still learning this lesson.

High-Speed Visual Estimation Using Preattentive Processing. C. G. Healey, K. S. Booth and J. T. Enns (PDF)

I fell in love with Chris Healey‘s work very early in my journey into visualization. It always struck me how innovative and intriguing his research was. His specialty is what he calls “Perceptual Visualization”: the study of visualization based on core human vision principles. His page about perception in visualization is a real classic.

  • What’s in it? It describes how the concept of preattentive processing can help in guiding the design of visualization and user interfaces. It contains several experimental studies.
  • Why is it important? Nobody before the work of Healey (maybe Colin Ware?) pushed the limits of perception applied to visualization so far. I bet many of the results of his studies have yet to be exploited.
  • What can you learn? You learn what preattentive processing is and how to apply it to the design of information visualizations. (As a byproduct you might also learn how tough this stuff is!)

Automating the Design of Graphical Presentations of Relational Information. Jock Mackinlay (PDF)

I mentioned Jock already in one of the papers above. Jock is not only behind some of the fundamental research in visualization and human-computer interaction but he is also one of the minds behind Tableau Software. This paper can be considered a very early draft of what became through several other steps Tableau today. In some sense it can still be considered visionary today since the dream of a tool that automatically adapts to data is very far to come (if it will ever come).

  • What’s in it? Jock presents a system called APT (A Presentation Tool) whose purpose is to automatically design effective visualizations automatically by matching data features with visual features through the use of logic rules.
  • Why is it important? It is not only important because it contains some visionary perspective in visualization but also because part of the work was focussed on the definition of visual primitives (starting from the work of Bertin) and on the way data features should match visual features.
  • What can you learn? Knowing how to match data features to visual features is one of the most important skills of knowledgeable data visualization experts.

How NOT to Lie with Visualization. Bernice E. Rogowitz, Lloyd A. Treinish (PDF).

How can you not love a paper with a title like this? I’d give it a prize for best marketing in the research papers design. This work is fully focussed on color use and perception but its implications extend beyond the scope of color mapping. This work was part of the development of one of the earlier data visualization systems called OpenDX developed by IBM which included a module called PRAVDA for assisted color mapping.

  • What’s in it? A detailed explanation of how the visual eye can be mislead if the wrong (color) mapping is used. Plus a thorough discussion of how to build effective color scales that take into account data distribution.
  • Why is it important? I still see a lot of people using color badly. By reading this I hope this number will get smaller and smaller. It is also an early example of how automatic computation and interaction can go happily together.
  • What can you learn? This paper will give you solid arguments about why mapping color badly is bad. Plus you will learn how to build effective color scales. On a side note, the same is true for every other visual feature you want to use.

The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. Ben Shneiderman (PDF).

This is a funny paper. Ben has the ability to write a paper with absolute nonchalance and make it massively popular. It’s not too technical, neither too elaborated. He straightforwardly proposed a classification and the famous infovis mantra and it became one of the biggest classics of infovis. If I remember well I’ve heard him talking about this paper once and explaining how unexpected this success was.

  • What’s in it? A classification of information visualization techniques according to data type. More importantly the explanation of the visual information seeking mantra: “overview first, zoom and filter, details on demand”.
  • Why is it important? The visual information seeking mantra has been the reference model for interactive visualization for 15 years. Hundreds of systems have been developed under this paradigm.
  • What can you learn? The classification by data type will help you mentally organize visual designs into classes (even if I must admit I am not a big fan of this classification). The visual information seeking mantra will guide you in designing and evaluating interactive visualizations: do you have an overview? zoom and filter capabilities? details on demand? tools to relate things? history facilities?

That’s all guys. Pufff … it’s been a marathon to write such a long and detailed list. That’s the best I could think of and I really really hope this will be tremendously helpful to you. Go on read them and feel free to expand the list. Please remember:

  • Let me know if something is not clear. I’d really love to help you.
  • Let me know if you don’t agree on something. I’d be happy to hear and learn from you.
  • If you have other papers to suggest, please do it!

Thanks a lot guys. Have fun with it.