In the second week of my course we start directly with the “data abstraction” chapter taken from Tamara’s book, which is the official textbook in my course.

Data abstraction is mostly about describing data in a way that is instrumental to visualization design. That is, by detecting certain structures and information within them, you can think ahead about how to navigate the visualization design space. This is where you learn, among other things, the extremely valuable concept of *attribute type*: *categorical*, *ordinal* and *quantitative*.

Tamara did an excellent job at creating a consistent catalog of data abstractions; which covers an extremely wide set of cases and situations. Here is an excerpt from the data abstraction summary you can find in the book chapter (side note: I absolutely love the way the book provides diagrams summarizing the content at the beginning of each chapter):

That said, there are a few things my students and I are always struggling with. Here is an account of the problems and possible solutions.

**Some data abstractions are way to complex and cover rare cases.**It is clear that Tamara tried to be as complete as possible and, as such, her effort is highly laudable. The need for completeness however creates a tension with simplicity. As you can see in the diagram, some abstractions are very familiar:*table*,*network*,*tree*, etc. But then things get way muddier with*grids*,*spatial fields*,*geometry*,*clusters*and*sets*. I have been through this many times and invariably when we get to this chapter my students are confused. I must confess I am confused too. Here are examples of questions I received this semester: “*I don’t really understand the concept of field and geometry, especially the difference between them*” or “*While looking at the continuous field example about sea temperature of locations on the*

*planet, it somehow seems like to be geometry dataset type?*“. What is the solution? I don’t know. As I said, I like the completeness and the consistent approach, but I am also concerned with how much students will be able to retain. Maybe we can create a “data abstraction light”? My data abstraction light would be something along these lines. There are two data set types:*tables*and*networks*. Each of these contain attributes of three possible types:*categorical*,*ordinal*and*quantitative*. Some of these may represent*time*and/or*geography*. Too simplistic?**Is data abstraction a matter of “describing” or “designing” data structures?**Whenever I teach data abstraction, I feel like there is one important part I should be teaching and it’s not developed enough: the art of “sculpting” your data so that it has the “shape” it is need to solve the problem you want to solve. To be fair, Tamara does talk about this, but I believe this part is not developed/structured enough. While data abstraction helps you**describe**what kind of information your data contain (description), you also need to figure out how your data can and should be molded to get to the desired solution (design). This is a crucial vis design activity and it’s never acknowledged enough. Let me state it in a different way: your role as a visualization designer is not only to find**how**to represent the data you have, but also**what**to represent in the first place. This is a huge aspect of visualization design which is very often overlooked! Now … In how many ways can data be manipulated? And which of these ways do we need to teach? Above all, I believe that anything resembling an SQL query (or Pivot Table in Excel-ese) is a fundamental step (this is why I believe Tableau is so successful: ultimately it’s a database query system). So, fundamental operations in a table are:**selecting**attributes,**filtering**rows in a principled manner and**aggregating**them according to aggregate operations. In a way, every chart can be described as an SQL query over a data set. Other fundamental operations are those that transform a data set or attribute from one type to another: from table to network, from quantitative to categorical, from place names to their coordinates, etc. I believe this is so crucial because there is never a unique way of looking at a given data set. In a way rather than calling this step “*data abstraction*” I would even call it “*data interpretation*” Or … “*data design*“? Or … “*data sculpting*“? For instance, in class I often use the Aid Data data set. It’s a fantastic data set recording information about financial disbursements between countries for aid purposes over time. The data is stored as a table/spreadsheet and contain, among other fields:*origin*,*destination*,*time*,*amount*,*purpose*. Now, if I select*origin*, transform it into spatial coordinates, and aggregate over*amount*, I can create a nice bubble chart map of donors. But if I select*origin,**destination*and*amount*, I can create a weighted node-link diagram. Whereas, If I select*purpose*and aggregate by*amount*, I can create a bar chart of how disbursements distribute across purposes. You see how crucial this is?**Not enough emphasis on the relationship between analytical questions and “data shapes”.**This is somewhat related to my last observation. Data abstraction and transformation do not happen in a vacuum, they are instrumental to achieving a data analysis and presentation goal. But data analysis and presentation presuppose looking for answers to a series of questions. Ultimately, this is what happens in practice: finding the right “shape” for your data is guided by your desire to pursue some questions and goals. This is easier said than done. One caveat in visualization is that not all questions are perfectly laid out in front of us. Sometime we “discover” new questions as we proceed in our analysis. In any case, it is true that virtually every single visual representation has one or more possible questions attached, that is, questions that can be answered by looking at it (and new questions that cannot be solved by looking at it by the way). I now feel that this tight relationship between*questions*,*data sculpting*and*representation*needs to be highlighted and trained. I have done a bit of this in class already, through exercises, and my students seem to have learned a lot. One student at the end of the class told me: “*Prof. I loved this exercise today!*“. On a side note, this is why I am so happy I no longer need to give lectures in class. I can afford teaching students concepts through practice. They end up developing a sense of what these concepts really mean, not only intellectually, but also in practical (more internalized) ways.

That’s all for now. My next post will be on the third module I teach, which is “fundamental variations of charts”. Let me know what you think! This is very much work in progress!

Thanks for reading.