Data with a Soul and a Few More Lessons I Have Learned About Data

by Enrico on January 15, 2014

in Thoughts

I don’t know if this is true for you but I certainly used to take data for granted. Data are data, who cares where they come from. Who cares how they are generated. Who cares what they really mean. I’ll take these bits of digital information and transform them into something else (a visualization) using my black magic and show it to the world.

I no longer see it this way. Not after attending a whole three days event called the Aid Data Convening; a conference organized by the Aid Data Consortium (ARC) to talk exclusively about data. Not just data in general but a single data set: the Aid Data, a curated database of more than a million records collecting information about foreign aid.

The database keeps track of financial disbursements made from donor countries (and international organizations) to recipient countries for development purposes: health and education, disasters and financial crises, climate change, etc. It spans a time range between 1945 up to these days and includes hundreds of countries and international organizations.

Aid Data users are political scientists, economists, social scientists of many sorts, all devoted to a single purpose: understand aid. Is aid effective? Is aid allocated efficiently? Does aid go where it is more needed? Is aid influenced by politics (the answer is of course yes)? Does aid have undesired consequences? Etc.

Isn’t that incredibly fascinating? Here is what I have learned during these few days I have spent talking with these nice people.

Data are not always abundant or easy to get. In the big data era we are so used to data abundance that we end up forgetting how some data, data crucial for important human endeavors, may be very hard to get. It’s not just like creating the next python script and scrap a million records in 24 hours. No, it’s a super-painful process. For instance, the Aid Data folks have a whole team of data gatherers and a multistep process which includes: setting up an agreement with a foreign country, having people flying to remote places and convince officials to make their information available, obtain a whole bunch of documents and files, transform these files into a common format and add geographical coordinates (geocoding) where necessary,  cross-checking data with multiple coders, etc. How far is this from writing a bunch of python code?

Data granularity can be a game-changer. It took me a while to understand why Aid Data users are so excited by the new release of the database which features, for the first time, data at the sub-national rather than only national level. This means that financial disbursements are geocoded at a higher level of granularity, that is, instead of knowing only that a certain amount has flown from one country to another you can now know in which region it has gone. To my eyes this seemed like a minor thing, but as I went through a few presentations of people doing real research with these data I suddenly realized it is a huge change! Picture this: you know data is flowing from the US to Uganda but you have no idea where it goes once it lands there. All in a sudden researchers can ask a whole lot of new and more interesting questions. In turn, this makes me think how this extends to many other data sets: small changes can have huge impacts. A little bit more details, may pave the way to much bigger questions. How can we make existing data systematically more valuable by adding crucial information? And what is this crucial information by the way?

Questions are much more important than data. I did not need to attend this conference to realize how true this is. Yet, after attending it I am even more convinced now. One of the highest peaks of the event for me was listening to all the diverse and interesting questions researchers have on this single data set. There are all kind of flavors: aid effect on health, democratic processes, recovery from disasters and violence or vice-versa how specific events or conditions influence aid. Even if data are a critical asset to answer these question, and to substantiate them with hard numbers, the real value comes from the questions, not from the data. And questions come from the most important asset we have: our brain. Data without brain is useless. Brain without data may still be somewhat useful I guess.

Interesting questions are causal. It’s stunning for me to see how most visualization projects are mostly organized around the detection and depiction of trends, patterns, outliers, groupings,  and so seldom around causation. Yet, in most scientific endeavors causal relationships is what matters the most. While detecting trends is still important, ultimately researchers want to see how A has an effect on B (well it may be much more complicated than that but you get the point): does aid have an effect on child mortality? does aid reduce conflicts? does aid to region A displace resources from region B? It’s extremely surprising to me, after working in visualization for many years, to realize how agnostic visualization is to causation and causal models, when in fact virtually every scientific question subsumes a causal relationship. How can we make progress and systematically explore how visualization can help uncover or present causal relationships?

OMG data bias! It was sometime halfway through the conference, after hearing all sorts of praises for Aid Data,  that one of the attendants bravely stood up and said something along the lines: “Hey wait a moment folks … these data have a huge bias! If we include only countries which accept to provide their data, we have a big  selection bias problem. How is this going to affect our research?” (kudos to Bruce for raising this question). This reminded me that data always comes with all sorts of intricacies and problems. It can be bias, it can be missing values, it can be errors, it can be a lot of other hidden things that may totally invalidate our findings. If there is one lesson to learn here is this: while it is easy to get super-excited about data and the endless opportunities they present, it is hard to acknowledge data are limited and may even be useless in some circumstances. Rather than sweeping these problems under the carpet, we’d better develop some sort of “data braveness” or “data mindfulness” and admit that data, after all, may have all sorts of bugs.

Communities of practice and visualization as a cultural artifact. During the course of the conference I had the opportunity to see lots of charts, graphs, diagrams. Visualization is definitely part of this community: they love maps and enjoy presenting they ideas through colorful visual representations. Earlier, last year I had the opportunity to work with a group of climate scientists on a different project and similarly I have seen them using lots of charts, diagrams, graphs. What I am starting to notice, after seeing so many people using visualization for their own purposes, is that visualization is a cultural artifact. Communities of practice go through an interesting evolutionary process where tools like data visualization are adopted, transformed and consolidated, forming numerous implicit and explicit defaults, conventions and expectations. With Aid Data for instance most people need to visually correlate two main variables: amount of aid and an outcome variable, both in geographical space. Most of them end up using a choropleth map with bubbles on top. Is that the best representation? I don’t know. But I know this is familiar to everyone and this is what most of them expect and are used to see. How much do we know about these communities of practice? How can research in visualization develop a better understanding of how people use visualization in real-world settings? What could we gain by doing that?

Behind data there might be a “soul”. Finally, the last thing I learned is the most important one. Data is just a signal, only a dry description of something that is much more important: real people, phenomena, events. It’s way too easy, when used to work with lots of different data and big piles of them, to forget what lies behind all these bits; what these bits really are. Aid Data and the stories I have heard of reminded me that behind data there can be profound desperation, joy, struggles, good and bad intentions, failures and successes. In a word, there can be real humans and their lives. I think it is really important for us not to lose this connection. Not to completely detach from what these data represent. Next time you start a project try to pause for a moment and think: behind data there might be a soul.

That’s all I had to say. This has been an extremely enriching experience for me and I hope these few thoughts will spark some new ideas and feelings into you. As usual, feel free to comment and react on it. I’d love to hear your voice!

Take care.

  • https://sites.google.com/site/mswinters1/ Matthew Winters

    Very cool to hear your thoughts of the conference, Enrico!

    • FILWD

      Yeah … lots of food for thoughts from the ARC Convening. I really enjoyed it. Thanks.

Previous post:

Next post: