I don’t know if you’ve ever experienced the same feeling I often have: for me the dichotomy between automatic and interactive approaches to solve analytical data analysis problems has always been a constant source of inspiration. I’ve always felt that there, where the two disciplines touch each other, lies the way to further progress in interactive data analysis.
I think it’s no mystery that in some way or another visualization and data mining have always been, and still are, somewhat in competition. The way I see it is that from the one hand dataminers see visualization as a too soft discipline, lacking of enough formalism and with the big original sin of having very poor evaluation methods in its toolbox. From the other hand visualizers think data mining is too rigid and narrowly focussed on a plethora of insignificant small deltas to algorithms that nobody will ever understand.
Both are partially right or wrong but this is not the question. What we have been missing during at least the last 10 years is a fruitful exchange of ideas from these divided communities to take the best out of them and build a too long desired communion.
There’s no question from my point of view that what one discipline lacks is perfectly filled up, in a complementary fashion, by the other. Sincerely, I don’t know why historically these two disciplines evolved separately and, sometimes, somewhat antagonistically. The truth is that today we just cannot afford taking the two completely separated.
I notice with a pinch of satisfaction that some few (feeble) signs of a new marriage seems to be coming to light. But under a new dress. The whole new hype around the figure of Data Scientist, though largely tailored around the main skills of a dataminer, show a relevant consideration for the world of data visualization. Similarly, in a specular fashion, what is Visual Analytics if not an attempt from the visualizers to acknowledge the fact that visualization without analytics is like tilting at windmills?
A bit of history
To tell the whole story it is necessary to remember how during the late ’90s first ’00s Data Mining and Visualization seemed for a while to converge to a new marriage. Some people coined the term Visual Data Mining and for a while there was a spark of interest around the idea. A group of distinguished researchers from data mining and visualization even published a book called “Information Visualization in Data Mining and Knowledge Discovery” (not the best book I’ve ever read, I must say). Around the same period Daniel Keim published a very well cited survey paper titled: “Information Visualization and Visual Data Mining” (pdf version), trying to trace the lines of the new discipline. During the same period a couple of workshops appeared in top data mining conferences like KDD and ICDM, a few number of visualization papers also found some place in their program and then what? Kind of oblivion … I don’t know … the whole idea lost its power, appeal and strength and researchers just turned their mind to something else. Or maybe dataminers and visualizers just discovered they really didn’t like each other? Who knows …
But you my dear readers for sure want more than a story, right? You want to learn something out of it. Ok, the main point here is that I strongly believe there’s no way to tackle the data analysis challenges of the new millennium without integrating these two branches of knowledge. The problems we face today require at least the following two broad features that no discipline is able to cover alone:
- Coping with monstrous data
- Harnessing the complexity of the machine
But let me explain these concepts more in details.
Why Visualization cannot afford ignoring Data Mining
- Data is full of rubbish: I repeated it several times in this blog. Data never comes for free, you have to manipulate it in order to accommodate the needs you have for your project. The most classical things you will need to deal with are: missing values, outliers detection, normalization, aggregation, sampling, etc., but every project comes with its own bag of necessary data wrangling. Each one of these requires robust and solid techniques, it is not something you can improvise. And no matter how skilled a data visualization expert you are, you will need to borrow solid techniques from dataminers, otherwise you are an amateur.
- Humans don’t scale, machines do: There is no way to visualize a billion items. really believe me, there’s no way to do that effectively. If you assign every item to one single pixel (known as pixel-based visualization), which is the maximum scalability available, you will need either a huge screen or very tiny pixels. In both cases our body has limitations. With a huge screen your perception is hampered by the maximum field of view, that is, there’s no way to embrace the whole screen with your eyes. With tiny pixels the human eye is limited by its maximum resolution. On the other hand machines do scale and can crunch monstrous amounts of data. Add a number of machines to your cluster and you have more power.
- We need order, in order to thrive: No matter how clever your visualization is and how skilled you are as a designer, visualization just cannot afford answering some questions without some kind of automatic abstraction and order. Data visualization is very powerful when lots of details can be exposed about every single item but this is not scalable, plus finding the right set up for any given question is hard and inefficient. Data mining offers some clear scaffolds around which one can build clear questions and receive somewhat clear answers.
Why Data Mining cannot afford ignoring Visualization
- Parameter setting is voodoo science. Despite the all encompassing goal of making things as automatic as possible without human intervention, almost all data mining techniques require some kind of parameter setting. Take an algorithm by chance from the extended list of classical mining tools and you’ll discover there’s nothing like a magic black box spitting the best answer like an oracle. What’s the consequence of this? It is that the user has to go through a lengthy trial-and-error process in a feedback loop fashion: (1) set the setting, (2) run it, (3) look at the results … satisfied? Not really … go back to point (1) and repeat. There is clearly a huge role of visualization here. Visualization can help to: better understand the output, compare alternative results, understand the relationship between the parameters an the output.
- You cannot trust black boxes. The issue of trust is very well known among dataminers: the models data mining algorithms build are often arcane and even if something seems to work, there’s no way to really understand why and how it works. Visualization has the power to shorten this gap and help model builders gain better confidence on the babies they build.
- There’s no right answer. Data Mining has a long tradition for providing tools to build models that give clear cut answers automatically: “should I give the loan to this customer or not?“. This is fine and useful and it’s been a very successful model for data mining so far. But many of the modern inquiries on data are not so clear-cut. Data analysis is often exploratory and and there’s no right answer. When mining is used for this purpose it necessarily needs a certain level of flexibility: ask a question, produce some initial results, visualize them, understand better the problem, change the parameters, use another algorithm, compare alternative results etc … and how do you do that without visualization?
How to start delving into DM from VIS
Let me add a couple of suggestions for my dear fellow visualizers. You guys, if you are not too accustomed to the tools data mining provides I suggest you to take a little journey out there and have a look. If you like it and learn something great! You have a new arsenal of tools your friends probably don’t have. If you don’t like it fine … you can go back to your stuff. But where do you start from? I have a couple of simple suggestions.
There is a long list of data mining books and some of them are great text books. I personally learned a lot from Data Mining: Concepts and Techniques but I expect any other textbook you find in the first ten list if you type “data mining” in Amazon to be equally good.
But there is one small, very practical, book that really changed the way I see data mining: Data Mining for Business Intelligence. I love this book. First of all it is short and very readable. Second it is all centered around practical real-world examples. This is the only book that really made me understand what data mining is all about and how useful it can be in real settings. So, if you have to pick one, pick this one. It is highly recommended. It also comes with a little tool called XLMiner you can install in Excel and run all the examples. The authors provide also seminars on the web and lots of teaching material.
And if you just cannot wait and want to start “doing”. There are two amazing free tools. Weka and KNIME. Weka is considered a sort of standard, thanks also to its java libraries that can be reused in your projects. You can run it on a console or on a UI they provide. The UI is not the best but it’s not too hard to use it. But if you don’t want to mess around with commands and hidden menu items go for KNIME. I talked about KNIME in my latest post and I can assure you it is a great great investment. It has a quite smooth user interface organized around a workflow model. You just have to connect processing “nodes” that apply specific functions and transform data in input to data in output.
Have fun! And let me know how it goes.
[And of course I wish you all a great Year 2011!!! I hope I will be able to write some posts you really love. All my best.]