viernes, abril 13, 2012

Jim Gray's last speech

Yesterday while surfing on the net I found a free book called "The Fourth Paradigm". The book is a compilation of papers about how data analysis can change the way science is done in many fields. The book is dedicated to the memory of Jim Gray, a technologist and ACM awarded computer science who worked many years at the research division of Microsoft. Jim Gray mysteriously disappeared at the end of January of 2007 while sailing the San Francisco Bay.

The book includes a foreword by Jim Gray, which basically is a transcription of his last speech. While reading it I found it visionary and amazingly accurate on how it predicted how the future of data analysis will evolve. Jim Gray made very good points about what has been going wrong with data for many years. For Jim, the reason why data scientists and small research centres marginalize their data analysis was the lack of good and affordable tools. Not many years ago the only thing scientists and universities had was Matlab and thousands of Excel spreadsheets to store their data. On the contrary, big research centres like CERN or NSA could afford much better tools, simply prohibitive for centres with more humble budgets.

This scenario was only 5 years ago. The situation has changed dramatically in the last years. The needs of internet companies for processing large data sets has speed up the development of more and much better tools, and as many of these developments have happened following an open-source development model, their adoption has been also massive.

It's an old story, when you build something good and lower the barrier of entry, people heavily adopt your product or technology. I personally believe that free and open-source software, and the principles these philosophies rely on, are one of the reasons why the world has changed so much in the recent years and continues doing it even at a faster pace.

Data is a game changer. Ten years ago I couldn't believe they will ever be such a good translator as Google Translate, a translator based on statistical translation. The algorithms have not improved much, what has changed is our capacity to store more and more data and process it in meaningful ways. As John More amusingly shows in his talk, What's a career in big data?, more data leads to better results.

Today, large and medium companies are adopting open-source solutions for storing, processing and understanding their data. Tools that didn't exist 5 years ago. On the other hand, the world is turning into a big playing-field of emitters and consumers of information, of people interconnected. We consume tons of information everyday but also produce a big chunk that others consume.

Jim Gray, cleverly points out that is still amusing to think the way scientists publish their work nowadays. Years of research are summarized in a 8 page article published in a respected magazine like Science or Nature. But what about all the data that backs up their conclusions? Why not providing that information too and let others rethink, reuse and prove their hypothesis. Turn science into a collaborative thing, very much like free and open-source software work, very much like lots of people take things on the internet, mix them and build new things out, very much like people collaborate to build something bigger that themselves, like Wikipedia. Give people easy-to-use, accessible tools and the ability to collaborate and they will come with surprising results.

This last speech of Jim Gray is inspiring and visionary. It reminded me somehow of the famous Richard Feynman's speech, "There is plenty of room at the bottom". Jim's talk is to data analysis what Richard Feynman talk was to physics, seeing and old field in a different way. Something that eventually gave rise to a new field on its own and a new way of doing things.

Lastly, here's one excerpt of the speech that I liked very much:

"But the Internet can do more than just make available the full text of research papers. In principle, it can unify all the scientific data with all the literature to create a world in which the data and the literature interoperate with each other. You can be reading a paper by someone and then go off and look at their original data. You can even redo their analysis. Or you can be looking at some data and then go off and find out all the literature about this data. Such a capability will increase the “information velocity” of the sciences and will improve the scientific productivity of researchers. And I believe that this would be a very good development!"

No hay comentarios: