Data for Dummies

Posted on August 19, 2011. Filed under: Hyperpublic, venture capital |

It occurs to me lately that the world of data and big data is very opaque and confusing to normal people and even most internet people. Frankly, if I’m super honest, it was pretty opaque to me even a year ago. I remember going to a talk at Hunch last year when Dixon was running his Tuesday night discussion groups where Roger Ehrenberg and the IA team broke down the ins and outs of big data. I left thinking to myself, WTF are these guys talking about…since then, data has become my life…in this post I will attempt to outline a sort of Data for Dummies…and what you will realize, I hope, is that data is actually your life too:

There is an infinite amount of data in the world. The cup on your desk right now is a datapoint. Actually, embedded within that cup are multiple data points. There is a datapoint that says “this cup exists at this physical location,” another point that says “this cup is 4.2 inches high” a third that says this cup is “paper” and so on and so forth. How I choose to organize this information, or what dimension of the cup I choose to focus on when structuring, storing, or visualizing the information about this cup are choices that differ based on what application or use I have for it. When people say that there is more data today than there was 10 years ago, that is misleading. The big data explosion is not that we are creating new data that did not exist in this volume previously (although we are insofar as population is growing, more activity is happening within a given unit of time on earth than was previously, etc…), but in the context of big data as it is commonly thrown around, the major difference is that higher volume of data is being captured in digital format. So, the cup has always existed, but now there are 10 images of the cup that have been taken with smartphone cameras, and the distributor of this cup has tracked its movement through space because they’ve moved to SAAS based POS solution that keeps a record of various dimensions of its existence, and one person tweeted about almost spilling it, and that tweet had a geopoint attached to it, so I know it’s here, and all of the sudden, there are 15 different representations of the dimensions of this cup that I am able to access, assuming I know where to look and how to query the environment which holds this information (or the environment in which it was captured). So, for every piece of data that exists in the world, for every event, every object, every occurrence, really for everything that exists or happens in our universe, there are an increasing number of accounts and representations of it turning up, largely in siloed digital environments. The interesting thing is all of these various environments that house the record of what is happening, and moving, and existing each day on earth choose to look at an occurrence or object through a different lens. Some look at the cups height, some look at its location, and some look at it’s material, and each environment stores the information of the cup along the axis that it cares about.

So, the amount that I can deduce or understand or act upon with a single dimension of the cup is small, relative to what I can achieve if I see every dimension of the cup. Sure, a perfect dataset would be if a single person sat down and documented every single aspect and element of it’s existence, but that is not the way information is created and captured in real life. Enter crawling, ingestion, and normalization. The process by which we are able to find every representation of the cup that has been captured in any environment and determine that each representation, in fact, references this same very cup. Think of this layer in the big data stack as the “glue” in some senses. The output of this effort is a much more complete set of information about this cup than exists in any one siloed location, and now when presented to a 3^rd party who would like to make a decision about this cup, my aggregate representation of the cup is more valuable than any single element of the cup that has been captured previously.

**BTW, in that example, the cup could be the cup, the transaction, the event, the person, the moment, etc, etc., etc.

Ok, so there is interesting work being done around capturing every dimension of the cup and creating a better representation of the cup. But what about cases where there is an object that exists but no representations of it have been captured in any format.? Or more realistically, no representations of it have been captured in a digital format that can improve my understanding of the aggregate object. In those cases, we can analyze existing bodies of data that are similar to the object that we lack a representation for and make very very good guesses (think 90% accuracy) as to what the dimensions of this unrecorded object are. When you hear people reference “Machine Learning” they are talking about training a machine on a known set of data to make smart guesses about unknown but related information. So, there is interesting work being done in the classification of information, and ascription of properties to objects or occurrences for which we have imperfect information.

Once we have done our best to capture every explicit and inferred dimension of the cup’s existence, we are able to build applications that wish to call any specific dimension of the cup in order to enhance an end users interaction with that cup, cups in general, or even the space in which the cup exists. Because someone has organized all of the data elements of the cup into a single easy to understand, and complete representation of the cup, I am able to create an iphone app that tells you where you can find cups in the Meatpacking district, or where you can find paper goods in the meatpacking district, or where you can find 4.2 inch objects in the meatpacking district, or what people have drank water in the last 30 minutes, etc., etc. etc. The applications of an infinite dataset are infinite as well. Much time and energy are spent determining which applications are most immediately valuable to users and therefore monetizable, but at a more fundamental level, the potential applications of big data are growing at as fast or even faster rate than the capture of big data. It is for this reason, that efforts around plumbing and infrastructure in this ecosystem have the potential to be enormously valuable and important, albeit not tangibly monetizable without pushing up stack to capture near term value. In some case the enterprise customer appears attractive, but personally the sex of a perfect representation of the cup lies in how that existence changes an individuals interaction with space, the world, other people, and ultimately time. We are moving in a direction where an increasing number, and ultimately every decision that I make is augmented, enhanced, or at least influenced by my access to these increasingly complete digital representations of life that go far beyond what I am able to understand and intake based solely on my physical senses. Basically, humans and machines are collectively TiVoing real life and space and time, but the “Play” button on that recording totally sucks, and there are is a ton of opportunity in giving consumers easy access to the recording.