Your data is big and it's getting bigger. (No, I didn't just call you fat.) Companies that collect and process data are getting bigger, often through mergers and acquisitions, but also through organic growth.The number and types of devices producing and capturing data have exploded in recent years. The data we are accumulating has reached epic proportions — and is growing by exponentially epic proportions every day. The vaguely childish code name "big data" doesn't begin to cover it. That name almost seems to imply that the problem is limited to 'really big files'. You know, like your sales history file after forty years on the same computer system. Yes, it's big. It is really big. But that is a really simple problem. It's a defined data set of a known structure and presumably you have the tools to manipulate it. Even with the added complication of mergers it is still pretty simple. It's just that now you have two sales history files and they are in different formats. That is something we have been dealing with for decades — mapping one to the other or both to some middle ware — often a "cube" for report processing. All simple, still manageable. This concept of "Big Data" goes beyond all of that. It goes beyond do-able. That's probably the most succinct definition: "Beyond do-able." The more formal definition gets into the three V's that define Big Data: Volume, Velocity, Variety. And a fourth "V" comes in the effort that must be made to determine the Value of a stream of data. So we must deal with data that is large (volume), that is growing quickly (velocity) and that has varied structure (variety) and that has some weighted value that must be determined.
Each of these "V"s brings its own inherent challenges, that is sure. But the scary and impressive one is "Velocity." We can deal with the variety of data — if we have some time to poke around with it. We can deal with the size of a large file. We can use indexes, reporting cubes, programs to reformat and to compress. We can do most anything, given time. But velocity comes along and eats our time for lunch. We don't have the luxury of the time to poke around and build giant data cubes. The new data is coming at us so fast that it is outpacing our efforts to deal with it! We are sitting in a pool of water trying to bail with one-liter bottles, but our pool is at the base of a waterfall. That is the problem with "Big Data." When you braid in the other two V's,it gets so much worse. Because it's not just a high value of various data that is coming at velocity, the variety itself is increasing at velocity, too! It is the velocity of the variety — and the variety of the velocity. I'm sure you get the drift, but let's look at this analogy. Let's say we step out of the pool and onto dry land. We take the time we need to programmatically churn through five data streams coming over that waterfall. What we need, what don't we need, how are we going to consolidate the stream and what do we want to use from it. We lost some ground while we tinkered with it, but that's okay. We've decided to accept that loss because now we have data to play with. We have a way to sample the streams in the most intelligent fashion (predictive analytics). We've stratified and prioritized, we've stacked and compressed, we've powdered and perfumed. We're feeling pretty good about our data. But while we were looking away, three more streams of data dug new grooves over the dam. Data that we didn't know about, didn't expect, don't understand. By the time we figure out what we need from that data, we look up and…well, you can guess the pattern. That is the real issue behind big data. People talk about those other "V"s, but it's that V for Velocity that is creaming us.
The image most often associated with big data is an elephant. Even Hadoop — the predominate tool for managing Big Data — is named for an elephant. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. My own Dad has an adage about elephants. He got it from his Dad. And like all of my grandfather's wise sayings, the real impact comes in the corollary. You ask: "How do you eat an elephant?" The answer is easy, "One bite at a time." But the corollary? "That last bite can get pretty gamey." It was always a good aphorism, applying well to most big projects. If you spend too much time solving a problem, the nature of the problem itself will have changed — and not usually for the better. While this still applies in the case of Big Data, those slowly lumbering elephants have become giant rabbits. You know, hopping around quickly and multiplying like crazy.
To cope with all of those bunnies, a divide and conquer approach known as map reduce has come into use. Unlike a centralized database system where you have one disk connected to one or more CPUs with a limited amount of horsepower, MapReduce allows you to distribute the data across clusters of servers that include distributed storage and multiple processors. So while your program for indexing your data may not have changed much (yet), you can get results faster by sending your application and a chunk of the data to each of servers in your cluster. Each server operates on its own chunk and then the results are then delivered back in a unified whole.
The idea — and the name — for MapReduce came from Google, a company that has always been pretty good with data. Their breakthrough was realizing that a search engine could use input other than the text on the page. The joke is that they "thought outside of the (search) box." They needed to usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform in 2004. They named it simply enough with the two verbs that describe its action (no stuffed animals were injured): MapReduce. MapReduce allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors. The framework is divided into two parts, the two actions. Mapping is the process of breaking up a task and the data to multiple nodes and Reducing is the function that collates the work and resolves the results into a single value.
Google's innovations were incorporated into " Nutch ", an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications. Hadoop is written in Java as part of the Apache project (sponsored by the Apache Software Foundation). Both Google's MapReduce and the open source Hadoop have to rely on distributed file systems. Hadoop uses a standard distributed file system the HDFS (Hadoop Distributed File Systems) while Google MapReduce uses the proprietary GFS (Google File System). In both cases, the distributed file system facilitates rapid data transfer rate s among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.
Using this approach a lot of interesting data products have emerged. Google used the technology to include spell-checking (by building a dictionary of common misspellings and their context), to integrate voice search and for useful functions such as tracking the progress of the Swine Flu epidemic of 2009. And Google isn't the only company that knows how to use data. Facebook and LinkedIn use patterns of friendship relationships to suggest other people you may know, or should know, with frightening perspicacity.
No stranger to this sort of trickery from its very inception, Amazon saves your searches, correlates what you search for with what other users search for, and use the data to create disturbingly accurate and budget-busting recommendations. These recommendations help to drive Amazon's more traditional retail business. Retailers understand that customers generate a trail of "data exhaust" that can be mined and put to use.
It is still hard to implement a Hadoop solution and there are not that many experts. This is where Amazon has taken things a step further and has "packaged" a cloud-based MapReduce service which they are calling Amazon Elastic MapReduce (Amazon EMR). This is a user-friendly pay-for-user service. It is worth your time to watch this instructional video:
Learning how to develop code using Map Reduce and Hadoop is a completely different way of thinking from traditional programming paradigms. Most traditional programming shops will have to re-tool to take advantage of this new paradigm. Seriously? The paradigm shift again? Not only that. Not only might we want to "re-tool" and build some programs that we pass off to these clusters to mine our data but we may also want to think about modifying our approach to our routine data processing applications. MapReduce on the fly, as it were. So there we are — retooling and refactoring. Again. It's something we might want to start thinking about now, even if we aren't ready to make any serious moves in that direction.
Everywhere you look you see the discussions about Big Data and because of the (let's face it, dumb) name it is easy to start thinking of the "problem" as how to manage a lot of data. And that is a challenge, no question about it. But it is not the juicy part. The juicy parts are the new ways that we will use our giant data. Whether or not an organization is able to figure out innovative uses of data is going to be critical to its survival in coming years. This is where the new field of data science comes in. According to Harvard, "Data Scientist" is the sexiest job of the 21st century. Quoting from that Harvard Review Article:
"… thousands of data scientists are already working at both start-ups and well-established companies. Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before. If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a "mashup" of several analytical efforts, you've got a big data opportunity."
A "data opportunity"! These opportunities will lead to "data products" that are developed using "data science" through "data conditioning". These are all interesting new ideas, with the real excitement of getting ahead of your data using "predictive analytics".
So what we've started with here is a simple definition of "Big Data" and an overview of the mechanical tools and methodologies that are coming into use for managing it and for mining it. What's next, and way more fun, is to take a look at how people are using all of this structured and unstructured data. It is fascinating to think about how your company might use their Big Data. What jewels are out there in your data, waiting to be mined?
We aren't just changing how we store and access data. We will change the way we think about data. We will change how we market and sell and will certainly change how we buy. This all necessarily leads to new views on privacy and to some ethical dilemmas. There is a line — and sometimes it is a very fine line — between opportunity and exploitation, between providing a service and committing an offense. This is where the Big meets the Data.