Big Data is a hype. It’s also a buzz word. Maybe a trend? Down-to-earth people could say it’s just mass data called “big”. Although there are many very large data warehouses in the BI world, data science seems obsessed with handling “big data – when the size of the data itself becomes party of the problem.” For Gartner and Forrester even “big” is not enough anymore, they started using the term “extreme” and they are right – volume alone is not Big Data.
Big Data is data at extreme scale when it comes to Volume, Velocity, Variety and Variability according to Gartner. Since the word “big” overemphasizes Volume, “extreme” might be the more appropriate term. Anyway, “big” is there, is shorter and sounds better, so let’s stick to it. 😉 Big Data also fits better to big money, extreme money does sound strange, right? According to new study from Wikibon, Big Data pegs revenues at $5B in 2012, surging to more than $50B by 2017.
So what’s really new about Big Data? In order to find an answer we first have to ask ourselves: How come? What lead to this trend? Let’s have a look at some other important and interdependent trends:
“Software is eating the world” and the Internet Revolution
Two decades ago you needed a special training in order to use software systems. Consumers used their Office suites and the few websites out there were only an bunch of static HTML files. Enterprises had their software to support some specific business functions, mostly with relational storage and they just started to put this relational data to use.
The rise of modern Internet started a new trend where all of the technology required to transform industries through software finally works and can be widely delivered at global scale. Today consumers and businesses moved online where more than 2 billion people use the broadband internet and today’s internet is:
– easy to use and everywhere (pervasiveness)
– dynamic, complex and agile (variability)
– extremely large (volume)
– extremely quick (velocity)
– noisy (extracting the message is getting harder)
– vague and uncertain
– not well-structured and diverse (variety)
– not always consistent
while every single one of these attributes is getting more extreme.
The transformation of Web 1.0 static websites to Web 2.0 web applications is now continuing towards Web 3.0 or Semantic Web where data, their semantics and insights as well as actions derived from that data become the most important part of the internet service.
A Shift in Data
Is Big Data only about Web or Internet Data? Not necessarily, but WWW still is the main driver. Plus the new awareness for an old fact: unlike people, not all data is equal whereas the inequality is even growing. Many new consumer and enterprise apps create data footprints that are constantly growing larger and quicker in more different formats as well as getting more complex. So why treating all data equally? Why would you want to store and process data streams of RFID messages the same way as your business transaction data? Well, only if you have no choice.
Many people talk about unstructured data being Big Data. Thinking about the term “unstructured data” longer than a few seconds opens up following questions: What is data without structure? When does structure end? How can it be interpreted and analyzed?
The answers are: There is no data without structure. If there is absolutely no structure or context, it’s just noise and you can forget about analyzing it. Even a piece of text has a certain structure and context, therefore one can mine it in order to extract the semantics. What most people mean by “unstructured” is data coming from a “non-relational” source with varying structure. After 40 years of dealing with nice and tidy relational data in analytical environments the brave new world surely might seem a bit chaotic and unstructured. But it’s not, it’s just different.
NoSQL – new choice for Data Storage and Processing
In order to efficiently process this kind of data for generating insights and actions, a new set of data management and processing software has emerged. These software technologies are:
– mostly Open-Source and frequently JVM based
– excellent in scaling through massive parallelism on commodity computing capacity
– storing and processing all different kinds of data formats such as JSON, XML, Binary, Text, …
They represent the sofar missing alternative for many use cases such as (complex) event processing, operational intelligence, machine learning, real-time analytics, genetic algorithms, sentiment analysis, etc.
Traditional mass data storage and integration solutions in the domain of Data Warehousing and Business Intelligence are based on relational formats and batch processing running for years on large, expensive and poorly scalable enterprise editions of RDBMS and even more expensive enterprise hardware. As the history has shown many times, it is not always the idea or the use case searching for the right technology (as one would expect it to be), but also the new technology inspiring people when generating ideas and driving innovation.
Looking at the components of a data-driven or analytical application following technologies associated with the term “Big Data” have already taken a leading role:
MongoDB for Data Storage, Real-Time Processing and Operational Intelligence. JSON based, schema-less document oriented DBMS.
Apache Hadoop for ETL/Batch Processing implementing MapReduce algorithm for aggregation
R Project for Statistical Computing and Data Visualization
Hardware and High Performance Cloud Computing
All of the above technologies allow High Performance Computing by supporting high scalability on bunches of commodity hardware. As computing capacity is always getting cheaper and seemingly limitless through different “Cloud” offerings, we don’t have to ask ourselves “Do we really need this data” before storing it. Store first, analyze later is reality today, not only because of cheap hard disk, but also because we have the possibility to add additional computing capacity for a limited time once we want to run our analyses.
It is the combination of the above mentioned trends that sums up in a different way we look at data today. These trends surely depend on and affect each other, but explaining this would lead off the subject. Being a practical person, I would want to get more into details and describe an analytical platform based on the three leading technologies: MongoDB, Apache Hadoop and R. Not now and not here, so stay tuned…