Patterns in Big Data

Porsche was founded in 1931 in Stuttgart, Germany. While Porsche is often associated with sports cars, that has never been the sole focus for the company. The first project for Porsche was to design a car for the people, as requested by the German governement. This led to the creation of the Volkswagen Beetle, one of the great successes in the history of the automotive industry. During World War II, Porsche designed 3 types of tanks, as the War obviously called for a more robust vehicle than the Beetle. It wasn’t until 1964, that Porsche introduced their first sports car, the Porsche 911. Porsche developed an entire line of professional racing cars and more casual sports car for the rest of the century, until the world demanded a new vehicle: the Porsche Cayenne. The Cayenne was geared towards families that needed more space and passengers, than a typical sports car. The most recent chapter was the development of the Porsche Panamera, a sedan, with the features of a sports car, but not the bulk of a Cayenne. A key part of the engineering strategy has been to leverage common parts across the product lines, to drive efficiencies. This enabled the company to deliver to many different client needs, at a value on par with the quality.

One philosophy has dominated Porsche engineering since the company’s formation: there is not one vehicle for all situations and people. Instead, each vehicle needs to perform a specific job for its user:

Patterns in Big Data

I first wrote about Next Generation Middleware in October of 2011. While alot has changed since then, many of my views on how Big Data will evolve have not. That being said, they have certainly become more granular.

I've had a front row seat to how Big Data is changing client environments for a few years now. 2 things are quite evident to me:

1) This change is quite real, it’s accelerating, and its much more than Hadoop.
2) There is a set of emerging deployment patterns.

As we’ve moved through the experimental phase of Hadoop and Big Data, I’m seeing clients take a much more strategic approach to the topic. It’s less about trying out the flavor of the month (Cassandra, Mongo, Hadoop, etc) and more about figuring out how to integrate many of these components into their existing environment.

A key tenet in developing a Big Data strategy requires an organization to take a page of Porsche's strategy and acknowledge that one size does not fit all. There are many technologies, most have a unique and special purpose, and the leaders in Big Data will leverage all or most in a complementary way. Hence, the pattern that I am seeing around building a Big Data Strategy revolves around 3 cornerstone environments:

The Landing Zone
The Discovery Zone
The Guided Zone

This is what it looks like logically:

You will recognize that IT environments of the last 20 years, have been largely focused in the ‘white areas’. These are traditional data repositories, providing data to business applications. This is how companies ran their business, in the e-business era. Certainly, as datawarehousing and analytics have risen to prominence, we have seen more investment in the ‘blue boxes’ or Big Data Zone. However, most of that investment to date has been an augmentation of the ‘white areas’ (ie providing analytics of structured data from transactional systems).

The Big Data Zone is where companies will separate themselves from others in the next 5-10 years. Those that can execute on this vision and get there faster will be more efficient, more information rich, and make better decisions.

The Landing Zone

This is the place where you 'land' your data in its native form. All data types, sizes, veracity accepted and expected. It's the innovation 'manufacturing floor', and as you begin to harvest your data assets, you can send those refined assets to other zones. The Landing Zone must be cost effective and differentiated by analytics and analysis (not just the run-time), as the effectiveness of your other zones may be dependent on the Landing Zone. I expect that we will see Hadoop and the plethora of NOSQL options take root in the Landing Zone.

The Discovery Zone

This is the place for discovery and deep analytics, primarily of structured data assets, but not limited to that. Have large complex analytic queries? Do them here. Need high performance analytics? Do it here. This becomes the core analysis and analytics hub for the organization. This will be the most efficient and cost effective place for high performance analytics. Obviously, this requires tight integration with the Landing Zone.

The Guided Zone

This is the place for mixed analytic workloads. It's not just deep analytics like the Discovery Zone; it encompasses thousands of concurrent users, operational workloads, analytic workloads and all of them in combination. It's the best place for mixed workloads, but it's too expensive to use for just landing data or for data discovery. This zone will be more important in some companies (like credit card companies tracking fraud transactions in real-time), than in others (a retailer analyzing last months sales).

This pattern of Big Data Zones is gaining steam in the forward looking IT environments across the industry. Like Porsche realized long ago, many companies know that there is not a single answer to every problem. Leaders in Big Data will embrace this notion of the Zones and start to build a plan to meet the analytic needs of the organization, leveraging all aspects of Next Generation Middleware.