Scale Effects, Machine Learning, and Spark

“In 1997, IBM asked James Barry to make sense of the company’s struggling web server business. Barry found that IBM had lots of pieces of the puzzle in different parts of the company, but not an integrated product offering for web services. His idea was to put together a coordinated package, which became WebSphere. The problem was that a key piece of the package, IBM’s web server software, was technically weak. It held less than 1 percent of a market..”

“Barry approached Brian Behlendorf [President of the Apache Software Foundation] and the two quickly discovered common ground on technology issues. Building a practical relationship that worked for both sides was a more complex problem. Behlendorf’s understandable concern was that IBM would somehow dominate Apache. IBM came back with concrete reassurances: It would become a regular playing in the Apache process, release its contributions to the Apache code base as open source, and earn a seat on the Apache Committee just the way any programmer would by submitting code and building a reputation on the basis of that code. At the same time, IBM would offer enterprise-level support for Apache and its related WebSphere product line, which would certainly help build the market for Apache.”

-Reference: The Success of Open Source, Steven Weber 2004

***

In the 20th century, scale effects in business were largely driven by breadth and distribution. A company with manufacturing operations around the world had an inherent cost and distribution advantage, leading to more competitive products. A retailer with a global base of stores had a distribution advantage that could not be matched by a smaller company. These scale effects drove competitive advantage for decades.

The Internet changed all of that.

In the modern era, there are three predominant scale effects:

-Network: lock-in that is driven by a loyal network (Facebook, Twitter, Etsy, etc.)
-Economies of Scale: lower unit cost, driven by volume (Apple, TSMC, etc.)
-Data: superior machine learning and insight, driven from a dynamic corpus of data

I profiled a few of the companies that are exploiting data effects in Big Data Revolution —CoStar, IMS Health, Monsanto, etc. But by and large, big data is an unexploited scale effect in institutions around the world.

Spark will change all of that.

***

Thirty days ago, we launched Hack Spark in IBM, and we saw a groundswell of innovation. We made Spark available across IBM’s development community. Teams formed based on interest areas, moonshots were imagined, and many became real. We gave the team ‘free time’ to work on Spark, but the interest was so great that it began to monopolize their nights and weekends. After ten days, we had over 100 submissions in our Hack Spark contest.

We saw things accomplished that we had not previously imagined. That is the power of Spark.

To give you a sampling of what we saw:

Genomics: A team built a powerful development environment of SQL/R/Scala for data scientists to analyze genomic data from the web or other sources. They provided a machine learning wizard for scientists to quickly dig into chromosome data (kmeans classifying genomes by population). This auto-scalable cloud system increased the speed of processing and analyzing massive genome data and put the power in the hands of the person that knows the data best. Exciting.

Traffic Planning: A team built an Internet of Things (IoT) application for urban traffic planning, providing real-time analytics with spatial and cellular data. Messaging queues could not handle the massive and continuous data inputs. Data lakes could not handle the large volume of cellular signaling data in real-time. Spark could. The team exploited Spark as the engine of the computing pool, Oozie, to build the controller module, and Kafka as the messaging module. The result is an application to processes massive cellular signal data and visualizes those analytics in real-time. Smarter Planet indeed.

Political Analysis: A team built a real-time analytics platform to measure public response to speeches and debates in real-time. The team built a Spark cluster on top of Mesos, used Kafka for data ingestion and Cloudant for data storage. Spark Streaming was deployed for processing. Political strategists, commentators, and advisors can isolate the specific portion of a speech that produces a shift in audience opinion. The voice of the public, in real-time.

Spark is changing the face of innovation in IBM. We want to bring the rest of the world along with us.

***

Apache Spark lowers the barrier to entry to build analytics applications, by reducing the time and complexity to develop analytic workflows. Simply put, it is an application framework for doing highly iterative analysis that scales to large volumes of data. Spark provides a platform to bring application developers, data scientists, and data engineers together in a unified environment that is not resource-intensive and is easy to use. This is what enterprises have been clamoring for.

An open-source, in-memory compute engine, Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Today, business professionals have analytics in their hands in the form of visual dashboards that inform them what is happening. Think of this as descriptive analytics. Now, with Apache Spark, these can be complemented with analytics smarts built into applications that learn from their surroundings and specifies actions in the moment. Think of it as prescriptive analytics. This means that, with Spark, enterprises can deploy insights into applications at the front lines of their business exponentially faster than ever before.

Spark is highly complementary to Hadoop. Hadoop makes managing large volumes of data possible for many organizations due to its distributed file system. It has grown to a broad ecosystem of capabilities that span data integration and data discovery. It changed the speed at which data could be collected, and fundamentally changed how we make data available to people. Spark complements Hadoop by providing an in-memory compute engine to perform non-linear analysis. Hadoop delivered mass quantities of data, fast. But the real value of data cannot always be exposed because there isn’t an engine to push it through. With Spark, there’s a way to understand which data is valuable and which is not. A client can leverage Spark to augment what they are doing with Hadoop or use Spark on a stand-alone basis. The approach is in the eye of the beholder.

***

While there are many dimensions to the Spark ecosystem, I am most excited by machine learning. Machine learning is better equipped to deal with the modern business environment than traditional statistical approaches, because it can adapt. IBM’s machine learning technology makes expressing algorithms at scale much faster and easier. Our data scientists, mathematicians, and engineers will work with the open source community to help push the boundaries of Spark technology with the goal of creating a new era of smart applications to fuel modern and evolving enterprises.

With machine learning at the core of applications, they can drive insight in the moment. Applications with machine learning at their core get smarter and more customized through interactions with data, devices and people—and as they learn, they provide previously untapped opportunity. We can take on what may have been seen as unsolvable problems by using all the information that surrounds us and bringing the right insight or suggestion to our fingertips right when it's most needed.

It is my view that over the next five years, machine learning applications will lead to new breakthroughs that will assist us in making good choices, look out for us, and help us navigate our world in ways never before dreamed possible.

***

I see Apache Spark as the analytics operating system of the future, and we are investing to grow Spark into a mature platform. We believe it is the best technology today for attacking the toughest problems of organizations of all sizes and delivering the benefits of intelligence-based, in-time action. Our goal is to be a leading committer and technology contributor in the community. But actions speak louder than words, which brings us to today’s announcements:

1)IBM is opening a Spark Technology Center in San Francisco. This center will be focused on working in the open source community and providing a scalable, secure, and usable platform for innovation. The Spark Technology Center is a significant investment, designed to grow to hundreds of people and to make substantial and ongoing contributions to the community.

2)IBM is contributing its industry leading System ML technology— a robust algorithm engine for large-scale analytics for any environment—to the Apache Spark movement. This contribution will serve to promote open source innovation and accelerate intelligence into every application. We are proud to be partnering with Databricks to put this innovation to work in the community.

3)IBM will host Spark on our developer cloud, IBM BlueMix, offering a hosted service and system architectures, as well as the tools that surround the core technology to make it easier to consume. Our approach is to accelerate Spark adoption.

4)IBM will deliver software offerings and solutions built on Spark, provide infrastructure to host Spark applications such as IBM Power and Z Systems, and offer consulting services to help clients build and deploy Spark applications.

IBM is already adopting Spark throughout our business: IBM BigInsights for Apache Hadoop, a Spark service, InfoSphere Streams, DataWorks, and a number of places in IBM Commerce. Too many to list. And IBM Research currently has over 30 active Spark projects that address technology underneath, inside, and on top of Apache Spark.

Our own analytics platform is designed with just this sort of environment in mind: it easily blends these new technologies and solutions into existing architectures for innovation and outcomes. The IBM Analytics platform is ready-made to take advantage of whatever innovations lie ahead as more and more data scientists around the globe create solutions based on Spark.

Our strategy is about building on top of and around a successful open platform, and adding something of our own that’s substantial and differentiated. Spark is that platform. We are just at the start of building many solutions that leverage Spark to the advantage of our clients, users, and the developer community.

***

IBM is now and has historically been a significant force supporting open source innovation and collaboration, including a more than $1 billion investment in Linux development. We collaborate in more than 120 projects contributed to the open source community, including Eclipse, Hadoop, Apache Spark Apache Derby, and Apache Geronimo. IBM is also contributing to Apache Tuscany and Apache Harmony. In terms code contributions, IBM has contributed 12.5 million lines of code to Eclipse alone, not to mention Linux— 6.3 percent of total Linux contributions are from IBM. We’ve also contributed code to Geronimo and a wide variety of other open-source projects.

We see in Spark the opportunity to benefit data engineers, data scientists, and application developers by driving significant innovation into the community. As these data practitioners benefit from Spark, the innovation will make its way into business applications, as evidenced in the Genomic, Urban Traffic, and Political Analysis solutions mentioned above. Spark is about delivering the analytics operating system of the future—an analytics operating system on which new solutions will thrive, unlocking the big data scale effect. And Spark is about a community of Spark-savvy data scientists and data analysts who can quickly transform today's problems into tomorrow's solutions. Spark is one of the fastest-growing open source projects in history. We are pleased to be part of the movement.