Scale Effects, Machine Learning, and Spark

“In 1997, IBM asked James Barry to make sense of the company’s struggling web server business. Barry found that IBM had lots of pieces of the puzzle in different parts of the company, but not an integrated product offering for web services. His idea was to put together a coordinated package, which became WebSphere. The problem was that a key piece of the package, IBM’s web server software, was technically weak. It held less than 1 percent of a market..”

“Barry approached Brian Behlendorf [President of the Apache Software Foundation] and the two quickly discovered common ground on technology issues. Building a practical relationship that worked for both sides was a more complex problem. Behlendorf’s understandable concern was that IBM would somehow dominate Apache. IBM came back with concrete reassurances: It would become a regular playing in the Apache process, release its contributions to the Apache code base as open source, and earn a seat on the Apache Committee just the way any programmer would by submitting code and building a reputation on the basis of that code. At the same time, IBM would offer enterprise-level support for Apache and its related WebSphere product line, which would certainly help build the market for Apache.”

-Reference: The Success of Open Source, Steven Weber 2004

***

In the 20th century, scale effects in business were largely driven by breadth and distribution. A company with manufacturing operations around the world had an inherent cost and distribution advantage, leading to more competitive products. A retailer with a global base of stores had a distribution advantage that could not be matched by a smaller company. These scale effects drove competitive advantage for decades.

The Internet changed all of that.

In the modern era, there are three predominant scale effects:

-Network: lock-in that is driven by a loyal network (Facebook, Twitter, Etsy, etc.)
-Economies of Scale: lower unit cost, driven by volume (Apple, TSMC, etc.)
-Data: superior machine learning and insight, driven from a dynamic corpus of data

I profiled a few of the companies that are exploiting data effects in Big Data Revolution —CoStar, IMS Health, Monsanto, etc. But by and large, big data is an unexploited scale effect in institutions around the world.

Spark will change all of that.

***

Thirty days ago, we launched Hack Spark in IBM, and we saw a groundswell of innovation. We made Spark available across IBM’s development community. Teams formed based on interest areas, moonshots were imagined, and many became real. We gave the team ‘free time’ to work on Spark, but the interest was so great that it began to monopolize their nights and weekends. After ten days, we had over 100 submissions in our Hack Spark contest.

We saw things accomplished that we had not previously imagined. That is the power of Spark.

To give you a sampling of what we saw:

Genomics: A team built a powerful development environment of SQL/R/Scala for data scientists to analyze genomic data from the web or other sources. They provided a machine learning wizard for scientists to quickly dig into chromosome data (kmeans classifying genomes by population). This auto-scalable cloud system increased the speed of processing and analyzing massive genome data and put the power in the hands of the person that knows the data best. Exciting.

Traffic Planning: A team built an Internet of Things (IoT) application for urban traffic planning, providing real-time analytics with spatial and cellular data. Messaging queues could not handle the massive and continuous data inputs. Data lakes could not handle the large volume of cellular signaling data in real-time. Spark could. The team exploited Spark as the engine of the computing pool, Oozie, to build the controller module, and Kafka as the messaging module. The result is an application to processes massive cellular signal data and visualizes those analytics in real-time. Smarter Planet indeed.

Political Analysis: A team built a real-time analytics platform to measure public response to speeches and debates in real-time. The team built a Spark cluster on top of Mesos, used Kafka for data ingestion and Cloudant for data storage. Spark Streaming was deployed for processing. Political strategists, commentators, and advisors can isolate the specific portion of a speech that produces a shift in audience opinion. The voice of the public, in real-time.

Spark is changing the face of innovation in IBM. We want to bring the rest of the world along with us.

***

Apache Spark lowers the barrier to entry to build analytics applications, by reducing the time and complexity to develop analytic workflows. Simply put, it is an application framework for doing highly iterative analysis that scales to large volumes of data. Spark provides a platform to bring application developers, data scientists, and data engineers together in a unified environment that is not resource-intensive and is easy to use. This is what enterprises have been clamoring for.

An open-source, in-memory compute engine, Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Today, business professionals have analytics in their hands in the form of visual dashboards that inform them what is happening. Think of this as descriptive analytics. Now, with Apache Spark, these can be complemented with analytics smarts built into applications that learn from their surroundings and specifies actions in the moment. Think of it as prescriptive analytics. This means that, with Spark, enterprises can deploy insights into applications at the front lines of their business exponentially faster than ever before.

Spark is highly complementary to Hadoop. Hadoop makes managing large volumes of data possible for many organizations due to its distributed file system. It has grown to a broad ecosystem of capabilities that span data integration and data discovery. It changed the speed at which data could be collected, and fundamentally changed how we make data available to people. Spark complements Hadoop by providing an in-memory compute engine to perform non-linear analysis. Hadoop delivered mass quantities of data, fast. But the real value of data cannot always be exposed because there isn’t an engine to push it through. With Spark, there’s a way to understand which data is valuable and which is not. A client can leverage Spark to augment what they are doing with Hadoop or use Spark on a stand-alone basis. The approach is in the eye of the beholder.

***

While there are many dimensions to the Spark ecosystem, I am most excited by machine learning. Machine learning is better equipped to deal with the modern business environment than traditional statistical approaches, because it can adapt. IBM’s machine learning technology makes expressing algorithms at scale much faster and easier. Our data scientists, mathematicians, and engineers will work with the open source community to help push the boundaries of Spark technology with the goal of creating a new era of smart applications to fuel modern and evolving enterprises.

With machine learning at the core of applications, they can drive insight in the moment. Applications with machine learning at their core get smarter and more customized through interactions with data, devices and people—and as they learn, they provide previously untapped opportunity. We can take on what may have been seen as unsolvable problems by using all the information that surrounds us and bringing the right insight or suggestion to our fingertips right when it's most needed.

It is my view that over the next five years, machine learning applications will lead to new breakthroughs that will assist us in making good choices, look out for us, and help us navigate our world in ways never before dreamed possible.

***

I see Apache Spark as the analytics operating system of the future, and we are investing to grow Spark into a mature platform. We believe it is the best technology today for attacking the toughest problems of organizations of all sizes and delivering the benefits of intelligence-based, in-time action. Our goal is to be a leading committer and technology contributor in the community. But actions speak louder than words, which brings us to today’s announcements:

1)IBM is opening a Spark Technology Center in San Francisco. This center will be focused on working in the open source community and providing a scalable, secure, and usable platform for innovation. The Spark Technology Center is a significant investment, designed to grow to hundreds of people and to make substantial and ongoing contributions to the community.

2)IBM is contributing its industry leading System ML technology— a robust algorithm engine for large-scale analytics for any environment—to the Apache Spark movement. This contribution will serve to promote open source innovation and accelerate intelligence into every application. We are proud to be partnering with Databricks to put this innovation to work in the community.

3)IBM will host Spark on our developer cloud, IBM BlueMix, offering a hosted service and system architectures, as well as the tools that surround the core technology to make it easier to consume. Our approach is to accelerate Spark adoption.

4)IBM will deliver software offerings and solutions built on Spark, provide infrastructure to host Spark applications such as IBM Power and Z Systems, and offer consulting services to help clients build and deploy Spark applications.


IBM is already adopting Spark throughout our business: IBM BigInsights for Apache Hadoop, a Spark service, InfoSphere Streams, DataWorks, and a number of places in IBM Commerce. Too many to list. And IBM Research currently has over 30 active Spark projects that address technology underneath, inside, and on top of Apache Spark.

Our own analytics platform is designed with just this sort of environment in mind: it easily blends these new technologies and solutions into existing architectures for innovation and outcomes. The IBM Analytics platform is ready-made to take advantage of whatever innovations lie ahead as more and more data scientists around the globe create solutions based on Spark.

Our strategy is about building on top of and around a successful open platform, and adding something of our own that’s substantial and differentiated. Spark is that platform. We are just at the start of building many solutions that leverage Spark to the advantage of our clients, users, and the developer community.

***

IBM is now and has historically been a significant force supporting open source innovation and collaboration, including a more than $1 billion investment in Linux development. We collaborate in more than 120 projects contributed to the open source community, including Eclipse, Hadoop, Apache Spark Apache Derby, and Apache Geronimo. IBM is also contributing to Apache Tuscany and Apache Harmony. In terms code contributions, IBM has contributed 12.5 million lines of code to Eclipse alone, not to mention Linux— 6.3 percent of total Linux contributions are from IBM. We’ve also contributed code to Geronimo and a wide variety of other open-source projects.

We see in Spark the opportunity to benefit data engineers, data scientists, and application developers by driving significant innovation into the community. As these data practitioners benefit from Spark, the innovation will make its way into business applications, as evidenced in the Genomic, Urban Traffic, and Political Analysis solutions mentioned above. Spark is about delivering the analytics operating system of the future—an analytics operating system on which new solutions will thrive, unlocking the big data scale effect. And Spark is about a community of Spark-savvy data scientists and data analysts who can quickly transform today's problems into tomorrow's solutions. Spark is one of the fastest-growing open source projects in history. We are pleased to be part of the movement.

Technical Leadership

As companies grow and mature, it is difficult to maintain the pace of innovation that existed in the early days. This is why as many companies mature (i.e. Fortune 500), they sometimes lose their innovation edge. The edge is lost when technical leadership in the company either takes a backseat or evolves to a different role (different than the role it had in the early days). I see a number of companies where over time, the technical managers give way to "personnel" or "process" managers, which tends to be a death knell for innovation.

Great technical leaders provide a) team support and motivation, b) technical excellence, and c) innovation. Said another way, they lead through their actions and thought leadership.

***

As I look at large organizations today, I believe that technical leaders fall into 3 types (this is just my framework for characterizing what I see).

The Ambassador
A technical leader of this type brings broad insight and knowledge and typically spends a lot of time with the clients of the company. They drive clients in broad directional discussions and will often be a part of laying out a logical architectures and approaches. They are typically not as involved where the rubber hits the road (ie implementation of architectures or driving specific influence product roadmaps). Most of the artifacts from The Ambassador are in email, powerpoint, and discussion (internally and with clients).

The Developer
A technical leader that is very deep, typically in a particular area. They know their user base intimately and use that knowledge to drive changes to the product roadmap. They are heavily involved in critical client situations, as they have the depth of knowledge to solve the toughest problems and they make the client comfortable due to their immense knowledge. Most of the artifacts from The Developer are code in a product and a long resume of client problems solved and new innovations delivered in a particular area.

The Ninjas
A technical leader that is deep, but broad as appropriate. They integrate across capabilities and products, to drive towards a market need. They have a 'build first' mentality or what i call a 'hacker mentality'. They would prefer to hack-up a functional prototype in 45 days, than do a single slide of powerpoint. Their success is defined by their ability to introduce a new order to things. They thrive on user feedback and iterate quickly, as they hear from users. Said another way, they build products like a start-up would. Brian, profiled here, is a great example of a Ninja. Think about the key attributes of Brian's approach:

1) Broad and varied network of relationships
2) Identifying 'strategy gaps'
3) Link work to existing priorities
4) Work with an eye towards scale
5) Orchestrating milestones to build credibility

That's what Ninja's do.

***

Most large companies need Ambassadors, Developers, and Ninjas. They are all critical and they all have a role. But, the biggest gap tends to be in the Ninja category. A company cannot have too many, and typically does not have enough.

Why Actuaries Will Become Data Scientists

Insurance today takes many forms: life insurance, health insurance, property insurance, casualty insurance (liability insurance for negligent acts), marine insurance, and catastrophe insurance (covering perils such as earthquakes, floods, windstorms, and terrorism). You name it, you can probably insure it. Prior to the insurance innovation that led to the varieties just mentioned, a branch of management science was established that served as the root of insurance: actuarial science. Actuaries are in the business of assessing risk and uncertainty. Said a different way, they value and assess financial impact, of a variety of risks. But that is much easier said than done. A variety of inputs provide the information an actuary needs for better decision-making.

The Institute and Faculty of Actuaries (IFoA) is the only professional organization in the United Kingdom dedicated to educating, regulating, and generally advocating for actuaries worldwide. If Ben Franklin is the father of insurance, Chris Lewin, associated with IFoA, is probably the original actuarial historian. He has published regularly on the topic and is well known in the community for his significant contributions. Lewin states, “An actuary looks at historical data, and then makes appropriate adjustments (subjective, of course).” One of the primary skills of an actuary, therefore, is to make estimates based on the best information available. Even if an actuary uses data to develop an informed judgment, that type of estimate does not seem sufficient in today’s era of big data. There is some- thing about modern-day insurance that has led the industry to believe that informed judgment is good enough. As the quantity and quality of data improves, it will be possible to calculate increasingly accurate estimates based directly on information, negating the need for human judgment and associated biases.

***

The data era has already begun to spark a new wave of innovation in insurance. Rich access to a variety of data assets, coupled with the ability to analyze and act, enables processes that were not previously possible. This will usher in the era of dynamic risk management and improved approaches for modeling catastrophe risk in the insurance industry. In the case of automobile insurance, the industry commonly refers to this type of insurance as Usage Based Insurance. There are two types of policies under this type of insurance — Pay-As-You-Drive (PAYD) and Pay How You Drive (PHYD). However, dynamic risk management can apply well beyond the scope of driving and automobile insurance.

Dynamic risk management is an accelerated form of actuarial science. Recall that actuarial science is about collecting all pertinent data, using models and expertise to factor risk, and then making a decision. Dynamic risk manage- ment entails real-time decision-making based on a stream of data. Let’s explore the two models with an example of car insurance for a 22-year-old female:

Actuarial insurance: Collect all the data available for the 22 year old — her driving history, vehicle type, location, criminal history, etc. Merge that data with demographic data for her age, gender, location, and work status. Leverage methods like probability, mortality, and compound interest to estimate benefits and obligations. Then, offer a policy to the woman, based on these factors.

Dynamic risk management: Install a sensor in her car and tell her to go about her normal life. Collect mileage, time of day she drives, how far she drives, acceleration/deceleration, and the locations that she drives to. When she is driving, monitor the motion of the vehicle. Said another way, this is an on-board monitor, constantly pricing her insurance policy based on her personal driving behavior. If she drives well, her next premium may be lower. The policy is tailor-made for her and is based on actual data, as opposed to estimates.

There is now increased momentum for dynamic risk-management. In March 2011, the European Court of Justice stated that “taking the gender of the insured individual into account as a risk factor in insurance contracts constitutes discrimination.” Since December 2012, insurers operating in Europe are no longer able to charge different premiums on the basis of an insured person’s gender. There are good reasons why insurers might want to use gender as a means of quantifying risk. Men under the age of 30 are almost twice as likely to be involved in a car accident as their female counterparts. Insurers also have empirical evidence to show that the claims they receive for young men are over three times as large as those for women.

Arguments around gender equality have rightly determined that it is unfair to blindly discriminate against young men. Furthermore this debate has highlighted the need for more appropriate metrics for forecasting risk rather than the blunt use of gender. This gap in the market calls for better models and dynamic risk management based on the actual driving ability of the individual. Unfortunately, in the meantime, we are all paying the price as car insurers have increased their premiums across the board.

Currently, many of the large insurance carriers offer some version of dynamic risk management, or pay-as-you-drive insurance for automobiles: Progressive, Allstate, State Farm, Travelers, Esurance, the Hartford, Safeco, and GMAC, to name a few. Most of the insurers market that premiums will cost 20 to 50 percent less for consumers who adopt this approach. The National Association of Insurance Commissioners estimates that 20 percent of insurance plans will have a dynamic approach, by 2018. For the moment, dynamic insurance for automobiles is less than one percent of the market.

Dynamic risk management can apply to any data-centric insurance process, whether a company is leveraging telematics or data points about a consumer in a lending scenario. In the Big Data era, dynamic risk management will become routine.

Insurers could make themselves more popular by recognizing that dynamic risk management could become a means for encouraging behavior change. Rather than offering non-negotiable premiums based on coarse models, the use of big data to assess individual risk would urge those customers to behave more responsibly. In this way, insurance could provide a price signal to nudge customers toward a lower-risk lifestyle. Insurers such as U.S.-based PruHealth have a healthy living rewards program, known as Vitality, which gives points for healthy activities such as regular gym attendance and not smoking. Points earned from the rewards program can also be redeemed for other lifestyle rewards such as cinema tickets or gift certificates.

***

Big data has the potential to create sophisticated risk models that are focused on individuals, extremely accurate, and capable of being updated in real- time. This is bad news for those hoping to use insurance as a means to justify excessive risk taking, but it is good news for those that want to be rewarded for managing risk more effectively. As more and more individuals opt for dynamic risk management, society will benefit from safer roads and smaller healthcare bills.

This post is adapted from the book, Big Data Revolution: What farmers, doctors, and insurance agents teach us about discovering big data patterns, Wiley, 2015.