A Practical Guide to Machine Learning: Understand, Differentiate, and Apply

Co-authored by Jean-Francois Puget (@JFPuget)

Machine Learning represents the new frontier in analytics, and is the answer of how many companies can capitalize on the data opportunity. Machine Learning was first defined by Arthur Samuel in 1959 as a “Field of study that gives computers the ability to learn without being explicitly programmed.” Said another way, this is the automation of analytics, so that it can be applied at scale. What is highly manual today (think about an analyst combing thousand line spreadsheets), becomes automatic tomorrow (an easy button) through technology. If Machine Learning was first defined in 1959, why is this now the time to seize the opportunity? It’s the economics.

A relative graphic to explain:


Since the time that Machine Learning was defined and through the last decade, the application of Machine Learning was limited by the cost of compute and data acquisition/preparation. In fact, compute and data consumed the entirety of any budget for analytics which left zero investment for the real value driver: algorithms to drive actionable insights. In the last couple years, with cost of compute and data plummeting, machine learning is now available to anyone, for rapid application and exploitation.

***

It is well known that businesses must constantly adapt to changing conditions: competitors introduce new offerings, consumer habits evolve, and the economic and political environment change, etc. This is not new, but the velocity at which business conditions change is accelerating. This constantly accelerating pace of change places a new burden on technology solutions developed for a business.

Over the years, application developers moved from V shaped projects, with multi-year turnaround, to agile development methodologies ( turnaround in months, weeks, and often days). This has enabled businesses to adapt their application and services much more rapidly. For example:

a) A sales forecasting system for a retailer: The forecast must take into account today's market trends, not just those from last month. And, for real-time personalization, it must account for what happened as recently as 1 hour ago.

b) A product recommendation system for a stock broker: they must leverage current interests, trends, and movements, not just last months.

c) A personalized healthcare system: Offerings must be tailored to an individual and their unique circumstance. Healthcare devices, connected via The Internet of Things (IoT), can be used to collect data on human and machine behavior and interaction.

These scenarios, and others like them, create a unique opportunity for machine learning. Indeed, machine learning was designed to address the fluid nature of these problems.

Firstly, it moves application development from programming to training: instead of writing new code, the application developer trains the same application with new data. This is a fundamental shift in application development, because new, updated applications can be obtained automatically on a weekly, if not daily basis. This shift is at the core of the cognitive era in IT.

Secondly, machine learning enables the automated production of actionable insights where the data is (i.e. where business value is greatest). It is possible to build machine learning systems that learn from each user interaction, or from new data collected by an IoT device. These systems then produce output that takes into account the latest available data. This would not be possible with traditional IT development, even if agile methodologies were used.

***

While most companies get to the point of understanding machine learning, too few are turning this into action. They are either slowed down by concerns over their data assets or they attempt it one-time and then curtail efforts, claiming that the results were not interesting. These are common concerns and considerations, but they should be recognized as items that are easily surmounted, with the right approach.

First, let’s take data. A common trap is to believe that data is all that is needed for successful machine learning project. Data is essential, but machine learning requires more than data. Machine learning projects that start with a large amount of data, but lack a clear business goal or outcome, are likely to fail. Projects that start with little or no data, yet have a clear and measurable business goal are more likely to succeed. The business goal should dictate the collection of relevant data and also guide the development of machine learning models. This approach provides a mechanism for assessing the effectiveness of machine learning models.

The second trap in machine learning projects is to view it as a one-time event. Machine learning, by definition, is a continuous process and projects must be operated with that consideration.

Machine learning projects are often run as follows:

1) They start with data and a new business goal.

2) Data is prepared, because it wasn’t collected with the new business goal in mind.

3) Once prepared, machine learning algorithms are run on the data in order to produce a model.

4) The model is then evaluated on new, unforeseen, data to see whether it captured something sensible from the data. If it does, then it is deployed in a production environment where it is used to make predictions on new data.

While this typical approach is valuable, it is limited by the fact that the models learn only once. While you may have developed a great model, changing business conditions may make it irrelevant. For instance, assume machine learning is used to detect anomaly in credit card transactions. The model is created using years of past transactions and anomalies are fraudulent transactions. With a good data science team and the right algorithms, it is possible to obtain a fairly accurate model. This model can then be deployed in a payment system where it flags anomalies when it detects them. Transactions with anomalies are then rejected. This is effective in the short term, but clever criminals will soon recognize that their scam is detected. They will adapt, and they will find new ways to use stolen credit card information. The model will not detect these new ways because they were not present in the data that was used to produce it. As a result, the model effectiveness will drop.

The cure to avoid this performance degradation is to monitor the effectiveness of model predictions by comparing them with actuals. For instance, after some delay, a bank will know which transactions were fraudulent or not. Then it is possible to compare the actual fraudulent transactions with the anomalies detected by the machine learning model. From this comparison one can compute the accuracy of the predictions. One can then monitor this accuracy over time and watch for drops. When a drop happens, then it is time to refresh the machine learning model with more up to date data. This is what we call a feedback loop. See here:


With a feedback loop, the system learns continuously by monitoring the effectiveness of predictions and retraining when needed. Monitoring and using the resulting feedback are at the core of machine learning. This is no different than how humans perform a new task. We learn from our mistakes, adjust, and act. Machine learning is no different.

***

Companies that are convinced that machine learning should be a core component of their analytics journey need a tested and repeatable model: a methodology. Our experience working with countless clients has led us to devise a methodology that we call DataFirst. It is a step-by-step approach for machine learning success.



Phase 1: The Data Assessment
The objective is to understand your data assets and verify that all the data needed to meet the business goal for machine learning is available. If not, you can take action at that point, to bring in new sources of data (internal or external), to align with the stated goal.


Phase 2: The Workshop
The purpose of a workshop goal is to ensure alignment on the definition and scope of the machine learning project. We usually cover these topics:
- Level set on what machine learning can do and cannot do
- Agree on which data to use.
- Agree on the metric to be used results evaluation
- Explore how the machine learning workflow, especially deployment and feedback loop, would integrate with other IT systems and applications.


Phase 3: The Prototype
The prototype aims at showing machine learning value with actual data. It will also be used to assess performance and resources needed to run and operate a production ready machine learning system. When completed, the prototype is often key to secure a decision to develop a production ready system.

***

Leaders in the Data era will leverage their assets to develop superior machine learning and insight, driven from a dynamic corpus of data. A differentiated approach requires a methodical process and a focus on differentiation with a feedback loop. In the modern business environment, data is no longer an aspect of competitive advantage; it is the basis of competitive advantage.


iPad Pro: Going All-in






Here is my tweet from a few weeks back:


I have given it a go, going all-in with the iPad Pro. In short, I believe I have discovered the future of personal computing. That being said, in order to do this, you truly have to change the way you work; how you spend your time, how you communicate, etc. But, it's worth it and will probably make you a better professional. I knew I was hooked, when I had to go back to my MacBook for something and I started touching the screen; the touch interface had been ingrained in my work.

Here are my quick observations:

1) The speed of the iPad Pro is unbelievable. While I didn't realize this in advance, this fact alone makes up for a lot of the reasons why I could never move to an iPad before.

2) You have to master multitasking in the iPad Pro in order to make the switch. There are a lot of shortcuts on the screen, keyboard shortcuts, and hand gestures. If you are not using them, you will not understand the advantage of this form factor.

3) Keyboard shortcuts are now available for my corporate mail. That's a big time saver.

4) I never have to worry about a power cable. The battery on this is great, but even if it gets low, nearly everyone I know has a compatible charger.

5) The integration of apps on the Pro is tremendous: Box/Office, Slack, etc.

6) It goes without saying that the Pro is super light and convenient for travel.

7) Here are some things I can't do on the iPad Pro:
- Renew Global Entry
- Corporate workflow (forms and expenses)
- Blogging (writing is easy, but posting to corporate blog or even Blogger is very hard). I'm not sure why there is not a good app for this.

8) I got the smaller version of the iPad Pro. I thought the large one was just too big. It seems like the ideal size may be a size in between the two.

In short, after a few weeks, I highly recommend. You can make the switch, but you'll likely need a laptop once a week or so, for some of the items mentioned above. I haven't really gotten into the Apple Pencil yet. I've used it a couple times and may try it more over the next couple weeks.

Data Science is a Team Sport



In 2013, Ron Howard directed and released the movie Rush, a film that captured the rivalry between James Hunt and Niki Lauda during the 1976 Formula One racing season. It’s a vivid portrait of the drivers and their personalities—a pretty typical, if captivating focus on the drivers as heroes of the race. But it does something deeper and more interesting as well. The film looks into the essence of Formula One—a true team sport.

“Formula” in Formula One refers to the set of rules to which all participants' cars must conform. Formula One rules were agreed upon in 1946, on the heels of World War II. Modern Formula One cars are open cockpit, single-seat vehicles. The cornering speed of a car comes from “wings” mounted at the front and rear of the vehicle. The tires also play a major role in the cornering speed of a car. Carbon disc brakes are used to increase performance. Engines have evolved to turbocharged V6’s. All these components are integrated to provide precision and performance, and to win the race. However, the precision and design of the vehicle is useless, without the right team.

In Formula One, an “entrant” is the person who registers a car and driver for the race, and maintains the vehicle. The “constructor” is the person who builds the engine or chassis and owns the intellectual rights to the design. The “pit crew” is the team that prepares and maintains the vehicle before, during, and after the race. The cameras focus on the driver, with a couple of obligatory shots of the pit crew scrambling to change tires. But the real story is the collaboration of the complete team: experts working together to make the difference between success and failure.

***

Since the turn of the century, enterprises around the world have been on a journey to master data science and analytics. We have fewer camera crews, and no cool uniforms, but the goal is no less difficult to achieve. Said simply, we want the right information, at the right moment, to make better decisions. Despite years of effort, organizations have achieved inconsistent results. Some are building competitive moats with machine learning on a large corpus of data, but others are only reducing their costs by 3%, using some new tools. This is best viewed on an enterprise maturity curve:


Why are some organizations able to achieve differentiated results, while others struggle to set up a Hadoop cluster?

***

Spark is the Analytics Operating System for the modern enterprise. Anyone using data, starting right now, will be leveraging Spark. Spark enables universal access to data in an organization.

Today, we are announcing the Data Science Experience, the first enterprise app available for the Analytics Operating System. This is the first integrated development environment for real-time, high performance Analytics, designed to blend emerging data technologies and machine learning into existing architectures.

An IDE for data science is a collaborative environment; it brings data scientists together to make data science and machine learning available to everyone. Today, data science is an individual sport. If you are a data scientist at a retailer, for example, you have to choose your own tool or flavor, work on your own, and, with any luck, you produce a meaningful insight. Anything you learn stays with you—it’s self-contained, because it is built in your own lingua-franca.

Now, with the Data Science Experience, you can use any language you want—R, Python, Scala, etc.—and share your models with other data scientists in your organization.

We have made data science a team sport.

In Formula One parlance, Spark is the chassis, holding everything together. The Data Science Experience (the IDE) is the integrated components, acting as one, to drive precision and performance. And the data science discipline now has a driver, a pit crew, a constructor, and a coach, that incredible vehicle whose sum is greater than its parts: a team.

The Data Science Experience is born on the cloud. It adapts to open source innovation. And the Data Science Experience grows stronger as more and more data scientists around the globe create solutions based on Spark. Further, the ecosystem for The Data Science Experience is open and available. We are proud to have partners like H20, RStudio, Lightbend, and Galvanize, to name a few.

With Data Science Experience, the discipline of data science can now accomplish exponentially greater outcomes. It’s the difference between a shiny car sitting in a garage, and crossing the finish line at 230 miles per hour.

***

IBM is building the next generation analytics platform in the cloud.

1. It started with our investment in Apache Spark as the Analytics O/S, last year.
2. It continues today, as we launch the first IDE for this new way of thinking about data & analytics.
3. Over time, this will evolve as the platform for an enterprise in the data era.

All of this is enabled by Spark.

***

In June 2015, we announced IBM’s commitment to Apache Spark. In closing, I want to provide some context on our progress in the last year. If you missed it last year, here is why I believe Spark is will be a critical force in technology, on the same scale as Linux.

So, what have we accomplished and where are we going?

1) We continue to expand the Spark Technology Center (STC). We opened an STC in India. We continue to hire aggressively. And, later this year, we will move into our new home on Howard St. in San Francisco.

2) Client traction and response has been phenomenal. We have 40+ client references already and more on the way.

3) We have open sourced SystemML as promised and we are working on it with the community, in the open. This contribution is over 100,000 lines of code. SystemML was accepted into Apache as an official Incubator project as of November 2015. Since it was open-sourced, 859 contributions have been made to the project (i.e. a build-out of the Spark backend, API improvements; usability with Scala Spark & PySpark notebooks for data science, experimental work into deep learning, etc.)

4) For Spark 1.6.x, a total of 29 team members contributed to the release (26 of them from the STC), and each contributing engineer is a credited contributor in the release notes of Spark 1.6.x. For Spark 2.0, 31 STC developers have contributed to Spark 2.0 thus far. This is still in progress

5) Our Spark specific JIRAs have been almost 25,000 lines of code. You can watch them in action here. Much of our focus has been on SQL, MLlib, and PySpark.

6) We launched the Open Source Analytics Ecosystem and are working closely with partners like Databricks, Lightbend, RStudio, H20, and many others. We welcome all.

7) We have trained ~400,000 data scientists through a number of forums, including BigDataUniversity.com.

8) Adoption of the Spark Service on IBM Cloud continues to grow exponentially, as users seek access to the Analytics Operating System.

9) We have over 30 IBM products that are leveraging Spark and many more in the pipeline.

10) We launched a Spark newsletter. Anyone can subscribe here.

11) Lastly, we have launched a Spark Advisory Council. Over 25 leading enterprises and partners — Spark experts building new companies and established industry leaders building new platforms — participate in this regular dialogue about their experiences with Spark and the direction of the Spark project. We use this thinking to focus our efforts in the Spark Technology Center. All are welcome. Contact us here if you are interested.

***

Data Science is a team sport. Spark is the enabler. This is why I stated last year that anyone using data will be leveraging Spark in the future. That future is quickly arriving.

Winning in Formula One is about speed, performance, precision, and collaboration. Those that find the winners circle have found a way to integrate the components (human and material) to act as ONE. The same opportunity exists in Analytics and Data Science. Let’s make data science a team sport. Welcome to the first enterprise app for the Analytics Operating System: The Data Science Experience.

The Fortress Cloud


In 1066, William of Normandy assembled an army of over 7,000 men and a fleet of over 700 ships to defeat England's King Harold Godwinson and secure the English throne. King William, recognizing his susceptibility to attack, immediately constructed a network of castles to preserve his kingdom and improve his status among followers. 

The word 'castle' comes from the Latin word castellum, which means 'fortress.' While Medieval castles evolved in structure and function through the years, their core role has not changed:

1. To protect, as a defensive measure.
2. A platform to wage battle, as an offensive measure.
3. To ensure orderly governance. 

Medieval castles were well planned in terms of their location and several key attributes. They were built near or on a water spring. They had direct access to key transportation routes and were built on high ground to make defending the stronghold a bit easier. 

I have written extensively about The Big Data Revolution, researching how digital technologies and data exploitation are impacting industries in the Data era. While every industry is different, there are clear patterns in how data is reinventing business processes and disrupting traditional business models. Most notable is that the Revolution cannot be effectively waged without the right protection, foundation for an offensive, and orderly governance. We need a modern day castle for The Big Data Revolution; a fortress cloud.

***

The first wave of big data has hit, creating great opportunities, but also cracks in company security, worries about customer data privacy, and showing the limitations of current analytics. Perhaps the Big Data Maturity curve captures it best:


Most of the investments to date have been focused on cost reduction and extending existing IT capabilities. We are now entering an era that will be marked by business re-invention on the basis of data. Incumbents beware. This demands a thoughtful approach on security measures companies may have to take, how improved analytics can help all achieve stronger insights, and how consumers are demanding a new privacy contract. 

The traditional IT stack is giving way to a fluid data layer: a new set of composable cloud services, defined by next generation capabilities. With this new approach to analytics, we must re-imagine all aspects of data movement and governance for that world. I see 3 defining capabilities:

1. Ingest- The ability to lift data from wherever it resides and integrate it into a cloud-based fluid data layer. This must be done seamlessly and at incredibly high speeds, with little to no manual intervention. 
2. Preparation- The ability to massage, filter, and select only the data most relevant to the task at hand. 
3. Governance- The ability to catalog, describe (metadata), and manage access to sensitive data sets.

Companies will require a new approach to data integration, data preparation, data governance, and data pipelining; a modern day fortress, on the cloud, ready for The Big Data Revolution.

***

The Basel Committee on Banking Supervision (BCBS) announced regulation 239 in January 2013. For many institutions, this immediately put them on the defensive. However, Sun Tzu reminds us, "Security against defeat implies defensive tactics; ability to defeat the enemy means taking the offensive." BCBS 239, for the data-era organizations, represents an opportunity for an offensive.

For those less familiar, the principles of BCBS 239 center on governance, data and IT architecture, accuracy, timeliness, and completeness in reporting, when it comes to an organizations data assets and processes. While these may appear to be defensive measures, the endgame is a platform from which to wage battle: a true governance offensive.

With the right data architecture, established on the cloud, a new set of opportunities emerge for an enterprise that embraces governance as an offensive measure. An enterprise will find itself with a castle for the Data era, armed with key offensive weapons:

a) Self-Service: designed to empower the citizen analyst, data engineer, and data steward to engage on their own accord. A user does not need to ask for access to data; they simply engage and discover.

b) Hybrid: taps into data everywhere...ground to cloud. Where the data resides does not matter to the consumer/user; it’s just data.

c) Intelligent: Machine learning makes everyone a super human and automates many manual processes.

d) All Data: works with both structured & unstructured data

These are the principles that have guided the construction of our fortress destination on the cloud. Private and Public.

***

When William of Normandy assembled his fortress many years ago, he adorned his castles with a number of attributes: towers, curtain walls, moats, drawbridges, portcullis, etc. All were a set of best practices designed for defensive protection, coupled with a base from which to wage an offensive. It was modern protection, for an unmodern time.

Our fortress cloud, like that of William of Normandy, is designed around governance as a strategic lever: offensive and defensive. It’s a unique destination, for our modern time.


------

Special thanks to @danhernandezATX for editing and guidance.

Machine Learning and The Big Data Revolution


I had the opportunity to speak at TDWI in Chicago today. It was a tremendous venue and a well organized event. Thanks to the TDWI team. I spoke on the topic of machine learning and the big data revolution. The slides are below, although they are not all self-explanatory.

3 key points from the talk:

Scale Effects
In the 20th century, scale effects in business were largely driven by breadth and distribution. A company with manufacturing operations around the world had an inherent cost and distribution advantage, leading to more competitive products. A retailer with a global base of stores had a distribution advantage that could not be matched by a smaller company. These scale effects drove competitive advantage for decades. The Internet changed all of that.

In the modern era, there are three predominant scale effects:

-Network: lock-in that is driven by a loyal network (Facebook, Twitter, Etsy, etc.)
-Economies of Scale: lower unit cost, driven by volume (Apple, TSMC, etc.)
-Data: superior machine learning and insight, driven from a dynamic corpus of data


The Big Data Maturity curve

This is the barometer for any enterprise seeking competitive advantage, based on data. Many companies are beginning to utilize new techniques to reduce the cost of data infrastructure. But, the competitive breakthrough comes when an enterprise moves to the right side of the curve: Line of business analytics to transform operations and new business imperatives and business models. I alluded to a number of companies that I admire for leading on this side of the curve: CoStar, StitchFix, and Monsanto.


AnalyticsFirst

A proven and repeatable methodology for applying the value of data science and machine learning in the context of an enterprise. With thousands of successful engagements, we have learned a lot about what works (and what does not). I've seen companies achieve major breakthroughs leveraging this methodology, often ending months/years of frustration. Any organization can lead the revolution with AnalyticsFirst. Let me know if you are interested!