How do you combine historical Big Data with machine learning for real-time analytics? An approach is outlined with different software vendors, business use cases, and best practices.
Big Data has gained a lot of momentum recently. Vast amounts of operational data are collected and stored in Handoop and other platforms on which historical analysis is conducted. Business intelligence tools and distributed statistical computing are used to find new patterns in this data and gain new insights and knowledge for a variety of use cases: promotions, up- and cross-sell campaigns, improved customer experience, or fraud detection.
An event processing approach known as "fast data" automates decisions and initiates actions in real-time, based on statistical insights from Big Data platforms. This article explains how to achieve a closed loop for real-time analytics with Big Data and machine learning and analytic models, and event-processing engines.
One of the key challenges in such environments is to quickly turn these newfound insights and patterns into action while processing operational business data in real time. This is necessary to keep customers happy, increase revenue, optimize margin, or prevent fraud when it matters most.
Use Cases for Real-Time Event Processing:
Before we take a look into technologies, let's think about some use cases where it is important to analyze and act on critical business moments while the data is still in motion. The window of opportunity might be a few seconds, minutes or even hours. The key is to be proactive as automatically or via human interaction based on predictions, which are calculated in real time by an analytic model
Optimized pricing: Offer a product, hotel or flight for the best price, i.e. to make most revenue. It should be lower than the competition, but also not too low. If it is sold for too low a price, you create less revenue.
Fraud detection: Decide if a payment is accepted or rejected, e.g. in a store or online sales, before the fraud has happened.
Cross selling: Send an offer (e.g. via push message to the mobile device) while the customer is still in the store. If the customer has left the store, he will spend his money somewhere else.
Rerouting transportation: Act on logistic events that are not foreseeable hours before - for example, traffic congestion due to weather or accidents.
Customer service: Recommend the best solution for a customer while he is in the line, solve a problem in the appropriate way based on the customer's history, knowledge and loyalty status.
Proactive maintenance: Stop and replace a part before the machine will (probably) break. This can save a lot of money and efforts.
Restock inventory: Ship inventory to specific shops based on expectations, external influences, and customer behaviour.
Analytics maturity model:
The closed loop is necessary to find insights and pattern in historical data, build analytic models for prediction, and execute the analytic models on new events automatically. The new events are also stored for future historical analysis and optimizations of the analytic model and re-deployment.
The following sections describe the three phases of the analytics maturity model in more detail:
Phase 1: Data Discovery: Using visualization by the business user.
Phase 2: Predictive And Prescriptive Analytics: Using machine learning by data scientist.
Phase 3: Operationalization And Automation: Using event processing by the developer.
1.Data Discovery Using Visualization:
"The greatest value of a picture is when it forces us to notice what we never expected to see" is what John W. Tukey said in 1977, long before the era of Big Data analytics. The questions businesses should ask in the first phase are: What happened? Why did it happen?
The business user has to follow these steps:
Data acquisition: Integrate relevant data sources. These can be CSV files, a relational database, a data warehouse, or Big Data storage such as a Hadoop cluster. Using these data sources should be very easy via graphical out-of-the-box connectors to let businesses focus on discovering business insights.
Data preparation: Usually, you have to merge different data sets to get valuable insights. The real insights are found by (often unexpected) correlations of different data. The output of data preparation is often just flat files with rows and columns (like a simple CSV file, but often with very large data sets), These files can be used easily for extensive analysis.
Exploratory data analysis: The business user uses the integrated and prepared data to spot problems and new insights. Ease-of-use of the tool is key for success (e.g. using a recommendation engine that suggests specific visualizations for selected columns, or interactive brush-linking between different tables and charts).
"A picture is worth more a thousand tables" describes how relevant visualization is for humans. A business user is not technical expert or developer. He needs an easy-to-use tool for data discovery (whether Tableau, Qlik Sense, or our product, TIBCO Spotfire).
A data discovery tool might be enough for a business user to find problems, new insights or patterns in historical data. But for analyzing Big Data, he often needs help from a data scientist who can use machine learning algorithms to create analytic models.
2. Predictive and Prescriptive Analytics Using Machine Learning:
The data scientist focuses on the question: What will happen? He can create predictive analytic models that can be leveraged by the business user to find insights in Big Data stores.
Let's explain machine learning a bit before discussing the role of a data scientist. If you use the Internet, chances are very high that one of the websites you visit leverages machine learning under the hood. Every Google search, Facebook timeline, Netflix movie recommendation or Amazon product recommendation is calculated based on analytic models to show the best context-specific, personalized information to you.
Machine learning is a method of data analysis that automates analytical model building. Think about spam detection, credit card fraud protection, product recommendation, or recognition of pictures or handwritten letters. In such cases, machine learning allows computers to find hidden insights without being explicitly programmed where to look. For that, it uses algorithms such as linear regression, random forest or k-means clustering to iteratively learn from data. The developer (data scientist) does not have to write everything in code. The code learns itself! This is necessary, as the developer cannot implement every rule or condition in the mentioned use cases.
All tech giants already use machine learning heavily. Other enterprises are getting started, and there is no way around using machine learning if an enterprise wants to be competitive in a few years. In the above analytics maturity model, the data scientist has to do the following steps in this second phase while working closely together with the business user:
Analytic model building: One size does not fit all! Many different machine learning algorithms are available, and the number is growing these days massively. A data scientist usually tries out different alternatives and repeats different approaches iteratively to find and create the best analytic model.
Analytic model validation: This is a key for success. Is the model really working well, also with new inputs? After training a machine learning algorithm initially with some historical data, you have to use another part of the historical data to validate the model. Afterwards, you can either improve the model by changing variables, formulas, or by changing the complete algorithm. Even if the model is good enough to deploy for real-time event processing, it is still revalidated later with new data to improve it further.
R Language: Well known as the most popular free and open-source programming language used by data scientists for modeling. It is developing very rapidly with a very active community. R comes from the academic world and was not build for enterprise production, scalability or high performance. Therefore, an enterprise product on top of R is often used instead of just deploying open-source R code for production with better performance. Another issue with R is the restrictive open-source GPL license, which might be a hindrance if you want to "resell some R code" within a product to other customers.
Python: The other widespread language for machine learning. It emphasizes ease-of-use, productivity and code reliability. Therefore, its syntax is "nicer" and easier to learn than R, but its primary focus - in contrary to R - is not statistics and predictive analytics. Rather, it is a multi-purpose language in which machine learning is just a small part.
Apache Spark: A general scalable data-processing framework, which includes machine learning, graph processing, SQL support and streaming features. However, the focus in most projects today is especially on analytics using its machine learning library, MLlib. It consists of common learning algorithms and utilities, as well as lower-level optimization primitives and higher-level pipeline APIs.
H2O: An extensible open-source platform for analytics. It leverages best of breed open source technology such as Hadoop, Spark, R, and others. Its goal is to offer massively scalable Big Data analytics with very high performance for real-time scoring.
Machine learning software:
Microsoft's Revolution Analytics:A statistical software company focused on developing open source and "open-core" versions (i.e. commercial add-ons and support) of the R language for enterprise, academic and analytics customers. Microsoft - who acquired it in 2015 - is rebranding the software and releasing new products and features.
TIBCO Spotfire + TERR (Enterprise Runtime for R): Rewritten R as a commercial compute engine to overcome the restrictive GPL license and to offer enterprise-grade scalability and performance. Can be used as single runtime, but is also integrated into TIBCOA's data discovery tool Spotfire.
KNIME: An open-source data analytics, reporting, and integration platform. It integrates various components for machine learning and data mining through its modular data-pipelining concept. A graphical user interface allows assembly of nodes for data preprocessing for modeling, data analysis and visualization. It is often combined with languages such as R or other tools such as TIBCO Spotfire.
SAS: Commercial vendor for building analytics models. One of the leaders in the market.
IBM SPSS: Another leading product in the analytics market, offered by the mega-vendor IBM.
Amazon Machine Learning: A managed Software-as-a-Service (SaaS) offering for building machine learning models and generating predictions. Integrated into the corresponding cloud ecosystem of Amazon Web Services. It's easy to use, but has a limited feature set and potential latency issues if combined with external data or applications.
Machine learning offerings are also available from many other cloud vendors, e.g. Microsoft Azure, IBM Watson Analytics, or HP Haven on Demand.
Business user + data scientist = smart!
The business user must be able to leverage the analytic models. Therefore, a data discovery tool has to be able to integrate different analytic models. The business user does not care which kind of algorithm or machine learning model is used under the hood. He just wants to use it to find new insights.
3: Operations and Automation with Event Processing:
Insights and patterns found by a business user (leveraging an analytic model created by the data scientist) is a good beginning. However, it only creates business value if you are able to execute the model on future events in real time. Therefore, the key question in the third phase of our analytics maturity model is: What should I do?
The developer has to integrate the analytic model into real-time event processing to automatically monitor real-time transactions and trigger actions (e.g. for cross-selling, fraud detection or predictive maintenance). This phase includes two parts to implement: machine-to-machine automation and enablement of human interactions.
Machine-to-machine automation: Automated action based on analytic models of history combined with live context and business rules. The Challenge: Create, understand, and deploy algorithms and rules that automate key business reactions.
Human interactions: Human decisions in real time informed by up-to-date information via pushed events. The challenge: Empower operations staff to see and seize key business moments.
Event processing is used to operationalize insights and patterns in real time without rebuilding the analytic models. Business users and data scientists have to work together closely to build and validate analytic models. These models have to be deployable into a real-time event processing engine without rebuilding the model. Manual or automated re-validation and re-deployment of an improved model closes the analytics loop. If you want to implement a closed loop of Big Data analytics to create analytic models with machine learning and deploy them into real-time event processing, then you should focus on ease-of-use, time-to-market, flexibility and extensibility of the tool chain.