asdfasdfsadf

Machine Learning (Beta)

Buddha once said, "To reach Enlightenment, you must turn data into insight and insight into action". Ok, he didn't say that, but Knowi can help you blend hindsight with foresight and drive actions from your data.

Overview

Currently, Knowi supports Classification, Regression and Time-Series Anomaly Detection type Machine Learning use cases, with clustering and deep learning coming soon. We also have a data preparation wizard that will guide you through the steps necessary to clean your data prior to any supervised modeling activities.

Anomaly detection is often used to identify unusual patterns that do not conform to expected behavior (called outliers). There ares many applications in business, from intrusion detection to system health monitoring and from fraud detection in credit card transactions to fault detection in operating environments.

For supervised learning, algorithms are selected based on the type of prediction response:

  • if your response is continuous numbers, then you will be using regression algorithms.
  • if your response is categories or classes, then you will be using classification algorithms.

For example, if you are building a model to predict the $ amount by which a person is likely to default on a credit card payment, then it's regression. However, if your you just want to know if they are likely to default or not then it's classification.

To start the Machine Learning process, simply select the Machine Learning icon, create your workspace and let Knowi guide you through the steps required to create your Machine Learning models!

Trigger Notification and Actions

Triggers and actions can be applied to the results. For example, you can send an alert or a webhook into your application for the users with a high risk of default for the use case above. The process for setting up triggers and alerts on a query with machine learning remains the same as a normal dataset/query. For more details, see Alerts.

Workspaces

Creating Workspaces

The very first thing required when starting a Machine Learning project in Knowi is to create a workspace. A workspace can be thought of as a folder that will contain all your subsequent machine learning models for the particular use case in question.

Workspace

Once the workspace is created and the required type of modeling determined, the user is then required to either select or upload their training dataset. This dataset will include historical data relating to the predictor variable they wish to predict. The example flow below is for supervised learning (classification and regression).

The user is then able to perform Cloud9QL upon the training dataset, select the variable they wish to predict and also analyze their data to not only see the columns present in their training dataset, but they can also view statistical information about the data in each column by clicking on the icon in each column header.

select

Once the data is uploaded and the attribute to be predicted has been selected, the user then selects Prepare Data. This will then guide the user step by step through some tasks designed to help clean the data items ready for the machine learning algorithms to use.

Editing Workspaces

When entering the Machine Learning module, the user will automatically be taken to a list of their current workspaces and published models. To edit a previously created workspace, simply click on the edit icon next to the workspace name.

edit

Data Preparation

Once your data is loaded and your predictor attribute selected, the next step is to ensure that your data is ready for the machine learning algorithms to successfully run against.

Knowi will lead you through a series of data preparation steps (some are mandatory and some are optional) prior to running the algorithms of your choice.

Note that the results of each step are saved. If a user leaves the data preparation area and returns later the system will direct them to the next step in the process automatically. The user also always has access to view their data by clicking in the top right hand corner of the box.

Data Types

Firstly, we need to ensure that all data types are correct. The user has the option to modify the data types, if necessary. The user simply selects the correct data type per column and then selects 'Next Step'

Data_types

Outliers

The next step is the identification of potential outliers in your data. Knowi will highlight these values and allow the user to either remove all of them, remove selected values or skip the step completely.

Note that the user also always has the ability to go back to the Cloud9QL processing area and inspect their data again.

outliers

Missing Values

It is important that the training dataset does not have any missing values (null values). Rows containing missing values will either need to be removed or imputed (calculated) using the mean of the associated column. The user has the option to remove or impute values. This step is mandatory.

The system will allow the user to:
1. enter a % above which all rows with this percentage of missing values will be removed from the dataset (eg, remove all rows where there are >25% missing numerical values)
2.enter a % above which all columns with this percentage of missing numerical values will be removed from the dataset (eg, remove all columns where there are >9% missing values)
3.impute the remaining missing values on a column by column basis

missing

Rescaling

If your numerical attributes are comprised of different scales (for example, weight, height, age, etc.) then you have the option of rescaling this data. This is not required, but may boost performance. Try creating different models for your non-rescaled, standardized and normalized data and see which ones achieve higher accuracy.

Two methods of rescaling are offered; Normalization (when you do not know the distribution of your data or the distribution is not Gaussian; this will set all values across the board to be between 0 and 1) and Standardization (if your data is Gaussian; this will transform the data to have a mean of 0 and a standard deviation of 1).

Simply select the data items to rescale, the method and chose 'Next Step'.

This step may be skipped entirely.

rescaling

Discrete Grouping

Some algorithms, such as Decision Trees, work better with discrete data. This means taking numerical data and converting it into logical, ordered groups or bins of data (ordinal attributes). It is most useful if you believe their are natural groupings within your column data or if your numerical data has a large range of values (for example, -infinity ? 7,000,000,000).

This step is optional and can be skipped.

discrete

Dummy Variables

Some algorithms only work with numerical data and do not support nominal or ordinal data. It will therefore be necessary to convert these values into real values. Each category will be transformed into a column (or attribute) and 0 or 1 will be inserted as the value. This is called widening your dataset.

For example, a column called Gender typically has permissible string entries for ?Male?, 'Female' and 'Not Specified'. If the value in a particular case is 'Male', then this would become three columns (one for each category), Gender:Male (with a value of 1), Gender:Female (with a value of 0), Gender:Not Specified (with a value of 0)

Existing column below would become three columns:

existing column value new column value
Gender Male Male 1
    Female 0
    Not Specified 0

dummy

This concludes the data preparation activity. Any decisions made along the way have been saved and a user can jump back to any previous step and make changes, if they wish.

The next step in our machine learning journey is to now select the model features that will help predict the outcome.

Feature Selection

Once all data has been prepared, the user is now asked to select the features (data attributes) to feed into the model creation.

features

Feature selection is a crucial part of machine learning and a user will typically create many different models using many different combinations of features before finding the best fit.

The user has two options at this point, to either manually select their features or to let Knowi auto-select features for them based upon correlation and information gain algorithms that we run against the dataset.

It is highly recommended to run your model several times with different features selected.

select_features

Once the features have been selected, the user then selects the algorithms to run and train their models.

Model Creation

After selecting the features, the user is then able to select the algorithms they wish to use to train their model(s).

The user can select one or more algorithms and can also repeat using different features and settings each time.

The algorithms displayed depend on whether the user specified Classification or Regression as the workspace type at workspace creation time.

start_train

Clicking on the settings cog will allow the user to enter algorithm specific parameters.

Once all required algorithms and their settings have been entered, the user then selects 'Train'.

The models and their corresponding results will then appear in the Results section.

results

Each model result has 3 icons associated with it. These allow you to inspect the results of the model and also publish the chosen model. Published models can then be used against a live Knowi queries to predict against incoming data.

mceclip0 view the data results of the trained model and see the predicted output against the original predictor input
mceclip1 view the statistical results of each model
mceclip2 publish the chosen model and make it available for use against incoming data

publish

To use the model against a live Knowi query, the user selects the 'Use Model' option corresponding to the model they wish to use. The system will then take them to the query list page where they can select the appropriate query and associate the model to be used at query run time.

use

Classification Machine Learning

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available.

An algorithm that implements classification, especially in a concrete implementation, is known as a classifier.

Knowi currently support 4 different classifiers:

  • Decision Tree
  • Logistic Regression
  • KNN
  • Naive Bayes

As an example, to predict whether a client will default on their next payment period based on their prior payment behavior:

  1. Download and use the data from UCI Machine Learning Repository. This dataset contains 30,000 client Credit Card data with 24 attributes including:

    • Personal characteristics such as age, education, gender, and marital status
    • Credit line limit information
    • Billing/payment history for the 6 months period from April to September of 2005
  2. Navigate over to our Workspaces page and you will be lead through all the necessary steps to create your model.

Time-series Anomaly Detection

Time-series anomaly detection is a feature used to identify unusual patterns that do not conform to expected behavior, called outliers. There are many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.

Upon creation of your Anomaly Detection Workspace, the user will be presented with a number of configuration steps.

  1. Select Dataset - the user is able to select an existing time- series dataset or upload a new dataset to analyze (please note that anomaly detection algorithms work only with time series data at this time)
  2. Cloud9QL data manipulation (optional) - this allows the user to post process the data by applying Cloud9QL transformation
  3. Select the Date/Time Dimension - this is the time series feature of the selected dataset that is going to be on the X chart axis
  4. Select the Numeric attribute - this is the numerical feature of the selected dataset that you'd like to monitor. This will be the Y chart axis
  5. Choose your Algorithm - here the user will select one of the many anomaly forecasting algorithms available (see below)
Anomaly forecasting algorithms
  • Olympic Model (Seasonal Naive) The naive seasonal model where the prediction for next point is a smoothed average over the previous n periods.

  • Double and Triple Exponential Smoothing Models Both are popular models used to produce smoothed time- series. The exponential smoothing variant add trend and seasonality into the model. The ETS model used automatically picks the best 'fit' exponential smoothing model.

  • Moving Average Model Here, the forecast is based on an artificially constructed time series in which the value for a given time period is replaced by the mean of that value and the values for some number of the preceding and succeeding time periods.

  • Weighted Moving Average and Naive Forecasting Models The forecast for both of these models is based on an artificially constructed time series in which the value for a given time period is replaced by the mean of that value and the values for some number of the preceding and succeeding time periods. The Weighted Moving Average is a special case of the moving average model.

  • Regression Model Models the relationship between x & y using one or more variable.

  • ARIMA Model Uses the Autoregressive Integrated Moving Average method.

anomaly-blank

As soon as the above steps have been completed and the Run Analysis option selected an anomaly detection model is trained and applied to the data. The precision of the model increases over time as more data is made available.

The anomaly detection visualization itself consists of a configurable blue band range of expected values (acceptable threshold limit) along with the actual metric data points. Any values outside of the blue band range are considered anomalies and will appear in red.

anomaly-results

Configuring the Anomaly Detection Algorithm

The width of the blue band of the expected values can be configured by setting the threshold attribute explicitly on the settings modal dialog. This Anomaly detection threshold is the mean absolute percentage deviation from the expected value. The default threshold value set is 50% but this can be modified.

anomaly-settings

Saving the Anomaly detection visualization

As an option you can save the anomaly detection visualisation results as widget that can then be shared on one or more dashboards. To do this, simply select teh Save Widget option and enter a widget name. The widget will now appear in the general widget list for subsequent use outside of the Machine Learning module.

anomaly-widget

However all anomaly related information available within the widget settings bar will not be readily available for user edit. All anomaly detection settings have to be changed via the anomaly workspace directly.

anomaly-widget-settings

Setting an Anomaly Detection alert

One crucial feature around the anomaly detection is the ability to configure alerts that provide automatic notification when new anomalies are detected.

Channels such as email, webhook and slack can be easily set up by selecting the alerts button from the control list.

By default the look back interval is set to equals to the alert frequency, so any anomaly will be communicated within that interval only. As soon as at least 1 anomaly is detected the system will trigger the alert.

anomaly-alerts.png

There are several fixed email placeholders that may be used in the email template to add additional information:

  • %DATASET_NAME% - represents the dataset name selected
  • %ANOMALY_SIZE% - represents the number of anomalies within the look back interval
  • %FREQUENCY% - represents the frequency of the alert chosen
  • %ANOMALY_RESULTS% - represents the detailed information about the anomalies, including expected range and actual metric value
Adding additional analyses

The workspace can contain one or more anomaly detection models. To add another into the workspace, simply choose the Add Analysis button.

add_analysis

Regression Machine Learning

In regression problems, we are trying to predict continuous values as the output. This differs from classification, where the output is a category or class. There are a number of different types of regression problems we support using the following algorithms:

  • Linear Regression (OLS)
  • Radial Base Functions
  • Regression Trees (e.g. Random Forest)
  • Support Vector Regression (SVR)

As an example, we will build a predictive model to predict house price (price is a number from some defined range, so it will be regression task). We will be using linear regression to predict sales price based on multiple attributes.

You can download the house price dataset here.

Let's suppose you want to sell your house and you are wondering what you can get for it. You usually look for other homes similar to yours, in the same area and close to the same age as yours. We will do something similar, but with Linear Regression Machine Learning.

Attribute Information:

  1. CRIM per capita crime rate by town

  2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.

  3. INDUS proportion of non-retail business acres per town

  4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

  5. NOX nitric oxides concentration (parts per 10 million)

  6. RM average number of rooms per dwelling

  7. AGE proportion of owner-occupied units built prior to 1940

  8. DIS weighted distances to five Boston employment centers

  9. RAD index of accessibility to radial highways

  10. TAX full-value property-tax rate per $10,000

  11. PTRATIO pupil-teacher ratio by town

  12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

  13. LSTAT % lower status of the population

  14. PRICE True value of owner-occupied homes in $1000's

We will be training our model using PRICE.

Now, navigate over to our Workspaces page and you will be lead through all the necessary steps to create your model.

Algorithms

Radial Base Function Network

A radial basis function network is an artificial neural network that uses radial basis functions as activation functions. It is a linear combination of radial basis functions. They are used in function approximation, time series prediction, and control.

A radial basis function (RBF) is a real-valued function whose value depends only on the distance from the origin, so that ?(x)=?(||x||); or alternatively on the distance from some other point c, called a center, so that ?(x,c)=?(||x-c||). Any function ? that satisfies the property is a radial function. The norm is usually Euclidean distance, although other distance functions are also possible. For example by using probability metric it is for some radial functions possible to avoid problems with ill conditioning of the matrix solved to determine coefficients wi (see below), since the ||x|| is always greater than zero.

Ordinary Least Squares (OLS)

In linear regression, the model specification is that the dependent variable is a linear combination of the parameters. The residual is the difference between the value of the dependent variable predicted by the model, and the true value of the dependent variable. Ordinary least squares obtains parameter estimates that minimize the sum of squared residuals, SSE (also denoted RSS).

K-Nearest Neighbor

The k-nearest neighbor algorithm (k-NN) is a method for classifying objects by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is typically small). k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification.

The simplest k-NN method takes a data set of feature vectors and labels with Euclidean distance as the similarity measure.

The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques, e.g. cross-validation. In binary problems, **it is helpful to choose k to be an odd number as this avoids tied votes**.

The nearest neighbor algorithm has some strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). k-NN is guaranteed to approach the Bayes error rate, for some value of k (where k increases as a function of the number of data points).

The user can also provide a customized distance function.

Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighborhood Components Analysis.

Alternatively, the user may provide a k-nearest neighbor search data structure. Besides the simple linear search, KD-Tree, Cover Tree, and LSH (Locality-Sensitive Hashing) for efficient k-nearest neighbor search are also available.

A KD-tree (short for k-dimensional tree) is a space-partitioning dataset structure for organizing points in a k-dimensional space. Cover tree is a data structure for generic nearest neighbor search (with a metric), which is especially efficient in spaces with small intrinsic dimension. The cover tree has a theoretical bound that is based on the dataset's doubling constant. LSH is an efficient algorithm for approximate nearest neighbor search in high dimensional spaces by performing probabilistic dimension reduction of data.

Nearest neighbor rules in effect compute the decision boundary in an implicit manner. In general, the larger k, the smoother the boundary.

Naive Bayes

The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

NaiveBayesIntro

To demonstrate the concept of Na´ve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.

Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.

The users can change the following settings:

Generation Model Multinomial or Bernoulli. Th multinomial model generates one term in each position of the document. The multivariate Bernoulli model or Bernoulli model generates an indicator for each term , either indicating presence of the term in the document or indicating absence.
Add k-smoothing By default, we use add-one or Laplace smoothing, which simply adds one to each count to eliminate zeros.

Support Vector Regression

Support vector machines can be used as a regression method, maintaining all the main features of the algorithm. In the case of regression, a margin of tolerance ? is set in approximation. The goal of SVR is to find a function that has at most ? deviation from the response variable for all the training data, and at the same time is as flat as possible. In other words, we do not care about errors as long as they are less than ?, but will not accept any deviation larger than this.

Regression Tree

A decision tree can be learned by splitting the training set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning.

Classification and Regression Tree techniques have a number of advantages over many of those alternative techniques.

Simple to understand and interpret.
In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classification of new observations, but can also often yield a much simpler "model" for explaining why observations are classified or predicted in a particular manner.
Able to handle both numerical and categorical data.
Other techniques are usually specialized in analyzing datasets that have only one type of variable.
Tree methods are nonparametric and nonlinear.
The final results of using tree methods for classification or regression can be summarized in a series of (usually few) logical if-then conditions (tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific non-linear link function, or that they are even monotonic in nature. Thus, tree methods are particularly well suited for data mining tasks, where there is often little a priori knowledge nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analytics, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.

One major problem with classification and regression trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. Besides, decision-tree learners can create over-complex trees that cause over- fitting. Mechanisms such as pruning are necessary to avoid this problem. Another limitation of trees is the lack of smoothness of the prediction surface.

Logistic Regression

Logistic regression (logit model) is a generalized linear model used for binomial regression. Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable. A logit is the natural log of the odds of the dependent equaling a certain value or not (usually 1 in binary logistic models, the highest value in multinomial models). In this way, logistic regression estimates the odds of a certain event (value) occurring.

logit

Logistic regression has many analogies to ordinary least squares (OLS) regression. Unlike OLS regression, however, logistic regression does not assume linearity of relationship between the raw values of the independent variables and the dependent, does not require normally distributed variables, does not assume homoscedasticity, and in general has less stringent requirements.

Compared with linear discriminant analysis, logistic regression has several advantages:

  • It is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group
  • It does not assume a linear relationship between the independent variables and dependent variable.
  • It may handle nonlinear effects since one can add explicit interaction and power terms. However, it requires much more data to achieve stable, meaningful results.

Logistic regression also has strong connections with neural network and maximum entropy modeling. For example, binary logistic regression is equivalent to a one-layer, single-output neural network with a logistic activation function trained under log loss. Similarly, multinomial logistic regression is equivalent to a one-layer, softmax- output neural network.

Logistic regression estimation also obeys the maximum entropy principle, and thus logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier".

Decision Tree

A decision tree can be learned by splitting the training set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions.

The settings cog allows the user to enter options for the following:

  • Maximum number of leaf nodes
  • Minimum number of leaf nodes
  • Splitting Rule
    • Gini impurity: a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset
    • Entropy: Information gain is based on the concept of entropy used in information theory. For categorical variables with different number of levels, however, information gain are biased in favor of those attributes with more levels. Instead, one may employ the information gain ratio, which solves the drawback of information gain Decision tree techniques have a number of advantages over many alternative techniques.

Simple to understand and interpret:
In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classification of new observations, but can also often yield a much simpler "model" for explaining why observations are classified or predicted in a particular manner.

Able to handle both numerical and categorical data:
Other techniques are usually specialized in analyzing datasets that have only one type of variable.

Nonparametric and nonlinear:
The final results of using tree methods for classification or regression can be summarized in a series of (usually few) logical if-then conditions (tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific non-linear link function, or that they are even monotonic in nature. Thus, tree methods are particularly well suited for data mining tasks, where there is often little a priori knowledge nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analytics, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.

One major problem with classification and regression trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. Besides, decision-tree learners can create over-complex trees that cause over- fitting. Mechanisms such as pruning are necessary to avoid this problem. Another limitation of trees is the lack of smoothness of the prediction surface.