Predictor
Integrates Deep Learning frameworks (CNTK) with your data.
Machine Learning is taking the world, in the last years Deep Neural Networks are solving problems that were considered out of the reach for computers, like Image or Speach Recognition.
Still, most business applications are not able to make simple predictions like expected sales next month, product recomendation, house pricing etc.. We think these may be the main reasons:
- Most ML concepts require some level of math / statistics and familiarity reading formulas.
- Typically new programming languages (like Python or R) and ML frameworks (like Tensorflow or Caffe) need to be learned. Often they do not integrate well with the technologies applications are made (like C#, Java or Javascript), and need to be used through a REST service.
- Most ML algorithms require a flat array of numbers as an input, so some mundane table-flattening / data-conversion / CSV export needs to be done every time the inputs (or outputs) of the model change.
Predictor module tries to simplify all this, providing an integrated solution that can create ML models easily by the end-users with almost not knowledge of ML, similar to how the Chart module works, where users can create a D3.js chart from the database in a few clicks.
Lets see how it works:
A Predictor is an entity that is able to learn from data to predict some column(s). If the output is a number we call it regression, if it's a category (an entity, enum or a well-known string) we call it classification.
The predictor requires three pieces of information from the end-user:
- A query (or multiple ones) to fetch the data from the database. As we select the columns (using just a sequence of combo-boxes), we'll also need to select the codification, if necessary.
- A Machine Learning Algorithm, like CNTK Neural Network.
- Some parameters, like the number and size of the hidden layers, the learning algorithm or the percentage of data to use for verification.
Once this information is provided the Predictor can be trained, executing the following steps:
- Query: First the query(s) are made to the database, using the same technology that the SearchControl, Chart module or Word/Email templates uses, so it can seamlessly explore
- Codification: The codifications are created, this is the most important part. Most ML algorithms, like Neural Networks or Bayes inference algorithms expect flat arrays of the same data type. Typically this mapping is creates friction when trying different models, but fortunately Predictor will save it automatically after training so it can be used when evaluating the model,
- Data Conversion: It's then necessary to convert any input and output to the required data type, maybe codifying each category to a discrete number (Bayes), codifying each category to the neuron that will be active (one hot encoding) or normalizing the numeric values (Z-Normalization).
- Array flattering: If the predictor has sub-queries to get information from sub-collections (like the order lines in an order) the information has to be flattered, so that each column contains the value for one stable grouping key (like the product).
- Splitting: The Training / Verification sets are splitted.
- Training: The inputs and outputs are converted, using the codifications, and the selected algorithm is used for training. Currently only CNTK Neural Networks is available, because if offers an excellent performance and great integration with C#. This algorithm will be trained in mini-batches, sets of data that are small enought to fit into memory.
- As the training makes progress, the result of the loss function and error function for each mini-batch will be displayed in a chart. Also every few mini-batches the model will be evaluated for the verification data, to detect overfitting as soon as possible. This information is also saved in the database to allow comparing different predictors.
- Final Results: The trained model file is saved, also the last results of the loss and error function will be stored in the predictor itself, for easily comparison. Other well-known statistics, will be stored, depending of the problem:
- Classification:
- Total Count / Miss Count
- Confusion matrix (using charting).
- Regression:
- Mean Error / Mean Squared Error / Mean Absolute Error / Root Mean Square Error / Percentage Error / Mean Percentage Error
- Confusion scatterplot (using charting)
This training process takes longer the more complicated the model is, but can be accelerated using most nVidia GPUs. Evaluating any model is typically insantaneous.
The predictor itself automatically generates a UI for testing the model, creating fields for simple inputs/output and tables for inputs and outputs in the sub-queries.
In order to integrate the predictions inside a UI, or expose them in a web-service, a very simple API is provided.
No excuses! It's time to make your app smarter :)