Examples¶
Note that all the IPython notebooks used in the example are available here: https://github.com/ritabratamaiti/RapidML/tree/master/Examples
ASD (Autism Spectrum Disorder) Detection¶
GitHub: https://github.com/ritabratamaiti/Autism-Detection-API
This project utilizes RapidML to detect ASD cases in adults. The training data consists of responses provided by the patients on the AQ-10 questionnaire. RapidML is utilized for selecting, training, serializing and packaging a high accuracy classifier. The files directory generated by RapidML, containing the packaged model is then uploaded to a WSGI server (See various deployment options here: http://flask.pocoo.org/docs/1.0/deploying/).
PythonAnywhere (https://www.pythonanywhere.com) was used in this project. The builder_script.py
utilizes RapidML.
import RapidML
import os
import pandas as pd
# This Autism Screening Adult Data Set is from UCI Machine Learning Repository and is available here: https://archive.ics.uci.edu/ml/datasets/Autism+Screening+Adult
df = pd.read_csv('out.csv')
df = df.drop(columns = ['Unnamed: 0'])
df.head()
ml_model = RapidML.rapid_classifier(df,name='ASDapi')
Note: The training data is an Autism Screening Adult DataSet from UCI Machine Learning Repository and is available here: https://archive.ics.uci.edu/ml/datasets/Autism+Screening+Adult
The code generates the following output.
RapidML, Version: 0.1, Author: Ritabrata Maiti
.---. .-----------
/ \ __ / ------
/ / \( )/ -----
////// ' \/ ` ---
//// / // : : ---
// / / /` '--
// //..\
====UU====UU====
'//||\\`
''``
Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
Using the RapidML Classifier; Experimental, For Issues Contact Author: ritabratamaiti@hiretrex.com
Label Encoding is being done....
Training....
Generation 1 - Current best internal CV score: 1.0
Generation 2 - Current best internal CV score: 1.0
Generation 3 - Current best internal CV score: 1.0
Generation 4 - Current best internal CV score: 1.0
Generation 5 - Current best internal CV score: 1.0
Best pipeline: DecisionTreeClassifier(input_matrix, criterion=entropy, max_depth=2, min_samples_leaf=4, min_samples_split=6)
Sample Output from input dataframe:
1,1,0,1,0,0,1,1,0,1,6,35.0,f,White-European,no,yes,United States,no,Self,NO
The generated model, scripts and serialized files are stored in the directory: ASDapi
.This directory is uploaded to a WSGI server, for making cloud predictions.
Note: This is a complete project, and some parts (such as the creation of the Android application) is outside the scope of RapidML documentation. Please visit the project on Github for more details.
Boston House Prices¶
Let’s say we are building a machine learning model to run on the cloud and predict housing prices in an area, using parameters such as crime rates, business development, pollution metrics etc. We will be using the Boston House Prices dataset, due to its wide availability and usage within machine learning academia. Dataset description here: https://www.kaggle.com/c/boston-housing
Note: We will be using sklearn.datasets
for easy loading of the Boston-housing dataset within Python.
Since we are predicting prices, it is clearly a regression problem. We will be using RapidML.rapid_regressor_arr
for this task.
# coding: utf-8
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import RapidML
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
train_size=0.75, test_size=0.25)
model = RapidML.rapid_regressor_arr(X_train, y_train)
print(model.m_tpot.score(X_test, y_test))
The following output is generated.
RapidML, Version: 0.1, Author: Ritabrata Maiti
.---. .-----------
/ \ __ / ------
/ / \( )/ -----
////// ' \/ ` ---
//// / // : : ---
// / / /` '--
// //..\\
====UU====UU====
'//||\\\`
''``
Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
Using RapidML Regressor with arrays, Inputs will not be label encoded.; Experimental, For Issues Contact Author: ritabratamaiti@hiretrex.com
Training....
Generation 1 - Current best internal CV score: -11.913707598413463
Generation 2 - Current best internal CV score: -11.913707598413463
Generation 3 - Current best internal CV score: -11.913707598413463
Generation 4 - Current best internal CV score: -11.913707598413463
Generation 5 - Current best internal CV score: -11.404014702360742
Best pipeline: GradientBoostingRegressor(input_matrix, alpha=0.75, learning_rate=0.1, loss=huber, max_depth=3, max_features=1.0, min_samples_leaf=5, min_samples_split=4, n_estimators=100, subsample=0.6000000000000001)
-10.908425630183695
As we can see in this example, a score of -10.908425630183695
has been achieved. Do note that different models may be generated on a separate program run and hence the scores may fluctuate by a small margin (approximately 1% or so).
In the directory RapidML_files
, the model file and API.py
script has been generated which can be uploaded to a WSGI server (with Flask
support) to perform cloud predictions.
Using RapidML to build a neural network (For recognizing hand-written digits)¶
This example serves to demonstrate the versatility of RapidMl, by using udm(User Defined Models). Do note that we will be using matplotlib to visualise the digits’ images. In this example, we use RapidML.rapid_udm_arr
in order to feed a neural network classifier (sklearn.neural_network.MLPClassifier
) as the machine learning model. We use the digits
dataset from sklearn.datasets
, and train the neural network on half the data. The other half is used for testing and visualization.
The following are Jupyter Notebook cells and their corresponding output.
import RapidML
from sklearn import datasets
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
digits = datasets.load_digits()
# The data that we are interested in is made of 8x8 images of digits, let's
# have a look at the first 4 images, stored in the `images` attribute of the
# dataset. If we were working from image files, we could load them using
# matplotlib.pyplot.imread. Note that each image must have the same size. For these
# images, we know which digit they represent: it is given in the 'target' of
# the dataset.
images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
plt.subplot(2, 4, index + 1)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Training: %i' % label)
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
clf = MLPClassifier(alpha=1)
mclf = RapidML.rapid_udm_arr(data[:n_samples // 2], digits.target[:n_samples // 2], clf)
Using RapidML with User Defined Models and Arrays, Inputs will not be label encoded; note that the model provided by the user should be a Scikit_learn model and not a TPOT object.; Experimental, For Issues Contact Author: ritabratamaiti@hiretrex.com
Training....
C:UsersRitabrata MaitiAnaconda3libsite-packagessklearnneural_networkmultilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. % self.max_iter, ConvergenceWarning)
expected = digits.target[n_samples // 2:]
predicted = mclf.model.predict(data[n_samples // 2:])
from sklearn import metrics
print("Classification report for classifier %s:\n%s\n"
% (mclf.model, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))
Classification report for classifier MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
verbose=False, warm_start=False):
precision recall f1-score support
0 0.99 0.97 0.98 88
1 0.95 0.92 0.94 91
2 0.99 0.98 0.98 86
3 0.96 0.85 0.90 91
4 0.99 0.89 0.94 92
5 0.93 0.96 0.94 91
6 0.91 0.99 0.95 91
7 0.95 0.99 0.97 89
8 0.93 0.94 0.94 88
9 0.86 0.96 0.91 92
avg / total 0.95 0.94 0.94 899
Confusion matrix:
[[85 0 0 0 1 0 2 0 0 0]
[ 0 84 0 1 0 1 0 0 0 5]
[ 1 0 84 1 0 0 0 0 0 0]
[ 0 0 1 77 0 3 0 4 6 0]
[ 0 0 0 0 82 0 6 0 0 4]
[ 0 0 0 0 0 87 1 0 0 3]
[ 0 1 0 0 0 0 90 0 0 0]
[ 0 0 0 0 0 1 0 88 0 0]
[ 0 3 0 0 0 0 0 0 83 2]
[ 0 0 0 1 0 2 0 1 0 88]]
images_and_predictions = list(zip(digits.images[n_samples // 2:], predicted))
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
plt.subplot(2, 4, index + 5)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Prediction: %i' % prediction)
plt.show()
Note: If you wish to use the model as a flask API, do remember to flatten the image, to turn the data in a (samples, feature) matrix, and then convert to URL argument. However, this method hasn’t undergone complete testing and is not guaranteed to work. However, it is possible to modify the API.py
file to say, accept an image and then flatten it within the script itself.