.. _Example:

============
Examples
============

Note that all the IPython notebooks used in the example are available here: https://github.com/ritabratamaiti/RapidML/tree/master/Examples

****************************************
ASD (Autism Spectrum Disorder) Detection
****************************************

.. _Github: https://github.com/ritabratamaiti/Autism-Detection-API

GitHub: https://github.com/ritabratamaiti/Autism-Detection-API

This project utilizes RapidML to detect ASD cases in adults. The training data consists of responses provided by the patients on the AQ-10 questionnaire. RapidML is utilized for selecting, training, serializing and packaging a high accuracy classifier. The files directory generated by RapidML, containing the packaged model is then uploaded to a WSGI server (See various deployment options here: http://flask.pocoo.org/docs/1.0/deploying/). 

PythonAnywhere (https://www.pythonanywhere.com) was used in this project. ``The builder_script.py`` utilizes RapidML. 

.. code-block:: python

       import RapidML
       import os
       import pandas as pd
       
       
       # This Autism Screening Adult Data Set is from UCI Machine Learning Repository and is available here: https://archive.ics.uci.edu/ml/datasets/Autism+Screening+Adult
       
       df = pd.read_csv('out.csv')
       df = df.drop(columns = ['Unnamed: 0'])
       df.head()
       
       ml_model = RapidML.rapid_classifier(df,name='ASDapi')

*Note: The training data is an Autism Screening Adult DataSet from UCI Machine Learning Repository and is available here:* https://archive.ics.uci.edu/ml/datasets/Autism+Screening+Adult

The code generates the following output.    

.. code-block:: text

	    
    RapidML, Version: 0.1, Author: Ritabrata Maiti
    
    
           .---.        .-----------
          /     \  __  /    ------
         / /     \(  )/    -----
        //////   ' \/ `   ---
       //// / // :    : ---
      // /   /  /`    '--
     //          //..\
            ====UU====UU====
                '//||\\`
                  ''``
    
    Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
    Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
    Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
    Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
    
    Using the RapidML Classifier; Experimental, For Issues Contact Author: ritabratamaiti@hiretrex.com
    Label Encoding is being done....
    
    Training....
    
    Generation 1 - Current best internal CV score: 1.0                            
    Generation 2 - Current best internal CV score: 1.0                            
    Generation 3 - Current best internal CV score: 1.0                            
    Generation 4 - Current best internal CV score: 1.0                            
    Generation 5 - Current best internal CV score: 1.0                            
                                                                                  
    Best pipeline: DecisionTreeClassifier(input_matrix, criterion=entropy, max_depth=2, min_samples_leaf=4, min_samples_split=6)
    
    Sample Output from input dataframe: 
    1,1,0,1,0,0,1,1,0,1,6,35.0,f,White-European,no,yes,United States,no,Self,NO


The generated model, scripts and serialized files are stored in the directory: ``ASDapi``.This directory is uploaded to a WSGI server, for making cloud predictions.


**Note**: This is a complete project, and some parts (such as the creation of the Android application) is outside the scope of RapidML documentation. Please visit the project on Github_ for more details.

*******************
Boston House Prices 
*******************

Let's say we are building a machine learning model to run on the cloud and predict housing prices in an area, using parameters such as crime rates, business development, pollution metrics etc. We will be using the Boston House Prices dataset, due to its wide availability and usage within machine learning academia. Dataset description here: https://www.kaggle.com/c/boston-housing

**Note**: We will be using ``sklearn.datasets`` for easy loading of the Boston-housing dataset within Python.

Since we are predicting prices, it is clearly a regression problem. We will be using ``RapidML.rapid_regressor_arr`` for this task.

.. code-block:: python

       # coding: utf-8
       
       from sklearn.datasets import load_boston
       from sklearn.model_selection import train_test_split
       import RapidML
       
       housing = load_boston()
       X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                           train_size=0.75, test_size=0.25)
       
       model = RapidML.rapid_regressor_arr(X_train, y_train)
       
       print(model.m_tpot.score(X_test, y_test))
       
The following output is generated.

.. code-block:: text
       
       RapidML, Version: 0.1, Author: Ritabrata Maiti
    
    
              .---.        .-----------
             /     \  __  /    ------
            / /     \(  )/    -----
           //////   ' \/ `   ---
          //// / // :    : ---
         // /   /  /`    '--
        //          //..\\
               ====UU====UU====
                   '//||\\\`
                     ''``
    
       Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
       Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
       Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
       Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
     
       Using RapidML Regressor with arrays, Inputs will not be label encoded.; Experimental, For Issues Contact Author: ritabratamaiti@hiretrex.com

       Training....
       
       Generation 1 - Current best internal CV score: -11.913707598413463            
       Generation 2 - Current best internal CV score: -11.913707598413463            
       Generation 3 - Current best internal CV score: -11.913707598413463            
       Generation 4 - Current best internal CV score: -11.913707598413463            
       Generation 5 - Current best internal CV score: -11.404014702360742            
                                                                                     
       Best pipeline: GradientBoostingRegressor(input_matrix, alpha=0.75, learning_rate=0.1, loss=huber, max_depth=3, max_features=1.0, min_samples_leaf=5, min_samples_split=4, n_estimators=100, subsample=0.6000000000000001)
       -10.908425630183695
            

As we can see in this example, a score of ``-10.908425630183695`` has been achieved. Do note that different models may be generated on a separate program run and hence the scores may fluctuate by a small margin (approximately 1% or so).

In the directory ``RapidML_files``, the model file and ``API.py`` script has been generated which can be uploaded to a WSGI server (with ``Flask`` support) to perform cloud predictions. 


********************************************************************************
Using RapidML to build a neural network (For recognizing hand-written digits)
********************************************************************************

This example serves to demonstrate the versatility of RapidMl, by using udm(User Defined Models). Do note that we will be using matplotlib to visualise the digits' images. In this example, we use ``RapidML.rapid_udm_arr`` in order to feed a neural network classifier (``sklearn.neural_network.MLPClassifier``) as the machine learning model. We use the ``digits`` dataset from ``sklearn.datasets``, and train the neural network on half the data. The other half is used for testing and visualization.

The following are Jupyter Notebook cells and their corresponding output.


.. code:: ipython3

    import RapidML
    from sklearn import datasets
    from sklearn.neural_network import MLPClassifier
    import matplotlib.pyplot as plt

.. code:: ipython3

    digits = datasets.load_digits()
    # The data that we are interested in is made of 8x8 images of digits, let's
    # have a look at the first 4 images, stored in the `images` attribute of the
    # dataset.  If we were working from image files, we could load them using
    # matplotlib.pyplot.imread.  Note that each image must have the same size. For these
    # images, we know which digit they represent: it is given in the 'target' of
    # the dataset.
    images_and_labels = list(zip(digits.images, digits.target))
    for index, (image, label) in enumerate(images_and_labels[:4]):
        plt.subplot(2, 4, index + 1)
        plt.axis('off')
        plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
        plt.title('Training: %i' % label)
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))

.. code:: ipython3

    clf = MLPClassifier(alpha=1)
    mclf = RapidML.rapid_udm_arr(data[:n_samples // 2], digits.target[:n_samples // 2], clf)


.. parsed-literal::

    
    Using RapidML with User Defined Models and Arrays, Inputs will not be label encoded; note that the model provided by the user should be a Scikit_learn model and not a TPOT object.; Experimental, For Issues Contact Author: ritabratamaiti@hiretrex.com
    
    Training....
    
    
.. parsed-literal::

    C:\Users\Ritabrata Maiti\Anaconda3\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
      % self.max_iter, ConvergenceWarning)
    

.. code:: ipython3

    expected = digits.target[n_samples // 2:]
    predicted = mclf.model.predict(data[n_samples // 2:])

.. code:: ipython3

    from sklearn import metrics
    print("Classification report for classifier %s:\n%s\n"
          % (mclf.model, metrics.classification_report(expected, predicted)))
    print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))


.. parsed-literal::

    Classification report for classifier MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
           beta_2=0.999, early_stopping=False, epsilon=1e-08,
           hidden_layer_sizes=(100,), learning_rate='constant',
           learning_rate_init=0.001, max_iter=200, momentum=0.9,
           nesterovs_momentum=True, power_t=0.5, random_state=None,
           shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
           verbose=False, warm_start=False):
                 precision    recall  f1-score   support
    
              0       0.99      0.97      0.98        88
              1       0.95      0.92      0.94        91
              2       0.99      0.98      0.98        86
              3       0.96      0.85      0.90        91
              4       0.99      0.89      0.94        92
              5       0.93      0.96      0.94        91
              6       0.91      0.99      0.95        91
              7       0.95      0.99      0.97        89
              8       0.93      0.94      0.94        88
              9       0.86      0.96      0.91        92
    
    avg / total       0.95      0.94      0.94       899
    
    
    Confusion matrix:
    [[85  0  0  0  1  0  2  0  0  0]
     [ 0 84  0  1  0  1  0  0  0  5]
     [ 1  0 84  1  0  0  0  0  0  0]
     [ 0  0  1 77  0  3  0  4  6  0]
     [ 0  0  0  0 82  0  6  0  0  4]
     [ 0  0  0  0  0 87  1  0  0  3]
     [ 0  1  0  0  0  0 90  0  0  0]
     [ 0  0  0  0  0  1  0 88  0  0]
     [ 0  3  0  0  0  0  0  0 83  2]
     [ 0  0  0  1  0  2  0  1  0 88]]
    

.. code:: ipython3

    images_and_predictions = list(zip(digits.images[n_samples // 2:], predicted))
    for index, (image, prediction) in enumerate(images_and_predictions[:4]):
        plt.subplot(2, 4, index + 5)
        plt.axis('off')
        plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
        plt.title('Prediction: %i' % prediction)
    
    plt.show()


.. image:: output_5_0.png

**Note**: If you wish to use the model as a flask API, do remember to flatten the image, to turn the data in a (samples, feature) matrix, and then convert to URL argument. However, this method hasn't undergone complete testing and is not guaranteed to work. However, it is possible to modify the ``API.py`` file to say, accept an image and then flatten it within the script itself.