Verified by Star Health Data Science Team
Introduction to Machine Learning:
We are in the world of massive amount of raw data which we wish to transform in a form where we can take some useful insights to make our life easier. Machine Learning is one of the most exciting technology in today’s world which provides the ability to learn from the data and imitate human intelligence to solve the problem.
There are two types of machine learning algorithms which are being used extensively, supervised and unsupervised machine learning. As its name reflects, supervised machine learning models need supervision to train the model using labelled data whereas unsupervised model doesn’t need any supervision. It is a method to find hidden patterns and useful insights from the unlabelled input data. Clustering is the one of the example of unsupervised learning.
Supervised machine learning is used in order to train the model so that it can predict the output for the new data. Regression and classification are the types of supervised machine learning algorithm. Classification is the task of predicting a discrete class labels whereas Regression is used in case of continuous output.
Health Insurance service providers must ensure that every customer is covered under the policy and gets the maximum benefit. It is very important to retain its old customers along with on boarding new customers year on year. If the policyholders are not able to renew the policies before the due date, then the policies will enter into the grace period where the benefits cannot be availed. Hence the customers are needed to be informed well in advance to renew the policy on or before due date.
The common practices to notify customers are calling them directly, sending reminder messages through SMS, WhatsApp and other mediums, doing multiple follow ups. But, the volume is so huge it is next to impossible to reach out to each customer before policy expiry date to make the renewal on time. Here comes the role of Data Science. This leads to the ideation of building a predictive model which can give us a set of customers who are not going to renew their policies before expiry date. In results it will help us in prioritizing the calling base and make the calling activity more effective.
Since it is a classification task where the outcome is binary (1 or 0), below are some of the classification algorithms can be built to solve the problem.
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- XGBoost Classifier
- Gradient Boosting Classifier
It is a classification machine learning algorithm generally used for binary classification but can also be extended to multiclass-classification. It uses sigmoid function which maps the output ranging from 0 to 1 classes (categories). It uses probabilistic values to predict the class labels. It performs well when the dataset is linearly separable.
Decision Tree Classifier:
Decision trees are a supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
It is a binary tree based structure which uses most significant node based on metrics such as entropy, information gain and Gini impurity. Simple to understand and to interpret. Trees can be visualized or plotted easily.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
Random forest is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. Random Forest is ensemble of trees i.e. multiple decision trees run in parallel with a concept called bootstrap aggregation technique. Random Forest has a variety of applications, such as recommendation engines, image classification and feature selection.
Random forest is considered as a highly accurate and robust method because of the number of decision trees participating in the process
Random forest is slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to make a prediction for the same given input and then perform voting on it. This whole process is time-consuming.
The model is difficult to interpret compared to a decision tree, where you can easily make a decision by following the path in the tree.
Gradient Boosting is an ensemble technique where new models are added to correct the error made by the existing models. In Gradient Boosting we use gradients instead of the weights. It is used to find the coefficient of a function which minimizes the loss of the function. Prone to overfitting but can be tackled using L1 and L2 regularization. It is generally more accurate compare to other modes, train faster especially on larger datasets, most of them provide support handling categorical features, some of them handle missing values natively.
XGBoost is built in top of gradient boosting known as extreme gradient boosting. XGBoost is also an ensemble technique which handles missing values efficiently. Computational speed is very high because of the parallelism and distributed computing. Implementation can be slow because it needs to create one tree at a time to correct the error of tall previous trees.
Model Performance Check:
The real challenge comes while evaluating which model is performing well on the data set and whether it is able to meet the stakeholders’ expectations. Below are some of the metrics used to evaluate models’ efficacy in a typical classification problem.
Accuracy simply measures how often the classifier predicts correctly. We can define accuracy as the ratio of the number of correct predictions and the total number of predictions.
A confusion matrix is defined as the table that is often used to describe the performance of a classification model on a set of the test data for which the true values are known.
Precision explains how many of the correctly predicted cases actually turned out to be positive. Precision is useful in the cases where False Positive (FP) is a higher concern than False Negatives (FN). The importance of Precision is in music or video recommendation systems, e-commerce websites, etc. where wrong results could lead to customer churn and this could be harmful to the business.
Recall explains how many of the actual positive cases we were able to predict correctly with our model. It is a useful metric in cases where False Negative is of higher concern than False Positive. It is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!
It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is equal to Recall. The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric when FP and FN are equally costly.
We can segregate the model predictions into 3 major buckets as follows:
- Customers who need reminder calls and follow ups
- Customers who need only reminders through SMS or WhatsApp or Emails
- Customers who do not need any reminders
Early reaching out to customers who are highly probable of not renewing the policy on time can be effectively handled.
It will also help us in identifying the customers who will renew on time so that we need not send them multiple reminders or follow ups.
Benefits of Lapsation Model:
- Use of resources efficiently
- Improves Retention Rate
- Upselling and Cross-Selling
- Building a strong perception of the organization among the public
- Improves Customers Acquisition Rate and Reduces Customer Churn Rate
By renewing policy on time you are making sure about the coverage of the risk and saving yourself from a claim situation during grace period. Also it is equally important to get the continuity benefits in terms of waiting period if insured has any pre-existing diseases which will be lost if he/she failed to renew the policy. And in the end he/she has to buy a new policy and go through the entire waiting period all over again.
DISCLAIMER: The information contained in this blog is provided based on the author’s views, opinion and understanding.