Final Project: Customer Churn Prediction
Prerequisites
Goal
In this project, I aimed to predict customer churn for a telecommunications company using a dataset from Kaggle. My goal was to understand why customers leave and to create a machine learning model that can predict which customers are at risk of churning. This project showcases how machine learning can address real-world challenges.
Original Kaggle Competition: Link
Dataset
Features | Description |
---|---|
customerID | Customer ID |
gender | Male/Female |
Step 1: Importing Libraries
The 1st step in this project is to import the necessary libraries in order to perform my testing. This includes the machine learning scikit-learn module along with pandas for handling data and numpy for basic math operations.
As the dataset has a high number of discrete features (Features with categories rather than numbers), I will be using the CatBoost algorithm for my machine learning prediction.
import numpy as np
import pandas as pd
import os
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.metrics import (
accuracy_score, classification_report, recall_score, confusion_matrix,
roc_auc_score, precision_score, f1_score, roc_curve, auc
)
from sklearn.preprocessing import OrdinalEncoder
from catboost import CatBoostClassifier, Pool
Step 2: Loading and Preprocessing Data
Next I will read in the dataset which is currently stored as a csv folder in my main projects folder.
Data pre-processing is one of the most, if not the most important aspect of machine learning projects. In reality, most datasets provided are not clean and ready for modelling straight away and therefore have to go through multiple analysis as well as pre-processing and cleaning before ready for use.
data_path = "churn_data.csv"
df = pd.read_csv(data_path)
# Convert TotalCharges to numeric, filling NaN values
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['tenure'] * df['MonthlyCharges'], inplace=True)
# Convert SeniorCitizen to object
df['SeniorCitizen'] = df['SeniorCitizen'].astype(object)
# Replace 'No phone service' and 'No internet service' with 'No' for certain columns
df['MultipleLines'] = df['MultipleLines'].replace('No phone service', 'No')
columns_to_replace = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
for column in columns_to_replace:
df[column] = df[column].replace('No internet service', 'No')
# Convert 'Churn' categorical variable to numeric
df['Churn'] = df['Churn'].replace({'No': 0, 'Yes': 1})
Step 3: Creating Training and Testing Datasets
In machine learning, it is common practice to split your dataset between a testing and training dataset. The training dataset, as the name implies, is used to create the model. The testing dataset is then used to test the accuracy of the model.
# Create the StratifiedShuffleSplit object
strat_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=64)
train_index, test_index = next(strat_split.split(df, df["Churn"]))
# Create train and test sets
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]
X_train = strat_train_set.drop("Churn", axis=1)
y_train = strat_train_set["Churn"].copy()
X_test = strat_test_set.drop("Churn", axis=1)
y_test = strat_test_set["Churn"].copy()
Step 4: Making the Model
Now that we have our datasets, we will need to create our model using the training dataset. We will first need to let the CatBoost algorithm know which features are categorical so these can be encoded first. For a detailed explanation of encoding, see the Encoding section.
# Identify categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
# Initialize and fit CatBoostClassifier
cat_model = CatBoostClassifier(verbose=False, random_state=0, scale_pos_weight=3)
cat_model.fit(X_train, y_train, cat_features=categorical_columns, eval_set=(X_test, y_test))
# Predict on test set
y_pred = cat_model.predict(X_test)
Step 5: Evaluating the Model
The last step in this process is to evaluate the success of the model. The scikit-learn library has a function to automatically calculate the success of our model. An accuracy of 0.7764 was achieved in this model.
# Calculate evaluation metrics
accuracy, recall, roc_auc, precision = [round(metric(y_test, y_pred), 4) for metric in [accuracy_score, recall_score, roc_auc_score, precision_score]]
# Print results
print(result)
Yay, we are done and can save this model to use in any future evaluations of churn. For example, this model could be used to identify customers at risk of deactivating their accounts and specific marketing strategies could be tailored towards these individuals so the business does not lose their existing customer base.