This is third project in Istanbul Data Science Academy. The project was about classification.

Customer Satisfied or Not?

In this project I used data from Kaggle. This data is the data of an e-commerce company in Brazil and it has almost every table (like orders, reviews, products, etc.). In this data set, I tried to determine whether the customer is satisfied or not, based on the review scores of the customers. In this project, I loaded the data from the database. After loading the data, I connected the database from the jupyter notebook and pulled the data into python. I finished the data visualization part by preparing an interactive dashboard with the data tableu program. I tried to develop my model with future engineering. I tried to choose the best model by comparing about 10 classification algorithms. I tried to find the best model parameters with model tunning and tested the success of my model. Here Github repo and I tell everything step by step.

First Step: Loading data to database

In this project I used postgresql. First of all you need create database. After creating you have to load data this database. I used the postgresql shell for database creation and table creation.

CREATE DATABASE ecommerce;\connect ecommerce;CREATE TABLE name_freq(
seller_id TEXT,
seller_zip_code_prefix INT,
seller_city INT,
seller_state TEXT
);

Second step: Connecting database from Jupyter Notebook

There are two ways to connect database from notebook. One of them psycopg2 and the other one is sqlalchemy In this project I used psycopg2

from psycopg2 import connect
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
params = {
‘host’: ‘***’,
‘user’: ‘***’,
‘port’ : 5432, #postgresql port
‘password’: ‘*****’
}
connection = connect(**params, dbname=’****’)

Third Step: EDA (Exploratory data analysis)

This is my first project using Tableau. I didn’t expect to love Tableau this much, but I couldn’t stop myself from making an interactive dashboard.

As we can see on dashboard many of the customers are from Sao paulo state. When we select 2018 by using a filter, we can see that the most shopping is done in this year. Customers used credit cards more than others every year as a payment method. Customers made more purchases from furniture_decor for 2016, bed_bath_table for 2017 and health_beauty for 2018. Customers’ review scores for each year are generally around 90%.

Fourth step: Feature engineering

Feature engineering for a better model is an inevitable end for a data scientist. Depending on the situation, combining, multiplying and dividing two or more columns can open the door to a better model.

We all know that sometimes we may encounter problems when we come to the payment part while shopping from a website. One of these problems is that our payment is somehow not approved by the website after making the payment. Thinking that this might affect customer satisfaction, I created a column called different_between_purcase. After the column was created, I realized that they had sureties up to 1650 hours. I grouped theese. After that we create 8 grups as;

< 1= 1

1 &< 5= 2

5 &<= 10 = 3 etc.

This time I tried to find out if the cargo arrived on time. I have columns where the order is given to the courier, the order reaches the customer, and the order is estimated to be delivered.Using these columns, I created a table showing whether the order reached the customer on time. If the cargo arrived on time, I assigned 0 values, if not 1 values.

I took the column showing the exact date when the order was delivered and created a separate column from the time part. I grouped in which part of the day the cargo was delivered. (like morning, evening, etc.)

Again, using the same date column, I found the season in which the cargo was delivered and grouped it.

Sometimes it can affect the review scores we give at the vendors. Since the dataset did not give me the scores of the sellers, I found and grouped the total number of products sold by the sellers, as seen in the code above (like x<=50 = 1 , x>50 and x<=100 = 2, etc).

Finally, I grouped the review_score column, which is my target column, by taking those who gave 4 and 5 stars to one group and the others to another group.

Fifth step: Machine learning

My main goal in this project was to use classification algorithms and make comparisons. I compared about 10 algorithms and tried to determine the best algorithm for the dataset.

As can be seen in the table above, the train and test accuracy scores of the models are very good. But is this what we need? As seen in the EDA section, the distribution in our target column, review_score, is quite unbalance. Therefore, it would be misleading for us to look at the accuracy scores. We need to look more at Precision, Recall and F1 scores. But as can be seen in the table, all these scores are quite low. I choose the random forest algorithm, which I thought had a better score on average than the others, and decided to move forward.

When we examine the random forest model in confusion matrix, we see that two-thirds of the more than 5 thousand not satisfied data is predicted incorrectly.

In this way, there are two options for unbalance distributed data sets. One of them is undersampling and the other is oversampling. In both methods, the basic principle is to make the target column balanced. I preferred to use the oversampling method in this project.

After using oversampling method our precision, recall and F1 score is better, not best but it’s good.

When we examine the confusion matrix, we can easily say that we guessed correctly about two thirds of the not satisfied people. But it should not be overlooked that we have also experienced a decrease in the estimates of those who are satisfied. This is one of the minuses of oversampling and undersampling methods. Business should make the decision here. Since I wanted to guess the dissatisfied ones, I decided to continue on this path.

When we examine the feature importance table, we see the effect of the columns I created, delivery and seller_count, on the model. On the other hand, it is seen in columns that do not affect the model. These columns should be removed from the model during model estimation.

Sixth step: Model Tunning

Model tuning one of the most important steps because you have to learn which model parameters best for you before production. There are way for it. One of them GridSearchCV which is take sometimes days, the other one RandomizedSearchCV which is almost 3 or 4 hour so you can watch Lord of the Rings :). I used RandomizedSearchCV.

'n_estimators': 1146,
'min_samples_split': 2,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': None,
'bootstrap': False

This is the best fit features for my model.

Confusion matrix after using model tuning

Final step: Deployment

In the final step of the project, I went into production with the final version of machine learning. You can check it with streamlit.

For any suggestion and question: dgnapo@hotmail.com

See you next time until then happy analyzing.

--

--

--

Data Scientist Jr.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Get Expert SR-22 Filing Services From Licensed Fitchburg, WI Insurance Agents

Get Expert SR-22 Filing Services From Licensed Fitchburg, WI Insurance Agents

Bootstrapping and bagging 101

Episode 27 cover

Case Study 5 Whys Technique

How to Implement and Evaluate Decision Tree classifiers from scikit-learn

Training ML model to predict personality type..

Announcing the Updated Machine Learning for Earth Observation Market Map

Udacity Data Science Capstone Project — Starbucks data set

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abdullah Doğan

Abdullah Doğan

Data Scientist Jr.

More from Medium

Ask phase of Data Analytics

We should not forget the “science” part of data science

Capstone Project: Helping a Business Increase Profits

Why understanding data is so important.