Project 2 — Predicting Car Price

Abdullah Doğan
İstanbul Data Science Academy
5 min readAug 5, 2021

--

1. Web Scraping

This is my first project at Istanbul Data Science Academy about Web scraping. I needed several Python libraries to complete this task. The “BeautifulSoup” library is one of them, while the “requests” library is the other and I used Selenium. In this project, I created a second-hand automobile pricing estimation using data from used car adverts found on a website that sells cars.

This website uses a vehicle url, so I’ll have to scrape that first. This homepage has automobile urls, I had to loop the urls first.

I scraped the car’s features on each page after scraping the URLs and then I collected them in a list.

This process resulted in the acquisition of approximately 45625 car data. I created my dataset by extracting the attributes of the autos from each notice page’s distinguishing variables. I created ten different features in my dataset.

2. Data Cleaning

In this project for the cleaning data

  • Replacing some characters (like TL, HP, km)
  • Spliting columns
  • Deleting null columns
  • Deleting some columns
  • Setting index price columns
  • Converting all numeric columns to integer
  • Translating data to English

After removing the missing values, there are 44821data points left.

3. Exploratory Data Analysis

Let’s take a deeper look at the data we’ve gathered.

This is my price index, thus I had to delete some data because it could influence my model if I didn’t, which is why I deleted price>500000.

After eliminating my price index, the data appears to be normal.

  • Before elimination, this is the initial result. The most expensive cars are definitely Ferrari, Rolls Royce, and Lamborghini.
  • After eliminating the least costly vehicles, we can observe that Infiniti, Porsche, and BMW are the most expensive, showing that the vehicle’s brand has an impact on its price.
  • Car owners are marketed their vehicles for a lower price.

The majority of vehicles developed in recent years have automatic or semi-automatic transmissions.

The engines in Cadillac, Chrysler, and Mercedes-Benz are larger.

The most expensive cars are in Sinop, Eskişehir, and Düzce, while the cheapest cars are in Kars, Çankırı, and Kütahya. I didn’t have any information on Bayburt, Tunceli, or Karaman.

4. Model

In this project I used Regression Model.

At the beginning of the model, I split my data frame as X, y for baseline model. X includes all numeric columns (4 features) and y is my target column (Price) and create OLS model

According my baseline model, R-squared value is 0.671, it is not good and the model has multicollinearity problems (Cond. No. is very high)

I needed to increase the R-squared value. We began by obtaining the logarithmic value of our target column.

I needed to increase the R-squared value. We began by obtaining the logarithmic value of our target column and then transformed the string value (like make, transmisson) to dummy variables as a second step. Because, as we can see in EDA, the pricing column is affected by variables such as gear type, brand, and who sells the vehicle, I transformed these values.

Then I created a model with selected features and checking some parameters. And OLS Regression Results looks great. Because there are more feature (63) and it help us explain our model clearly, Our R-squared value is high (0.933) Due to Cond.No (1.00e+16), there is still multicollinearity, however this score is quite low when compared to the previous model.

Now I have to split my data as train, test and validation set. And run model again and finally compare Ridge, Lasso and Polynomial regression results.

We have best result at Polynominal regression and second is Linear Regression. Now I am going to run cross validation.

The best results after cross validation are Linear Regression and Polynomial Regression.

5. Conlusion

I compared actual values vs predicted value.

Githup: https://github.com/doganapo/Project-2--Web_scraping

--

--