Recently a group of enthusiasts in Data Science which I lead had a discussion about Kaggle competition Bike Sharing Demand. I presented a possible solution and during discussion were raised several questions which I was not easy to answer. Under the cut I’ll state those questions and will try to answer in the followup articles.
Choosing Algorithm
Before adjusting algorithm I checked several of them using Azure Machine Learning studio. Below the results of evaluation of algorithms using the UCI data set and 0.7 split for training data.
Algorithm | Parameters | MAE | RMSE | R2 |
---|---|---|---|---|
Linear Regression | Default | 106.49 | 144.92 | 0.36 |
Boosted Decision Tree Regression | Default | 25.67 | 41.85 | 0.947 |
Boosted Decision Tree Regression | Learning rate changed to 0.1 (was 0.2) | 27.36 | 44.18 | 0.940 |
Boosted Decision Tree Regression | Total number of trees changed to 200 (was 100) | 24.29 | 39.94 | 0.951 |
Decision Forest Regression | Default | 27.86 | 46.71 | 0.934 |
Decision Forest Regression | Number of decision trees changed to 16 (was 8) | 26.57 | 44.99 | 0.939 |
Decision Forest Regression | Number of decision trees changed to 32 (was 8) | 26.02 | 44.05 | 0.941 |
Results show that linear regression is heavily loosing against decision tree and forest. Comparing trees and forests it’s visible that the later performs not so well. At the same time increasing the number of trees makes them to perform better. In the end I chose the algorithm Boosted Decision Tree with 200 of trees.
After that I needed to choose an optimal number of features to train the model. I identified the smallest set of features and on each following iteration I added one more feature until the whole set contains 12 features, all available features.
Number Features | MAE | RMSE | R2 | Features |
---|---|---|---|---|
6 | 105.13 | 145.10 | 0.361 | mnth,workingday,weekday,temp,hum,windspeed |
8 | 97.76 | 135.96 | 0.439 | + dteday,holiday |
9 | 97.61 | 136.17 | 0.438 | + season |
10 | 96.53 | 135.20 | 0.446 | + weathersit |
11 | 96.79 | 134.81 | 0.449 | + atemp |
12 | 24.29 | 39.94 | 0.951 | + hr |
Questions
For me it was strange that adding hours so impacted the performance. The whole research brought several questions to the table.
- Why is linear regression bad for this data set?
- What is the difference between decision tree and decision forest algorithms?
- When correlation with a measure is 0.6, is it worth to choose is as a feature?
- Can we use measure which has negative correlation, for example -0.7?
- Why was Random Forest Classifier algorithm used to clean wind speed data (as per this article)?
Also
The Jupyter Nobook could be found here.