Bike Sharing Demand Questions

Recently a group of enthusiasts in Data Science which I lead had a discussion about Kaggle competition Bike Sharing Demand. I presented a possible solution and during discussion were raised several questions which I was not easy to answer. Under the cut I’ll state those questions and will try to answer in the followup articles.

Choosing Algorithm

Before adjusting algorithm I checked several of them using Azure Machine Learning studio. Below the results of evaluation of algorithms using the UCI data set and 0.7 split for training data.

AlgorithmParametersMAERMSER2
Linear RegressionDefault106.49144.920.36
Boosted Decision Tree RegressionDefault25.6741.850.947
Boosted Decision Tree RegressionLearning rate changed to 0.1 (was 0.2)27.3644.180.940
Boosted Decision Tree RegressionTotal number of trees changed to 200 (was 100)24.2939.940.951
Decision Forest RegressionDefault27.8646.710.934
Decision Forest RegressionNumber of decision trees changed to 16 (was 8)26.5744.990.939
Decision Forest RegressionNumber of decision trees changed to 32 (was 8)26.0244.050.941

Results show that linear regression is heavily loosing against decision tree and forest. Comparing trees and forests it’s visible that the later performs not so well. At the same time increasing the number of trees makes them to perform better. In the end I chose the algorithm Boosted Decision Tree with 200 of trees.

After that I needed to choose an optimal number of features to train the model. I identified the smallest set of features and on each following iteration I added one more feature until the whole set contains 12 features, all available features.

Number FeaturesMAERMSER2Features
6105.13145.100.361mnth,workingday,weekday,temp,hum,windspeed
897.76135.960.439+ dteday,holiday
997.61136.170.438+ season
1096.53135.200.446+ weathersit
1196.79134.810.449+ atemp
1224.2939.940.951+ hr

Questions

For me it was strange that adding hours so impacted the performance. The whole research brought several questions to the table.

  1. Why is linear regression bad for this data set?
  2. What is the difference between decision tree and decision forest algorithms?
  3. When correlation with a measure is 0.6, is it worth to choose is as a feature?
  4. Can we use measure which has negative correlation, for example -0.7?
  5. Why was Random Forest Classifier algorithm used to clean wind speed data (as per this article)?

Also

The Jupyter Nobook could be found here.




No Comments


You can leave the first : )



Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.