Kaggle学习之路
8.28开始在Coursera上学习How to Win a Data Science Competition: Learn from Top Kagglers
Kaggle竞赛的五个要素
- Data
- Model
- Submission
- Evaluation
- Leaderboard
数据会分为两个部分,public和private,private用于最后的测试评估,public则帮助你校验自己模型的准确性,并反映在leaderboard上。
一次完整的competition过程如下:
常见data science competition平台:
- Kaggle
- DrivenData
- CrowdAnalityx
- CodaLab
- DataScienceChallenge.net
- Datascience.net
- Single-competition sites(like KDD,VizDooM)
Kaggle竞赛 vs Real world application
- Things need to think about:
几种常用模型简介
- Linear Model(Logistic regression,SVM)
- Tree-based model
- KNN
- Neural networks
Disadvantages of Random Forest:
- Random forests 在小训练集上训练效果较差
- Random forests是一种预测工具,而不是解释工具,无法查看和理解自变量和因变量之间的关系。
- 相较于决策树,Random forests训练代价较大。
- 回归问题下,决策树和随机森林的因变量范围由训练集中的值确定,不能获得训练数据之外的值
Advantages of Random Forests:
-
Since we are using multiple decision trees, the bias remains same as that of a single decision tree. However, the variance decreases and thus we decrease the chances of overfitting. I have explained bias and variance intuitively at The curse of bias and variance.
-
When all you care about is the predictions and want a quick and dirty way out, random forest comes to the rescue. You don’t have to worry much about the assumptions of the model or linearity in the dataset.
Conclusion:
- There is no “silver bullet” algorithm
- Linear models split space into 2 subspaces
- Tree-based methods splits space into boxes
- k-NN methods heavy rely on how to measure points “closeness”
- Feed-forward NNs produce smooth non-linear decision boundary
Feature preprocessiong and generation
- Numeric features
- Categorical and ordinal features
- Datetime and coordinates
- Handling missing values
自然语言处理
Word of bags
TF-IDF
N-grams
Text preprocessing
- Lowercase
- Lemmatization and stemming
- Stop words
Conclusion