Travis's Tech Blog

Noting Travis's daily life

Kaggle学习之路

8.28开始在Coursera上学习How to Win a Data Science Competition: Learn from Top Kagglers

Kaggle竞赛的五个要素

Data
Model
Submission
Evaluation
Leaderboard

数据会分为两个部分，public和private，private用于最后的测试评估，public则帮助你校验自己模型的准确性，并反映在leaderboard上。

一次完整的competition过程如下：

常见data science competition平台：

Kaggle
DrivenData
CrowdAnalityx
CodaLab
DataScienceChallenge.net
Datascience.net
Single-competition sites(like KDD,VizDooM)

Kaggle竞赛 vs Real world application

Things need to think about:

几种常用模型简介

Linear Model(Logistic regression,SVM)
Tree-based model
KNN
Neural networks

Disadvantages of Random Forest:

Random forests 在小训练集上训练效果较差
Random forests是一种预测工具，而不是解释工具，无法查看和理解自变量和因变量之间的关系。
相较于决策树，Random forests训练代价较大。
回归问题下，决策树和随机森林的因变量范围由训练集中的值确定，不能获得训练数据之外的值

Advantages of Random Forests:

Since we are using multiple decision trees, the bias remains same as that of a single decision tree. However, the variance decreases and thus we decrease the chances of overfitting. I have explained bias and variance intuitively at The curse of bias and variance.
When all you care about is the predictions and want a quick and dirty way out, random forest comes to the rescue. You don’t have to worry much about the assumptions of the model or linearity in the dataset.

Conclusion:

There is no “silver bullet” algorithm
Linear models split space into 2 subspaces
Tree-based methods splits space into boxes
k-NN methods heavy rely on how to measure points “closeness”
Feed-forward NNs produce smooth non-linear decision boundary

Feature preprocessiong and generation

Numeric features

Categorical and ordinal features

Datetime and coordinates

Handling missing values

更多关于特征工程的资料1

更多关于特征工程的资料2

自然语言处理

Word of bags

TF-IDF

N-grams

Text preprocessing

Lowercase

Lemmatization and stemming

Stop words

Conclusion