Travis's Tech Blog

Noting Travis's daily life


Kaggle学习之路

8.28开始在Coursera上学习How to Win a Data Science Competition: Learn from Top Kagglers

Kaggle竞赛的五个要素

  • Data
  • Model
  • Submission
  • Evaluation
  • Leaderboard

数据会分为两个部分,public和private,private用于最后的测试评估,public则帮助你校验自己模型的准确性,并反映在leaderboard上。

一次完整的competition过程如下:

常见data science competition平台:

  • Kaggle
  • DrivenData
  • CrowdAnalityx
  • CodaLab
  • DataScienceChallenge.net
  • Datascience.net
  • Single-competition sites(like KDD,VizDooM)

Kaggle竞赛 vs Real world application

  • Things need to think about:

几种常用模型简介

  • Linear Model(Logistic regression,SVM)
  • Tree-based model
  • KNN
  • Neural networks

Disadvantages of Random Forest:

  1. Random forests 在小训练集上训练效果较差
  2. Random forests是一种预测工具,而不是解释工具,无法查看和理解自变量和因变量之间的关系。
  3. 相较于决策树,Random forests训练代价较大。
  4. 回归问题下,决策树和随机森林的因变量范围由训练集中的值确定,不能获得训练数据之外的值

Advantages of Random Forests:

  1. Since we are using multiple decision trees, the bias remains same as that of a single decision tree. However, the variance decreases and thus we decrease the chances of overfitting. I have explained bias and variance intuitively at The curse of bias and variance.

  2. When all you care about is the predictions and want a quick and dirty way out, random forest comes to the rescue. You don’t have to worry much about the assumptions of the model or linearity in the dataset.

Conclusion:

  • There is no “silver bullet” algorithm
  • Linear models split space into 2 subspaces
  • Tree-based methods splits space into boxes
  • k-NN methods heavy rely on how to measure points “closeness”
  • Feed-forward NNs produce smooth non-linear decision boundary

Feature preprocessiong and generation

  • Numeric features

  • Categorical and ordinal features

  • Datetime and coordinates

  • Handling missing values

更多关于特征工程的资料1

更多关于特征工程的资料2

自然语言处理

Word of bags

TF-IDF

N-grams

Text preprocessing

  • Lowercase

  • Lemmatization and stemming

  • Stop words

Conclusion