Why you should be Spot-Checking Algorithms on your Machine Learning Problems
Spot-checking algorithms is about getting a quick assessment of a bunch of different algorithms on your machine learning problem so that you know what algorithms to focus on and what to discard. # 随机抽查算法就是,快速评估一堆算法,以便来决定哪些算法需要继续深入下去而那些算法应该舍弃。
下面给出了5点关于随机抽查算法的注意事项:
- Algorithm Diversity: You want a good mix of algorithm types. I like to include instance based methods (live LVQ and knn), functions and kernels (like neural nets, regression and SVM), rule systems (like Decision Table and RIPPER) and decision trees (like CART, ID3 and C4.5). # 算法多样性。确保几种算法在实质上差别很很大,比如SVM和LR+正则化本质是一样的,所以如何尝试了SVM那么没有必要尝试LR+正则化. 比如我们可以选择基于实例的kNN, 基于决策树的CART, 基于核函数SVM,以及基于生成方法的NB. 文章后面还给了10个比较常用的算法。当然这些都是没有做加强的。
- Best Foot Forward: Each algorithm needs to be given a chance to put it's best foot forward. This does not mean performing a sensitivity analysis on the parameters of each algorithm, but using experiments and heuristics to give each algorithm a fair chance. For example if kNN is in the mix, give it 3 chances with k values of 1, 5 and 7. # 为每个算法评价的时候需要尽可能的公平,为这个算法提供最有利的条件。
- Formal Experiment: Don't play. There is a huge temptation to try lots of different things in an informal manner, to play around with algorithms on your problem. The idea of spot-checking is to get to the methods that do well on the problem, fast. Design the experiment, run it, then analyze the results. Be methodical. I like to rank algorithms by their statistical significant wins (in pairwise comparisons) and take the top 3-5 as a basis for tuning. # 和上面一样,要深入分析这个方法。最终选择前面3-5名来做作为basis进行调优。
- Jumping-off Point: The best performing algorithms are a starting point not the solution to the problem. The algorithms that are shown to be effective may not be the best algorithms for the job. They are most likely to be useful pointers to types of algorithms that perform well on the problem. For example, if kNN does well, consider follow-up experiments on all the instance based methods and variations of kNN you can think of. # 选出的这几个算法只是一个开始,它能告诉我们这个问题最终结构可能会是什么样的。我们可以以此为起点继续深入。
- Build Your Short-list: As you learn and try many different algorithms you can add new algorithms to the suite of algorithms that you use in a spot-check experiment. When I discover a particularly powerful configuration of an algorithm, I like to generalize it and include it in my suite, making my suite more robust for the next problem. # 建立自己的候选算法集合
下面是这篇文章给出的10个候选算法:
- C4.5 This is a decision tree algorithm and includes descendent methods like the famous C5.0 and ID3 algorithms.
- k-means. The go-to clustering algorithm.
- Support Vector Machines. This is really a huge field of study.
- Apriori. This is the go-to algorithm for rule extraction.
- EM. Along with k-means, go-to clustering algorithm.
- PageRank. I rarely touch graph-based problems.
- AdaBoost. This is really the family of boosting ensemble methods.
- knn (k-nearest neighbor). Simple and effective instance-based method.
- Naive Bayes. Simple and robust use of Bayes theorem on data.
- CART (classification and regression trees) another tree-based method.