7 Tips for Improving MapReduce Performance
http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
- Configure your cluster correctly
- 文件系统取消 noatime 属性。
- 使用JBOD而不是用RAID或者是LVM,尤其是在TT和DN上。
- mapred.local.dir and dfs.data.dir
- If you find that a particular TaskTracker becomes blacklisted on many job invocations, it may have a failing drive.
- If you see swap being used, reduce the amount of RAM allocated to each task in mapred.child.java.opts.
- Use LZO Compression
- Tune the number of map and reduce tasks appropriately
- distcp -Ddfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/inputdata-with-largeblocks 可以修改block size.
- Write a Combiner
- A job performs aggregation of some sort, and the Reduce input groups counter is significantly smaller than the Reduce input records counter. 这个场景非常适合使用combiner. input records非常多但是groups非常少。
- The number of spilled records is many times larger than the number of map output records as seen in the Job counters.
- Use the most appropriate and compact Writable type for your data
- Reuse Writables 重用序列化对象
- Add -verbose:gc -XX:+PrintGCDetails to mapred.child.java.opts. Then inspect the logs for some tasks. If garbage collection is frequent and represents a lot of time, you may be allocating unnecessary objects. 观察GC情况
- it may not bring you a gain for every job, but if you’re low on memory it can make a huge difference. 重用序列化对象可以减少内存分配次数,显著改善GC带来的影响
- Use “Poor Man’s Profiling” to see what your tasks are doing