The economics of getting hired as a data scientist

成为数据科学家的经济学。全文的大意就是,做任何事情包括成为数据科学家,都不要跟风。因为跟风意味着你是沿着大部分人的路径和方向, 那么大概率结果是你得到的平均的结果,不可能获得outstanding result. 学习本身就是非常痛苦和无聊的过程,所以要写会接受这个过程。 同时要跳出舒适区,做一些常人不太可能做的事情。以数据科学家为例,去做做爬虫工作,做做部署工作,去复制论文结果,去解决一些实际问题。

This isn’t a new observation, of course. Everyone agrees that when it comes to investing, if you’re doing what everyone else is doing, you’re unlikely to see any returns. What’s weird though, is that people fail to apply this same reasoning when it comes to investing in themselves.

The problem is, most people don’t think this way when they embark on their data science journeys. I’ve spoken to literally hundreds of aspiring data scientists through my work at SharpestMinds, and about 80% of them have roughly the same story to tell:

  1. First, they learn the ropes (Python + sklearn + Pandas + maybe some SQL or something)
  2. Then, they take a cookie-cutter MOOC of some sort
  3. They read a few job descriptions, get worried that they don’t have what it takes
  4. Maybe take another MOOC, maybe start applying to jobs through a jobs board
  5. Hear nothing back (or at best, bomb a few interviews)
  6. Get frustrated, think about doing a Master’s, apply to some more jobs
  7. Come to a decision point: do I repeat #2 through #7 until something different happens?

If this ever happens to you, odds are you’re in a self-improvement bubble too: you’re doing what everyone else is doing, but expecting a different outcome. The very first thing you need to do is stop.

Overall, the rule is: if something seems like an obvious next step because everyone else is doing it, that’s a great thing to not do. And conversely, you need to find the things that no one else is doing, and do those things as soon as possible.

What are those things? Based on what I’ve seen, about 5 come to mind:

  1. Replicate papers. This is especially true if you’re a deep learning buff. People don’t do this because it’s harder than grabbing a dataset and using a simple ANN or XGBoost to do cookie-cutter classification. Find the most interesting paper (ideally a relatively recent one) relevant to your field on the arXiv, and read it. Understand it. Then, replicate it, potentially on a new dataset. Write a blog post about it.
  2. Don’t get comfortable in your comfort zone. If you start a new project, it had better be to learn some new frameworks/libraries/tools. If you’re building your 6th Jupyter notebook that starts with df = pd.read_csv(filename) and ends with f1 = f1_score(y_true, y_pred) , it’s time to change your strategy.
  3. Learn boring things. Other people aren’t doing this because no one likes boring things. But learning a proper Git flow, how to use Docker, how to build an app using Flask, and how to deploy models on AWS or Google Cloud, are skills that companies desperately want applicants to have, but that are under-appreciated by a solid majority of applicants.
  4. Do annoying things. 1) Offer to present a paper at a local data science meetup. Or, at the very least, attend the local data science meetup. 2) Send cold messages to people on LinkedIn. Try to offer value upfront (“I just noticed a typo on your website”). DO NOT ASK THEM FOR A JOB RIGHT AWAY. Make your ask as specific as possible (“I’d love to get your feedback on my blog post”). You’re trying to build a relationship and expand your network, and that takes patience. 3) Attend conferences and network. 4) Start a study group.
  5. Do things that seem crazy. Everyone goes to the UCI repository, or uses some stock dataset (yawn) to build their project. Don’t do that. Learn how to use a web scraping library, or some under-appreciated API to build your own, custom dataset. Data is hard to come by, and companies often need to rely on their engineers to get it for them. Your goal should be to come across as the kind of data science-obsessed lunatic who will build your own goddamn dataset if that’s what it takes to get the job done.