My CV

Open your mind, and open your eyes. (放眼未来, 自由想象)

Email: dirtysalt1987 AT gmail DOT com

GitHub: https://github.com/dirtysalt/

LinkedIn: https://www.linkedin.com/in/dirtysalt

Talks:

友盟数据分析平台架构 Umeng Analytical Architecture
中国系统架构师大会(SACC 2014) 如何在一天之内收集3.6亿移动设备的数据

1. 英文介绍

1.1. Summary

Extensive experience in:

large-scale distributed system design and implementation.
storage system design and implementation.
performance optimization and tuning for systems and applications.
big data and machine learning platform.

Specialties:

proficient in C/C++, Python, Java.
solid knowledge of data structure and algorithm.
extremely familiar with system development on Linux.

1.2. Experience

1.2.1. Software Engineer, Celerdata.com(StarRocks), 2021.4 - now

StarRocks Committer

Query Team:

Implement the function of reading ORC format files and optimize ORC data scanning with information such as zone maps and dictionaries. StarRocks has 3-5X the performance of Presto/Trino on the TPCH-100G ORC data set.
Implement Global Runtime Filter (shuffle awareness runtime filter) function. Unlike other traditional implementations, it can be used in a wider range of scenarios without tuning parameters. On the TPCH-100 data set, some queries can achieve a 3-7X performance improvement.
Optimize the performance of "group by" and "count distinct" in the high cardinality case. By inserting software prefetch instructions and reordering some instructions at the right place, the performance can be improved by 50%~100% without modifying the hashmap implementation.
Optimize the performance of "group by" for multi-column short strings. By encoding multi-column short strings as int64/int128 to improve the efficiency of hashmap query and insertion, we finally improve the query performance by 30%~100%.
Optimize the query performance of predicates containing complex expressions. By simplifying complex expressions and analyzing the monotonicity of complex expressions, we can make better use of zone maps and dictionaries, and the predicate can play a greater role in data pruning, with performance improvement in the range of 10% to 300%.

Data Lake Team:

Implement the lazy materialization on reading ORC/Parquet file. Reading the columns in predicates first and then the rest columns can effectively reduce the amount of data read. The implementation of lazy materialization does not have any performance degradation problem, and the amount of data read can be reduced to 10%~30%.
Implement IO coalescence on reading ORC/Parquet file. By merging some consecutive or spatially separated but very close columns for reading, the number of IO requests initiated to the storage system can be effectively reduced, which can avoid triggering the IOPS limit and being throttled. In some production scenarios, IOPS can be reduced to 10%~20%, and performance can be improved by 3~7X.
Implement the Block Cache feature, which can speed up query analysis by caching files locally while reading them on external storage systems (e.g. S3). This feature is the basis of StarRocks Lake Warehouse Fusion. With Block Cache enabled on TPCH/SSB data collections, the data lake can achieve query performance close to the data warehouses.

1.2.2. Senior Software Engineer, Microsoft.com, 2020.12 - 2021.4

Microsoft.com Bing IndexServe Team.

Improve and Optimize the Quality Platform for Bing IndexServe.

1.2.3. Senior Software Engineer, Amazon.com , 2020.6 - 2020.12

Amazon.com Search Experience Team.

Develop UX Guardrails tool. It ensures the rendered web elements on the search page are compliant with internal UX design guidelines.

1.2.4. Software Engineer, Head of Backend Development, CastBox.FM, 2016.4 - 2020.6

Design and implement a crawler system. This crawler system constantly synchronizes all publicly available podcasts on the Internet and pushes the latest episodes to users in a timely manner. The technologies used include MongoDB, Python, Nginx, AWS, etc. By leveraging RSS submitted by users and user search terms, we have increased the number of podcasts included in the platform from 200K to 600K, and the number of episodes from 20M to 40M, far exceeding the completeness of our competitors. We use machine learning algorithms to predict the future release of episodes based on the historical release time of the episodes, so that we can discover the latest episodes of popular podcasts within 5 minutes, and users can receive podcast update push and listen to these episodes at the very first time. Meanwhile, we compress and crop the images of podcasts and episodes from 3MB down to less than 300KB, which greatly saves image download traffic and reduces image loading time.
Design and implementation of a podcast search system. This search system is based on ElasticSearch and we improve the search system in three aspects: indexing timeliness, search speed, and relevance ranking. In terms of indexing timeliness, the crawler system checks for changes in the podcast or episode data and notifies the search system of indexing through the message queue once there are updates. The pipeline runs within 10s, and more than 200K documents are re-indexed every day. In terms of search speed, the keyword search latency is reduced to less than 200ms through caching and tuning of ElasticSearch; in terms of sorting relevance, in addition to using the document relevance scores returned by ElasticSearch, many factors like the total number of subscriptions, the number of plays in the last 1 and 7 days, and other indicators are considered, which are combined as the relevance scores to give users a better search experience.

1.2.5. Software Engineer, Logzilla, 2015.4 - 2015.8 (Remote, as Consultant)

A real-time event analytical platform.

Implement a new event storage engine to support 1M events per second on HDD disk, and 3M events per second on SSD disk.

1.2.6. Software Engineer, Galera, 2014.4 - 2014.11 (Remote, as Consultant)

A drop-in plugin of MySQL multi-master.

Optimize cluster recovery process regarding data center outage case, and reduce recovery time from the 30s to less than 3s.

1.2.7. Software Architect, Data Platform, Umeng, 2012.6 - 2016.4

Apply optimizations on Hadoop cluster usage includes: a) Using LZMA compression algorithm to compress cold data which saves more than 60% of storage space. b) Using MapReduce tasks to generate HBase files directly and then Bulk-Loading those files, which leads to good cluster stability compared with direct use of HBase API to write data. c) Redesign the HBase key prefix so that there is no overheated Region Server.
Design and implement FastHBaseRest. It is a high-performance service for accessing HBase based on HTTP/protobuf, using asynchbase for asynchronous access to HBase to improve throughput, using google guava written at the application level LRU cache to reduce access latency. The service is modular and easily extensible, and supports rewrite request/response functionality. Compared to HBase/rest, access latency is reduced by 20%-70%, and traffic data is reduced by 40%-60%.

1.2.8. Senior Software Engineer, Baidu, 2008.7 - 2012.6

Design and implement the internal build system comake2. In principle this system is very similar to maven in the Java ecosystem: you can use it to build projects and use it to manage internal dependencies. comake2 is written in Python and is very extensible, with dozens of plugins contributed by several departments, and is widely used within the company.
Develop and maintain various generic components including generic data structures, lock-free B-Trees, HTTP client, URL encoding and decoding, character encoding and conversion, regular expressions, signature algorithms, memory allocators, data exchange formats, IDL compilers, etc.

1.3. Education

MS. Computer Science. Shandong University
BE. Electronic Engineering. Shandong University

2. 中文介绍

2.1. 技术技能

熟悉C++, Python, Java等语言
熟悉数据结构和算法
精通大规模分布式系统设计和实现
熟悉存储系统/数据库系统的设计和实现

2.2. 工作经历

2.2.1. 软件工程师，Celerdata.com(StarRocks), 2021.4 - now

StarRocks Committer

Query Team:

实现读取ORC格式文件功能，并且利用zonemap和dict等信息优化ORC数据扫描。在TPCH-100G ORC数据集合上，StarRocks在性能上是Presto/Trino的3-5X.
实现Global Runtime Filter(shuffle awareness runtime filter)功能. 有别于其他传统实现，可以不用调整参数就在更广泛的场景下面使用。在TPCH-100数据集合上，部分查询可以达到3-7X的性能提升。
优化高基数情况下的group by和count distinct性能。通过在合适的位置增加prefetch, 可以在不修改hashmap实现的情况下，性能提升50%~100%.
优化多列短字符串的group by性能。通过将多列短字符串编码成为int64/int128，来提升hashmap查询和插入效率，最终将查询性能提升30%~100%.
优化包含复杂表达式的谓词的查询性能。通过简化复杂表达式，以及分析复杂表达式的单调性，可以更好地利用zonemap和dict数据，谓词在数据裁剪上可以发挥更大的能力，性能提升在10%~300%.

Data Lake Team:

实现ORC/Parquet文件读取上的延迟物化功能。通过先读取谓词列然后读取非谓词列的方式，可以有效地减少数据读取量。延迟物化不存在任何性能退化问题，而数据读取量节省到原来的10%~30%.
实现ORC/Parquet文件读取上的IO合并功能。通过将一些连续或者是空间上相距不远的列合并起来进行读取，可以有效地降低对存储系统发起的IO请求次数，避免触发IOPS限制而被限流。在生产场景下，IOPS可以降低到之前的10%~20%，性能提升3~7X.
实现Block Cache功能。这个功能可以在读取外部存储系统上（比如S3）的文件时在在本地也进行缓存，从而加速查询分析。这个功能是StarRocks湖仓融合的基础，在TPCH/SSB数据集合上，开启Block Cache功能之后，数据湖可以到达接近数据仓库的查询性能。

2.2.2. 高级软件工程师, Microsoft.com, 2020.12 - 2021.4

改进和优化Bing IndexServe的质量平台。

2.2.3. 高级软件工程师，Amazon.com, 2020.6 - 2020.12

参与开发UX Guardrails工具，确保电商搜索页面中网页元素符合内部UX设计准则。

2.2.4. 后端服务技术负责人, CastBox.fm, 2016.4 - 2020.6

设计和实现爬虫系统。这个爬虫系统不断同步互联网上所有公开的播客，并且及时地将最新单集推送给用户。使用技术包括 MongoDB, Python, Nginx, AWS等. 通过收集用户提交的RSS和用户搜索词，将平台收录的播客数量从20w提高到60w，单集数量从2000w提高到4000w，收录完整性上远超竞品。我们使用机器学习算法，根据播客单集历史发布时间预测未来单集的发布时间，可以5分钟以内发现热门播客的最新单集，用户可以在第一时间收到播客更新推送并且收听这些单集。同时我们对播客和单集的图片进行压缩和裁剪优化，将图片尺寸从3MB压缩至300KB以内，极大地节省用户图片下载流量和减少图片加载时间。
设计和实现播客搜索系统。这个搜索系统基于ElasticSearch开发，用户可以搜索到平台收录的播客和单集，目前支持的语言多达12种，包括英语，葡语，西语，德语，中日韩等。数据显示有超过1/3的用户订阅来自于搜索，因此我们从索引及时性，检索速度和相关性排序三个方面改进搜索系统。索引及时性方面，爬虫系统一旦检查到播客或者是单集数据发生变化，通过消息队列通知检索系统进行索引，整个pipeline延迟在10s以内，平均每天有超过2w个文档被重新索引；检索速度方面，通过缓存和对ElasticSearch的调优，将关键词检索延迟减低到200ms以内；在排序相关性上，除了使用ElasticSearch返回的文档相关性分数外，还使用了播客和单集的总订阅量和播放量，最近1天和7天的订阅量和播放量等特征，综合起来作为相关性分数，给用户更好的搜索体验。

2.2.5. 高级软件架构师, 友盟, 2012.6 - 2016.4

中国系统架构师大会(SACC 2014) 如何在一天之内收集3.6亿移动设备的数据
优化Hadoop集群使用包括：a) 使用lzma压缩算法来压缩不冷数据，节省60%以上的存储空间. b) 使用MapReduce任务直接生成HBase file然后进行Bulk Loading，相比直接使用HBase API来写入数据有很好的稳定性。 c) 重新设计HBase key prefix使得不会存在过热的Region Server
设计和实现FastHBaseRest. 它是一个用于访问HBase基于HTTP/Protobuf高性能服务，使用asynchbase对hbase进行异步访问来提高吞吐，使用guava编写的应用层级别LRU cache减少访问延迟。服务模块化易于扩展，支持rewrite request/response功能。相比hbase/rest, 传输延迟减少20%-70%, 传输数据减少40%-60%.

2.2.6. 软件工程师, Remote, 2014.4 - 2015.8

Logzilla, 2015.4 - 2015.8. 重写原有消息存储引擎，在写入事件数量指标上，SSD上从500K提升到3M, HDD上从100K提升到1.2M.
Galera, 2014.4 - 2014.11. 针对DC断电这种情况改进集群恢复机制，将集群恢复时间从30s降低到3s以内。

2.2.7. 高级软件工程师, 百度, 2008.7 - 2012.6

设计和实现内部构建系统comake2. 从原理上说这个系统非常类似于Java生态系统中的maven：可以使用它来构建项目，同时可以使用它来管理内部依赖。comake2使用Python语言编写，具有非常强的可扩展性，由多个部门贡献了超过数十种插件，在公司内部被广泛使用。
开发和维护各种通用组件，包括通用数据结构，lock-free B-Trees, HTTP客户端，URL处理，字符编码识别和转换，正则表达式，签名算法，内存分配器，数据交换格式，IDL编译器等等。

2.3. 教育经历

本科电子科学与技术专业山东大学
硕士计算机科学与技术专业山东大学