网络文章@202410


Databricks 与 Snowflake:2024 年完整比较 |通过同步计算 |同步计算|中等的 — Databricks vs Snowflake: A Complete 2024 Comparison | by Sync Computing | Sync Computing | Medium

For the time being, most would agree that Snowflake tends to be the dominant name for easy-to-use cloud data warehouse solutions, and Databricks is the winner for cloud-based machine learning and data science workflows. 目前,大多数人都认为 Snowflake 往往是易于使用的云数据仓库解决方案的主导名称,而 Databricks 是基于云的机器学习和数据科学工作流程的赢家。

Another big difference between the two services is that Snowflake runs and charges for the entire compute stack (virtual warehouses and cloud instances), whereas Databricks only runs and charges for the management of compute, requiring users still have to pay a separate cloud provider bill. It is worth noting that Databricks’ new serverless product mimics the Snowflake operating model. Databricks works off a compute/time units called Databricks Units (or DBUs) per second and Snowflake uses a Snowflake credit system. 这两种服务之间的另一个大区别是,Snowflake 运行整个计算堆栈(虚拟仓库和云实例)并收费,而 Databricks 只运行计算管理并收费,要求用户仍然需要支付单独的云提供商账单。值得注意的是,Databricks 的新无服务器产品模仿了 Snowflake 操作模型。 Databricks 采用每秒称为 Databricks 单位(或 DBU)的计算/时间单位,而 Snowflake 使用 Snowflake 信用系统。

On the whole though, the Databrick’s ecosystem is typically more “open” than Snowflake, since Databricks still runs in a user’s cloud VPC. This means, users can still install custom libraries, or even introspect low level cluster data. Such access is not possible in Snowflake, and hence integrating with your favorite tools may be harder. Databricks also tends to be generally more developer / integration friendly than Snowflake for this exact reason. 但总体而言,Databrick 的生态系统通常比 Snowflake 更加“开放”,因为 Databricks 仍然在用户的云 VPC 中运行。这意味着,用户仍然可以安装自定义库,甚至内省低级集群数据。这种访问在 Snowflake 中是不可能的,因此与您最喜欢的工具集成可能会更困难。出于这个确切的原因,Databricks 通常也比 Snowflake 对开发人员/集成更友好。


关于 Databricks 无服务器工作的 9 个重要经验教训 |通过同步计算 |同步计算|中等的 — Top 9 Lessons Learned about Databricks Jobs Serverless | by Sync Computing | Sync Computing | Medium

Now for the good stuff: here are the top 9 lessons learned from evaluating Databricks Jobs Serverless. 现在来说说好东西:以下是从评估 Databricks Jobs Serverless 中学到的 9 个重要经验教训。

  1. Serverless compute is not cost optimized 无服务器计算未进行成本优化
  2. Ideal for short or ad-hoc jobs 非常适合短期或临时工作
  3. Eliminating spin up time is the biggest value add 消除旋转时间是最大的附加值
  4. Serverless has zero knobs, which makes life easy but at the price of control 无服务器具有零旋钮,这让生活变得轻松,但代价是控制
  5. You have no control over the runtime of your jobs 您无法控制作业的运行时间
  6. Migrating to serverless is not easy 迁移到无服务器并不容易
  7. Costs are completely determined by Databricks 成本完全由 Databricks 决定
  8. What happens if there’s an error? 如果出现错误会怎样?
  9. You can’t leverage your cloud contracts 您无法利用您的云合同

This test result might not translate to your internal jobs. You may have a job that demonstrates that serverless massively outperforms an optimized cluster in terms of costs — it all depends on your workload. With that said, this data point does prove that serverless is not GLOBALLY optimal. Serverless does not guarantee cost savings. 该测试结果可能不会转化为您的内部工作。您的工作可能证明无服务器在成本方面远远优于优化的集群 - 这一切都取决于您的工作负载。话虽如此,这个数据点确实证明无服务器并不是全球最优的。无服务器并不能保证节省成本。

We couldn’t love this aspect enough. Cluster spin up time is such a pain to deal with when you’re just trying to run something in real-time. So many times users have to wait for a cluster to spin up and get sidetracked by another task so that they don’t come back to the cluster until an hour later. 我们非常喜欢这个方面。当您只是尝试实时运行某些东西时,集群旋转时间是非常痛苦的。很多时候,用户必须等待集群启动并被另一个任务转移注意力,这样他们才能在一小时后回到集群。

The big downside of jobs serverless is that there’s no way to tune the cluster to adjust cost or runtime. You basically have to live with whatever Databricks decides. This means that if you want faster runtime, you can’t just throw a bigger cluster at it and call it a day. You can’t do anything really, except change your code. You’re stuck. 无服务器作业的一大缺点是无法调整集群来调整成本或运行时间。您基本上必须接受 Databricks 的任何决定。这意味着,如果您想要更快的运行时间,您不能只是扔一个更大的集群然后就到此为止。除了更改代码之外,您实际上无能为力。你被困住了。

This goes on and on, exceeding over 100 limitations. We, in fact, had a hard time getting ANY job to run on serverless. We ran into issues even with simple test jobs. It wasn’t until we manually changed the code and moved data around that we finally got it to work. 如此周而复始,超过了一百多个限制。事实上,我们很难让任何工作在无服务器上运行。即使是简单的测试工作,我们也遇到了问题。直到我们手动更改代码并移动数据后,我们才终于让它工作起来。

One thing we found troubling was the pricing. On the Databricks website they say that the cost is $0.35/DBU. But, where is the DBU/hr metric? Normally, one would take the runtime of the job and calculate a $/hr rate. Then, a user could tune the cluster size and tune the rate of cost. But with zero knobs, we have no control over the rate of cost. 我们发现令人不安的一件事是定价。在 Databricks 网站上,他们说成本为 0.35 美元/DBU。但是,DBU/hr 指标在哪里?通常,人们会计算工作的运行时间并计算每小时美元的费率。然后,用户可以调整集群大小并调整成本率。但由于零旋钮,我们无法控制成本。

At the end of the day, benchmarks presented by external parties can be totally irrelevant to your use case. Fancy benchmarks like TPC-DS, or even the one we shared in this post do not look like your jobs. There’s only one thing that really matters: YOUR WORKLOADS. 归根结底,外部各方提供的基准可能与您的用例完全无关。像 TPC-DS 这样的花哨基准测试,甚至我们在这篇文章中分享的基准测试看起来都不像是您的工作。只有一件事真正重要:您的工作量。


https://weibo.com/1401527553/Idpdy9wfZ?pagetype=fav

关于网络安全学习的合辑:

如果在上面没找到答案,可以看看这个:微博正文


A Love for Legacy – Signal v. Noise

One of the things that’s most interesting to me at Basecamp is that we wear our legacy applications as a badge of honor. The very first Rails application ever built still exists as Basecamp Classic. That is the application Ruby on Rails was both created for and born from. It’s easy to forget the weight of that sometimes. (我们以维护遗留应用为荣)

There is something very special about getting to work on the first Rails application. You can see how the Rails framework was built out of necessity and how it has since evolved over time. The Rails framework will continue to be guided by the real-world needs of our applications and now scores of others.(通过遗留程序我们可以看到Rails框架是如何构建出来解决问题的, 以及是如何进化的)

In programming, there is often an obsession with using the latest and greatest technology. Programmers view the use of edge technology as its own badge of honor, and are quick to throw away legacy applications. We don’t do that at Basecamp. We move forward while not forgetting and discarding our past. Programmers don’t put enough weight on the importance of legacy applications and systems. They wonder “why would someone write code this way,” without understanding the history of how the code base evolved. A legacy application can actually teach you more because it has lived in a way a new application has not. (在编程行业里面, 人们总是沉迷于最新最棒的技术, 并且以使用它们为荣, 然后很快地丢弃遗留代码. 遗留代码记录历史, 可以让我们避免犯过去所犯过的错误)


Power of Small Optimizations

Generally, you need to have good introspection for your application and always profile your application, both in production and during development. You also need to be curious to explore every possible performance optimization opportunity. Even highly optimized places in your application can be optimized even further.

Examples of some optimizations for algorithms and data structures choice:

  1. Use hybrid algorithms. Some algorithms or data structures can work well when the amount of data is small, but when the amount of data grows, the underlying algorithm or data structure needs to be changed.
  2. Use statistics for run-time optimizations. All algorithm’s performance is affected by data distribution. For example, if you know the cardinality of your data in advance, it is possible to choose a faster algorithm or data structure.
  3. Use specializations. You can specialize your algorithms and data structures for specific data types. For example, if you know that you need to sort integers, it makes sense to use radix sort instead of general comparison-based sort algorithms like pdqsort. Another example is if you need to sort integers, but they are always sorted or almost sorted, radix sort will be much slower than pdqsort.

You always need to start optimization with places that take most of the time during application execution. But usually, after you optimize every such place, it is hard to understand which places to optimize next. For example, when you investigate some potential place, and you already know that this place takes 3-5% of the whole application execution time, it is important to still improve this place, even if you can improve this place’s performance only by a small amount. If you optimize many such places, the compound result of such optimizations will be visible for the whole application.


JIT in ClickHouse 写的挺好的文章


CPU Dispatch in ClickHouse

__attribute__((target("sse,sse2,sse3,ssse3,sse4,avx,avx2")))
void plusAVX2(int64_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, size_t size)
{
    for (size_t i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

告诉编译器利用这些指令集合来优化下面的函数.

使用cpu dispatch每次还是有cpu feature set的判断,好处是如果短小代码可以直接inline, 但是还是有分支判断,适合放在头文件里面。

另外一个实现是初始化使用cpu dispatch来安装函数指针,这样每次调用没有分支判断,但是有个function pointer call.


给各位天命人的劝退信

于是现在 “研究生生涯” 就成了多方在多种利益面前不得已做出的一个权衡和妥协。聪明的老板们已经把自己打造成了无情的营盘,提供实验的机器、几篇相关工作、一个粗犷的想法,还有一个还算靠谱的师兄,剩下你就跟打工人一样直面天命吧。往好的方面想,你从事了本专业的工作 (给老板搬了砖) 换来了老板的指导和学历学位。无论是做了真研究,还是做了工程写了代码,哪怕就是在论文流水线工厂上写了个 related work,对专业精进总还是有益的。至于和其他方面利益的权衡:(你和老板) 的职业生涯、(你和老板) 个人兴趣、各种需要完成的杂事 (杂事总有人要完成),天命人自有天命。

天命人们要意识到国家和学校在博弈,学校和院系在博弈,院系和教师在博弈,教师和金字塔底层的吗喽们也在博弈。比如经济上行周期时的指挥棒是 SCI 论文,成就了今天的生化环材。我眼睁睁看着化院的精神小伙被导师催残得人样都快没了,博士毕业进了工业界立马满血复活。2010 年 CCF 发布了论文分级列表,计算机系立马行动制定了 “毕业套餐”。2011 年马所长发表了南大软件所第一篇软工 A 类会议论文 (FSE)。我在 PhD 期间发了 ICSE'14, ASE'15 和两篇 FSE'16,那时候 Program Committee 都没啥华人,甚至还有素不相识的 PC 在 pdf 上从头到尾批注论文,我这个产量在同龄人中也算是很可以了。再看看今天,软工四大会都快成 Chinasoft 了,论文迅速贬值 (没有像样的论文基本就直接出局了),你还得去卷别的东西:系统工程实现、影响力、研究工作的企业转化,最终变成人才头衔……指挥棒的初衷都是好的,但是嘛,有人的地方就有江湖,至于怎么包装自己的成果,那就是八仙过海、见仁见智了。

所以本质的矛盾是,大部分天命人希望的是获得指导、提升自己、求得职位。但无论是哪种导师,想做大工作的、想做项目的、想躺平的,都很有可能和你的利益发生冲突。你最想要一秒立即毕业,不想浪费时间做没用的项目,但先富起来的人拉高了标准,老板还想叫你出活。吗喽们刚刚经历了 “上课耽误学习” 的本科四年,就被赶鸭子上架,只顾着焦虑地刷 GPA 和各种履历想让自己在竞争中能取得一些优势,结果反而导致了基础不牢,写代码都还是个半吊子,于是产生出许多因果的报应,看看这个有趣的知乎问题吧。

知乎问题 “博士是怎么废掉的” 下有一个很有趣的回答:

高效率组: 流水线 + 只抓问题关键 + 追热点,出成果快,正反馈多。由于科研压力大,是非很少。在这种组里只要能跟对高年级的就能顺利毕业。低效率组: 没有任何操作规程,每个人上手都得自己淌着石头过河,喜欢讨论问题,一个芝麻绿豆都能讨论几个月没有定论,一篇文章改到每句话都是导师写的,出成果非常慢,在无休止的重新分析和一遍一遍修改文章的过程中磨灭了对科研的热情。由于效率低,大家磨洋工,是非很多。在这种组里很容易废掉。

以上,高效率组在国内常见,低效率组在欧洲常见,请对号入座,慎重选择。

我现在觉得欧洲“低效率组”是在刷人。见过几个真正有潜力成为大教授的,在博士期间就展现出来很强的实力,能够靠自己提炼课题、非常专注、效率很高,擅于跟身边人合作、有个人魅力善于社交。毕竟在欧洲职位非常少,这种学术新星才是学界希望看到的。大多数资质平平的就躺着毕业,然后润工业界了

低效率组培养能力,让你从实验设计到写文章能自己走通,而不是当个流水线操作员。