Thoughts on Systems for Large Datasets: Problems and Opportunities
感觉从2012年之后,jeff工作开始从构建分布式数据库系统,转向构建机器学习和深度学习的分布式系统。这篇presentation后半部分谈到了如何从大量数据,尤其是大量的非结构化的互联网数据里面,找到有价值的信息(通过deep learning)
Areas I Wish New Grads Knew More About
- Ability to do back-of-the-envelope calculations and quickly evaluate many alternative designs
- Understanding the importance of locality at all levels (caches & memory systems, disk I/O, cross-machine, geographic regions, etc.)
- Low-level encoding and compression schemes and their tradeoffs
- More math and statistics knowledge. e.g. use of randomized, probabilistic algorithms in distributed systems
Roughly in two main areas: – issues that arise in building systems that store and manipulate large datasets – automatically extracting higher-level information from raw data
关于存储和使用大量数据集的分布式系统的几个问题:
- Programming Models.
Distributed Systems Abstractions.
- High-level tools/languages/abstractions for building distributed systems
- Are there unifying abstractions for other kinds of distributed systems problems?
– systems for handling interactive requests & dealing with intra-operation parallelism
- load balancing, fault-tolerance, service location & request distribution, …
– systems that seamlessly divide, expand, and contract processing subsystems?
- Building Applications on top of Weakly Consistent Storage Systems – consistent operations (e.g. use Paxos). often imposes additional latency for common case – inconsistent operations. better performance/availability, but apps harder to write and reason about in this model
- Building Applications on top of Weakly Consistent Storage Systems
- Challenge: General model of consistency choices, explained and codified – ideally would have one or more “knobs” controlling performance vs. consistency – “knob” would provide easy-to-understand tradeoffs
- Challenge: Easy-to-use abstractions for resolving conflicting updates to multiple versions of a piece of state – Useful for reconciling client state with servers after disconnected operation – Also useful for reconciling replicated state in different data centers after repairing a network partition
- Adaptivity and Self-Tuning in World-Wide Systems
- Challenge: automatic, dynamic world-wide placement of data & computation to minimize latency and/or cost, given constraints on: bandwidth, packet loss, power, resource usage – failure modes, …
- Users specify high-level desires:
- “99%ile latency for accessing this data should be <50ms” “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”
- ACLs in Information Retrieval Systems # 大规模系统下的权限控制问题.
- Retrieval systems with mix of private, semi-private, widely shared and public documents – e.g. e-mail vs. shared doc among 10 people vs. messages in group with 100,000 members vs. public web pages
- Challenge: building retrieval systems that efficiently deal with ACLs that vary widely in size – best solution for doc shared with 10 people is different than for doc shared with the world – sharing patterns of a document might change over time
- Automatic Construction of Efficient IR Systems.
- Challenge: can we have a single parameterizable system that automatically constructs efficient retrieval system based on these parameters?
- Information Extraction from Semi-structured Data. # 处理半结构化数据
- Data with clearly labelled semantic meaning is a tiny fraction of all the data in the world
- But there’s lots semi-structured data – books & web pages with tables, data behind forms, …
- Challenge: algorithms/techniques for improved extraction of structured information from unstructured/ semi-structured sources – noisy data, but lots of redundancy – want to be able to correlate/combine/aggregate info from different sources
处理半结构化数据.
Plenty of Data
- Text: trillions of words of English + other languages Visual: billions of images and videos
- Audio: spoken queries, audio portion of video data, …
- User activity: queries, result page clicks, map requests, etc.
- Knowledge graph: billions of labelled relation triples
- Biology and Health: genetic data, health care records, …
- Physical sciences: physics, astronomy, …
- …