What Does 'Big Data' Mean?
- http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext
- http://cacm.acm.org/blogs/blog-cacm/156102-what-does-big-data-mean-part-2/fulltext
- http://cacm.acm.org/blogs/blog-cacm/157589-what-does-big-data-mean-part-3/fulltext
- http://cacm.acm.org/blogs/blog-cacm/162095-what-does-big-data-mean-part-4/fulltext
big data can mean one of four things:
- Big volumes of data, but "small analytics." Here the idea is to support SQL on very large data sets. Nobody runs "Select*" from something big as this would overwhelm the recipient with terabytes of data. Instead, the focus is on running SQL analytics (count, sum, max, min, and avg with an optional group_by) on large amounts of data. I term this "small analytics" to distinguish this use case from the one which follows. # 大数据量简单分析. 比如生成数据报表.
- Big analytics on big volumes of data. By big analytics, I mean data clustering, regressions, machine learning, and other much more complex analytics on very large amounts of data. At the present time users tend to run big analytics using statistical packages, such as R, SPSS and SAS. Alternately, they use linear algebra packages such as ScalaPack or Arpack. Lastly, there is a fair amount of custom code (roll your own) used here. # 大数据量上复杂分析. 比如机器学习.
- Big velocity. By this I mean being able to absorb and process a fire hose of incoming data for applications like electronic trading, real-time ad placement on Web pages, real-time customer targeting, and mobile social networking. This use case is most prevalent in large Web properties and on Wall Street, both of whom tend to roll their own. # 快速实时处理
- Big variety. Many enterprises are faced with integrating a larger and larger number of data sources with diverse data (spreadsheets, Web sources, XML, traditional DBMSs). Many enterprises view this as their number one headache. Historically, the extract, transform, and load (ETL) vendors serviced this market on modest numbers of data sources. # 柔和各种数据源
Big Volume, Small Analytics
- All are running on "shared nothing" server farms with north of 100 usually "beefy" nodes, survive hardware node failures through failover to a backup replica, and perform a workload consisting of SQL analytics as defined above. All report operational challenges in keeping a large configuration running, and would like new DBMS features. # ~100 "beefy" nodes组成的高可用shard-nothing集群来支撑这种需求. 确保高可用情况下平滑升级.
- Number one on everybody’s list is resource elasticity (i.e., add 50 more servers to a system of 100 servers, automatically repartitioning the data to include the extra servers, all without taking down time and without interrupting query processing). In addition, better resource management is also a common request. Here, multiple cost centers are sharing a common resource, and everybody wants to get their fair share. # 可扩展性和资源管理.
- There have been quite a few papers in the recent literature documenting the inefficiency of Hadoop, compared to parallel DBMSs. In general, you should expect at least an order of magnitude performance difference. This will translate into an order of magnitude worse response time on the same amount of hardware or an order of magnitude more hardware to achieve the same performance. If the later course is chosen, this is a decision to buy a lot of iron and use a lot of power. # 认为Hadoop/Hive架构会比PDBMS慢一个数量级
- Off into the future, I see the main challenge in this world to be 100% uptime (i.e. never go down, no matter what). Of course, this is a challenging "ops" problem. In addition, this will require the installation of new hardware, the installation of patches, and the next iteration of a vendor’s software, without ever taking down time. Harder still is schema migration without incurring downtime. # 支持"热"操作
- In addition, I predict that the SQL vendors will all move to column stores, because they are wildly faster than row stores. In effect, all row store vendors will have to transition their products to column stores over time to be competitive. This will likely be a migration challenge to some of the legacy vendors. # 列式存储
- Lastly, there is a major opportunity in this space for advanced storage ideas, including compression and encryption. Sampling to cut down query costs is also of interest. # 压缩, 加密, 以及抽样查询
Big Volume, Big Analytics
- First, I think complex analytics will increase dramatically in importance as data mining and other complex tasks increase in importance. This shift will be driven by complex analytics problems with dramatic economic value such as recommendation engines, ad placement, and targeted customer segmentation for various purposes. Hence, the world will shift away from the simple analytics in traditional business intelligence systems to more complex analytics. - Second, enterprises will have to upgrade the skill set of their business analysts. Instead of merely running current business intelligence tools, they will need to become facile in statistical operations. This will be a non-trivial talent upgrade.
- Third, I think the complex analytics market is in its infancy, and which kind of system will ultimately win is an open question. In the meantime, expect a huge amount of marketing "fud" from the larger DBMS players.
- Lastly, notice that I have not included Hadoop as an option. Since complex analytics are not "embarrassingly parallel," Hadoop will suffer significant performance problems. Hence, I do not consider this a reasonable Hadoop use case.
Big velocity
- "Big velocity" means "drinking from a firehose," i.e. coping with data arriving at very high speed. # 对着消防栓喝水:)
- Immediately, we can divide the high velocity space into two components. The first one requires real-time attention to the feed, while the second one allows one to collect large groups of messages for batch processing. The batch processing case can be handled by a lot of technologies, so this posting deals only with the case where one needs to take action in real time. # 这里只关注realtime应用
- The first application maintains essentially no state and is focused on complex patterns, while the second one deals primarily with maintaining state, and performs the same action on every message. Hence, the first is "big pattern – little state" while the second is "little pattern – big state." These two use cases are dealt with by very different software systems. # realtime应用可以分为两类: 前面一类应用类似CEP处理逻辑复杂但是状态很少, 后面一类应用类似计数处理逻辑很少但是依赖状态, 对database要求很高.
- database选型关键词: RDBMS, NoSQL, NewSQL, ACID, CAP
Big variety
The ETL methodology that has emerged is: For each data source to be integrated # 针对新数据源
- Assign a programmer to understand the data source – by talking to people, reading documentation, etc.
- The programmer writes a program (in a scripting language) to extract data from the source
- The programmer figures out how to map the local data source to a pre-defined global schema
- The programmer writes any needed data cleaning and/or transformation routines # 数据清洗和转化
This (very widely used) methodology has a number of disadvantages:
- The global schema may not exist a-priori. In one use case I am familiar with, a Big Pharma company wants to integrate the electronic lab notebook of several thousand bench chemists and biologists. The global schema must be pieced together as a composite of the local schemas. Hence, it is a byproduct of the integration, not something that is known upfront. # 需要预先定义好global schema, 而这点本身就比较困难.
- It may be difficult (or impossible) to write the transformations.For example, a "salary" attribute in a local source in Paris might well be a person's wages in Euros including a lunch allowance, but after taxes. These semantics may not be in any data dictionary. Hence, the programmer may have a very difficult time understanding the meaning of attributes in local data sources
- Data cleaning routines may be very difficult to write. For example, it is very difficult to decide if two different restaurants at the same address are valid (a food court, for example) or one went out of business and was replaced by the other.
- Data integration is not a one-shot.All schemas, transformations and mappings may be time varying and may vary with the local updates performed on the data sources. >It is not unusual to see monthly changes to these elements.
- The methodology does not scale. Since a lot of the steps are human-centric, this process will scale to a few tens of local sources, before the cost becomes exorbitant. # 这个方式不能很容易扩展.
#note: 预先定义好global schema应该是技术上最大的问题.