Cassandra - A Decentralized Structured Storage System

Table of Contents

1. ABSTRACT

2. INTRODUCTION

3. RELATED WORK

4. DATA MODEL

  • A table in Cassandra is a distributed multi dimensional map indexed by a key.
    • The value is an object which is highly structured. The row key in a table is a string with no size restrictions, although typically 16 to 36 bytes long.
    • Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.
  • Columns are grouped together into sets called column fam-ilies very much similar to what happens in the Bigtable system.
    • Cassandra exposes two kinds of columns families, Simple and Super column families.
    • Super column families can be visualized as a column family within a column family. (已经废弃了。现在data model应该是和big table完全一致了)
  • Furthermore, applications can specify the sort order of columns within a Super Column or Simple Column family. The system allows columns to be sorted either by time or by name. Time sorting of columns is exploited by applica- tion like Inbox Search where the results are always displayed in time sorted order. (可以对column排序,按照时间或者是名称)
  • Any column within a column family is accessed using the convention column family : column and any column within a column family that is of type super is accessed using the convention column family : super column : column.

5. API

The Cassandra API consists of the following three simple methods.

  • insert(table, key, rowMutation)
  • get(table, key, columnName)
  • delete(table, key, columnName)

columnName can refer to a specific column within a col-umn, a column family, a super column family, or a column within a super column.

6. SYSTEM ARCHITECTURE

6.1. Partitioning

Cassandra partitions data across the cluster using consistent hashing but uses an order pre-serving hash function to do so.

The basic consistent hashing algorithm presents some challenges.

  • First, the random position assignment of each node on the ring leads to non-uniform data and load distribution.
  • Sec-ond, the basic algorithm is oblivious to the heterogeneity in the performance of nodes.

Typically there exist two ways to address this issue:

  • One is for nodes to get assigned to multi-ple positions in the circle (like in Dynamo),
  • and the second is to analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily loaded nodes as described in.
  • Cassandra opts for the latter as it makes the design and implementation very tractable and helps to make very deterministic choices about load balanc-ing.

6.2. Replication

Cassandra provides the client with various options for how data needs to be replicated. Cassandra provides various replication poli- cies such as “Rack Unaware”, “Rack Aware” (within a data- center) and “Datacenter Aware”. Replicas are chosen based on the replication policy chosen by the application. If cer-tain application chooses “Rack Unaware” replication strat-egy then the non-coordinator replicas are chosen by picking N-1 successors of the coordinator on the ring. For “Rack Aware” and “Datacenter Aware” strategies the algorithm is slightly more involved.(可以很好地为跨数据中心服务)

Cassandra system elects a leader amongst its nodes using a system called Zookeeper. All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for and leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring. The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper - this way a node that crashes and comes back up knows what ranges it was responsible for.(通过zookeeper对membership持久化)

6.3. Membership

Cluster membership in Cassandra is based on Scuttle-butt, a very efficient anti-entropy Gossip based mech-anism. The salient feature of Scuttlebutt is that it has very efficient CPU utilization and very efficient utilization of the gossip channel. Within the Cassandra system Gossip is not only used for membership but also to disseminate other sys-tem related control state.

Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down. In Cassandra failure detection is also used to avoid at-tempts to communicate with unreachable nodes during var-ious operations.

Cassandra uses a modified version of the Φ Accrual Failure Detector. The idea of an Accrual Failure Detection is that the failure detection module doesn’t emit a Boolean value stating a node is up or down. Instead the failure detection module emits a value which represents a suspicion level for each of monitored nodes. This value is defined as Φ. The basic idea is to express the value of Φ on a scale that is dynamically adjusted to reflect network and load conditions at the monitored nodes.(使用置信区间的方式来判断节点是否出现故障)

Φ has the following meaning: Given some threshold Φ, and assuming that we decide to suspect a node A when Φ = 1, then the likelihood that we will make a mistake (i.e., the decision will be contradicted in the future by the reception of a late heartbeat) is about 10%. The likelihood is about 1% with Φ = 2, 0.1% with Φ = 3, and so on. Every node in the system maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster. The distribution of these inter-arrival times is determined and Φ is calculated.

Although the original paper suggests that the distribution is approximated by the Gaussian distribu-tion we found the Exponential Distribution to be a better approximation, because of the nature of the gossip channel and its impact on latency. To our knowledge our implemen-tation of the Accrual Failure Detection in a Gossip based setting is the first of its kind. Accrual Failure Detectors are very good in both their accuracy and their speed and they also adjust well to network conditions and server load conditions.

6.4. Bootstrapping

When a node starts for the first time, it chooses a random token for its position in the ring. For fault tolerance, the mapping is persisted to disk locally and also in Zookeeper. The token information is then gossiped around the cluster. This is how we know about all nodes and their respective po-sitions in the ring. This enables any node to route a request for a key to the correct node in the cluster.

In the bootstrap case, when a node needs to join a cluster, it reads its configu-ration file which contains a list of a few contact points within the cluster. We call these initial contact points, seeds of the cluster. Seeds can also come from a configuration service like Zookeeper.

6.5. Scaling the Cluster

The node giving up the data streams the data over to the new node using kernel-kernel copy techniques. Operational experience has shown that data can be transferred at the rate of 40 MB/sec from a single node. We are working on improving this by having multiple replicas take part in the bootstrap transfer thereby parallelizing the effort, similar to Bittorrent.(数据转移底层使用kernel-to-kernel的拷贝技术,后续考虑类似bt方式来加快这个过程)

6.6. Local Persistence

We have a dedicated disk on each machine for the commit log since all writes into the commit log are sequential and so we can maximize disk throughput.(有专门的硬盘来写入commit log)

In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval. As the columns for a given key are being serialized and written out to disk we generate indices at every 256K chunk boundary. This boundary is configurable, but we have found 256K to work well for us in our production workloads. (为column做索引)

6.7. Implementation Details

All sys-tem control messages rely on UDP based messaging while the application related messages for replication and request routing relies on TCP.(系统控制走UDP)

7. PRACTICAL EXPERIENCES

In the process of designing, implementing and maintaining Cassandra we gained a lot of useful experience and learned numerous lessons. One very fundamental lesson learned was not to add any new feature without understanding the effects of its usage by applications. Most problematic scenarios do not stem from just node crashes and network partitions. We share just a few interesting scenarios here.

  • We exposed some background channels for the M/R process to aggregate the re-verse index per user and send over the serialized data over to the Cassandra instance, to avoid the serializa-tion/deserialization overhead. This way the Cassandra instance is only bottlenecked by network bandwidth.(BulkLoad方式)
  • Most applications only require atomic operation per key per replica. However there have been some appli-cations that have asked for transactional mainly for the purpose of maintaining secondary indices.(需要事务功能主要是为二级索引)
  • We experimented with various implementations of Fail-ure Detectors. Our experience had been that the time to detect fail-ures increased beyond an acceptable limit as the size of the cluster grew. In one particular experiment in a cluster of 100 nodes time to taken to detect a failed node was in the order of two minutes. This is prac-tically unworkable in our environments. With the ac-crual failure detector with a slightly conservative value of PHI, set to 5, the average time to detect failures in the above experiment was about 15 seconds.(降低故障检测的延迟)
  • Monitoring is not to be taken for granted. The Cas-sandra system is well integrated with Ganglia, a distributed performance monitoring tool. We expose various system level metrics to Ganglia and this has helped us understand the behavior of the system when subject to our production workload. Disks fail for no apparent reasons. The bootstrap algorithm has some hooks to repair nodes when disk fail. This is however an administrative operation.(监控指标直接对接到Ganglia)
  • Although Cassandra is a completely decentralized sys-tem we have learned that having some amount of co-ordination is essential to making the implementation of some distributed features tractable. For example Cassandra is integrated with Zookeeper, which can be used for various coordination tasks in large scale dis-tributed systems. We intend to use the Zookeeper ab-straction for some key features which actually do not come in the way of applications that use Cassandra as the storage engine.(Zookeeper来完成协调工作使得整个系统易于追踪)

8. CONCLUSION