Cassandra

February 21, 2018 by Byron ZHU

Data Model

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

row

example

CREATE KEYSPACE sports
    WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2};

# you can tune the r/w consistency level with keyspace  Availability versus Consistency

CREATE TABLE crossfit_gyms_by_city (  
 country_code text,  
 state_province text,  
 city text,  
 gym_name text,  
 opening_date timestamp,  
 PRIMARY KEY ((country_code, state_province, city), opening_date, gym_name)  
) WITH CLUSTERING ORDER BY ( opening_data ASC, gym_name ASC );

Cassandra	RDBMS
Keyspace	Database
Column Family	Table
Partition Key	Primary Key

Partitioner

A partitioner determines how data is distributed across the nodes in the cluster (including replicas).

Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash hash values.
RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values.
ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes

hash

R/W

Read

To satisfy a read, Cassandra must combine results from the active memtable and potentially multiple SSTables. Cassandra processes data at several stages on the read path to discover where the data is stored, starting with the data in the memtable and finishing with SSTables:

Check the memtable
Check row cache, if enabled
Checks Bloom filter
Checks partition key cache, if enabled
Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not If the partition summary is checked, then the partition index is accessed
Locates the data on disk using the compression offset map
Fetches the data from the SSTable on disk

write

Logging data in the commit log
Writing data to the memtable
Flushing data from the memtable
Storing data on disk in SSTables

Limits

No join or subquery support for aggregation. According to Cassandra’s documentation, this is by design, encouraging denormalization of data into partitions that can be queried efficiently from a single node, rather than gathering data from across the entire cluster.
Ordering is set at table creation time on a per-partition basis. This avoids clients attempting to sort billions of rows at run time.
All data for a single partition must fit on disk in a single node in the cluster.
It’s recommended to keep the number of rows within a partition below 100,000 items and the disk size under 100 MB.
A single column value is limited to 2 GB (1 MB is recommended).

Connect

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
auth_provider = PlainTextAuthProvider(
        username='cassandra', password='cassandra')
cluster = Cluster(auth_provider=auth_provider)
session = cluster.connect()
session.execute("Your CQL")

Docker Cluster

docker-composer.yaml

# Please note we are using Docker Compose version 3
version: '3'
    services:
    # Configuration for our seed cassandra node. The node is call DC1N1
    # .i.e Node 1 in Data center 1.
    DC1N1:
        # Cassandra image for Cassandra version 3.1.0. This is pulled
        # from the docker store.
        image: cassandra:3.11
        # In case this is the first time starting up cassandra we need to ensure
        # that all nodes do not start up at the same time. Cassandra has a
        # 2 minute rule i.e. 2 minutes between each node boot up. Booting up
        # nodes simultaneously is a mistake. This only needs to happen the firt
        # time we bootup. Configuration below assumes if the Cassandra data
        # directory is empty it means that we are starting up for the first
        # time.
        command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 0; fi && /docker-entrypoint.sh cassandra -f'
        # Network for the nodes to communicate
        networks:
            - dc1ring
        ports:
            - "9041:9042"                        
        # Maps cassandra data to a local folder. This preserves data across
        # container restarts. Note a folder n1data get created locally
        volumes:
            - ./n1data:/var/lib/cassandra
        # Docker constainer environment variable. We are using the
        # CASSANDRA_CLUSTER_NAME to name the cluster. This needs to be the same
        # across clusters. We are also declaring that DC1N1 is a seed node.
        environment:
            - CASSANDRA_CLUSTER_NAME=dev_cluster
            - CASSANDRA_SEEDS=DC1N1
        # Exposing ports for inter cluste communication
        expose:
            - 7000
            - 7001
            - 7199
            - 9042
            - 9160
        # Cassandra ulimt recommended settings
        ulimits:
            memlock: -1
            nproc: 32768
            nofile: 100000
    # This is configuration for our non seed cassandra node. The node is call
    # DC1N1 .i.e Node 2 in Data center 1.
    DC1N2:
        # Cassandra image for Cassandra version 3.1.0. This is pulled
        # from the docker store.
        image: cassandra:3.11
        # In case this is the first time starting up cassandra we need to ensure
        # that all nodes do not start up at the same time. Cassandra has a
        # 2 minute rule i.e. 2 minutes between each node boot up. Booting up
        # nodes simultaneously is a mistake. This only needs to happen the firt
        # time we bootup. Configuration below assumes if the Cassandra data
        # directory is empty it means that we are starting up for the first
        # time.
        command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 60; fi && /docker-entrypoint.sh cassandra -f'
        # Network for the nodes to communicate
        networks:
            - dc1ring
        ports:
            - "9042:9042"                        
        # Maps cassandra data to a local folder. This preserves data across
        # container restarts. Note a folder n1data get created locally
        volumes:
            - ./n2data:/var/lib/cassandra
        # Docker constainer environment variable. We are using the
        # CASSANDRA_CLUSTER_NAME to name the cluster. This needs to be the same
        # across clusters. We are also declaring that DC1N1 is a seed node.
        environment:
            - CASSANDRA_CLUSTER_NAME=dev_cluster
            - CASSANDRA_SEEDS=DC1N1
        # Since DC1N1 is the seed node
        depends_on:
              - DC1N1
        # Exposing ports for inter cluste communication. Note this is already
        # done by the docker file. Just being explict about it.
        expose:
            # Intra-node communication
            - 7000
            # TLS intra-node communication
            - 7001
            # JMX
            - 7199
            # CQL
            - 9042
            # Thrift service
            - 9160
        # Cassandra ulimt recommended settings
        ulimits:
            memlock: -1
            nproc: 32768
            nofile: 100000

    # This is configuration for our non seed cassandra node. The node is call
    # DC1N3 .i.e Node 3 in Data center 1.
    DC1N3:
        image: cassandra:3.11
        # In case this is the first time starting up cassandra we need to ensure
        # that all nodes do not start up at the same time. Cassandra has a
        # 2 minute rule i.e. 2 minutes between each node boot up. Booting up
        # nodes simultaneously is a mistake. This only needs to happen the firt
        # time we bootup. Configuration below assumes if the Cassandra data
        # directory is empty it means that we are starting up for the first
        # time.
        command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 120; fi && /docker-entrypoint.sh cassandra -f'
        # Network for the nodes to communicate. This is pulled from docker hub.
        ports:
            - "9043:9042"            
        networks:
            - dc1ring
        # Maps cassandra data to a local folder. This preserves data across
        # container restarts. Note a folder n1data get created locally
        volumes:
            - ./n3data:/var/lib/cassandra
        # Docker constainer environment variable. We are using the
        # CASSANDRA_CLUSTER_NAME to name the cluster. This needs to be the same
        # across clusters. We are also declaring that DC1N1 is a seed node.
        environment:
            - CASSANDRA_CLUSTER_NAME= dev_cluster
            - CASSANDRA_SEEDS=DC1N1
        # Since DC1N1 is the seed node
        depends_on:
              - DC1N1
        # Exposing ports for inter cluste communication. Note this is already
        # done by the docker file. Just being explict about it.
        expose:
            # Intra-node communication
            - 7000
            # TLS intra-node communication
            - 7001
            # JMX
            - 7199
            # CQL
            - 9042
            # Thrift service
            - 9160
        # Cassandra ulimt recommended settings
        ulimits:
            memlock: -1
            nproc: 32768
            nofile: 100000
    # A web based interface for managing your docker containers.
    portainer:
        image: portainer/portainer
        command: --templates http://templates/templates.json
        networks:
            - dc1ring
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock
            - ./portainer-data:/data
        # Enable you to access potainers web interface from your host machine
        # using http://localhost:10001
        ports:
            - "10001:9000"
networks:
    dc1ring:

Records

cluster logs and folder structure

connect to cassandra using python driver