Spark Cassandra integration is an wonderful combination for many level processing . Quick start is all about Spark Cassandra connectivity.
1.Perquisite of Spark Cassandra connectivity
apache-cassandra-2.2.3
spark-1.5.1-bin-hadoop2.6
jdk1.7.0_45
cassandra-driver-core-2.1.5.jar
spark-cassandra-connector_2.10-1.5.0-M1.jar
1.1 Setting Environment
Note : Even you can set this in .bash_profile and .bashrc also
2.Apache Spark stand alone quick start
Please Refer to chennaihug.org for spark installation
http://chennaihug.org/knowledgebase/spark-master-and-slaves-single-node-installation/
But use the version mentioned above
3.Apache Cassandra stand alone quick start
Please Refer to chennaihug.org for Cassandra installation
http://chennaihug.org/knowledgebase/cassandra-single-node-installation/
But use the version mentioned above
4.Configuration
a.Copy all the apache-cassandra-2.2.3/lib jars to spark-1.5.1-bin-hadoop2.6/lib + cassandra-driver-core-2.1.5.jar (Have to download this jar as mentioned above
b.Open spark-1.5.1-bin-hadoop2.6/conf/
Rename the spark-evn.sh-template to spark-evn.sh and include the following two ENV variables and path
c.Start Cassandra and Spark check the daemons with jps(Java Process Status)
5.Key space and Table in Cassandra
Create a key space and Table need for this quick start in Cassandra
Insert some records . Here this quick start uses patient dataset as input
Bulk load this record by using the following COPY command in Cassandra
6.Start the spark-shell
Move the downloaded “spark-cassandra-connector_2.10-1.5.0-M1.jar” to spark-1.5.1-bin-hadoop2.6
bin/spark-shell –jars spark-cassandra-connector_2.10-1.5.0-M1.jar
And run the following
6.1 Configure a new sc
6.2 Access to Cassandra
6.3 Insert data in Cassandra
Reference images
a. Creating spark context for Cassandra
b.Insert data in Cassandra
c.Now check the CQLSH for the newly inserted record
P Saravana kumar, Data Engineer @ DataDotz.
SB Gowtham, Data Engineer @ DataDotz.
DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing. Gowtham can be reached via his linkedin
profile(https://in.linkedin.com/in/saravanasaro)
profile(https://in.linkedin.com/in/sbgowtham)