How to connect PredictionIO 0.9.3 with Cloudera CDH 5.4.x HBase

Posted on Updated on

PredictionIO Image

  1. For some reason when I’m trying to use the Spark from CDH it doesn’t work with PredictionIO 0.9.3,
    So I use spark 1.3.1 binary with hadoop 2.6 support and I extracted mine to: SPARK_HOME=$PIO_HOME/vendors/spark-1.3.1-bin-hadoop2.6
  2. From CDH part I only use the HBase part as the event server storage.
  3. I use Elasticsearch as metadata storage.
  4. I use LocalFS as model storage.
  5. I installed spark standalone server manually (not from cdh) (spark 1.3.1 with hadoop 2.6 support)
    – For this test case I’m using a spark master with 4 workers node and let say I installed at spark://my.remote.sparkhost:7077
    – If you don’t know how to install a stand alone spark server, please read the spark manual.

My config file as below:

# pio-env.sh
#################################################################################################

#!/usr/bin/env bash
# spark 1.3.1-bin-hadoop2.6 binary downloaded from spark website
 SPARK_HOME=$PIO_HOME/vendors/spark-1.3.1-bin-hadoop2.6
# cloudera hadoop (CDH 5.4.x)
 HADOOP_CONF_DIR=/etc/hadoop/conf
# cloudera hbase (CDH 5.4.x)
 HBASE_CONF_DIR=/etc/hbase/conf
#######
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
 PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
 PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
 PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
#######
# Elasticsearch
 PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
 PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
 PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
 PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.4.4
# LocalFS
 PIO_FS_BASEDIR=$HOME/.pio_store
 PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
 PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp
 PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
 PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models
# HBase CDH 5.4.x
 PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
 PIO_STORAGE_SOURCES_HBASE_HOME=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hbase

#################################################################################################

pio@nn01:~$ pio status
[INFO] [Console$] Inspecting PredictionIO…
[INFO] [Console$] PredictionIO 0.9.3 is installed at /pio/PredictionIO-0.9.3
[INFO] [Console$] Inspecting Apache Spark…
[INFO] [Console$] Apache Spark is installed at /pio/PredictionIO-0.9.3/vendors/spark-1.3.1-bin-hadoop2.6
[INFO] [Console$] Apache Spark 1.3.1 detected (meets minimum requirement of 1.3.0)
[INFO] [Console$] Inspecting storage backend connections…
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)…
[INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)…
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)…
[INFO] [Storage$] Test writing to Event Store (App Id 0)…
[INFO] [HBLEvents] The table pio_event:events_0 doesn’t exist yet. Creating now…
[INFO] [HBLEvents] Removing table pio_event:events_0…
[INFO] [Console$] (sleeping 5 seconds for all messages to show up…)
[INFO] [Console$] Your system is all ready to go.

so after you do “pio “build”, to make PredictionIO utilize remote spark server that I created before on point #5, I use this parameter when I want to “train and deploy”

$ pio train — –master spark://my.remote.sparkhost:7077 –driver-memory 4G –executor-memory 1G

$ pio deploy — –master spark://my.remote.sparkhost:7077 –driver-memory 4G –executor-memory 1G

if you have a large dataset, you might want to add: –conf spark.akka.frameSize=1024 in “pio train”

So by using this setup, PredictionIO will utilize remote spark server and write event data to a Cloudera HBase Cluster.

This post is to answer Yanbo question at predictionio user group on how to integrate predictionIO with cloudera hadoop.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s