cloudera

Cloudera Enterprise VS Cloudera Express

Posted on Updated on

Cloudera Version Comparison Express Version Enterprise Version
Cluster Management
Number of hosts supported Unlimited Unlimited
Host inspector for determining CDH readiness Yes Yes
Multi-cluster management Yes Yes
Centralized view of all running commands Yes Yes
Resource management Yes Yes
Global time control for historical diagnosis Yes Yes
Cluster-wide configuration Yes Yes
Cluster-wide event management Yes Yes
Cluster-wide log search Yes Yes
Aggregate UI No Yes
Deployment
Support for CDH 4 and CDH 5 Yes Yes
Automated deployment and readiness checks Yes Yes
Installation from local repositories Yes Yes
Rolling upgrade of CDH No Yes
Service and Configuration Management
Manage Accumulo, Flume, HBase, HDFS, Hive, Hue, Impala, Isilon, Kafka, Kudu, MapReduce, Oozie, Sentry, Solr, Spark, Sqoop, YARN, and ZooKeeper services Yes Yes
Manage Key Trustee and Cloudera Navigator No Yes
Manage add-on services Yes Yes
Rolling restart of services No Yes
High availability (HA) support Yes Yes
CDH 4 – HDFS and MapReduce JobTracker (CDH 4.2) Yes Yes
CDH 5 – HDFS, Hive Metastore, Hue, Impala Llama ApplicationMaster, MapReduce JobTracker, Oozie, YARN ResourceManager Yes Yes
HBase co-processor support Yes Yes
Configuration audit trails Yes Yes
Client configuration management Yes Yes
Workflows (add, start, stop, restart, delete, and decommission services, hosts, and role instances) Yes Yes
Role groups Yes Yes
Host templates Yes Yes
Configuration versioning and history No Yes
Restoring a configuration using the API No Yes
Security
Kerberos authentication Yes Yes
LDAP authentication for CDH Yes Yes
LDAP authentication for Cloudera Manager No Yes
SAML authentication No Yes
Encrypted communication between Server and host Agents (TLS) Yes Yes
Sentry role-based access control Yes Yes
Password redaction Yes Yes
Data encryption with KMS No Yes
Cloudera Manager user roles No Yes
Monitoring and Diagnostics
Service, host, and activity monitoring Yes Yes
Proactive health tests Yes Yes
Health history Yes Yes
Advanced filtering and charting of metrics Yes Yes
Job monitoring for MapReduce jobs, YARN applications, and Impala queries Yes Yes
Similar activity performance for MapReduce jobs Yes Yes
Support for terminating activities Yes Yes
Alert Management
Alert by email Yes Yes
Alert by SNMP Yes Yes
User-defined triggers Yes Yes
Custom alert publish scripts Yes Yes
Advanced Management Features
Automated backup and disaster recovery No Yes
File browsing, searching, and disk quota management No Yes
HBase, MapReduce, Impala, and YARN usage reports No Yes
Support integration No Yes
Operational reports No Yes
Cloudera Navigator Data Management
Metadata management and augmentation No Yes
Ingest policies No Yes
Analytics No Yes
Auditing No Yes
Lineage No Yes
Advertisements

Don’t Upgrade Your Cloudera Manager When HDFS Rebalancer is Active

Posted on Updated on

This is just simple post and literally means the same thing as the title.

For Cloudera Hadoop Users:

Don’t Upgrade Your Cloudera Manager When HDFS Rebalancer is Active

If you accidently done it, then the Cloudera Manager upgrade process will fail when it’s trying to start the new version.
You need to revert it back to the previous working version.

How to connect PredictionIO 0.9.3 with Cloudera CDH 5.4.x HBase

Posted on Updated on

PredictionIO Image

  1. For some reason when I’m trying to use the Spark from CDH it doesn’t work with PredictionIO 0.9.3,
    So I use spark 1.3.1 binary with hadoop 2.6 support and I extracted mine to: SPARK_HOME=$PIO_HOME/vendors/spark-1.3.1-bin-hadoop2.6
  2. From CDH part I only use the HBase part as the event server storage.
  3. I use Elasticsearch as metadata storage.
  4. I use LocalFS as model storage.
  5. I installed spark standalone server manually (not from cdh) (spark 1.3.1 with hadoop 2.6 support)
    – For this test case I’m using a spark master with 4 workers node and let say I installed at spark://my.remote.sparkhost:7077
    – If you don’t know how to install a stand alone spark server, please read the spark manual.

My config file as below:

# pio-env.sh
#################################################################################################

#!/usr/bin/env bash
# spark 1.3.1-bin-hadoop2.6 binary downloaded from spark website
 SPARK_HOME=$PIO_HOME/vendors/spark-1.3.1-bin-hadoop2.6
# cloudera hadoop (CDH 5.4.x)
 HADOOP_CONF_DIR=/etc/hadoop/conf
# cloudera hbase (CDH 5.4.x)
 HBASE_CONF_DIR=/etc/hbase/conf
#######
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
 PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
 PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
 PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
#######
# Elasticsearch
 PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
 PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
 PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
 PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.4.4
# LocalFS
 PIO_FS_BASEDIR=$HOME/.pio_store
 PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
 PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp
 PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
 PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models
# HBase CDH 5.4.x
 PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
 PIO_STORAGE_SOURCES_HBASE_HOME=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hbase

#################################################################################################

pio@nn01:~$ pio status
[INFO] [Console$] Inspecting PredictionIO…
[INFO] [Console$] PredictionIO 0.9.3 is installed at /pio/PredictionIO-0.9.3
[INFO] [Console$] Inspecting Apache Spark…
[INFO] [Console$] Apache Spark is installed at /pio/PredictionIO-0.9.3/vendors/spark-1.3.1-bin-hadoop2.6
[INFO] [Console$] Apache Spark 1.3.1 detected (meets minimum requirement of 1.3.0)
[INFO] [Console$] Inspecting storage backend connections…
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)…
[INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)…
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)…
[INFO] [Storage$] Test writing to Event Store (App Id 0)…
[INFO] [HBLEvents] The table pio_event:events_0 doesn’t exist yet. Creating now…
[INFO] [HBLEvents] Removing table pio_event:events_0…
[INFO] [Console$] (sleeping 5 seconds for all messages to show up…)
[INFO] [Console$] Your system is all ready to go.

so after you do “pio “build”, to make PredictionIO utilize remote spark server that I created before on point #5, I use this parameter when I want to “train and deploy”

$ pio train — –master spark://my.remote.sparkhost:7077 –driver-memory 4G –executor-memory 1G

$ pio deploy — –master spark://my.remote.sparkhost:7077 –driver-memory 4G –executor-memory 1G

if you have a large dataset, you might want to add: –conf spark.akka.frameSize=1024 in “pio train”

So by using this setup, PredictionIO will utilize remote spark server and write event data to a Cloudera HBase Cluster.

This post is to answer Yanbo question at predictionio user group on how to integrate predictionIO with cloudera hadoop.

Hadoop Itu Apaan Sih?

Posted on Updated on

Updated – Saya punya tentang introduction to big data technology khusus-nya tentang hadoop yang saya publish di Slideshare.

Sebenernya udah lumayan lama gue sempet baca beberapa artikel tentang Hadoop semenjak beberapa tahun lalu, tapi ngga pernah bener2 nyoba. Nahh.. akhirnya nih pada hari ini gue mulai nyentuh dan beneran belajar yang namanya Hadoop. Mudah2an gue bisa ngerti dan paham nih sama teknologi yg satu ini, karena cukup menjanjikan banget fungsinya 😉

Apaan sih Hadoop?

(Menurut Wikipedia)

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Its Hadoop Distributed File System (HDFS) splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster. For processing the data, the Hadoop Map/Reduce ships code (specifically Jar files) to the nodes that have the required data, and the nodes then process the data in parallel. This approach takes advantage of data locality,[3] in contrast to conventional HPC architecture which usually relies on a parallel file system (compute and data separated, but connected with high-speed networking).[4]

Terus kalau yang saat ini gue tau, hadoop juga ada ‘custom’ distribution nya kayak mysql gitu. ada yg vanila hadoop ada juga yg di merk-in sama beberapa company. contoh nya 2 vendor hadoop gratis yang lumayan terkenal adalah Cloudera dan Hortonworks.

Jadi kalau misalkan lu punya data yang super gede banget dan server database lu udah ga mampu untuk mem-proses data tersebut, nah mungkin udah saatnya lu mulai belajar hadoop juga deh bro.. Nanti kalau gue udah punya hands-on experience tentang hadoop, gue bakalan posting lagi yak!