hadoop

HDFS Balancer and HBase Data Locality

Posted on Updated on

HBase blocks file locality and HDFS Balancer could post some problems ūüė¶

There are 3 facts that I’ve learned from this stackoverflow post.

  1. The Hadoop (HDFS) balancer moves blocks around from one node to another to try to make it so each datanode has the same amount of data (within a configurable threshold). This messes up HBases’s data locality, meaning that a particular region may be serving a file that is no longer on it’s local host.
  2. HBase’s balance_switch balances the cluster so that each regionserver hosts the same number of regions (or close to). This is separate from Hadoop’s (HDFS) balancer.
  3. If you are running only HBase, I recommend not running Hadoop’s (HDFS) balancer as it will cause certain regions to lose their data locality. This causes any request to that region to have to go over the network to one of the datanodes that is serving it’s HFile.

HBase’s data locality is recovered though. Whenever compaction occurs, all the blocks are copied locally to the regionserver serving that region and merged. At that point, data locality is recovered for that region. With that, all you really need to do to add new nodes to the cluster is add them. Hbase will take care of rebalancing the regions, and once these regions compact data locality will be restored.

Advertisements

Don’t Upgrade Your Cloudera Manager When HDFS Rebalancer is Active

Posted on Updated on

This is just simple post and literally means the same thing as the title.

For Cloudera Hadoop Users:

Don’t Upgrade Your Cloudera Manager When HDFS Rebalancer is Active

If you accidently done it,¬†then the Cloudera Manager upgrade process will fail when it’s trying to¬†start the new version.
You need to revert it back to the previous working version.

How to connect PredictionIO 0.9.3 with Cloudera CDH 5.4.x HBase

Posted on Updated on

PredictionIO Image

  1. For some reason when I’m trying to use the Spark from CDH it doesn’t work with PredictionIO 0.9.3,
    So I use spark 1.3.1 binary with hadoop 2.6 support and I extracted mine to: SPARK_HOME=$PIO_HOME/vendors/spark-1.3.1-bin-hadoop2.6
  2. From CDH part I only use the HBase part as the event server storage.
  3. I use Elasticsearch as metadata storage.
  4. I use LocalFS as model storage.
  5. I installed spark standalone server manually (not from cdh) (spark 1.3.1 with hadoop 2.6 support)
    – For this test case I’m using a spark master with 4 workers node and let say I installed at spark://my.remote.sparkhost:7077
    – If you don’t know how to install a stand alone spark server, please read the spark manual.

My config file as below:

# pio-env.sh
#################################################################################################

#!/usr/bin/env bash
# spark 1.3.1-bin-hadoop2.6 binary downloaded from spark website
 SPARK_HOME=$PIO_HOME/vendors/spark-1.3.1-bin-hadoop2.6
# cloudera hadoop (CDH 5.4.x)
 HADOOP_CONF_DIR=/etc/hadoop/conf
# cloudera hbase (CDH 5.4.x)
 HBASE_CONF_DIR=/etc/hbase/conf
#######
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
 PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
 PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
 PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
#######
# Elasticsearch
 PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
 PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
 PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
 PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.4.4
# LocalFS
 PIO_FS_BASEDIR=$HOME/.pio_store
 PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
 PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp
 PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
 PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models
# HBase CDH 5.4.x
 PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
 PIO_STORAGE_SOURCES_HBASE_HOME=/opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/hbase

#################################################################################################

pio@nn01:~$ pio status
[INFO] [Console$] Inspecting PredictionIO…
[INFO] [Console$] PredictionIO 0.9.3 is installed at /pio/PredictionIO-0.9.3
[INFO] [Console$] Inspecting Apache Spark…
[INFO] [Console$] Apache Spark is installed at /pio/PredictionIO-0.9.3/vendors/spark-1.3.1-bin-hadoop2.6
[INFO] [Console$] Apache Spark 1.3.1 detected (meets minimum requirement of 1.3.0)
[INFO] [Console$] Inspecting storage backend connections…
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)…
[INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)…
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)…
[INFO] [Storage$] Test writing to Event Store (App Id 0)…
[INFO] [HBLEvents] The table pio_event:events_0 doesn’t exist yet. Creating now…
[INFO] [HBLEvents] Removing table pio_event:events_0…
[INFO] [Console$] (sleeping 5 seconds for all messages to show up…)
[INFO] [Console$] Your system is all ready to go.

so after you do “pio “build”, to make PredictionIO utilize remote spark server that I created before on point #5, I use this parameter when I want to “train and deploy”

$ pio train — –master spark://my.remote.sparkhost:7077 –driver-memory 4G –executor-memory 1G

$ pio deploy — –master spark://my.remote.sparkhost:7077 –driver-memory 4G –executor-memory 1G

if you have a large dataset, you might want to add:¬†–conf spark.akka.frameSize=1024 in “pio train”

So by using this setup, PredictionIO will utilize remote spark server and write event data to a Cloudera HBase Cluster.

This post is to answer Yanbo question at predictionio user group on how to integrate predictionIO with cloudera hadoop.

The Simplest Way to Generate CSV Output From Hive in Linux Shell

Posted on Updated on

If you are wondering the easiest way (at least IMHO) how to generate data output in HIVE in Excel-like CSV compatible format without modifying any table or using 3rd party java plugin is:

hive -e “SELECT col1, col2, … FROM table_name”¬†| perl -lpe ‘s//\\”/g; s/^|$//g; s/\t/,/g’ > output_file.csv

I know you can also use awk or some other shell commands, but perl regex is very POWERFUL and FAST.
I got this perl regex tips some time ago from stackoverflow link (i will put the link once i remember) and this method worked for me to convert the standard Tab separated output into CSV compatible ūüėČ

(Hadoop) Make Sure Your Datanode File System Have the Correct Permission!

Posted on Updated on

Kalau pernah ngalamin error seperti di bawah ini pas lagi mau jalanin MapReduce task, ini kemungkinan besar masalahnya ada pada file directory permission di salah satu datanode di-tempat MapReduce nya berjalan (via YARN).

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/hive-common-0.13.1-cdh5.3.0.jar!/hive-log4j.properties
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1420709500935_0492, Tracking URL = http://**********:8088/proxy/application_1420709500935_0492/
Kill Command = /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/bin/hadoop job  -kill job_1420709500935_0492
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2015-02-03 01:05:26,016 Stage-1 map = 0%,  reduce = 0%
2015-02-03 01:05:36,674 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.15 sec
2015-02-03 01:05:55,424 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 11.77 sec
2015-02-03 01:06:13,084 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.43 sec
MapReduce Total cumulative CPU time: 15 seconds 430 msec
Ended Job = job_1420709500935_0492
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1420709500935_0493, Tracking URL = http://**********:8088/proxy/application_1420709500935_0493/
Kill Command = /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/bin/hadoop job  -kill job_1420709500935_0493
Hadoop job information for Stage-2: number of mappers: 0; number of reducers: 0
2015-02-03 01:06:29,383 Stage-2 map = 0%,  reduce = 0%
Ended Job = job_1420709500935_0493 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 15.43 sec   HDFS Read: 75107359 HDFS Write: 48814 SUCCESS
Stage-Stage-2:  HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 15 seconds 430 msec

Ketika kita melihat lebih jauh lagi error msg di:

 http://**********:8088/proxy/application_1420709500935_0493/
Application application_1420709500935_0493 failed 2 times due to AM Container for appattempt_1420709500935_0493_000002 exited with exitCode: -1000 due to: Not able to initialize distributed-cache directories in any of the configured local directories for user USERNAME
.Failing this attempt.. Failing the application.

Nah kalau gue, datanode root directory-nya di filesystem ada di:

/disk1/, /disk2/, /disk3/

Jadi.. supaya gue terbebas dari error msg diatas, gue harus make sure directories ini mempunyai permission yang cocok (yarn:yarn) (Karena MapReduce gue di manage oleh YARN, supaya default user nya bisa create file cache nya)

/disk1/yarn/nm/usercache
/disk2/yarn/nm/usercache
/disk3/yarn/nm/usercache

Cara nya gimana?

chown -R yarn:yarn /disk1/yarn/nm/usercache
chown -R yarn:yarn /disk2/yarn/nm/usercache
chown -R yarn:yarn /disk3/yarn/nm/usercache