hive

How to Mount HBase Table as Hive External Table

Posted on Updated on

HBase table: “h_test_table”
Hive table: “test_table”

notes:
attribute:column1″
attribute << is the COLUMN FAMILY

Example:

CREATE EXTERNAL TABLE test_table (
 raw_key STRING,
 column1 STRING,
 column2 STRING,
 value STRING,
 updated_at bigint)
 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
 WITH SERDEPROPERTIES (
 'hbase.columns.mapping'=':key,attribute:column1,attribute:column1,attribute:value,attribute:updated_at'
 )
 TBLPROPERTIES("hbase.table.name" = "h_test_table");
Advertisements

The Simplest Way to Generate CSV Output From Hive in Linux Shell

Posted on Updated on

If you are wondering the easiest way (at least IMHO) how to generate data output in HIVE in Excel-like CSV compatible format without modifying any table or using 3rd party java plugin is:

hive -e “SELECT col1, col2, … FROM table_name” | perl -lpe ‘s//\\”/g; s/^|$//g; s/\t/,/g’ > output_file.csv

I know you can also use awk or some other shell commands, but perl regex is very POWERFUL and FAST.
I got this perl regex tips some time ago from stackoverflow link (i will put the link once i remember) and this method worked for me to convert the standard Tab separated output into CSV compatible 😉

(Hadoop) Make Sure Your Datanode File System Have the Correct Permission!

Posted on Updated on

Kalau pernah ngalamin error seperti di bawah ini pas lagi mau jalanin MapReduce task, ini kemungkinan besar masalahnya ada pada file directory permission di salah satu datanode di-tempat MapReduce nya berjalan (via YARN).

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/hive-common-0.13.1-cdh5.3.0.jar!/hive-log4j.properties
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1420709500935_0492, Tracking URL = http://**********:8088/proxy/application_1420709500935_0492/
Kill Command = /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/bin/hadoop job  -kill job_1420709500935_0492
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2015-02-03 01:05:26,016 Stage-1 map = 0%,  reduce = 0%
2015-02-03 01:05:36,674 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.15 sec
2015-02-03 01:05:55,424 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 11.77 sec
2015-02-03 01:06:13,084 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.43 sec
MapReduce Total cumulative CPU time: 15 seconds 430 msec
Ended Job = job_1420709500935_0492
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1420709500935_0493, Tracking URL = http://**********:8088/proxy/application_1420709500935_0493/
Kill Command = /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/bin/hadoop job  -kill job_1420709500935_0493
Hadoop job information for Stage-2: number of mappers: 0; number of reducers: 0
2015-02-03 01:06:29,383 Stage-2 map = 0%,  reduce = 0%
Ended Job = job_1420709500935_0493 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 15.43 sec   HDFS Read: 75107359 HDFS Write: 48814 SUCCESS
Stage-Stage-2:  HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 15 seconds 430 msec

Ketika kita melihat lebih jauh lagi error msg di:

 http://**********:8088/proxy/application_1420709500935_0493/
Application application_1420709500935_0493 failed 2 times due to AM Container for appattempt_1420709500935_0493_000002 exited with exitCode: -1000 due to: Not able to initialize distributed-cache directories in any of the configured local directories for user USERNAME
.Failing this attempt.. Failing the application.

Nah kalau gue, datanode root directory-nya di filesystem ada di:

/disk1/, /disk2/, /disk3/

Jadi.. supaya gue terbebas dari error msg diatas, gue harus make sure directories ini mempunyai permission yang cocok (yarn:yarn) (Karena MapReduce gue di manage oleh YARN, supaya default user nya bisa create file cache nya)

/disk1/yarn/nm/usercache
/disk2/yarn/nm/usercache
/disk3/yarn/nm/usercache

Cara nya gimana?

chown -R yarn:yarn /disk1/yarn/nm/usercache
chown -R yarn:yarn /disk2/yarn/nm/usercache
chown -R yarn:yarn /disk3/yarn/nm/usercache