hasil nyoba
How to Convert Compressed Text: bzip2 to gzip without Temporary Files Multithreaded
The tools that you will need is: lbzip2, pigz and split.
lbzip2 => http://lbzip2.org/ pigz => http://www.zlib.net/pigz/
If you are using Ubuntu (I’m using 14.04 LTS).
You can easily install lbzip2 and pigz by using apt-get (aptitude).
$ apt-get install lbzip2 pigz
Or you could download the source code from their website and compile manually.
Let say you have 500gb of text compressed bzip2 files called file01.txt.bz2 and you want to split that file to a multiple gziped files with 1500000 lines each, so it will be able to be processed faster in your hadoop cluster.
$ lbzcat file01.txt.bz2 | split -d -a 10 -l1500000 --filter='pigz > newfile01-$FILE.gz'
How to Export/Import HBase Table
– EXPORT –
eg, hbase table: hbase_test_table
and today date is 20160820
1. Create a temporary folder in hdfs for the exported files:
$ hadoop fs -mkdir /tmp/hbasedump/20160820
2. Execute this shell command in any hadoop node that has hbase gateway
$ hbase org.apache.hadoop.hbase.mapreduce.Export hbase_test_table /tmp/hbasedump/20160820/hbase_test_table
3. Please don’t forget to get the table structure, so you will be able to import the data back later on if needed.
$ hbase shell hbase-shell> describe 'hbase_test_table'
Table hbase_test_table is ENABLED hbase_test_table COLUMN FAMILIES DESCRIPTION {NAME => 'test_cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', MIN_VERSIONS => '0', TTL => 'FOREVER' , KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'} 1 row(s) in 0.1290 seconds
– IMPORT –
eg, hbase table: test_import_hbase_test_table
1. Let say you have the dumped export file for that table in (hdfs) /tmp/hbasedump/20160820/hbase_test_table
And you want to import it to a new table “test_import_hbase_test_table”
2.
$ hbase shell
– Create the table if it’s not yet created “test_import_hbase_test_table”
– Create the table with the same column family name (get the information on the export step #3 above).
3. Start the import process:
$ hbase org.apache.hadoop.hbase.mapreduce.Import "test_import_hbase_test_table" "/tmp/hbasedump/20160820/hbase_test_table"
How to Mount HBase Table as Hive External Table
HBase table: “h_test_table”
Hive table: “test_table”
notes:
“attribute:column1″
attribute << is the COLUMN FAMILY
Example:
CREATE EXTERNAL TABLE test_table ( raw_key STRING, column1 STRING, column2 STRING, value STRING, updated_at bigint) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,attribute:column1,attribute:column1,attribute:value,attribute:updated_at' ) TBLPROPERTIES("hbase.table.name" = "h_test_table");
HDFS Balancer and HBase Data Locality
HBase blocks file locality and HDFS Balancer could post some problems 😦
There are 3 facts that I’ve learned from this stackoverflow post.
- The Hadoop (HDFS) balancer moves blocks around from one node to another to try to make it so each datanode has the same amount of data (within a configurable threshold). This messes up HBases’s data locality, meaning that a particular region may be serving a file that is no longer on it’s local host.
- HBase’s balance_switch balances the cluster so that each regionserver hosts the same number of regions (or close to). This is separate from Hadoop’s (HDFS) balancer.
- If you are running only HBase, I recommend not running Hadoop’s (HDFS) balancer as it will cause certain regions to lose their data locality. This causes any request to that region to have to go over the network to one of the datanodes that is serving it’s HFile.
HBase’s data locality is recovered though. Whenever compaction occurs, all the blocks are copied locally to the regionserver serving that region and merged. At that point, data locality is recovered for that region. With that, all you really need to do to add new nodes to the cluster is add them. Hbase will take care of rebalancing the regions, and once these regions compact data locality will be restored.