Month: September 2016
The tools that you will need is: lbzip2, pigz and split.
lbzip2 => http://lbzip2.org/ pigz => http://www.zlib.net/pigz/
If you are using Ubuntu (I’m using 14.04 LTS).
You can easily install lbzip2 and pigz by using apt-get (aptitude).
$ apt-get install lbzip2 pigz
Or you could download the source code from their website and compile manually.
Let say you have 500gb of text compressed bzip2 files called file01.txt.bz2 and you want to split that file to a multiple gziped files with 1500000 lines each, so it will be able to be processed faster in your hadoop cluster.
$ lbzcat file01.txt.bz2 | split -d -a 10 -l1500000 --filter='pigz > newfile01-$FILE.gz'