csv

The Simplest Way to Generate CSV Output From Hive in Linux Shell

Posted on Updated on

If you are wondering the easiest way (at least IMHO) how to generate data output in HIVE in Excel-like CSV compatible format without modifying any table or using 3rd party java plugin is:

hive -e “SELECT col1, col2, … FROM table_name” | perl -lpe ‘s//\\”/g; s/^|$//g; s/\t/,/g’ > output_file.csv

I know you can also use awk or some other shell commands, but perl regex is very POWERFUL and FAST.
I got this perl regex tips some time ago from stackoverflow link (i will put the link once i remember) and this method worked for me to convert the standard Tab separated output into CSV compatible 😉

Advertisements

Export Data From Cassandra to CSV

Posted on Updated on

Karena kebutuhan untuk mindahin data Cassandra dari cluster lama ke cluster baru maka gue bikin tools untuk ngebantu diri gue sendiri untuk export data cassandra ke CSV.

Updated: The code is pushed to a Github.

Kenapa gue harus bikin tools ini?

Karena Cassandra yg versi gratisan ngga punya tools “bisa jalan” untuk backup data di dalam cassandra.
Well ada sih tools nya beberapa, tapi kalau data lu besar pasti tools tersebut ngadat, makanya kita harus bikin sendiri dengan meraba-raba library connection untuk cassandra dari datastax.

Berhubung skill java gue masih cetek, jadi sorry kalau misalkan agak culun codingan nya. Tapi gue udah test bisa narik tables dengan isi data puluhan GB dan jutaan rows ga ada masalah, so kalau ada yg butuh silahkan di compile aja sendiri pake library java untuk cassandra dari datastax ya.

package lemonade.dumpCassandra;

import java.text.SimpleDateFormat;
import java.util.Iterator;

import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.ColumnDefinitions.Definition;
import com.datastax.driver.core.DataType;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;
import com.datastax.driver.core.Session;
import com.datastax.driver.core.SimpleStatement;
import com.datastax.driver.core.Statement;


/**
 * Dump Data from Cassandra to CSV
 * 2015/01/19
 * by sphinxid <firman.gautama@gmail.com>
 *
 */
public class CassExport
{
 public static void main( String[] args )
 {
 String keyspace = "YourKeyspace";
 String table = "TableName";
 String username = "username";
 String password = "password";
 String host = "127.0.0.1";


 Cluster.Builder clusterBuilder = Cluster.builder()
 .addContactPoints(host)
 .withCredentials(username, password);
 Cluster cluster = clusterBuilder.build();
 Session session = cluster.connect(keyspace);

 Statement stmt = new SimpleStatement("SELECT * FROM " + table);
 stmt.setFetchSize(1000);
 ResultSet rs = session.execute(stmt);
 Iterator<Row> iter = rs.iterator();

 while (!rs.isFullyFetched()) {
 rs.fetchMoreResults();
 Row row = iter.next();
 if (row != null)
 {
 StringBuilder line = new StringBuilder();
 for (Definition key : row.getColumnDefinitions().asList())
 {
 String val = myGetValue(key, row);
 line.append("\"");
 line.append(val);
 line.append("\"");
 line.append(",");
 }
 line.deleteCharAt(line.length()-1);
 System.out.println(line.toString());
 }
 }

 session.close();
 cluster.close();

 }

 public static String myGetValue(Definition key, Row row)
 {
 String str = "";

 if (key != null)
 {
 String col = key.getName();

 try
 {
 if (key.getType() == DataType.cdouble())
 {
 str = new Double(row.getDouble(col)).toString();
 }
 else if (key.getType() == DataType.cint())
 {
 str = new Integer(row.getInt(col)).toString();
 }
 else if (key.getType() == DataType.uuid())
 {
 str = row.getUUID(col).toString();
 }
 else if (key.getType() == DataType.cfloat())
 {
 str = new Float(row.getFloat(col)).toString();
 }
 else if (key.getType() == DataType.timestamp())
 {
 str = row.getDate(col).toString();

 SimpleDateFormat fmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ssZ");
 str = fmt.format(row.getDate(col));


 }
 else
 {
 str = row.getString(col);
 }
 } catch (Exception e)
 {
 str = "";
 }
 }

 return str;
 }

}

or this is the pastebin.

#Update

#Benchmark Speed
22gb of data ~ 122mil rows. 
Extracted in 444m38.061s. 
- 1 host cassandra server (4core, 8gb ram, sata hdd).
- avg 4.5k of rows / second.