Import / Export CSV data to-from SparkR dataframe

sparkr-read-write-csv

Reading or Writing CSVs is a very common activity when we deal with any DATA or BIGDATA project. So today we will see how to read CSV and dump its data into SparkR data frame as well as writing data to CSV from SparkR dataframe. There are 2 methods to it, spark version >= 2.0.x supports inbuilt feature to read CSV using read.df() & write.db() functions respectively, while it has another method, in which we need to download a JAR file called Spark-csv jar.

Method I:
Direct read / write csv to-fro SparkR Dataframe ( only spark >=2.0.x supports this method ):

Below code shows how to read csv and write it:

In above code,
1. initially we set sparkr path, then loaded and initialized sparkR session

2. then reading data from CSV, dumping it in SparkR data frame using read.df() function. Here in read.df()function, path is the location of csv file to read, source is csv, as we are reading it, and we want headers to be read so true. More on read.df() function

3. selecting some of the columns from SparkR dataframe using select() function.read.df() function

4. writing newly selected columns to new csv file using write.df() function. First parameter in write.df() is the SparkR dataframe to be read, then path of new csv, mode=overwrite is it will overwrite if file already exists. And you will get the file (basically in a new folder with same name as filename/path mentioned in write.df() function)
read.df() function

Method II:
Using spark-csv jar file:

1. Download Spark-csv Jar from mvnrepository.com.
2. put the jar file ( spark-csv_2.10-0.1.jar , in my case ) in your <>/jars/ folder ( which is F:\spark-2.1.0\jars , in my case)

3. Now we will be using “com.databricks.spark.csv”, which is basically a class in spark-csv jar, as source in our read.df() and write.df() function. And new code will look like:

Here in Method II, which is using spark-csv, another benefit is, you will get lot of flexiblity while reading CSVs, as it takes couple of more paramaters, like :
inferSchema=true, when you want to read schema dynamically of csv while reading actual file, headers, charset, dateFormat, escape, quote, delimeter and lot more. If you want to explore more Look Here on spark-csv Github account

You may also like...