Dump Huge CSV (or delimited) files in Elasticsearch

dump-huge-csv-dilimited-files-elasticsearch-python

Today we will see how to dump huge CSV or any delimited files or logs in elasticsearch for better search capabilities or for further analysis.
For this tutorial we will be using python and its one of the powerful package called esimport.
As python is platform independent launguage, you can use any Windows, Linux operating system, but for production people generally use any flavour of linux server,
for example: Centos or Ubuntu server. But here, for this tutorial, I am using windows, so the high level stack is:

-Operating system: windows 10 64bit
-python 2.7
-pip 8.1
-elasticsearch 2.3.5 (which is the latest one at the time of writing this post)
-es-import 0.1.9 (again the latest one at the time of writing this post)

Prepare the Infrastructure:

So let’s start and prepare the infrastructure by installing all the required tools, packages we will be using.

So here are the steps:

1. Check you have your Elasticsearch is up and running, if you don’t have one please install it from elastic.co , and run it. Verify this as shown in below screenshot:

elasticsearch-running

2. Install a copy of elasticsearch-head plugin , if you don’t have one check our post and install it, and verify as shown in below screenshot:

elasticsearch-head-running

3. Install python: Install python from Python.org

NOTE: Make sure your python environment variables, paths is set for Python

4.Install pip:
pip is basically a python package installer, on your machine from here: https://pip.pypa.io/en/stable/installing/

command to install pip on windows:

verify you have python and pip installed on your machine, as shown in below screenshot:

check-python-pip-version

5. Install es-import python package using below command:

To verify that esimport is installed on your server, use below command to verify:

Above one is basically a help command for esimport, but if esimport is installed successfully you will be able to see something like below screenshot:

esimport-help-check

Prepare your Data:

So here, infrastructure wise you have everything up and running for elasticsearch import, now get your data, which will be huge CSV or any delimited file and put it in a folder, in my case
it is on my c: drive under esimportdata folder and the path is c:\esimportdata\ , and I have some decent size csv in it, as shown in below screenshot:

raw-data-csv

csv-preview

Let’s Dump the Data in Elasticsearch:

Syntax for esimport command is:

Start your command prompt, trigger the following command on it and hit Enter, as shown in below screenshot:

esimport-command

so, we don’t need to create index and type explicitly, esimport will create that for us automatically. After successfull import you will see something like below on ommand prompt:

esimport-command-triggered

Now, go to your browser where you have your elasticsearch-head plugin, you will be able to see all data, with newly created index and type.

screenshot1

After Successful import you can see data in elasticsearch as shown in below screenshot:

esimport-successful-import
Command prompt success message

elasticsearch-data-preview

elasticsearch data

Using Custom Delimeter:

Offcourse, you can define your custom delimiter as per your data, for esimport command, the command will look like below:

Data Refresh:
As I have mentioned, esimport is very powerful package, using this you can refresh your data by deleting old data and dumping new data in a single command, Syntax for clearing old data and
dumping new data is as follows:

Using Mapping for data

You can use a custom mapping for your data, where you can define the format, data type of the fields in ElasticSearch, this is very useful feature of esimport, when it comes to date format, and defining other
data types while dumping data in elasticsearch.

General syntax for using mapping file is given below:

Refer elastic.co reference for more on elasticsearch supported data types, and mapping.

Worried about mapping for huge csv columns?

It seems little bit tedious job, so let’s make it little bit simpler. I will show you what I do, usually in case where I have lots of columns to be mapped.

Use ElasticSearch Plugin called kopf, for check mapping.
Install it in your elasticsearch instance from https://github.com/lmenezes/elasticsearch-kopf

elasticsearch-kopf-plugin-install

After successful installation access it using following URL http://localhost:9200/_plugin/kopf

kopf-mapping

you will see something like, as show in screenshot, Click on small “down arrow icon” right next to your index name, and click “show mapping”, and you will be able to see the mapping for your current data set.

kopf-preview-mapping

Now you can use it for your next data import command.

Reference links for esimport:

https://pypi.python.org/pypi/esimport
https://github.com/playnetwork/esimport

You may also like...