Understanding Elasticsearch

What is Elasticsearch?

Elasticsearch is an open source analytics and full text search engine built on apache lucene. It helps perform powerful text search queries using distributed inverted index and can also query structured data for analytics platforms. It is written and developed in Java. Elasticsearch uses a REST API to query documents.

Why Elasticsearch?

Using elasticsearch we can query all data types, even unstructured, geographical locations and metrics data.
Elasticsearch can search billions of records in just a few seconds, thanks to its distributed architecture, making it highly scalable for huge indexes and being able to perform multiple searches simultaneously.
Multilingual support using ICU plugin based on the lucene implementation of text segmentation standard.
Its powerful full text search helps in building search functionalities such as auto completion, correcting typos, highlighting matches, finding relevant records, finding synonyms and many more.
Elasticsearch search can not really be a business intelligence solution but you can indeed get a lot of valuable information out of the data that you store within the elasticsearch that is useful for analytics platforms when analysing lots of data.

How elasticsearch works?

To understand the architecture of elasticsearch we need to know about Nodes and Clusters which are the data units of elasticsearch. A Cluster is a collection of nodes where each node is a single server that stores part of the data that is added to the cluster as a whole. Each of these nodes work in symphony with other nodes in the cluster and forward the request from one node to another using Transport Layer. The nodes and the clusters are uniquely identified using names for each node and the cluster.

In elasticsearch data is stored as documents which is just a unit of information. A document in elasticsearch corresponds to a row in a relational database. A document then contains fields, which corresponds to columns in a relational database, a document is essentially a json object. An example document:

{

“First_name” : “Able”,

“Last_name” : “Tesfaye”,

“Albums” : [“After Hours”, “Starboy”,”Beauty Behind the Madness” ,“House of Balloons”]

}

Again the collection of logically related documents is called an index. Just like nodes in a Cluster these documents are organized in indices. An index in elasticsearch corresponds to a table in a relational database. Just like nodes and clusters the documents and indices are uniquely identified using unique names for each document and index.

As noted earlier elasticsearch is very scalable, thanks to its distributed architecture. Lets see how sharding helps in handling large indices making elasticsearch scalable. Let’s say we have an index that is of the size 1TB and two nodes each of size 512GB. The index cannot be stored in a single node and needs to be distributed among the two. So when an index is large enough to exceed the hardware limits of a node we break the data in the index into pieces called shards, hence the process sharding.

The shards are distributed between the nodes and even if we want to add more data to the cluster we can simply do so by adjusting the distribution of shards between the nodes, hence sharding makes elasticsearch highly scalable. Another advantage of distributed architecture is that we can parallelly search both the nodes at the same time increasing the performance of the search.

End Note:

Elasticsearch is being used by large companies, some of which are quora, adobe, facebook, firefox. Netflix, soundcloud, stackexchange, etc. There are many other users of elasticsearch and there is a vibrant community. However as a developer one can just use elasticsearch by knowing how to query a search and does not need to know all this, this can help shed some light on the working of elastic search.

References: