Learning OpenSearch® from scratch: Part 1

I recently had the opportunity that I’ve been wanting to take for a while now: learn a lot more about OpenSearch®.

This blog is a record of my research of sorts, and hopefully a guide for those looking to learn more about OpenSearch. I’ll walk through my learning path: from what OpenSearch is, to learning the lingo, going over the basics of indexing and sharding, explore dashboards and analytics, and finally talk about what makes good OpenSearch data.

But the first stop on my journey was clear: What even is OpenSearch? What does it do? What problems does it solve?

What is OpenSearch?

I started at the OpenSearch documentation, where it became immediately clear that OpenSearch is a search and analytics platform with a dashboard suite. These are the basic facts I dug up in a few minutes:

It runs in a distributed manner, using multiple machines to store and process data
It is highly tuneable and configurable for different processing needs
It’s open source under the Apache 2.0 license and it’s built on top of Apache Lucene™ (a search engine)
OpenSearch is commonly used for document search

I also found that researching OpenSearch led me to a lot of results for Elasticsearch, which made me curious; it turns out that the history of OpenSearch is tied to Elasticsearch–and wouldn’t exist without it.

History of OpenSearch

OpenSearch began in 2021 when Elasticsearch changed licenses from the open source Apache 2.0 license to the Server-Side Public License. The team at AWS, along with several other organizations, responded by forking Elasticsearch and naming this new fork OpenSearch.

OpenSearch remains under the Apache 2.0 software license. The projects have remained separate and at the time of my research I got varying answers as to their compatibility, coming to the conclusion that earlier versions of OpenSearch are compatible with corresponding versions of Elasticsearch–but users should expect reducing compatibility over time.

The OpenSearch project joined the Linux Foundation in September 2024, and before that was stewarded by AWS and its employees while still being open to community contributions.

What OpenSearch does and the problems it solves

Depending on the data set you’re feeding it, OpenSearch can do a few different things. At its core, however, OpenSearch is a data search and analytics engine built on the Apache Lucene search engine. It allows you to store data in such a way that it is easier and more efficient to search, then has frameworks for you to write queries against that data.

One of the most common use cases for OpenSearch is document search: you load your documentation, your support pages, etc., into OpenSearch, which then allow users to search with a search bar that triggers a query in OpenSearch for the requested data.

However, there are a few other use cases that prove very useful.

A great example of this is anomaly detection: OpenSearch can look for and alert you to anomalies in time-series data. This can be very useful in situations where, for instance, you need to aggregate logs from several different observers and make sure they are all running normally. The analytics platform also makes analysing metrics a breeze, and helps you create visualizations that communicate your data effectively.

Before I can scratch deeper into OpenSearch’s inner workings, there is some lingo that I first had to learn, and I think you should, too. Without knowing these terms, OpenSearch can appear a bit opaque at first. It’s a limited vocabulary addition but well worth explaining.

Learning the lingo

You might think of OpenSearch as a data store; that’s definitely a part of it, and this terminology helps you understand how your data is stored and searched.

Cluster: A cluster is a set of machines working together on the same OpenSearch data/tasks. You connect to an OpenSearch cluster, which manages its data and tasks. A single machine in a cluster is called a node.
Index: An index is a group of data that (ideally) all relates to each other. If you’re still thinking in database terms, think of this as akin to–but not exactly like–a table.
Document: A document is the smallest atomic stored unit of OpenSearch data. It’s a JSON document that lives in an index. In database terms, this would be a row. A document contains Fields.
Field: a field is a value in a document; think of it like a column in database terms.
Mappings: Mappings are a list of all of the data fields in all the documents in an index. You can set mappings in advance for document creation, do dynamic mappings, and more. Think of this as a definition table of every field you can find in that index.
- Keep in mind that mappings are not necessarily consistent across all documents in an index; one document might have a field that another does not.
Shard: A shard is a piece of an index, holding some (but not all) of its data. Shards are the units of replication in OpenSearch; they are replicated and stored across different machines in your cluster to ensure redundance should a node go down.
- Primary: a primary shard is a shard containing original data
- Replica: a replica shard contains a copy of the data from a primary shard

Now that you speak the OpenSearch language, I’ll cover one of the most important topics in OpenSearch: indexing.

Indexing 101

The first rule of OpenSearch: documents must be indexed to be searched.

Indices are extremely important to OpenSearch documents, because this is how they are retrieved when using the OpenSearch API on the application side of things. A common analogy is to compare it to a book index; each term gets its own data store.

While I learned a lot about why indexing is important, I also learned that some thought needs to go into how you store your data:

What size will the index be in terms of number of documents?
- Large indices take longer to search, and it’s harder for OpenSearch to narrow down searches if your data isn’t very granular.
What size documents will be stored and searched for in the index?
- Same problem as before, but amplified: text search takes much longer on larger pieces of data.
Do you de-normalize data across multiple tables, or do multiple queries?
- You need to weigh the cost of de-normalizing versus the cost of multiple queries or, even worse, multiple API calls.

Creating the right indexing for your data is not a one-size-fits-all task, and that is part of why OpenSearch works for so many. There are a lot of tools and analytics to help you measure and re-index to get your searches just right.

Once you’ve indexed your data, you’ll need to think about the replication of that data to prevent missing data if a node goes down.

Replicating data with sharding

After your data has been stored, OpenSearch will begin sharding and replicating it.

As I mentioned in the lingo section above, this means that the data will be placed into chunks called shards that will be copied across your cluster in such an arrangement that if a node goes down, the combination of primary and replica shards on the remaining available nodes will be able to recover with no lost data.

Here’s a diagram to help visualize this: the stars are shards and their color is their index (Blue, Purple, Green, Orange). The letter means P for primary and R for replica, and the number is the shard ID.

In this diagram, we have four indexes, each with 2 primary and 2 replica shards, spread over 4 nodes:

This arrangement makes it so if one node falls of the network, has a disk failure, etc., the cluster will be able to use the shards on the remaining nodes to piece together the data on the lost node.

For instance, let’s say Node 1 went offline. What’s missing is:

Index Blue
- Shard 1 Replica: Shard 1 Primary is on Node 4 and could be replicated to replace Shard 1 Replica
Index Purple
- Shard 1 Replica: Shard 1 Primary is on Node 2 and could be replicated to replace Shard 1 Replica
Index Orange
- Shard 2 Replica: Shard 2 Primary is on Node 2 and could be replicated to replace Shard 2 Replica
Index Green
- Shard 1 Primary: Shard 1 Replica is on Node 3 and could be replicated to replace Shard 1 Primary

Luckily, you don’t have to worry about sharding too much, unless you want to; there is endless tweaking to be done on sharding and storage.

Once you have your data indexed and it’s replicated, you might want to build some shiny new graphs and dashboards to show off the data. You’re in luck: OpenSearch has you covered.

Dashboards and analytics

The OpenSearch platform contains a dashboards suite that you can use to craft graphs and dashboards for your OpenSearch data, including cluster analytics so you can see how your OpenSearch cluster is running.

You can create very sophisticated setups using very powerful tools, including its own query language called Piped Processing language, or PPL!

What makes for good OpenSearch data?

Let’s take a step back. You may be wondering “what kind of data would I put into OpenSearch?” Here are a few use cases:

Logs and timeseries data make excellent OpenSearch data, as you can leverage tools like anomaly detection. One very interesting use case is the idea of aggregating logs and data from different systems and analyze it as a group: this 10,000-foot view can show some very interesting trends.
Analytics data allows you to create beautiful dashboards and reports that help everyone understand what your data is saying.
And of course, document stores such as documentation for code bases and support pages for services are the bread and butter of the OpenSearch world.

Conclusion

Today I covered what OpenSearch is, the vocabulary used to talk about OpenSearch, how indexing works, the dashboard and analytics suite, and what kinds of data go will with OpenSearch. In my next post, I’ll go further into indexing and searching data.

If you’d like to learn more about OpenSearch yourself, you can check out Learn Enough to Chat About OpenSearch or this video about Getting Started with OpenSearch on NetApp Instaclustr. If you’d like to try it for yourself, you can start up an Instaclustr free trial and provision your own OpenSearch cluster.

Speak To An Expert Today.

Contact us