ElasticSearch的存储结构


ElasticSearch的存储结构

Elasticsearch的存储结构

Node Data

Simply starting Elasticsearch from a empty data directory yields the following directory tree:

$ tree data
data
└── elasticsearch
    └── nodes
        └── 0
            ├── _state
            │   └── global-0.st
            └── node.lock

The node.lock file is there to ensure that only one Elasticsearch installation is reading/writing from a single data directory at a time.

More interesting is the global-0.st-file. The global-prefix indicates that this is a global state file while the .st extension indicates that this is a state file that contains metadata. As you might have guessed, this binary file contains global metadata about your cluster and the number after the prefix indicates the cluster metadata version, a strictly increasing versioning scheme that follows your cluster.

Index Data

Let’s create a single shard index and look at the files changed by Elasticsearch:

$ curl localhost:9200/foo -XPOST -d '{"settings":{"index.number_of_shards": 1}}'
{"acknowledged":true}

$ tree -h data
data
└── [ 102]  elasticsearch
    └── [ 102]  nodes
        └── [ 170]  0
            ├── [ 102]  _state
            │   └── [ 109]  global-0.st
            ├── [ 102]  indices
            │   └── [ 136]  foo
            │       ├── [ 170]  0
            │       │   ├── .....
            │       └── [ 102]  _state
            │           └── [ 256]  state-0.st
            └── [   0]  node.lock

We see that a new directory has been created corresponding to the index name. This directory has two sub-folders: _state and 0. The former contains what’s called a index state file (indices/{index-name}/_state/state-{version}.st), which contains metadata about the index, such as its creation timestamp. It also contains a unique identifier as well as the settings and the mappings for the index. The latter contains data relevant for the first (and only) shard of the index (shard 0). Next up, we’ll have a closer look at this.

Shard Data

The shard data directory contains a state file for the shard that includes versioning as well as information about whether the shard is considered a primary shard or a replica.

$ tree -h data/elasticsearch/nodes/0/indices/foo/0
data/elasticsearch/nodes/0/indices/foo/0
├── [ 102]  _state
│   └── [  81]  state-0.st
├── [ 170]  index
│   ├── [  36]  segments.gen
│   ├── [  79]  segments_1
│   └── [   0]  write.lock
└── [ 102]  translog
    └── [  17]  translog-1429697028120

The {shard_id}/index directory contains files owned by Lucene. Elasticsearch generally does not write directly to this folder. The files in these directories constitute the bulk of the size of any Elasticsearch data directory.

Before we enter the world of Lucene, we’ll have a look at the Elasticsearch transaction log, which is unsurprisingly found in the per-shard translog directory with the prefix translog-. The transaction log is very important for the functionality and performance of Elasticsearch, so we’ll explain its use a bit closer in the next section.

Per-Shard Transaction Log

The Elasticsearch transaction log makes sure that data can safely be indexed into Elasticsearch without having to perform elasticsearch flush which actually triggers a low-level Lucene commit for every document. Committing a Lucene index creates a new segment on the Lucene level which is fsync()-ed and results in a significant amount of disk I/O which affects performance.

In order to accept a document for indexing and make it searchable without requiring a full Lucene commit, Elasticsearch adds it to the Lucene IndexWriter and appends it to the transaction log. After each refresh_interval it will call elasticsearch refresh which actually triggers reopen() on the Lucene indexes and will make the data searchable without requiring a commit. This is part of the Lucene Near Real Time API. When the IndexWriter eventually commits due to either an automatic flush of the transaction log or due to an explicit flush operation, the previous transaction log is discarded and a new one takes its place.

Should recovery be required, the segments written to disk in Lucene will be recovered first, then the transaction log will be replayed in order to prevent the loss of operations not yet fully committed to disk.

Lucene Index Files

Lucene has done a good job at documenting the files in the Lucene index directory, reproduced here for your convenience:

Name Extension Brief Description
Segments File segments_N Stores information about a commit point
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file.
Segment Info .si Stores metadata about a segment
Compound File .cfs, .cfe An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Term Dictionary .tim The term dictionary, stores term info
Term Index .tip The index into the Term Dictionary
Frequencies .doc Contains the list of docs which contain each term along with frequency
Positions .pos Stores position information about where a term occurs in the index
Payloads .pay Stores additional per-position metadata information such as character offsets and user payloads
Norms .nvd, .nvm Encodes length and boost factors for docs and fields
Per-Document Values .dvd, .dvm Encodes additional scoring factors or other per-document information.
Term Vector Index .tvx Stores offset into the document data file
Term Vector Documents .tvd Contains information about each document that has term vectors
Term Vector Fields .tvf The field level info about term vectors
Live Documents .liv Info about what files are live

When using the Compound File format (default in 1.4 and greater) all these files (except for the Segment info file, the Lock file, and Deleted documents file) are collapsed into a single .cfs file.