Elasticsearch

DevOps/ELK

Elasticsearch

게임이 더 좋아 2023. 4. 1. 10:41

728x170

정의

고가용성의 확장 가능한 문서 기반의 데이터 저장소

기존에는 Full-Text 검색엔진으로 활용했고

최근에는 웹/앱 서버의 로그나 매트릭을 모아서 집계, 분석하는데 사용하는 분석 엔진으로 사용됨

JSON이 Elasticsearch가 지원하는 유일한 형식

구조

Elasticsearch는 여러 대의 노드(서버)로 구성된 분산 데이터 저장소

각 노드는 역할에 따라 마스터 노드와 데이터 노드로 구분하며, 운영 환경에서는 최소 3대의 마스터 노드와 데이터 사용량을 기준으로 여러 대의 데이터 노드를 사용하여 클러스터를 배포

사용자가 저장하는 데이터는 인덱스에 저장되며 인덱스는 여러 개의 샤드로 분리되어 여러 노드에 걸쳐서 저장

이를 통해 데이터의 분산 저장과 복제본 관리

용도

검색 엔진

검색엔진으로 Elasticsearch 클러스터를 사용할 때 가장 중요한 것은 응답 속도

따라서 검색엔진 클러스터를 요청하는 경우에는 데이터 규모와 실행할 검색 쿼리, 원하는 응답속도의 상한선, 검색 부하량과 가장 사용량이 많은 시간대, 캐시 가능 유무 등을 확인

분석 엔진

최근에 Elasticsearch는 분석 엔진으로 많이 사용

가장 대표적인 것이 로그 저장소로 사용하는 경우

애플리케이션에서 발생하는 로그를 저장하고 로그에 포함된 데이터를 조회하거나 데이터 간의 관계를 파악하기 위해 집계(Aggregation)을 실행

분석 엔진 클러스터는 보통 데이터의 규모가 검색엔진에 비해 크기 때문에 1일당 색인 데이터 크기와 보관기간(전체 데이터 저장량), 1초당 색인량(throughput), 검색 부하량과 가장 사용량이 많은 시간대, 집계에 사용할 쿼리 확인 필요

기본 원리

데이터가 Log Shipper, HTTP Request, data pipeline을 통해 input이 됨
elasticsearch가 데이터들을 indexing함
인덱싱된 데이터들이 샤드가 되어 저장됨
유저는 인덱싱된 데이터를 API를 통해 검색함
검색 결과는 relevance 대로 나열됨

HA(High-Availability)

Replication 생성
- 샤딩에 대한 레플리카 생성
Sharding 생성
- 샤딩을 통해 검색 성능 및 가용성 향상
Cluster setup
- 분산 서버 구성 또는 클라우드 환경에서 이용할 수 있음
- node failover 구성
Monitoring
- Elasticsearch cluster를 API, X-Pack monitroing plugin, Elasitc APM, Logstash 등으로 모니터링 가능

설치

#다운로드 후 파일 위치 설정 또는 해당 경로에서 다운로드
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-<version>.tar.gz
tar -xvzf elasticsearch-<version>.tar.gz


#또는 패키지 매니저 이용
sudo apt-get install elasticsearch

configuration 파일 설정 - https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html

/etc/elasticsearch/elasticsearch.yml

# Elasticsearch configuration file

# Set the path to the logs directory
path.logs: /var/log/elasticsearch

# Set the cluster name
cluster.name: my-cluster

# Set the node name
node.name: node-1

# Set the number of shards and replicas for new indices
index.number_of_shards: 3
index.number_of_replicas: 1

# Set the refresh interval for new indices
index.refresh_interval: 60s

# Set the maximum number of fields in the mapping for new indices
index.mapping.total_fields.limit: 1000

# Set the maximum number of search results to return for a query
index.max_result_window: 10000

# Enable allocation of shards to nodes
index.routing.allocation.enable: all

# Enable rebalancing of shards
index.routing.rebalance.enable: all

# Set the durability of the translog for new indices
index.translog.durability: async

# Set the maximum number of threads to use for the index merge scheduler
index.merge.scheduler.max_thread_count: 2

# Set the discovery type and seed hosts
discovery.type: unicast
discovery.seed_hosts: ["127.0.0.1", "192.168.0.1"]

# Set the network host and port
network.host: 0.0.0.0
http.port: 9200

# Set the JVM options
xpack.ml.enabled: false
xpack.monitoring.enabled: false

# Set the heap size
ES_JAVA_OPTS: "-Xmx2g -Xms2g"

추가적인 필드를 넣을 수도 있음

path.data: The path to the data directory, where Elasticsearch stores the shard data.
path.conf: The path to the configuration directory, where Elasticsearch looks for the configuration files.
path.home: The path to the Elasticsearch home directory, which contains the installation files.
node.master: Whether the node is a master node or a data node. Master nodes are responsible for managing the cluster, and data nodes store and index the data.
node.data: Whether the node stores data or not. If set to false, the node will not store any data and will only act as a coordinating node.

실행

Test Request & Response

#Elasticsearch가 설치되었다면 Test Request 보낼 수 있음
curl http://localhost:9200/

#Test response를 받을 수 있음
{
  "name" : "your_machine_name",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "abcdefgh-ijkl-mnop-qrst-uvwxyz123456",
  "version" : {
    "number" : "7.x",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "abcdefghijklmnopqrstuvwxyz123456",
    "build_date" : "2022-12-01T00:00:00.000000Z",
    "build_snapshot" : false,
    "lucene_version" : "8.x",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Index

제대로 설치가 되었다면 Elasticsearch에서 index 설정 가능

Define the Index
1. Name 그리고 Mapping Definition으로 이루어짐
2. Mapping Definition은 field와 해당 type으로 이루어짐
Index a document
1. Elasticssearch는 document들을 parsing 해서 정제한 다음 인덱스에 저장함
Retrieve the indexed document
1. indexed document는 index를 활용하여 검색이 가능함
2. 검색 시, 키워드, 필터, facet 등으로 여러가지 필터링을 통해 결과를 얻을 수 있음

Index : document Data의 검색 성능과 저장을 효율적으로 할 수 있게 하는 방법

Fast Searching : 대용량 데이터 안에서도 빠른 검색이 가능
Flexible Searching : keyword, filter, facets, aggregation 등 여러가지 조건을 걸어 검색 가능
Scalability(Horizontal) : 다중 서버, 클라우드로 구성하여 샤딩, 레플리카를 만들어 대용량 데이터에 대한 검색 성능 높임
Real-time Indexing : 실시간으로 데이터를 받아 indexing을 함

Index 예시

Index API 이용 - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html

HTTP PUT을 이용함

endpoint : /{index}

index에 내가 설정하고 싶은 index 의 이름을 입력하면 됨

해당 인덱스에 대한 매핑은 body에 넣음

아래의 예에선 book이 index 이름

curl -XPUT "http://localhost:9200/books" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "author": {
        "type": "keyword"
      },
      "year": {
        "type": "integer"
      }
    }
  }
}'

해당 index는 3가지 필드를 정의함

mapping 테이블 밑에 properties 밑에 3가지 필드가 존재

필드는 자신의 type을 정의해야함

이러한 인덱스는 샤드와 레플리카를 만들면서 수행될 수도 있음

setting 내용을 포함시켜야함

curl -XPUT "http://localhost:9200/books" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "author": {
        "type": "keyword"
      },
      "year": {
        "type": "integer"
      }
    }
  }
}'

Setting은 여러가지가 존재

number_of_shards: The number of shards for the index.
number_of_replicas: The number of replicas for the index.
refresh_interval: The refresh interval for the index, in seconds.
index.mapping.total_fields.limit: The maximum number of fields that can be included in the mapping for the index.
index.max_result_window: The maximum number of search results that can be returned for a query.
index.codec: The codec to use for the index.
index.routing.allocation.enable: Enables or disables shard allocation for the index.
index.routing.rebalance.enable: Enables or disables rebalancing of shards for the index.
index.translog.durability: The durability of the translog for the index.
index.merge.scheduler.max_thread_count: The maximum number of threads to use for the index merge scheduler.

또한 Setting 안에 Analyzer, Filter, Tokenizer가 들어갈 수 있음

curl -XPUT "http://localhost:9200/books" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "english_stop", "english_stemmer"]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": ["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "author": {
        "type": "keyword"
      },
      "year": {
        "type": "integer"
      }
    }
  }
}'

shard vs replica

여기서 shard는 인덱스를 다중 샤딩으로 하여금 index 자체를 scale out하여 검색 성능을 높이는 것임

각각의 shard들이 index를 가지고 있고 그러한 샤드들이 여러 서버나 클라우드에 분산되어 있는 것임

다시 말해서 어떠한 document를 indexing 할 때, Elasticsearch는 document id나 샤딩 방법에 따라 어떤 shard에 해당 document가 속할지 결정함

여기서 replica는 shard가 된 인스턴스의 개수를 뜻한다. shard의 복제본을 replica라고 함

즉, 검색 성능을 위해 shard를 하고 fail-over를 위해 replica를 두는 것임

검색 성능과 리소스를 적당한 지점을 찾아 shard의 수와 replica 수를 지정해야함

검색

아래 형식을 따르게 됨

curl -XGET "http://localhost:9200/index_name/_search?q=<query>"

예를 들면 title field에 대해서 book index를 활용해서 검색

curl -XGET "http://localhost:9200/books/_search?q=title:book"

응답은 아래와 같다.

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "title": "The Book Thief",
          "author": "Markus Zusak",
          "year": 2005
        }
      },
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0,
        "_source": {
          "title": "The Alchemist",
          "author": "Paulo Coelho",
          "year": 1988
        }
      },
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "3",
        "_score": 1.0,
        "_source": {
          "title": "Pride and Prejudice",
          "author": "Jane Austen",
          "year": 1813
        }
      },
      // More hits here
    ]
  }
}

Elasticsearch Query DSL이 조금 더 복잡한 쿼리를 가능하게 해줌

book 인덱스를 이용하고

1900 ~ 1950에 발행되었으며

Jane Austen이 저자가 아니어야하는 책들에 대해서

내림차순으로 검색을 하는 쿼리

curl -XGET "http://localhost:9200/books/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "book" } },
        { "range": { "year": { "gte": 1900, "lt": 1950 } } }
      ],
      "must_not": [
        { "match": { "author": "Jane Austen" } }
      ]
    }
  },
  "sort": [
    { "year": { "order": "desc" } }
  ]
}
'

응답

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "5",
        "_score": 1.0,
        "_source": {
          "title": "The Grapes of Wrath",
          "author": "John Steinbeck",
          "year": 1939
        }
      },
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "4",
        "_score": 1.0,
        "_source": {
          "title": "The Great Gatsby",
          "author": "F. Scott Fitzgerald",
          "year": 1925
        }
      },
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "3",
        "_score": 1.0,
        "_source": {
          "title": "To Kill a Mockingbird",
          "author": "Harper Lee",
          "year": 1960
        }
      },
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "title": "The Book Thief",
          "author": "Markus Zusak",
          "year": 2005
        }
      }
    ]
  }
}

참고 링크

https://esbook.kimjmin.net/

https://linuxhint.com/elasticsearch-create-index/

https://www.elastic.co/webinars/getting-started-elasticsearch

https://www.youtube.com/watch?v=HhjHY6iD3Qg

https://logz.io/blog/elasticsearch-tutorial/

https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

728x90

그리드형

저작자표시 비영리 변경금지

'DevOps > ELK' 카테고리의 다른 글

Filebeat - Dockerfile (0)	2023.07.01
Logstash vs Filebeat (0)	2023.01.04
Filebeat Tutorial, 파일비트 튜토리얼 (0)	2022.12.29
ELK, Elasticsearch+Logstash+Kibana (0)	2022.12.29
Logstash Tutorial, 로그스태시 튜토리얼 (2)	2022.12.29

현재글Elasticsearch

노는 게 제일 좋아

Elasticsearch

정의

구조

용도

검색 엔진

분석 엔진

기본 원리

HA(High-Availability)

설치

실행

Test Request & Response

Index

Index 예시

shard vs replica

검색

참고 링크

'DevOps > ELK' 카테고리의 다른 글

'DevOps/ELK'의 다른글

티스토리툴바

Elasticsearch

정의

구조

용도

검색 엔진

분석 엔진

기본 원리

HA(High-Availability)

설치

실행

Test Request & Response

Index

Index 예시

shard vs replica

검색

참고 링크

'DevOps > ELK' 카테고리의 다른 글

'DevOps/ELK'의 다른글

관련글

티스토리툴바