프로메테우스, Prometheus Up & Running

DevOps/Prometheus

프로메테우스, Prometheus Up & Running - ch.1

게임이 더 좋아 2022. 11. 7. 08:21

728x170

Chapter 1.What is Prometheus?

핵심만 알아보자

영어와 같이 적겠다.

시작해보자

Prometheus is an open soruce, metrics-based monitoring system.

It has a simple yet powerful data model and a query language that lets you analyse how your applications and infrastrucgture are performing.

프로메테우스는 우선 오픈소스 이면서 메트릭 기반의 모니터링 시스템이다.

간단할 뿐만 아니라 data model이 잘 갖추어져 있고 그에 따른 query (PromQL)이 있어서 실제로 인프라에서 작동하는 어플리케이션에 영향을 거의 주지 않고 작동할 수 있다.

For instrumenting your own code, there are client libraries in all the popular languages and runtimes, including Go, Java/JVM, C#/.Net, Python, Ruby, Node.js, Haskell, Erlang and Rust

많은 클라이언트 라이브러리가 존재해서

실제로 우리가 사용하는 코드에도 쉽게 접목시킬 수 있다.

=> 프로메테우스를 어플리케이션에 올려놓기 쉽다는 말과 같다.

Software like K8s and Docker are already instrumented with Prometheus client libraries.

역시나 쿠버네티스나 도커도 역시 라이브러리를 가져서 쉽게 올릴 수 있다.

For third-party software that exposes metrics in a non-prometheus format, there are hundres of integrations available.

exporters ( HAProxy, PostgreSQL, Redis, JMX, SNMP, Consul, Kafka)

역시 제 3자 프로그램도 exporter를 통해서 메트릭을 수집할 수 있다.

=> 결과적으로 프로메테우스는 다양한 언어와 연동이 되며, 다양한 소프트웨어와도 연동이되고 exporter를 통해서 기존의 프로그램과도 연동할 수 있다.

메트릭 수집에 있어서 프로메테우스는 정말 좋아 보인다고 말할 수 있다.

A simple text format makes it easy to expose metrics to Prometheus.

프로메테우스의 장점 중 하나다. 많은 text format 들이 프로메테우스 형식을 지원하게 되어서 실제로 너무 좋다.

The data model identifies each time series not just with a name, but also with an unordered set of key-value pairs called labels.

시계열 데이터로 저장되는 많은 데이터들이 이름 뿐만 아니라 key-value 형식으로 레이블링이 된다.

그렇게 됨으로써 얻을 수 있는 장점은

The PromQL query Language allows aggregation across any of theses labels, so you can analyse not just per process but also per datacenter and per service or by any other labels that you have defined.

프로세스 별로, IDC 별로, Svc 별로 우리가 정의한 label 대로 모니터링이 가능하다는 것이다.

** These can be graphed in dashboard systems such as Grafana

프로메테우스와 그라파나는 거의 뗄레야 뗄 수 없는 관계다.

Alerts can be defined using the exact same PromQL query language that you use for graphing.

If you can graph it, you can alert on it.

Alert 또한 쉽게 가능하다.

우리가 그래프를 만들기 위해 PromQL을 쓴다면

Alerting도 그냥 그대로 쓸 수 있다.

즉, 우리가 PromQL로 정의할 수 있는 것은

Alerting이 가능하다.

Relatedly, service discovery can automatically determine what applications and machines should be scraped from sources such as K8s, EC2, Azure, GCE, OpenStack.

또한 Service Discovery를 지원해서, 어떤 어플리케이션에서 가져올지 찾아서 가져온다.

특히 K8S에서 Service 오브젝트랑 같다고 생각하면 편하다.

A Single Prometheus server can ingest millions of samples per second.

All components of Prometheus can be run in containers, and they avoid doing anything fancy that would get in the way of configuration management tools.

프로메테우스는 가벼우면서도 성능이 좋아서 어디다 붙여놔도 좋다.

그렇다면 여기서 정의하는 모니터링이란 무엇일까???

Prometheus was build to aid software developers and adminstrators in ther operation of production computer systems, such as applications, tools, databases, and networks backing popular websites.

그래서 monitroing은 4개로 나누자면

1. Alerting

=> 사람에게(개발자, 관리자, 운영자 등)에게 알리는 것이다.

2. Debugging

=> 디버깅을 하기위해서 메트릭, 로그 수집등을 통해 root cause를 찾을 수 있다.

3. Trending

Trending can feed into design decisions and processes such as capacity planning

=> 평균을 보면 클러스터의 크기를 예측할 수 있듯이 모아놓은 데이터를 통해 우리가 얼마나 용량 산정을 해야하는지 간접적으로 알 수 있다.

4. Plumbing

=> MLOps에서 많이나온다고 하는데 Data Process Pipelining이다. 정확하게는 나도 잘 모르겠다.

어떤 부분을 Plumbing 해야 하는지 아직 감이 안온다.

그렇다면 우리가 대부분 하는 Monitoring의 대상은 무엇일까???

Receving a HTTP Request
Sending a HTTP 400 response
Entering a function
Reaching the else of an if statement
Leaving a function
A user logging in
Writing data to disk
Reading data from the network
Requesting more memory from the kernel

이 외에도 많은 것이 있다.

특히 entering function, leaving function은 Execution 시간을 간접적으로 알 수 있어서 나도 많이 썼다.

All events also have context.

모든 이벤트에는 맥락이라는 것이 있다.

A HTTP request will have the IP address it is coming from and going to, the URL being requested, the cookies that are set, and the user who made the request.

이와 같이 요청 하나에도 그 맥락이 있다. 물론 응답도 마찬가지로 그렇다.

하지만

Having all the context for all the events would be great for debugging and understanding how your systems are performing in both technical and business terms, but that amount of data is not practical to process and store.

하지만 모든 컨텍스트에 대해 고려하는 것은 디버깅이나 시스템 성능 측정하는데 좋을지 모르지만

그렇게 많은 데이터에 대해서 처리하는 것이나 저장하는 것은 권장되지는 않는다.

다시 말해서 특정한 상황에서만 필요하다.

Thus there are what I would see as roughly four way to approach reducing that volume of data to something workable, namely profiling, tracing, logging and metrics.

즉, 우리는 4가지 역할을 나누어 모니터링하는 것이 좋다.

프로파일링 부터 알아보자

1. Profiling

Profiling takes the approach that you can't have all the context for all of the events all of the time, but you can have some of the context for limited periods of time.

프로파일링은 부분적으로 context를 수집하는 것이라고 볼 수 있다.

다시 말해서 장애가 일어난 구간에 대해 알아보는 것이다.

Tcpdump is one example of a profiling tool.

It allows you to record network traffic based on a specified filter.

It's an essential debugging tool, but you can't really turn it on all the time as you will run out of disk space.

TCP덤프 같은 경우도 프로파일링인데 특정한 네트워크 트래픽을 보는 것이다.

하지만 이것 마저도 부분적으로 보는 것이다.

Debug builds of binaries that track profiling data are another example.

They provide a plethora of useful information, but the performance impact of gathering all that information, such as timings of every function call, means that it is not generally practical to run it in production on an onging basis.

이 또한 마찬가지이다. 모든 정보를 모으는 것은 성능 때문에 그렇게 일반적인 경우는 아니다.

Profiling is largely for tactical debugging.

If it is being used on a longer term basis, then the data volume must be cut down in order to fit into one of the other categories of monitoring.

프로파일링은 오래 모니터링을 위한 작업이 아니다.

왜냐하면 자세한 만큼 데이터가 크기 때문이다.

2. Tracing

Tracing doesn't look at all events, rather it takes some proportion of events such as one in a hundred that pass through some functions of interest.

Tracing will note the functions in the stack trace of the points of interest, and often also how long each of these functions took to execute.

From this you can get an idea of where your program is spending time and which code paths are most contributing to latency.

트레이싱은 말 그대로 추적이다.

다만 모든 것을 추적하는 것이 아니라 원하는 것들만 추적하는 것이다.

그래서 트레이싱을 보면 해당 flow를 따라가고 flow마다 얼마나 시간이 걸렸고 어떻게 stack trace를 쌓아갔는지 볼 수 있다.

Distributed tracing takes this a step further. It makes tracing work across processes by attaching unique IDs to request that are passed from one process to another in remote procedure calls(RPCs) in addition to whether this request is one that should be traced.

또한 프로세스 간 call로 다른 프로세스에 넘어가더라도 고유 ID를 이용하여 추적이 가능하다.

This is a vital tool for debugging distributed micro service architectures.

For Tracing, it is ther sampling that keeps the data volumes and instrumentation performance impact within reason.

MSA에서 특히 필요하다.

왜냐하면 각 독립된 서비스의 집합이기 때문에 프로세스 간 작업이 많이 일어난다.

Tracing을 할 때는 데이터나 실제 운영 상의 성능 이슈가 일어나지 않도록 하는 것이 중요하다.

3. Logging

Logging looks at a limited set of events and records some of the context for each of these events.

To avoid consuming too much resources, as a rule of thumb you are limited to somewhere around a hundred fields per log entry.

A big benefit of logging is that there is no sampling of events, so even though there is a limit on the number of fields, it is practical to determine how slow requests are affecting one particular user talking to one particular API endpoint.

로깅의 장점은 이벤트에 대한 샘플링이 없다는 것이다.

그래서 필드의 수에 대한 제한이 있더라도

느린 요청이 어떤 특정 유저가 다른 특정 API endpoint에 대해 talking하는 것에 영향을 끼치는지 결정하는데 유용하다.

로그를 남기는 것도 종류가 있다.

4가지가 있다.

Transaction Logs

There are the critical business records that you must keep safe at all costs, likely forever.

이것은 실제로 절대 모아야 하는 로그를 의미하며 대부분 유저에게 중요한 정보나 금융정보가 대표적이다.

Request Logs

They may be processed in order to implement user facing features, or just for internal optimisations.

일반적으로 모으는 로그지만 없어져도 크게 지장은 없는, 영원히 가지고 있을 필요는 없는 로그를 말한다.

Application Logs

Not all logs are about requsets; some are about the process it self.

이것은 어플리케이션의 상태 자체를 파악하기 위해 수집하는 로그다.

많이 쓰곤 한다.

Debug Logs

역시 디버그할 때 쓴다.

디버깅하기 위해서는 여러가지 필드들을 많이 사용하므로 가장 데이터 볼륨이 큰 로그이기도 하다.

그래서 디버그할 때만 쓴다.

Metrics

메트릭은 우리가 아는 시스템 메트릭과 사용자 정의 메트릭이 있다.

Metrics largely ignore context, instead tracking aggregations over time of different types of events.

That is not to say, though, that context is always ignored.

You can use metrics to track the latency and data volumes handled by each of the subsystem in your applications, making it easier to determine what exactly is causing a slowdown.

Metrics allow you to collect information about event from all over your process, but with generally no more than one or two fields of context with bounded cardinality.

Logs allow you to collect information about all of one type of event, but can only track a hundred fields of context with unbounded cardinality.

메트릭과 로그는 차이가 분명하고용도도 다르다.메트릭은 1,2개의 필드로 전체적인 프로세스의 상태를 보는 반면

로그는 여러가지 필드를 통해 특정 이벤트에 대한 정보를 얻을 수 있다.물론 단적으로 그렇단 얘기지 무조건 맞다는 말은 아니다.

프로메테우스의 아키텍처를 알아보자.

아키텍처를 알면 우리가 어떤 상호 작용을 어디서 하는지

어디서 커스터마이징을 해야할 지 알 수 있다.

아키텍쳐를 보다보면 대충 어떤 순서인지 알 수 있다.

컴포넌트 별로 알아보자

Client Libraries

Metrics do not typically magically spring forth from applications; someone has to add the instrumentation that produces them.

자동으로 프로메테우스와 연동하여 메트릭이 수집되는 것은 아직까지는 없다.

우리가 어플리케이션에 어떤 작업을 해주어야 연동이 된다.

그 연동을 쉽게 만들어주는 것이 클라이언트 라이브러리다.

다른 소프트웨어와 연동, 어플리케이션 구성 언어에 따른 연동 등이 있다.

Exporters

An exporter is a piece of software that you deploy right beside the application you want to obtain metrics from.

익스포터는 어플리케이션에서 메트릭을 노출시키는 역할이라고 하면 된다.

It takes in request from Prometheus, gathers the required data from the application, transforms them into the correct format, and finally returns them in a response to Prometheus.

프로테메우스의 요청을 받아서 어플리케이션으로부터 데이터를 수집하고 형식에 맞게 프로메테우스에게 전달한다.

Service Discovery

풀링 형식에 거의 다 가지고 있는 특징이다.

Prometheus needs to know where applications are.

This is so Promehesus will know what is meant to monitor, and be able to notice if something it is meant to be monitoring is not responding.

With dynamic environments you cannot simply provide a list of applications and exporters once, as it will get out of date.

This is where service discovery comes in.

결국 서비스 디스커버리는 프로메테우스가 모니터해야 할 대상을 찾는 것이다.

이것은 모니터해야 할 대상이 바뀌었거나 무엇인가 잘못되었을 때도 모니터링은 유지되어야 한다는 것을 의미한다.

Prometheus allows you to configure how metadata from service discovery is mapped to monitoring targets and their labels using relabelling.

프로메테우스는 레이블을 통해서 매핑된 서비스를 찾을 수 있다.

Scraping

실제로 데이터를 긁어오는 작업을 의미한다.

여기선 Pulling 방식을 이용한다.

Pull 과 Push는 뒤에서 설명하겠다.

Storage

대부분 locally 하게 저장하게 되는데

분산시스템으로 만들어서 조금 더 안전하게 저장할 수도 있다.

또한 압축까지 지원하여 데이터를 저장하는 것이 조금 더 효율적으로 변했다.

실제 하드웨어는 HDD보단 SSD를 쓰는 것이 좋다고 하는데 왠지는 모르겠다.

Dashboards

그라파나가 프로메테우스 포맷을 대표적으로 시각화하는 툴이며

프로메테우스 외에도 적용할 수 있다.

이러한 대시보드를 만들어 운영자가 쉽게 파악할 수 있다.

Recording Rules and Alerts

PromQL을 이용하여 Alerting, Recording에 관여한다.

Alert Management

뭐 다른 어플리케이션에 연동하여 Alerting을 직접 전달한다.

slack이 대표적인 대상이다.

Long-Term Storage

실제로 프로메테우스가 오랜 기간을 저장하기 위해 만들어진 것은 아니지만

Remote Storage도 사용 가능하기 때문에 다른 저장소에 원격으로 write / read 를 이용할 수 있다.

** Pull vs Push

Pull을 대상으로부터 내가 데이터를 긁어오는 것이며Push는 대상이 나에게 데이터를 보내주는 것이다.둘 다 장단이 있다.여기서 pulling은 어플리케이션의 성능에 장애가 되지 않기 때문에 프로메테우스가 썼다고 이해하면 좋다.Push는 중앙에 대한 정보만 있다면 바로 보낼 수 있기에 직관적이고 좋다.더 많은 것은 따로 정리하든가 해야겠다.

프로메테우스가 유용하게 쓰이려면?

Prometheus is not suitable for storing event logs or individual events. Nor is it the best choice for high cardinality data, such as email addresses or usernames.

프로메테우스는 고유값이 많아서도 안되며, 이벤트 로그를 저장하는 것도 안된다.

여기서 안된다 말하는 이유는 프로메테우스의 목적과 다르게 쓰이기 때문이다.

그런 것을 위해서라면 프로메테우스가 아닌 다른 도구를 쓰는 것이 좋다.

728x90

그리드형

저작자표시 비영리 변경금지 (새창열림)

'DevOps > Prometheus' 카테고리의 다른 글

Prometheus - Practice (0)	2023.07.09
Exporter 연동 (0)	2023.01.13
Prometheus_ (0)	2023.01.13
Prometheus, 프로메테우스 기본 개념 (0)	2022.10.02

현재글프로메테우스, Prometheus Up & Running - ch.1

노는 게 제일 좋아