Prometheus deployment and troubleshooting

Prometheus role:

Prometheus Monitoring (Prometheus Monitoring) is an open source system monitoring and alerting tool. It was originally developed and released by SoundCloud in 2012 and joined the Cloud Native Computing Foundation (CNCF) in 2016.Prometheus Monitoring is designed to collect, store, and query a variety of metrics data to help users monitor the performance and operational status of their applications and systems.

Deployment Process:

This article uses Prometheus to monitor k8s cluster resource status and resolve alertmanager port 9093 connection denial issues

1. Download the Prometheus version of the corresponding matrix according to the k8s cluster version

# My k8s cluster version is 1.26.9, so I download version 0.13

wget //prometheus-operator/kube-prometheus/archive/refs/tags/v0.13.


# Unzip the download and use it!

unzip v0.13.

2. into the extracted directory, customize the configuration of the alarm rules and email push (depending on the requirements)

cd kube-prometheus-0.13.0/manifests/

# Configure alert rules in this file

vim

# This file configures the alert push

vim

3. Deploy Prometheus monitoring and deletion

kubectl apply --server-side -f manifests/setup -f manifests

# removesPrometheus

kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup


# The following is the normal resource state after deployment is complete

# If ingress is not deployed then you need to change the following svc configuration files to change the svc type to NodePort for external access

kubectl -n monitoring edit svc alertmanager-main

kubectl -n monitoring edit svc prometheus-k8s

kubectl -n monitoring edit svc grafana


# Remove the corresponding network policy, which by default restricts egress and ingress traffic, and cannot be accessed directly even if NodePort type svc or ingress is used.

kubectl -n monitoring delete networkpolicy --all

4. Next, a little bit about the problems I've had before

# While I was deploying Prometheus monitoring services, my alertmanager kept failing to start properly, checking the status revealed the error message

kubectl -n monitoring describe pod alertmanager-main-1

# dial tcp 10.244.135.151:9093 connection refused

# At first when checking the issues on the github website, I found that someone else had the same problem and someone else had also given a solution, I tried to solve it the way he did, it didn't work. He wants to change the content of the file in the sts, if you do, you will find that no matter how you change it, it won't take effect, and you can't delete its sts, the sts is controlled by the (crd) custom resource alertmanager main, you can only modify this or delete this resource to stop the sts.

kubectl -n monitoring edit alertmanager main

kubectl -n monitoring delete alertmanager main

# At first thought it might be that the probe timeout was too short causing it to keep failing detection，I'll change it.alertmanager maindocuments，Change the timeout to300s，But there's still a problem.。Comment the probe out later.，If you don't let it test it, you'll still have a problem.。The last thing is to just comment out the container's port，Have it look up by domain name，Discovered the real problem

kubectl -n monitoring get alertmanager main -o yaml >



vim

apiVersion: /v1
kind: Alertmanager
metadata:
  creationTimestamp: "2024-08-19T08:12:24Z"
  generation: 1
  labels:
    /component: alert-router
    /instance: main
    /name: alertmanager
    /part-of: kube-prometheus
    /version: 0.26.0
  name: main
  namespace: monitoring
  resourceVersion: "510527"
  uid: ee407f56-bffa-4191-baa7-e458e7a1b9ff
spec:
  image: /prometheus/alertmanager:v0.26.0
  nodeSelector:
    /os: linux
  podMetadata:
    labels:
      /component: alert-router
      /instance: main
      /name: alertmanager
      /part-of: kube-prometheus
      /version: 0.26.0
  portName: web
  replicas: 3
  logLevel: debug
  resources:
    limits:
      cpu: 100m
      memory: 100Mi
    requests:
      cpu: 4m
      memory: 100Mi
  retention: 120h
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: alertmanager-main
  version: 0.26.0
  containers:
  - args:
    - --=/etc/alertmanager/config_out/
    - --=/alertmanager
    - --=120h
    - ---address=[$(POD_IP)]:9094
    - ---address=:9093
    - ---prefix=/
    - --=-operated:9094
    - --=-operated:9094
    - --=-operated:9094
    - ---timeout=5m
    - --=/etc/alertmanager/web_config/
    env:
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath:
    image: /prometheus/alertmanager:v0.26.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 10
      httpGet:
        path: /
        port: 443
        scheme: HTTPS
        host:
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 3
    name: alertmanager
    # ports:
    # - containerPort: 9093
    # name: web
    # protocol: TCP
    # - containerPort: 9094
    # name: mesh-tcp
    # protocol: TCP
    # - containerPort: 9094
    # name: mesh-udp
    # protocol: UDP
    readinessProbe:
      failureThreshold: 10
      httpGet:
        path: /
        port: 443
        scheme: HTTPS
        host:
      initialDelaySeconds: 3
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 3
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
      requests:
        cpu: 4m
        memory: 100Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/alertmanager/config
      name: config-volume
    - mountPath: /etc/alertmanager/config_out
      name: config-out
      readOnly: true
    - mountPath: /etc/alertmanager/certs
      name: tls-assets
      readOnly: true
    - mountPath: /alertmanager
      name: alertmanager-main-db
    - mountPath: /etc/alertmanager/web_config/
      name: web-config
      readOnly: true
      subPath:
# status:
# availableReplicas: 0
# conditions:
# - lastTransitionTime: "2024-08-19T08:12:28Z"
# message: |-
# pod alertmanager-main-1: containers with incomplete status: [init-config-reloader]
# pod alertmanager-main-2: containers with incomplete status: [init-config-reloader]
# observedGeneration: 1
# reason: NoPodReady
# status: "False"
# type: Available
# - lastTransitionTime: "2024-08-19T08:12:28Z"
# observedGeneration: 1
# status: "True"
# type: Reconciled
# paused: false
# replicas: 3
# unavailableReplicas: 3
# updatedReplicas: 3


# Delete existingmainresource (such as manpower or tourism)

kubectl -n monitoring delete alertmanager main

# recreatemainresource (such as manpower or tourism)

kubectl -n monitoring apply -f

# View sts logs found error message prompts that dns resolution problems, so go to check the k8s component coredns information, found the problem, my k8s cluster using the highly available deployment program, network plug-ins for calico, the cluster address is 10.10.40.100-105, service network segment for 10.96.0.0/16, pod network segment for 10.244.0.0/16, while the coredns network segment is 10.88.0.0/16 network segment, and this coredns network segment is 10.88.0.0/16 network segment. 16, pod segment is 10.244.0.0/16, while this coredns segment is 10.88.0.0/16 segment of the

kubectl -n monitoring logs sts alertmanager

kubectl get pod -A -o wide

# So look at the cni network component information, and see that all nodes have this cni0 NIC, this NIC is installed flannel network component will provide, the problem is here, calico network component provides the NIC is calic741a2df36d@if2 NIC name, so the original coredns removed, the network is back to normal

# At this point and then remove the entire Prometheus service redeployment is back to normal


ls -l /etc/cni//


# Delete both

kubectl -n kube-system delete pod coredns-5bbd96d687-gtl9r


kubectl -n kube-system get pod -o wide

summarize

# Regardless of the deployment of a set of services, pods can run, cross-node pods and pods can access each other is not a network problem, such as this individual pods have problems, check the error report, as long as it is found to be a port rejection and so on to prioritize checking the k8s component coredns problem, there is a miraculous effect, but of course still have to be based on the actual situation and the argument.

# If there is a problem with the deployment of the cluster, to change it to a single node test is also a good way to troubleshoot.