Prometheus role:
Prometheus Monitoring (Prometheus Monitoring) is an open source system monitoring and alerting tool. It was originally developed and released by SoundCloud in 2012 and joined the Cloud Native Computing Foundation (CNCF) in 2016.Prometheus Monitoring is designed to collect, store, and query a variety of metrics data to help users monitor the performance and operational status of their applications and systems.
Deployment Process:
This article uses Prometheus to monitor k8s cluster resource status and resolve alertmanager port 9093 connection denial issues
1. Download the Prometheus version of the corresponding matrix according to the k8s cluster version
# My k8s cluster version is 1.26.9, so I download version 0.13
wget //prometheus-operator/kube-prometheus/archive/refs/tags/v0.13.
# Unzip the download and use it!
unzip v0.13.
2. into the extracted directory, customize the configuration of the alarm rules and email push (depending on the requirements)
cd kube-prometheus-0.13.0/manifests/
# Configure alert rules in this file
vim
# This file configures the alert push
vim
3. Deploy Prometheus monitoring and deletion
kubectl apply --server-side -f manifests/setup -f manifests
# removesPrometheus
kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup
# The following is the normal resource state after deployment is complete
# If ingress is not deployed then you need to change the following svc configuration files to change the svc type to NodePort for external access
kubectl -n monitoring edit svc alertmanager-main
kubectl -n monitoring edit svc prometheus-k8s
kubectl -n monitoring edit svc grafana
# Remove the corresponding network policy, which by default restricts egress and ingress traffic, and cannot be accessed directly even if NodePort type svc or ingress is used.
kubectl -n monitoring delete networkpolicy --all
4. Next, a little bit about the problems I've had before
# While I was deploying Prometheus monitoring services, my alertmanager kept failing to start properly, checking the status revealed the error message
kubectl -n monitoring describe pod alertmanager-main-1
# dial tcp 10.244.135.151:9093 connection refused
# At first when checking the issues on the github website, I found that someone else had the same problem and someone else had also given a solution, I tried to solve it the way he did, it didn't work. He wants to change the content of the file in the sts, if you do, you will find that no matter how you change it, it won't take effect, and you can't delete its sts, the sts is controlled by the (crd) custom resource alertmanager main, you can only modify this or delete this resource to stop the sts.
kubectl -n monitoring edit alertmanager main
kubectl -n monitoring delete alertmanager main
# At first thought it might be that the probe timeout was too short causing it to keep failing detection,I'll change it.alertmanager maindocuments,Change the timeout to300s,But there's still a problem.。Comment the probe out later.,If you don't let it test it, you'll still have a problem.。The last thing is to just comment out the container's port,Have it look up by domain name,Discovered the real problem
kubectl -n monitoring get alertmanager main -o yaml >
vim
apiVersion: /v1
kind: Alertmanager
metadata:
creationTimestamp: "2024-08-19T08:12:24Z"
generation: 1
labels:
/component: alert-router
/instance: main
/name: alertmanager
/part-of: kube-prometheus
/version: 0.26.0
name: main
namespace: monitoring
resourceVersion: "510527"
uid: ee407f56-bffa-4191-baa7-e458e7a1b9ff
spec:
image: /prometheus/alertmanager:v0.26.0
nodeSelector:
/os: linux
podMetadata:
labels:
/component: alert-router
/instance: main
/name: alertmanager
/part-of: kube-prometheus
/version: 0.26.0
portName: web
replicas: 3
logLevel: debug
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 4m
memory: 100Mi
retention: 120h
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: alertmanager-main
version: 0.26.0
containers:
- args:
- --=/etc/alertmanager/config_out/
- --=/alertmanager
- --=120h
- ---address=[$(POD_IP)]:9094
- ---address=:9093
- ---prefix=/
- --=-operated:9094
- --=-operated:9094
- --=-operated:9094
- ---timeout=5m
- --=/etc/alertmanager/web_config/
env:
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath:
image: /prometheus/alertmanager:v0.26.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 10
httpGet:
path: /
port: 443
scheme: HTTPS
host:
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
name: alertmanager
# ports:
# - containerPort: 9093
# name: web
# protocol: TCP
# - containerPort: 9094
# name: mesh-tcp
# protocol: TCP
# - containerPort: 9094
# name: mesh-udp
# protocol: UDP
readinessProbe:
failureThreshold: 10
httpGet:
path: /
port: 443
scheme: HTTPS
host:
initialDelaySeconds: 3
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 4m
memory: 100Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/alertmanager/config
name: config-volume
- mountPath: /etc/alertmanager/config_out
name: config-out
readOnly: true
- mountPath: /etc/alertmanager/certs
name: tls-assets
readOnly: true
- mountPath: /alertmanager
name: alertmanager-main-db
- mountPath: /etc/alertmanager/web_config/
name: web-config
readOnly: true
subPath:
# status:
# availableReplicas: 0
# conditions:
# - lastTransitionTime: "2024-08-19T08:12:28Z"
# message: |-
# pod alertmanager-main-1: containers with incomplete status: [init-config-reloader]
# pod alertmanager-main-2: containers with incomplete status: [init-config-reloader]
# observedGeneration: 1
# reason: NoPodReady
# status: "False"
# type: Available
# - lastTransitionTime: "2024-08-19T08:12:28Z"
# observedGeneration: 1
# status: "True"
# type: Reconciled
# paused: false
# replicas: 3
# unavailableReplicas: 3
# updatedReplicas: 3
# Delete existingmainresource (such as manpower or tourism)
kubectl -n monitoring delete alertmanager main
# recreatemainresource (such as manpower or tourism)
kubectl -n monitoring apply -f
# View sts logs found error message prompts that dns resolution problems, so go to check the k8s component coredns information, found the problem, my k8s cluster using the highly available deployment program, network plug-ins for calico, the cluster address is 10.10.40.100-105, service network segment for 10.96.0.0/16, pod network segment for 10.244.0.0/16, while the coredns network segment is 10.88.0.0/16 network segment, and this coredns network segment is 10.88.0.0/16 network segment. 16, pod segment is 10.244.0.0/16, while this coredns segment is 10.88.0.0/16 segment of the
kubectl -n monitoring logs sts alertmanager
kubectl get pod -A -o wide
# So look at the cni network component information, and see that all nodes have this cni0 NIC, this NIC is installed flannel network component will provide, the problem is here, calico network component provides the NIC is calic741a2df36d@if2 NIC name, so the original coredns removed, the network is back to normal
# At this point and then remove the entire Prometheus service redeployment is back to normal
ls -l /etc/cni//
# Delete both
kubectl -n kube-system delete pod coredns-5bbd96d687-gtl9r
kubectl -n kube-system get pod -o wide
summarize
# Regardless of the deployment of a set of services, pods can run, cross-node pods and pods can access each other is not a network problem, such as this individual pods have problems, check the error report, as long as it is found to be a port rejection and so on to prioritize checking the k8s component coredns problem, there is a miraculous effect, but of course still have to be based on the actual situation and the argument.
# If there is a problem with the deployment of the cluster, to change it to a single node test is also a good way to troubleshoot.