Opentelemetry collector usage

Opentelemetry collector usage
- Service
  - Extensions
  - healthcheckextension
  - Pipelines
- receiver
  - OTLP Receiver
  - prometheus receiver
  - filelog receiver
- Processor
  - Data attribution
  - Important
  - memory limiter processor
  - batch processor
  - attributes processor && Resource Processor
  - filter processor
  - k8s attributes processor
  - Tail Sampling Processor
  - transform processor
  - routing processor
- Exporter
  - debug exporter
  - otlp exporter
  - otlp http exporter
  - prometheus exporter
  - prometheus remote write exporter
  - loadbalancing exporter
- Connector
  - roundrobin connector
  - span metrics connector
- troubleshooting
- expansion
  - When to Expand
  - How to expand capacity
    - stateful

The Opentelemetry collector contains the followingsubassemblies：

receiver
processor
exporter
connector
Service

Note that the individual components are only defined here; for them to actually take effect, they need to be added to theserviceMiddle.

formalopentelemetry-collectorcap (a poem)opentelemetry-collector-contribTwo libraries give a large number of Collector component implementations. The former is opentelemetry-collector'scruxconfiguration, which is used to provide vendor-independent collector configuration, the latter being provided by theDifferent vendors offer, such as aws, aure, kafka and so on. You can combine the functions of both to meet business needs when using them. It is also worth noting that each of the two repositories' individualComponent CatalogAll of them are provided in theHelp files such asotlpreceiver、prometheusremotewriteexporteretc.

Service

The service field is used to organize the enabling of the receivers, processors, exporters and extensions components. A service contains the following subfields:

Extensions
Pipelines
Telemetry: Support Configurationmetriccap (a poem)log。
- By default, opentelemetry will behttp://127.0.0.1:8888/metricsExposing metrics underSpecifies the address of the exposed metrics. The address of the exposed metrics can be specified using thelevelfield controls the number of exposed metrics (given here for each level under themetrics)：
  - none:: No collection of telemetry data
  - basic:: Acquisition of basic telemetry data
  - normal:: Default level, adding standard telemetry data on top of basic
  - detailed: The most detailed level, including dimensions and views.
- The default level of the log isINFOSupportDEBUG、 WARN、 ERROR。

Extensions

Collector authentication, health monitoring, service discovery or data forwarding can be implemented using extensions. Most extensions have default configurations.

service:
  extensions: [health_check, pprof, zpages]
  telemetry:
    metrics:
      address: 0.0.0.0:8888
      level: normal

healthcheckextension

A health check can be provided for the pod's probe:

    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133

Pipelines

A pipeline contains the set of receivers, processors and exporters, and the same receivers, processors and exporters can be put into multiple pipelines.

Configure pipeline, type:

traces: Capture and processing of trace data
metrics: Acquisition and processing of metric data
logs: Acquisition and processing of log data

Note that the order in which the PROCESSORS are positioned determines the order in which they are processed.

service:
  pipelines:
    metrics:
      receivers: [opencensus, prometheus]
      processors: [batch]
      exporters: [opencensus, prometheus]
    traces:
      receivers: [opencensus, jaeger]
      processors: [batch, memory_limiter]
      exporters: [opencensus, zipkin]

The following focuses on a few common component configurations.

receiver

Used to receive telemetry data.

This can be done by<receiver type>/<name>Configure multiple receivers for a class of receivers to ensure that the names of the receivers are unique. at least one receiver needs to be configured in the collector.

receivers:
  # Receiver 1.
  # <receiver type>:
  examplereceiver:
    # <setting one>: <value one>
    endpoint: 1.2.3.4:8080
    # ...
  # Receiver 2.
  # <receiver type>/<name>:
  examplereceiver/settings:
    # <setting two>: <value two>
    endpoint: 0.0.0.0:9211

OTLP Receiver

utilizationOTLPformat to receive gRPC or HTTP traffic, such as thepush mode, i.e., the client is required to push telemetry data to opentelemetry:

receivers:
  otlp:
    protocols:
      grpc:
      http:

The otlp receiver can be defined under k8s in the following way:

receivers.
  otlp.
    protocols.
      grpc: ${env:MY_POD_IP}:4317
        endpoint: ${env:MY_POD_IP}:4317 #Define the server that receives grpc data formats
      http: endpoint: ${env:MY_POD_IP}:4317
        endpoint: ${env:MY_POD_IP}:4318 #Define the server that receives the http data format.

The Receiver itself supports push and pull modes, such ashaproxyreceiverIt's the pull mode:

receivers:
  haproxy:
    endpoint: http://127.0.0.1:8080/stats
    collection_interval: 1m
    metrics:
      haproxy.connection_rate:
        enabled: false
      :
        enabled: true

prometheus receiver

The prometheusreceiver supports the use of the prometheus approachpull metrics data, but note that the approachCurrently in the development stage，It's official.caveatcap (a poem)Unsupported Features：

receivers:
    prometheus:
      config:
        scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 5s
            static_configs:
              - targets: ['0.0.0.0:8888']
          - job_name: k8s
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              regex: "true"
              action: keep
            metric_relabel_configs:
            - source_labels: [__name__]
              regex: "(request_duration_seconds.*|response_duration_seconds.*)"
              action: keep

filelog receiver

filelog-receiverUsed to capture logs from files.

Processor

Modify or transform the data collected by the receiver according to the rules or configurations defined by each processor, such as filtering, discarding, renaming, and other operations.The order of execution of Processors depends on theThe order of Processors defined in the The recommended processors are as follows:

memory_limiter
sampling processors or the initialfiltering processors
dependenciesContextThe processor that sends the source, e.g.k8sattributes
batch
Other Processors

Data attribution

Since a receiver may be attached to multiple pipelines, there may be multiple processors processing data from the same receiver at the same time, and this involves theData attributionThe problem. There are two modes of data attribution from pipelines' point of view:

Exclusive data: in this mode, the pipeline copies the data received from the receiver and the individual pipelines do not interact with each other.
Shared data: in this mode, the pipeline does not copy the data received from the receiver, multiple pipelines share the same data, and the data isread-only (computing)that cannot be modified. This can be changed by setting theMutatesData=falseto avoid data copying in exclusive mode.

Note: In the official(computer) fileThe warning is that when multiple pipelines refer to the same receiver, they can only be guaranteed to be independent, but since the entire process uses synchronous calls, if one pipeline blocks, it will cause other pipelines using the same receiver to block as well.：

Important

When the same receiver is referenced in more than one pipeline, the Collector creates only one receiver instance at runtime that sends the data to a fan-out consumer. The fan-out consumer in turn sends the data to the first processor of each pipeline. The data propagation from receiver to the fan-out consumer and then to processors is completed using a synchronous function call. This means that if one processor blocks the call, the other pipelines attached to this receiver are blocked from receiving the same data, and the receiver itself stops processing and forwarding newly received data.

memory limiter processor

Used to prevent collector OOM. this processor periodically checks the memory situation and if the memory used is greater than a set threshold, it executes a()。memorylimiterprocessorThere are two thresholds:soft limitcap (a poem)hard limit. When memory usage exceedssoft limitWhen it does, the processor will reject the data and return an error (thus requiring the ability to retry the occurrence of the data or there will be data loss) until the memory usage falls below thesoft limitIf the memory usage exceedshard limit, then a GC will be enforced.

recommended generalmemorylimiterprocessorSet as the first processor. The setup parameters are as follows:

check_interval (default 0s): memory check period, recommended value is 1s. can be lowered if Collector memory is spikycheck_intervalor increasespike_limit_mibto avoid exceeding the memoryhard limit。
limit_mib (default 0): Definehard limitThe value of the maximum memory requested by the process heap in MiB. note that usually the total memory will be higher than this value by about 50 MiB.
spike_limit_mib (The default is 20% of thelimit_mib): Measures the maximum expected peak between memory uses, which must be less thanlimit_mib.。soft limit Equivalent (limit_mib - spike_limit_mib)，spike_limit_mibThe recommended value of 20%limit_mib。
limit_percentage (default 0): defines the maximum amount of memory to be requested by the process heap as a percentage, with a lower priority than thelimit_mib
spike_limit_percentage (default 0): Measures the maximum expected peak between memory uses by a percentage, which can only be compared with thelimit_percentageUse in conjunction.

Use it as follows:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 4000
    spike_limit_mib: 800

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 50
    spike_limit_percentage: 30

batch processor

The batch processor can receive spans, metrics, or logs, and compresses the data to reduce the number of connections required for data transfer.

It is recommended to configure the batch processor on each Collector and place it in thememory_limiterand after sampling processors. There are two batch sending modes based on size and interval.

The configuration parameters are as follows:

send_batch_size (Default 8192): Defines the number of batch (spans, metric data points or log records) to send, after which a batch will be sent.
timeout (default 200ms): defines the timeout for sending a batch, after which a batch will be sent. if set to 0, it will be ignoredsend_batch_sizeand based only onsend_batch_max_sizeto send data.
send_batch_max_size (default 0): upper limit of the batch size, must be greater than or equal tosend_batch_size。 0Indicates that there is no upper limit.
metadata_keys (empty by default): If this value is set, the processor will set a new value for each different combination of thevalue creates a batcher instance.take note ofExecuting a batch using metadata increases the memory required by the batch.
metadata_cardinality_limit (Default 1000): When themetadata_keys Non-null, this value limits the number of combinations of metadata keys that need to be processed.

The following defines a default batch processor and a custom batch processor.Note that this is just a declaration; for it to take effect you need to add theserviceCited in.

processors:
  batch:
  batch/2:
    send_batch_size: 10000
    timeout: 10s

attributes processor && Resource Processor

The Resource Processor can be seen as a subset of the attributes processor and is used to modify resource ( span, log, metric) attributes.

The attributes processor has two main functions:Modifying Resource Propertiesas well asdata filtering. Typically used to modify resource attributes, data filtering can be considered using thefilterprocessor。

Here are some common ways of modifying resource attributes, similar to how Prometheus modifies labels; see the official(for) instance：

processors:
  attributes/example:
    actions:
      - key: 
        action: delete
      - key: redacted_span
        value: true
        action: upsert
      - key: copy_key
        from_attribute: key_original
        action: update
      - key: account_id
        value: 2245
        action: insert
      - key: account_password
        action: delete
      - key: account_email
        action: hash
      - key: http.status_code
        action: convert
        converted_type: int

filter processor

Used to discard spans, span events, metrics, datapoints, and logs captured by the Collector. filterprocessor will use the OTTL syntax to create whether or not it needs todiscardsThe CONDITIONS of the telemetry data, which are discarded if they match any CONDITION.

	Span
	SpanEvent
	Metric
	DataPoint
`logs.log_record`	Log

as belowdiscardsAll HTTP spans:

processors:
  filter:
    error_mode: ignore
    traces:
      span:
        - attributes[""] == nil

In addition the filter processor supportsOTTL Converter functions. Such as

# Drops metrics containing the '' attribute key
filter/keep_good_metrics:
  error_mode: ignore
  metrics:
    metric:
      - 'HasAttrKeyOnDatapoint("")'

k8s attributes processor

This processor automatically discovers k8s resources and then injects the required metadata information into span, metrics, and logs asresourcesProperties.

When k8sattributesprocessor receives data (log, trace or metric), it tries to match the data with the pod, and if the match is successful, it injects the associated pod metadata into that data. By default, k8sattributesprocessor uses the inbound connection IP and the Pod IP for association, but it can also be used via theresource_attributeCustomize the association method:

Each rule contains a pair offrom(indicating the type of rule) andname(iffrombecause ofresource_attribute, then the attribute name).

fromThere are two types:

connection: Match the data using the IP attribute in the afternoon on the connection.When using this type, the processor must be located before any batching or tail sampling.。
resource_attribute: Specify the attributes used to match the data from the received resource. Only attributes of metadata can be used.

pod_association:
  # below association takes a look at the datapoint's  resource attribute and tries to match it with
  # the pod having the same attribute.
  - sources:
      - from: resource_attribute
        name: 
  # below association matches for pair `` and ``
  - sources:
      - from: resource_attribute
        name: 
      - from: resource_attribute
        name:

By default the following properties are extracted and added, which can be accessed via themetadataModify the default value:

.start_time

The k8sattributesprocessor supports the use of the k8sattributesprocessor from the pods, namespaces, and nodes of thelabelscap (a poem)annotationsExtraction on the (extract) resource attributes.

extract:
  annotations:
    - tag_name: a1 # extracts value of annotation from pods with key `annotation-one` and inserts it as a tag with key `a1`
      key: annotation-one
      from: pod
    - tag_name: a2 # extracts value of annotation from namespaces with key `annotation-two` with regexp and inserts it as a tag with key `a2`
      key: annotation-two
      regex: field=(?P<value>.+)
      from: namespace
    - tag_name: a3 # extracts value of annotation from nodes with key `annotation-three` with regexp and inserts it as a tag with key `a3`
      key: annotation-three
      regex: field=(?P<value>.+)
      from: node
  labels:
    - tag_name: l1 # extracts value of label from namespaces with key `label1` and inserts it as a tag with key `l1`
      key: label1
      from: namespace
    - tag_name: l2 # extracts value of label from pods with key `label2` with regexp and inserts it as a tag with key `l2`
      key: label2
      regex: field=(?P<value>.+)
      from: pod
    - tag_name: l3 # extracts value of label from nodes with key `label3` and inserts it as a tag with key `l3`
      key: label3
      from: node

The full example is as follows, and since the k8sattributesprocessor is itself a k8s Controller, it needs to be passed through thefilterSpecifies the scope of listwatch:

k8sattributes:
k8sattributes/2:
  auth_type: "serviceAccount"
  passthrough: false
  filter:
    node_from_env_var: KUBE_NODE_NAME
  extract:
    metadata:
      - 
      - 
      - 
      - 
      - 
      - .start_time
   labels:
     - tag_name: 
       key: /component
       from: pod
  pod_association:
    - sources:
        - from: resource_attribute
          name: 
    - sources:
        - from: resource_attribute
          name: 
    - sources:
        - from: connection

Tail Sampling Processor

Sample traces based on a predefined strategy. note that in order to efficiently implement the sampling strategy, theAll spans under a trace must be processed in the same Collector instance。This processor must be placed into context-dependent processors (such as thek8sattributes) after that, otherwise reorganization will result in the loss of the original context.Before sampling is performed, a sample is taken based on thetrace_idGroups spans so that there is no need forgroupbytraceprocessorYou can then use the tail sampling processor directly.

tailsamplingprocessorcenterandis a more specific strategy that will use theAND logic strings together multiple strategies. For example, in the following exampleandString together multiple strategies for:

filter outbecause of[service-1, service-2, service-3]data
The data from the above 3 services is then filtered from thebecause of[/live, /ready]data
The last will come from[service-1, service-2, service-3]serviced[/live, /ready]The sampling rate of the data is set to 0.1

        and:
          {
            and_sub_policy: # andLogical set of strategies
              [
                {
                  # filter by service name
                  name: service-name-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: ,
                      values: [service-1, service-2, service-3],
                    },
                },
                {
                  # filter by route
                  name: route-live-ready-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: ,
                      values: [/live, /ready],
                      enabled_regex_matching: true, #Enabling Regular Expressions
                    },
                },
                {
                  # apply probabilistic sampling
                  name: probabilistic-policy,
                  type: probabilistic,
                  probabilistic: { sampling_percentage: 0.1 },
                },
              ],
          },

See more at Official(for) instance

transform processor

The processor contains a series of functions that are related to theContext Typeassociated conditions and statements, and executes the conditions and statements on the received telemetry data in the configured order.It uses a method namedOpenTelemetry Transformation LanguageThe SQL-like syntax of the

transform processorcantrace、metricscap (a poem)logsConfigure multiplecontext statements，contextSpecifies that the statements use theOTTL Context：

Telemetry	OTTL Context
`Resource`	Resource
`Instrumentation Scope`	Instrumentation Scope
`Span`	Span
`Span Event`	SpanEvent
`Metric`	Metric
`Datapoint`	DataPoint
`Log`	Log

The Contexts supported by trace, metric and log are as follows:

Signal	Context Values
trace_statements	`resource`, `scope`, `span`, and `spanevent`
metric_statements	`resource`, `scope`, `metric`, and `datapoint`
log_statements	`resource`, `scope`, and `log`

Each statement can contain aWheresub-statement to check if the statement is executed.

The transform processor also supports an optional field.error_mode, which is used to determine how the processor responds to errors generated by the statement.

error_mode	description
ignore	processor ignores the error, logs it, and moves on to the next statement, recommended mode.
silent	processor ignores the error, does not log it, and proceeds to the next statement.
propagate	The processor returns an error to the pipeline that causes the Collector to discard the payload. default option.

In addition, the transform processor supports OTTL.function (math.)Telemetry data can be added, deleted, and modified.

As in the following example, if attributetestdoes not exist, then the attributetestset topass：

transform:
  error_mode: ignore
  trace_statements:
    - context: span
      statements:
        # accessing a map with a key that does not exist will return nil. 
        - set(attributes["test"], "pass") where attributes["test"] == nil

debug

Locate it by enabling debug logging in Collector:

receivers:
  filelog:
    start_at: beginning
    include: [  ]

processors:
  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(["test"], "pass")
          - set(instrumentation_scope.attributes["test"], ["pass"])
          - set(attributes["test"], true)

exporters:
  debug:

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - transform
      exporters:
        - debug

routing processor

Routes logs, metrics or traces to the specified exporter.This processor needs to route trace information to specific exporters based on inbound HTTP request (gRPC) headers or resource attribute values.

Attention:

This processor terminates subsequent processors of the pipeline and issues an alert if other processors are defined after this processor.
If an exporter is added to the pipeline, it needs to be added to that processor as well, otherwise it won't take effect.
Since this processor relies on HTTP headers or resource attributes, care needs to be taken in using aggregation processors in the pipeline (batch maybegroupbytrace)

The mandatory parameters for configuration are as follows:

from_attribute: HTTP header name or resource attribute name to get the route value.
table: Processor's routing table
: FromAttributePossible values for the field
: IfFromAttributefield value matchingIf it is not, the exporters defined here will be used.

The optional fields are listed below:

attribute_source: Definitionfrom_attributeThe source of the attribute:
- context (default) - Querycontext(contains HTTP headers).defaultfrom_attributeSources of data, which can be injected manually or by a third-party service (such as a gateway).
- resource - Query Resource Properties
drop_resource_routing_attribute - Whether to remove resource attributes used by the route.
default_exporters: exporters that cannot match the data in the routing table.

Examples are given below:

processors:
  routing:
    from_attribute: X-Tenant
    default_exporters:
    - jaeger
    table:
    - value: acme
      exporters: [jaeger/acme]
exporters:
  jaeger:
    endpoint: localhost:14250
  jaeger/acme:
    endpoint: localhost:24250

Exporter

Attention opentelemetry's ExporterMostly push modethat needs to be sent to the back end.

debug exporter

For debugging use, you can output telemetry data to the terminal with the following configuration parameters:

verbosity: (Defaultbasic), with an optional value ofbasic(output summary information),normal(Output actual data),detailed(output details)
sampling_initial: (Default2), the number of messages output in each second in the beginning
sampling_thereafter: (Default1), insampling_initialThe sampling rate after which the1Indicates that the feature is disabled.assume (office)Output per second beforesampling_initialmessage, and then outputs the firstsampling_thereaftermessages, discarding the rest.

exporters:
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

otlp exporter

utilizationOTLP format, via gRPCSend dataNote that this ispushmode, TLS is required by default. you can optionally set theRetry and queue。

exporters:
  otlp:
    endpoint: otelcol2:4317
    tls:
      cert_file: 
      key_file: 
  otlp/2:
    endpoint: otelcol2:4317
    tls:
      insecure: true

otlp http exporter

via HTTPdispatchOTLP data in the format of，

endpoint: "https://1.2.3.4:1234"
tls:
  ca_file: /var/lib/
  cert_file: certfile
  key_file: keyfile
  insecure: true
timeout: 10s
read_buffer_size: 123
write_buffer_size: 345
sending_queue:
  enabled: true
  num_consumers: 2
  queue_size: 10
retry_on_failure:
  enabled: true
  initial_interval: 10s
  randomization_factor: 0.7
  multiplier: 1.3
  max_interval: 60s
  max_elapsed_time: 10m
headers:
  "can you have a . here?": "F0000000-0000-0000-0000-000000000000"
  header1: "234"
  another: "somevalue"
compression: gzip

prometheus exporter

utilizationPrometheus formatexposing metrics.pull mode。

endpoint: Expose the address of the metrics at path/metrics
const_labels: key/values appended for each metrics
namespace: If set, the indicator exposure is<namespace>_<metrics>
send_timestamps: DefaultfalseIf or not the metrics are sent in the response, the time of capture
metric_expiration: Default5mThe following is an example of how to define the length of time that exposed metrics do not need to be updated.
resource_to_telemetry_conversion: DefaultfalseIf enabled, all resource attributes are transformed into metric labels.
enable_open_metrics: Defaultfalse, if enabled, metrics are exposed using the OpenMetrics format, which can support Exemplars functionality.
add_metric_suffixes: Defaulttrue, if false, the type and unit suffixes will not be enabled.

exporters:
  prometheus:
    endpoint: "1.2.3.4:1234" # The exposed address is：https://1.2.3.4:1234/metrics
    tls:
      ca_file: "/path/to/"
      cert_file: "/path/to/"
      key_file: "/path/to/"
    namespace: test-space
    const_labels:
      label1: value1
      "another label": spaced value
    send_timestamps: true
    metric_expiration: 180m
    enable_open_metrics: true
    add_metric_suffixes: false
    resource_to_telemetry_conversion:
      enabled: true

It is recommended to use the transform processor to set the most common resource attributes to metric labels.

processor:
  transform:
    metric_statements:
      - context: datapoint
        statements:
        - set(attributes["namespace"], [""])
        - set(attributes["container"], [""])
        - set(attributes["pod"], [""])

prometheus remote write exporter

be in favor ofHTTP Settingscap (a poem)Retry and timeout settings

Used to send opentelemetry metrics to prometheus remote wirte compatible backends such as Cortex, Mimir, and thanos.

The configuration parameters are as follows:

endpoint：remote write URL
tls: TLS must be configured by default
- insecure: Defaultfalse. To enable TLS, you need to configure thecert_filecap (a poem)key_file
external_labels: add additional label name and value for each metric
headers: Adds an additional header to each HTTP request.
add_metric_suffixes: Defaulttrue, if false, the type and unit suffixes will not be enabled.
send_metadata: DefaultfalseIftrueIf it is not, it generates and sends the prometheus metadata
remote_write_queue: Configure the queue and send parameters for remote write
- enabled: start the send queue, defaulttrue
- queue_size: Number of OTLP metrics entered into the queue, default10000
- num_consumers: Minimum number of workers to send a request, default5
resource_to_telemetry_conversion: DefaultfalseIftrue, then all resource attributes will be transformed into metric labels.
target_info: DefaultfalseIftrue, then it generates one for each resource indicatortarget_infonorm
max_batch_size_bytes: Default3000000 -> ~2.861 mb. The number of batch samples sent to the remote end. If a batch is larger than this value, it will be given a cut into multiple batches.

exporters:
  prometheusremotewrite:
    endpoint: "https://my-cortex:7900/api/v1/push"
    external_labels:
      label_name1: label_value1
      label_name2: label_value2
    resource_to_telemetry_conversion:
      enabled: true # Convert resource attributes to metric labels

It is recommended to use the transform processor to set the most common resource attributes to metric labels.

processor:
  transform:
    metric_statements:
      - context: datapoint
        statements:
        - set(attributes["namespace"], [""])
        - set(attributes["container"], [""])
        - set(attributes["pod"], [""])

loadbalancing exporter

on the basis ofrouting_keyLoad balancing of spans, metrics and logs. If you don't configure therouting_key, then the default value of traces istraceIDThe default value for metrics isservicenamelyequaltraceID(or(whenserviceact asrouting_key)) spans are sent to the same backendThe following is an example of the use of the "red-metrics-collector". Especially suitable for tail-based samplers or red-metrics-collectors, which need to be based on theFull traceof the back end.

Note that load balancing is based only on the Trace ID or Service name and does not take into account the actual back-end load, nor does it perform polling load balancing.

routing_key The optional values are.

routing_key	can be used for
service	logs, spans, metrics
traceID	logs, spans
resource	metrics
metric	metrics
streamID	metrics

This can be done bystatic (as in electrostatic force)maybeDNSThe back-end is configured in the same way as the back-end. When a backend is updated, it is rerouted based on R/N (total number of routes/total number of backends). If backends change frequently, consider using thegroupbytrace processor。

It is important to note that if there is an anomaly on the back end, at this point theloadbalancingexporterdoes not attempt to resend the data, and there is a possibility of data loss, so theRequires queue and retry mechanisms to be configured on the exporter。

When the resolver isstaticwhenIf one backend is unavailable, data load balancing fails for all backends, until that backend returns to normal or fromstaticRemove from the list.dns resolver follows the same principle.
When using thek8s、dnsWhen a topology change is made, the topology change will eventually be reflected in theloadbalancingexporterUp.

The main configuration parameters are listed below:

otlp: Used to configureOTLP exporter. Note that no configuration is required hereendpoint, the field is overwritten by the resolver's backend.
resolver: You can configure astatic Onednsand ak8s maybeaws_cloud_map, but not all 4 resolvers can be specified at the same time.
- dnshit the nail on the headhostnamefor obtaining a list of IP addresses.portIndicates the port used to import traces, default is 4317;intervalSpecify the parsing interval, e.g.5s, 1d, 30mdefault5s；timeoutSpecify the parsing timeout, e.g.5s, 1d, 30mdefault1s
- k8shit the nail on the headservicerefers to the kubernetes service domain name, such as-ns。portRefers to the port used to import traces, default is 4317, if more than one port is specified, the corresponding backends will be added to the loadbalancer, just like different pods;timeoutSpecify the parsing timeout, e.g.5s, 1d, 30mdefault1s
routing_key: For data (spans or metrics) routing. Currently only supportstracecap (a poem)metrics Type. The following parameters are supported:
- service: Route based on service name. Great for span metrics, so that all spans for each service are sent to a consistent metrics Collector. Otherwise metrics for the same service may be sent to different Collectors, resulting in inaccurate aggregation.
- traceID: Based ontraceIDRoute spans. metrics are invalid.
- metric: Routes metrics based on metric name. spans is invalid.
- streamID: Routes metrics based on the streamID of the data. streamID is a unique value generated by hashing the attributes and resource, scope and metrics data.

In the following example it is possible to ensure that spans with the same traceID are sent to the same backend (Pod), :

    receivers:
      otlp/external:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            endpoint: ${env:MY_POD_IP}:4318
      otlp/internal:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:14317
          http:
            endpoint: ${env:MY_POD_IP}:14318
            
    exporters:
      loadbalancing/internal:
        protocol:
          otlp:
            sending_queue:
              queue_size: 50000
            timeout: 1s
            tls:
              insecure: true
        resolver:
          k8s:
            ports:
            - 14317
            service: -opentelemetry
            timeout: 10s
      otlphttp/tempo:
        endpoint: :14252/otlp
        sending_queue:
          queue_size: 50000
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          exporters:
          - loadbalancing/internal
          processors:
          - memory_limiter
          - resource/metadata
          receivers:
          - otlp/external
        traces/loadbalancing:
          exporters:
          - otlphttp/tempo
          processors:
          - memory_limiter
          - resource/metadata
          - tail_sampling
          receivers:
          - otlp/internal

Connector

A connector can connect two pipelines so that one pipeline acts as an exporter and the other as a receiver. connector can be thought of as an exporter that consumes data from the end of one pipeline and sends data to the beginning of the other pipeline from the beginning to the receiver at the beginning of the other pipeline. Data can be consumed, copied, or routed using a connector.

hereinafter denoted astracesData import of themetricsCenter:

receivers:
  foo/traces:
  foo/metrics:
exporters:
  bar:
connectors:
  count:
service:
  pipelines:
    traces:
      receivers: [foo/traces]
      exporters: [count]
    metrics:
      receivers: [foo/metrics, count]
      exporters: [bar]

roundrobin connector

Used to implement load balancing using polling for exporters that are not very scalable such asprometheusremotewrite, the following is used to convert the received data (metrics) is distributed in a polled fashion to different prometheusremotewrite(metrics/1cap (a poem)metrics/2):

receivers:
  otlp:
processors:
  resourcedetection:
  batch:
exporters:
  prometheusremotewrite/1:
  prometheusremotewrite/2:
connectors:
  roundrobin:
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [resourcedetection, batch]
      exporters: [roundrobin]
    metrics/1:
      receivers: [roundrobin]
      exporters: [prometheusremotewrite/1]
    metrics/2:
      receivers: [roundrobin]
      exporters: [prometheusremotewrite/2]

span metrics connector

Used to aggregate Request, Error and Duration() metrics from span data.

Request:

calls{="shipping",="get_shipping/{shippingId}",="SERVER",="Ok"}

Error:

calls{="shipping",="get_shipping/{shippingId},="SERVER",="Error"}

Duration:

duration{="shipping",="get_shipping/{shippingId}",="SERVER",="Ok"}

Each metric contains at least the following dimensions (these dimensions exist for all spans):

Common parameters are listed below:

histogram: Defaultexplicitfor configuring histograms, which can only be selectedexplicitmaybeexponential
- disable: DefaultfalseThe following is an example of how to disable all histogram metrics.
- unit: DefaultmsYou can choosemsmaybes
- explicit: Specifies the time bucket duration of the histogram. default[2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms, 2s, 5s, 10s, 15s]
- exponential: Maximum number of buckets in the range of positive and negative numbers
dimensions: The dimensions to be added in addition to the default dimensions, each of which must be defined by anamefield to perform a lookup from the span's attributes collection or resource attribute, such asip，mayberegion. If not found in the spannameattribute, then look for thedefaultIf the attributes defined in thedefaultIf the dimensions are not the same, then the dimensions will be ignored.
exclude_dimensions: A list of dimensions to exclude from default dimensions. Used to exclude unwanted data from metrics.
dimensions_cache_size: Saves the cache size of Dimensions, the default is1000。
metrics_flush_interval: The interval between metrics generated by flush, default 60s.
metrics_expiration: If no new spans are received within that time, no more export metrics will be exported. default.0, indicating that there will be no timeout.
metric_timestamp_cache_size:, default1000。
events: Configure events metrics.
- enable: Defaultfalse
- dimensions: If enable, the field must be present. event metric's additional Dimension
resource_metrics_key_attributes: Filter the resource attributes used to generate the hash of the resource metrics key, which prevents changes in the resource attributes from affecting the Counter metrics.

receivers:
  nop:

exporters:
  nop:

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms]
    dimensions:
      - name: 
        default: GET
      - name: http.status_code
    exemplars:
      enabled: true
    exclude_dimensions: ['']
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"    
    metrics_flush_interval: 15s
    metrics_expiration: 5m
    events:
      enabled: true
      dimensions:
        - name: 
        - name: 
    resource_metrics_key_attributes:
      - 
      - 
      - 

service:
  pipelines:
    traces:
      receivers: [nop]
      exporters: [spanmetrics]
    metrics:
      receivers: [spanmetrics]
      exporters: [nop]

troubleshooting

utilizationdebug exporter
utilizationpprof extensionThe exposed port is1777Collecting pprof data
utilizationzPages extensionExposed ports are55679The address is/debug/tracezThe following problems can be localized:
- latency issue
- Deadlocks and Tooling Issues
- incorrect

expansion

When to Expand

When using thememory_limiter When processor is used, it is possible to pass theotelcol_processor_refused_spansto check if the memory is sufficient
The Collector will use a queue to hold the data that needs to be sent if theotelcol_exporter_queue_size > otelcol_exporter_queue_capacitythen the data will be rejected (otelcol_exporter_enqueue_failed_spans)
Additionally specific components expose relevant metrics such asotelcol_loadbalancer_backend_latency

How to expand capacity

For scaling, components can be categorized into three types: stateless, scrapers, and stateful. for stateless, you just need to increase the number of replicas.

scrapers

insofar ashostmetricsreceivercap (a poem)prometheusreceiverSuch receivers, the number of instances cannot simply be increased, otherwise it would result in each Collector scraping the system's endpoints. this can be accomplished with theTarget Allocatorto slice the endpoints.

stateful

For some components that store data in memory, scaling may lead to different results. Such as the tail-sampling processor, which keeps spans data in memory for a certain period of time and evaluates the sampling decision when the trace is considered to be over. If such Collectors are augmented by increasing the number of replicas, this can result in different Collectors receiving spans for the same trace, causing each Collector to evaluate whether or not the trace should be sampled, and thus potentially obtaining different results (the trace loses spans).

Similarly with the span-to-metrics processor, aggregation based on service name becomes imprecise when different Collectors receive data from the same service.

To avoid this problem, you can precede the execution of tail-sampling or span-to-metrics with the load-balancing exporterThe load-balancing exporter obtains a hash value based on the trace ID or service name to ensure that the back-end Collector receives consistent data.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:

exporters:
  loadbalancing:
    protocol:
      otlp:
    resolver:
      dns:
        hostname: 

service:
  pipelines:
    traces:
      receivers:
        - otlp
      processors: []
      exporters:
        - loadbalancing