Location>code7788 >text

Linux debugging

Popularity:222 ℃/2024-11-13 11:23:42

Linux debugging, profiling and tracing training

This article is frombootlinand public training(computer) file

Debugging, Profiling, Tracing

Debugging

▶ Finding and fixing problems in the software/system

▶ Different tools and methods may be used:

  • Interactive debugging (e.g. GDB)
  • Ex post facto analysis (e.g. coredump)
  • Control flow analysis (using the tracing tool)
  • Testing (integration testing)

▶ Most debugging is done in the development environment

▶ Usually intrusive, allowing suspension and resumption of program operation

Profiling

▶ Helps optimize performance by analyzing the runtime of a program

▶ Usually collects counters while the program is running

▶ Measure performance using specific tools, libraries, and operating system features such asperfOProfile

▶ It first aggregates data during query execution, such as the number of program calls, memory usage, CPU load, cache miss, etc., and then extracts meaningful information from this data and uses it to optimize the program

Tracing

▶ Understanding bottlenecks and problems by tracing the execution flow of an application

▶ Execute the detection code at compile or run time. It is possible to use a specific tracer such asLTTng、trace-cmd、SystemTapetc. to see user-space to kernel-space function calls.

▶ Allows you to view the functions and values used in the execution of the application

▶ Trace data is usually recorded at runtime and displayed at the end of the run

  • A large amount of tracing data is generated at the end of a tracing.
  • Usually much larger than the profiling data

▶ Since data can be extracted via tracepoints, it can also be used for debugging purposes

Linux Application Stack

User/Kernel mode

▶ User mode and kernel mode usually refer to the privilege level of execution (privilege level)

▶ This mode actually refers to the processor execution mode, i.e., the hardware mode

▶ The kernel can control the complete processor state (exception handling, MMU, etc.), while user space can only do some basic control and execution under kernel supervision.

Processes and Threads

▶ A process is a group of resources, such as memory, threads, and file descriptors, allocated for the execution of a program.

▶ A PID represents a process for which all information is exposed in the/proc/directory

  • /proc/selfbe sure toshowcaseInformation about the processes accessing the directory

▶ When a process is started, it initializes astruct task_structstructure that represents a thread of execution that can be scheduled

  • A process is embodied in the kernel as a thread associated with multiple resources.

▶ A thread is an independent execution unit that shares resources within the process, such as address space, file descriptors, etc.

▶ Availablefork()system call to create a new process using thepthread_create() Create a new thread

▶ Only one task can be executed on a CPU core at any time (using theget_current()function to see what tasks are currently executing), and a task can also only be executed on one CPU core

▶ Different CPU cores can perform different tasks

MMU and memory management

▶ In the Linux kernel (configured with theCONFIG_MMU=y), all addresses accessed by the CPU are virtual addresses

▶ The Memory Management Unit (MMU) can map this virtual memory to physical memory (RAM or IO)

▶ The basic mapping unit of the MMU becomes a page, and the size of the page is fixed (depending on the architecture/kernel configuration).

▶ Address mapping information is inserted into the page table of the MMU hardware and is used to translate virtual addresses accessed by the CPU into physical addresses

▶ The MMU can restrict page map access through certain attributes such as No Execute, Writable, Readable bits, Privileged/User bit, cacheability, etc.

Userspace/Kernel memory layout

▶ Each process has its own virtual address space (struct task_structhit the nail on the headmmfields) and the page table (but theShare the same kernel mapping)

▶ By default, all user-mapped addresses (base of heap, stack, text, data, etc.) are randomized to minimize attacks. This can be accomplished bynorandmapsparameter to disable the function

image

Different processes have different user memory spaces:

image
Kernel memory map

▶ The kernel has its own specific memory map

▶ Linear mapping is configured at kernel startup by inserting all elements from the kernel's initial page table

▶ Different memory areas by location

▶ Support for randomized configuration of the kernel address space layout can be provided by thenokaslrcommand to disable the function

image
Userspace memory segments

▶ When a process is started, the kernel sets up some virtual memory areas (provided by thestruct vm_area_structmanaged Virtual Memory Areas (VMAs)) and configure different properties.

▶ VMA memory fields are mapped to specific attributes (R/W/X)

▶ A segment error occurs when a program attempts to access an unmapped memory field or maps to a memory field that is not allowed to be accessed, such as

  • Write data to a read-only memory segment
  • Attempt to execute a non-executable memory segment

▶ You can passmmap()Creating a new memory field

▶ By/proc//mapsYou can view the mapping of individual applications:

7f1855b2a000-7f1855b2c000 rw-p 00030000 103:01 3408650 ld-2.
7ffc01625000-7ffc01646000 rw-p 00000000 00:00 0 [stack]
7ffc016e5000-7ffc016e9000 r--p 00000000 00:00 0 [vvar]
7ffc016e9000-7ffc016eb000 r-xp 00000000 00:00 0 [vdso]
Userspace memory types
image
Terms for memory in Linux tools

▶ When using Linux tools, the following four terms are used to describe memory:

  • VSS/VSZ: Virtual Set Size (virtual memory size, including shared libraries)
  • RSS: Resident Set Size (total physical memory used, including shared libraries)
  • PSS: Proportional Set Size (refers to the amount of memory shared with other processes, if a process has 10MB of memory to itself and does share 10MB with another process, then the PSS is 15MB).
  • USS: Unique Set Size (physical memory occupied by the process, excluding shared mapped memory)

▶ VSS >= RSS >= PSS >= USS.

Process context

▶ The process context can be thought of as the contents of the CPU registers that are associated with a process: execution register, stack register

▶ The process context also specifies an execution state and allows hibernation in kernel mode

▶ Processes executing in a process context can be preempted

▶ When executing a process in such a context, you can pass theget_current()interviewsss struct task_struct

image

Scheduling

▶ There are several reasons to wake up the scheduler

  • disruptionsHZPeriodic ticks (clock interruptions) caused by the
  • Program interrupts caused by non-clocked systems (CONFIG_NO_HZ=y)
  • Active calls in codeschedule()
  • Implicitly calling functions that can be dormant (e.g.kmalloc()wait_event()and other blocking operations)

▶ When entering a scheduling function, the scheduler selects a run a newstruct task_structThe last call toswitch_to()macro (computing)

switch_to()will save the process context of the current task and restore the process context of the next task when a new current task is set to run

The Linux Kernel Scheduler

▶ The Linux kernel scheduler is a key component in enabling real-time behavior

▶ It is responsible for deciding which runnable tasks to perform

▶ It is also responsible for selecting the CPU on which the task runs and is tightly coupled with CPUidle and CPUFreq.

▶ Responsible for task scheduling in both kernel space and user space

▶ Each task is assigned a scheduling class or policy.

▶ The scheduling algorithm selects the tasks to be executed based on the type

▶ Tasks of different scheduling types can exist in the system

Non-Realtime Scheduling Classes

There are 3 non-real-time scheduling classes as follows:

SCHED_OTHER: Default policy, using the time slice algorithm

SCHED_BATCH: SimilarSCHED_OTHER, but mainly used to perform CPU-intensive tasks

SCHED_IDLE: Very low priority.

SCHED_OTHER cap (a poem)SCHED_BATCHAll of them can be usednicevalue to increase or decrease its scheduling frequency

  • uppernicevalue implies a lower scheduling frequency
Realtime Scheduling Classes

There are 3 types of real-time scheduling as follows:

▶ Runnable tasks preempt other low-priority tasks

SCHED_FIFO:: Tasks with the same priority are scheduled on a first-in-first-out basis

SCHED_RR: SimilarSCHED_FIFOThe time-slice polling is used between tasks that have the same priority, but are

SCHED_FIFO cap (a poem)SCHED_RR Can be assigned a priority of 1 to 99

SCHED_DEADLINE: Used to execute repetitivejobs, with additional attributes attached to the task:

  • computation timeThe time it takes to complete a job.
  • deadlineThe maximum amount of time a job is allowed to run
  • periodIf you want to run a job, you can only run one job in that time period.

▶ Defining task types alone is not sufficient for real-time behavior

Changing the Scheduling Class

▶ Each task has a scheduling class (Scheduling Class), which by default is theSCHED_OTHER

man 2 sched_setscheduler A system call can modify the scheduling type of a task

chrtTools:

  • Modifies the scheduling type of a running task:chrt -f/-b/-o/-r/-d -p PRIO PID

  • It is also possible to usechrtPull up a program of a specific scheduling type:chrt -f/-b/-o/-r/-d PRIO CMD

  • Displays the scheduling type and priority of the current process:chrt -p PID

▶ If usingman 2 sched_setschedulerset upSCHED_RESET_ON_FORKflag, then the new process inherits the scheduling type of the parent process

Context switching

▶ Context switching is a behavior that changes the execution mode of the processor (Kernel ↔ User):

  • Explicit execution of system call instructions (synchronizing requests from user mode to kernel)
  • Implicitly received exceptions (MMU exceptions, interrupts, breakpoints, etc.)

▶ This state change will eventually be reflected in a kernel entry (usually a call vector) that will execute the necessary code and set the correct state for kernel mode execution.

▶ The kernel handles behaviors such as register saving and switching to the kernel stack:

  • The kernel stack size is fixed for security purposes
Exceptions

▶ Exceptions are events that indicate that they cause the CPU to enter the exception mode (handle exceptions).

▶ There are two main types of exceptions: synchronous and asynchronous

  • Asynchronous exceptions are usually generated when executing MMUs, bus interrupts, or receiving interrupts from hardware and software
  • Synchronization exceptions are thrown when specific instructions are executed, such as breakpoints, system calls, etc.

▶ When such an exception is triggered, the processor jumps to the exception vector and executes the exception code

Interrupts

▶ Interrupts are asynchronous signals generated by hardware peripherals

  • It can also be a synchronization signal generated by a specific instruction (e.g. (Inter Processor Interrupts )

▶ When an interrupt is received, the CPU changes its execution mode, jumps to a specific vector and switches to kernel mode to handle the interrupt

▶ When there are multiple CPUs (cores), interrupts are usually directed to a certain core

▶ You can control the interrupt load of each CPU by "IRQ affinity".

  • confer (cf.)core-api/irq/irq-affinity cap (a poem)man irqbalance(1)

▶ When an interrupt is handled, the kernel runs an interrupt context called the interrupt context (interrupt context) of the special context

▶ This context does not enter user space and should not be usedget_current()

▶ Depending on the architecture, an IRQ stack may be used

▶ Disable interrupts (nested interrupts are not supported)!

image
System Calls

▶ System calls allow userspace to execute specific instructions by requesting services from the kernel (man 2 syscall)

  • fulfillmentlibcThe functions provided (e.g.read()write()etc.), a system call is usually executed when the

▶ The different system calls are recognized by the numerical identifiers passed to the registers:

  • The kernel is passed through the ((in)__NR_<sycall>to define system call identifiers such as:

    #define __NR_read 63
    #define __NR_write 64
    

▶ The kernel holds a table of function pointers to these identifiers, which are used to call the correct handler function after validation of the system call has been completed.

▶ Passing system call parameters via registers (max. 6 parameters)

▶ When a system call is executed, the CPU changes its execution state and switches to kernel mode

▶ Each architecture has a specific hardware mechanism (man 2 syscall)

mov w8, #__NR_getpid
svc #0
tstne x0, x1

Kernel execution contexts

▶ The kernel executes code in different contexts depending on the event being processed

▶ May include disabling interrupts (by disabling interrupts, you can ensure that a particular interrupt handler does not preempt the current code), specific stacks, etc.

Kernel threads

▶ Kernel threads (kthreads) are a special type ofstruct task_struct, which is not associated with any user resource (mm == NULL)

▶ You can get the information fromkthreaddprocess to clone the kernel process, or you can use thekthread_createCreating a kernel process

▶ Similar to user processes, you can schedule as well as hibernate kernel threads in the process context

▶ Bypscommand to see the name of the kernel thread (indicated by square brackets):

$ ps --ppid 2 -p 2 -o uname,pid,ppid,cmd,cls
USER PID PPID CMD                         CLS
root 2     0 [kthreadd]                    TS
root 3     2 [rcu_gp]                      TS
root 4     2 [rcu_par_gp]                  TS
root 5     2 [netns]                       TS
root 7     2 [kworker/0:0H-events_highpr   TS
root 10    2 [mm_percpu_wq]                TS
root 11    2 [rcu_tasks_kthread]           TS
Workqueues

▶ Workqueues allow scheduling the execution of work at a future point in time.

▶ Workqueues executes work functions in the kernel thread:

  • Allows hibernation while performing delayed work.
  • Interrupts can be enabled during execution

▶ Work can be executed in a specific workqueue or in a global workqueue shared by multiple users.

softirq

▶ SoftIRQs are a kernel mechanism that runs in a software interrupt context

▶ You can execute code that needs to be executed after interrupt processing and requires low latency. The execution timing is as follows:

  • Executed after the interrupt context has processed a hard interrupt
  • Executed in the same context as the execution of interrupt handling, so sleep is not allowed.

▶ If you need to execute code in a soft interrupt context, you should use existing soft interrupt implementations such as tasklets, and BH workqueues (which are used in place of tasklets after 6.9) without having to implement them yourself:

image
Threaded interrupts

▶ Threaded interrupts are a mechanism that allows interrupts to be handled using a hard interrupt handler (IRQ handler) and a threaded interrupt handler

▶ A thread interrupt processor can execute work that may be dormant in kthread

▶ The kernel creates a kthread for each interrupt line requesting a thread interrupt

  • kthread namedirq/<irq>-<name>You can use thepscommand to view
Allocations and context

▶ You can use the following function to request memory in the kernel:

void *kmalloc(size_t size, gfp_t gfp_mask);
void *kzalloc(size_t size, gfp_t gfp_mask);
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)

▶ All memory request functions have agfp_maskparameter to specify the memory type:

  • GFP_KERNEL: normal allocation, can sleep while allocating memory (can't be used in interrupt context)
  • GFP_ATOMIC: automatic slicing, does not sleep while allocating data

Linux Common Analysis & Observability Tools

Pseudo Filesystems

▶ The kernel exposes some virtual file systems to provide system information

procfs Contains process and system information

  • The mount location is/proc
  • Often parsed by the tool to present the raw data in a more user-friendly way

sysfsProvides hardware/logic information related to devices and drivers. The mount location is/sys

debugfsShows information related to debugging

  • Normally mounted in the/sys/kernel/debug/directory
  • mount -t debugfs none /sys/kernel/debug

procfs

procfsExposes process- and system-related information (man 5 proc)

  • /proc/cpuinfoCPU information is displayed
  • /proc/meminfoShows memory information (used, free, total, etc.)
  • /proc/sys/Contains adjustable system parameters. Adjustable system parameters by means ofadmin-guide/sysctl/indexYou can view the list of parameters that can be modified
  • /proc/interrupts: Counts the number of interrupts for each CPU
    • /proc/irq Each interrupt line in the
  • /proc/<pid>/ Shows process-related information
    • /proc/<pid>/statusShows basic information about the process
    • /proc/<pid>/mapsShows memory mapping information
    • /proc/<pid>/fdShows the file descriptor of the process
    • /proc/<pid>/taskShows the descriptors of the threads belonging to the process
  • /proc/self/Information about the processes accessing the file is displayed

▶ Can be used infilesystems/proc cap (a poem)man 5 proc to see what's available in theprocfsDocumentation and related content

sysfs

sysfsThe file system exposes information about various kernel subsystems, hardware devices, and driver-related information (man 5 sysfs)

▶ You can view the connection between drivers and devices by representing the file hierarchy of the device tree inside the kernel

/sys/kernelContains files for kernel debugging:

  • irq: contains interrupt-related information (mapping, counting, etc.)
  • tracing: for tracing control

admin-guide/abi-stable

debugfs

debugfsis a simple RAM-based file system that exposes debugging information

▶ Some subsystems (clk, block, dma, gpio, etc.) use it to expose internal debugging information

▶ Usually mounted to the/sys/kernel/debug

  • This can be done by/sys/kernel/debug/dynamic_debugEnabling Dynamic Debugging
  • /sys/kernel/debug/clk/clk_summaryExposed clock tree

ELF files analysis

ELF files

ELFindicateExecutable and Linkable Format

▶ The file contains a first part that defines the binary structure of the file

▶ A series of segments and sections containing data:

  • .text section: Code
  • .data section: Data
  • .rodata section: read-only data
  • .debug_info section: contains debugging information

▶ Sections are part of a segment and can be loaded into memory

▶ The same format is used for all architectures supported by the kernel.vmlinuxThe formatting is the same

  • Many other operating systems also use ELF as a standard executable file format
image

binutils for ELF analysis

▶ binutils for working with binary files (object files or executables)

  • including throughldasand other useful tools

readelfDisplay information about an ELF file (header, section, segments, etc.).

objdumpELF files can be displayed and disassembled.

objcopyYou can convert ELF files or extract/translate parts of EKF files.

nmYou can display a list of symbols embedded in an ELF file.

addr2lineYou can find the source line/file based on the address in the ELF file

binutils example

▶ Usenmfindksys_read()Addresses of kernel functions

$ nm vmlinux | grep ksys_read
c02c7040 T ksys_read

▶ Useaddr2lineto find the source code corresponding to a kernel OOPS address or symbolic name:

$ addr2line -s -f -e vmlinux ffffffff8145a8b0
queue_wc_show
:516

▶ UsereadelfDemonstrate an ELF header:

$ readelf -h binary
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Position-Independent Executable file)
Machine: Advanced Micro Devices X86-64
...

▶ UseobjcopyConverts an ELF file to a flat binary file:

$ objcopy -O binary  

ldd

▶ You can use ldd to display a shared library used in ELF (man 1 ldd)

▶ ldd lists all libraries used during linking

  • will not demonstrate the use of thedlopen()Loaded libraries
$ ldd /usr/bin/bash
.1 (0x00007ffdf3fc6000)
.8 => /usr/lib/.8 (0x00007fa2d2aef000)
.6 => /usr/lib/.6 (0x00007fa2d2905000)
.6 => /usr/lib/.6 (0x00007fa2d288e000)
/lib64/.2 => /usr/lib64/.2 (0x00007fa2d2c88000)

Processor and CPU monitoring Tools

▶ Many tools are available to monitor various parts of the system

▶ Most tools are CLI-interactive programs

  • Process:ps, top, htopet al. (and other authors)
  • Memory:Free, vmstatet al. (and other authors)
  • reticulation

▶ Most tool dependenciessysfs maybeprocfsFile system to get process, memory and system information

  • Networking tools use the netlink interface of the kernel networking subsystem

ps & top(omitted)

mpstat

▶ Displaying multiprocessor information (man 1 mpstat)

▶ For detecting unbalanced CPU load, incorrect IRQ affinity, etc.

$ mpstat -P ALL
Linux 6.0.0-1-amd64 (fixe) 19/10/2022 _x86_64_ (4 CPU)
17:02:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
17:02:50 all 6,77 0,00 2,09 11,67 0,00 0,06 0,00 0,00 0,00 79,40
17:02:50 0 6,88 0,00 1,93 8,22 0,00 0,13 0,00 0,00 0,00 82,84
17:02:50 1 4,91 0,00 1,50 8,91 0,00 0,03 0,00 0,00 0,00 84,64
17:02:50 2 6,96 0,00 1,74 7,23 0,00 0,01 0,00 0,00 0,00 84,06
17:02:50 3 9,32 0,00 2,80 54,67 0,00 0,00 0,00 0,00 0,00 33,20
17:02:50 4 5,40 0,00 1,29 4,92 0,00 0,00 0,00 0,00 0,00 88,40

Memory monitoring tools

free

freeis a simple program that shows the amount of remaining and used memory on the system (man 1 free)

  • Used to check if system memory is running out
  • utilization/proc/meminfoto get memory information
$ free -h
total used free shared buff/cache available
Mem: 15Gi 7.5Gi 1.4Gi 192Mi 6.6Gi 7.5Gi
Swap: 14Gi 20Mi 14Gi

freeA small value for a field does not mean that memory is exhausted; memory will cache unused memory in order to optimize performance. See alsoman 5 prochit the nail on the headdrop_cachesto observebuffers/cachetreat (sb a certain way)free/availableImpact of memory

vmstat

vmstatShows system virtual memory usage information

▶ You can also show process, memory, page, blocking IO, traps, disk, and CPU usage.man 8 vmstat

▶ Data can be acquired periodically: vmstat

$ vmstat 1 6
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 253440 1237236 194936 9286980 3 6 186 540 134 157 3 5 82 10 0

▶ Note:vmstatTreat kernel blocks as 1024 bytes

pmap

pmapDemonstrated in the form of a proposal/proc/<pid>/mapsContent in.man 1 pmap

# pmap 2002
2002: /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
...
00007f3f958bb000  56K   r---- .3.32.1
00007f3f958c9000 192K   r-x-- .3.32.1
00007f3f958f9000  84K   r---- .3.32.1
00007f3f9590e000   8K   r---- .3.32.1
00007f3f95910000   4K   rw--- .3.32.1
00007f3f95937000   8K   rw---   [ anon ]
00007f3f95939000   8K   r---- .2
00007f3f9593b000 152K   r-x-- .2
00007f3f95961000  44K   r---- .2
00007f3f9596c000   8K   r---- .2
00007f3f9596e000   8K   rw--- .2
00007ffe13857000 132K   rw---   [ stack ]
00007ffe13934000  16K   r----   [ anon ]
00007ffe13938000   8K   r-x--   [ anon ]
total          11088K

I/O monitoring tools

iostat

iostatShows the IOs of each device on the system

▶ Used to see if a device is overloaded with IOs

$ iostat
Linux 5.19.0-2-amd64 (fixe) 11/10/2022 _x86_64_ (12 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
          8,43  0,00  1,52    8,77    0,00   81,28
          
Device    tps  kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme0n1 55,89  1096,88   149,33    0,00      5117334 696668  0
sda      0,03  0,92      0,00      0,00      4308    0       0
sdb    104,42  274,55    2126,64   0,00      1280853 9921488 0

iotop

iotopShows information about IOs for each process

▶ Used to see which process is generating a lot of I/O streams

  • Needs to be enabled in the kernelCONFIG_TASKSTATS=y, CONFIG_TASK_DELAY_ACCT=y and CONFIG_TASK_IO_ACCOUNTING=y
  • It also needs to be enabled at runtime:sysctl -w kernel.task_delayacct=1
# iotop
Total DISK READ:    20.61 K/s | Total DISK WRITE:   51.52 K/s
Current DISK READ:  20.61 K/s | Current DISK WRITE: 24.04 K/s
	TID  PRIO USER   DISK READ DISK WRITE> COMMAND
  2629 be/4 cleger 20.61 K/s 44.65 K/s firefox-esr [Cache2 I/O]
   322 be/3 root   0.00 B/s  3.43 K/s  [jbd2/nvme0n1p1-8]
 39055 be/4 cleger 0.00 B/s  3.43 K/s  firefox-esr [DOMCacheThread]
     1 be/4 root   0.00 B/s  0.00 B/s  init
     2 be/4 root   0.00 B/s  0.00 B/s  [kthreadd]
     3 be/0 root   0.00 B/s  0.00 B/s  [rcu_gp]
     4 be/0 root   0.00 B/s  0.00 B/s  [rcu_par_gp]

Networking Observability tools

ss

ssShows the state of the network socket

  • IPv4, IPv6, UDP, TCP, ICMP, and UNIX domain sockets.

▶ Supersedesnetstat

▶ From/proc/netAccess to information in

▶ Usage:

  • ss: default display of connected sockets
  • ss -lDemonstrate listening to sockets
  • ss -aDemonstrate listening and connected sockets
  • ss -4/-6/-x Show only IPv4, IPv6 or UNIX sockets
  • ss -t/-uShow TCP or UDP sockets only
  • ss -pShow the process used for each socket
  • ss -nDisplaying addresses in digital form
  • ss -sDemonstrate an approximation of the existing sockets

▶ See alsothe ss manpage

# ss
Netid State  Recv-Q Send-Q						 Local Address:Port Peer Address:Port Process
u_dgr ESTAB  0      0 													 * 304840 				* 26673
u_str ESTAB  0      0   /run/dbus/system_bus_socket 42871         * 26100
icmp6 UNCONN 0      0 											  *:ipv6-icmp 			  *:*
udp   ESTAB  0      0 		192.168.10.115%wlp0s20f3:bootpc  192.168.10.88:bootps
tcp   ESTAB  0      136 								 172.16.0.1:41376    172.16.11.42:ssh
tcp   ESTAB  0      273 							 192.168.1.77:55494    87.98.181.233:https
tcp   ESTAB  0      0 							[2a02:...:dbdc]:38466    [2001:...:9]:imap2
...
#

iftop

iftopDemonstrate bandwidth usage for a remote host

▶ Displaying bandwidth using a histogram

iftop -i eth0:

  • image

▶ You can customize the output

▶ See alsothe iftop manpage

tcpdump

tcpdumpCan capture network traffic and decode many protocols

▶ Capturing messages based on the libpcap library

▶ You can save the captured message to a file and then read it again

  • It can be saved aspcapor newpcapngspecification
  • tcpdump -i eth0 -w
  • tcpdump -r

▶ Filters can be used to prevent the capture of irrelevant messages

  • tcpdump -i eth0 tcp and not port 22

/

Wireshark(omitted)

Application Debugging

Good practices

▶ The current compiler can detect many errors during compilation with alerts

  • If you want to catch errors as early as possible, it is recommended to use the-Werror -Wall -Wextra

▶ The compiler can provide static analysis functions

  • GCC can pass the-fanalyzer The flag provides this function
  • LLVM provides the build process with theSpecific tools

▶ You can also use component-specific helper/hardening

  • For example, when using the GNC C library, you can pass the_FORTIFY_SOURCE Macro to add runtime input detection.

Building with debug information

Debugging with ELF files

▶ GDB can debug ELF files, which contain debugging information.

▶ Debugging information using DWARF format

▶ Allows the debugger to debug based on address and symbol names, call points, etc.

▶ Debugging information is specified by the compiler during compilation by specifying the-gGenerate to ELF file

  • -g1: Minimal debugging information (call stack usage)
  • -g2: Designation-gDefault debugging level when
  • -g3: contains additional debugging information (macro definitions)

▶ For more debugging information seeGCC Documentation

Debugging with compiler optimizations

▶ Compiler optimization (-O<level>) can result in optimizing away certain variables and function calls

▶ This occurs when using GDB to display this information that has been optimized away:

  • $1 = <value optimized out>

▶ If you want to check variables and functions, it is best to use the-O0compile the code (without enabling optimization)

  • Note: This can only be done by-O2 maybe-OsCompile the kernel

▶ Functions can also be annotated using the compiler attribute: the

  • __attribute__((optimize("O0")))

▶ Remove the function ofstaticThe modifier avoids inlining the function

  • Note: LTO (Link Time Optimization) can solve this problem!

▶ Setting a specific variable tovolatileCompiler optimizations can be avoided

Instrumenting code crashes

▶ You can use the GNU extension functions.backtrace() (man 3 backtrace) to show the application's call stack:

char **backtrace_symbols(void *const *buffer, int size);

▶ You can passsignal() (man signal(3)) Add hooks to specific signals to print the call stack:

  • This can be done, for example, by capturingSIGSEGVsignals to dump the current call stack
void (*signal(int sig, void (*func)(int)))(int);

The ptrace system call

ptrace

ptraceYou can trace a process by accessing tracee memory and register memory.

▶ A tracer can observe and control the execution status of another process

▶ By placingptrace() The system call attach to a tracee process to implement tracing (man 2 ptrace)

▶ You can directly callptrace(), but will usually be called indirectly through the tool:

long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);

▶ Debugging tools such as GDB, strace, and others can access the tracee process state

GDB

Here's a brief overview of gdb's general commands:

  • gdb <program>: Starting to debug a program using gdb debugging
  • gdb -p <pid>: Attach gdb to a running program by specifying its PID.
  • (gdb) run [prog_arg1 [prog_arg2] ...]: Specify the command to use when running a program with GDB.
  • break foobar (b): as a functionfoobar()breakpoint
  • break :42: as a documentLine 42 interruptions
  • print varprint $reg maybeprint task->files[0].fd (p): Printing VariablesvarRegisters$regor a complex reference.
  • info registers: Display register information
  • continue (c): Continuing execution after a breakpoint
  • next (n): continue to the next line, skipping function calls
  • step (s): continue to the next line into the subfunction
  • stepi (si): Continue with the next instruction
  • finish: return function
  • backtrace (bt): Show program call stack
  • info threads (i threads): show list of available threads
  • info breakpoints (i b): show list of breakpoints/watchpoints
  • delete (d): Delete breakpoints
  • thread (t): Selection of threads
  • frame (f): selects a specific frame of the call stack.nFrame representing the call stack
  • watch <variable>maybewatch \*<address>: add a watchpoint for a specific variable/address
  • print variable = value (p variable = value): Modify the contents of a specific variable
  • break :42 if condition == value: If a specific condition is true, enter a breakpoint
  • watch if condition == value: This watchpoint is triggered when a certain condition is true.
  • display <expr>: prints the expression automatically every time the program stops
  • x/ <n><u> <address>: Displays the memory at the specified address.nfor the amount of memory displayed.uThe data type for the display (b/h/w/g). This can be done by using theiType display instructions.
  • list <expr>: Show source code related to the current program counter position
  • disassemble <location,start_offset,end_offset> (disas): Show the currently running assembly code
  • p function(arguments): Execute a function via GDB. Note the possible side effects of executing this function
  • p $newvar = value: Declare a new gdb variable that can be used locally or in the order of commands
  • define <command_name>: Define a new command sequence. This sequence of commands can then be called directly in GDB.

remote debugging

▶ In a non-embedded environment, you can use thegdbAs a debugging front-end

gdbDirect access to binaries and libraries compiled with debug symbols

▶ However, in an embedded context, the target platform usually restricts the direct use of thegdbdebugging

▶ Remote debugging is required at this point

  • ARCH-linux-gdbDeployed in development workstations to provide debugging features for users
  • gdbserverDeployed in the target system (only 400KB in arm architecture)
image

Remote debugging: architecture

image

Remote debugging: target configuration

▶ At the target throughgdbserverRun a program, at which point the program is not immediately executed:

gdbserver :<port> <executable> <args>
gdbserver /dev/ttyS0 <executable> <args>

▶ Alternatively, you can makegdbserver attach to a running program:

gdbserver --attach :<port> <pid>

▶ Or you can start a program without executing itgdbserver(Followed by setting up the target program on the CLIENT side):

gdbserver --multi :<port>

Remote Commissioning: Host Configuration

▶ Starting on the host sideARCH-linux-gdb <executable>and use the followinggdbCommand:

  • let knowgdbShared library catalog:gdb> set sysroot <library-path>

  • connectivity target

    gdb> target remote <ip-addr>:<port> (networking)
    gdb> target remote /dev/ttyUSB0 (serial link)
    

    If the startupgdbserverWhen you specify the--multioption, you need to use thetarget extended-remoteinterchangeabilitytarget remote

  • If it is not in thegdbserverIf you specify the program to be debugged on the command line, the following command is executed:

    gdb> set remote exec-file <path_to_program_on_target>
    

Coredumps

▶ When a program crashes due to a segment error, it will not be controlled by the debugger

▶ Fortunately, Linux can generate an ELF-formatted image file containing the memory of the program at the time of its crash.corefile. gdb can use thecoreFile analysis of crashed program status

▶ At the target end

  • pass (a bill or inspection etc)ulimit -c unlimited Start the application so that if the program crashes it can generate acorefile
  • This can be done by/proc/sys/kernel/core_pattern(man 5 core) Modify the name of the output coredump file
  • in usingsystemdThe coredump feature is disabled by default for security reasons, and can be accessed via theecho core > /proc/sys/kernel/core_patterngo live (on a temporary basis)

▶ On the host side

  • After the program crashes, thecoreThe file is transferred from the target to the host, and then theARCH-linux-gdb -c core-file application-binary

minicoredumper

▶ For complex programs, coredump may be larger

minicoredumperis a user-space tool that is based on the standard core dump feature

▶ You can redirect the core dump output to a user-space program through a pipe

▶ JSON-based configuration is possible:

  • Only relevant sections (stack, heap, selected ELF sections) are saved.

  • Compressing output files

  • through (a gap)/procSave additional information

/diamon/minicoredumper

▶ "Efficient and Practical Crash Data Acquisition for Embedded Systems"

  • Video:/watch?v=q2zmwrgLJGs
  • Slides:/images/8/81/Eoss2023_ogness_minicoredumper.pdf

GDB: going further

▶ Tutorial: Debugging Embedded Devices using GDB - Chris Simmonds, 2020

  • Slides: /images/0/01/
  • Video: /watch?v=JGhAgd2a_Ck

GDB Python Extension

▶ GDB provides apython integrationfeature, which allows you to script some debugging operations.

▶ When executing Python with GDB, a file namedgdbmodule, which contains all the classes related to GDB

▶ You can add new commands, breakpoints, and pointer types

▶ You can fully control and observe the debugged program through the GDB capability in Python scripts.

  • Control execution, add breakpoints, watchpoints, etc.
  • Access to program memory, frames, symbols, etc.

GDB Python Extension

class PrintOpenFD():
  def __init__(self, file):
     = file
    super(PrintOpenFD, self).__init__()
    
  def stop (self):
    print ("---> File " +  + " opened with fd " + str(self.return_value))
    return False

class PrintOpen():
  def stop(self):
    PrintOpenFD(gdb.parse_and_eval("file").string())
    return False

class TraceFDs ():
  def __init__(self):
  	super(TraceFDs, self).__init__("tracefds", gdb.COMMAND_USER)

  def invoke(self, arg, from_tty):
    print("Hooking open() with custom breakpoint")
    PrintOpen("open")

TraceFDs()

▶ via gdbsourcecommand to load a Python script

  • If the name of the script is<program>-, then it will be automatically loaded by GDB:
(gdb) source trace_fds.py
(gdb) tracefds
Hooking open() with custom breakpoint
Breakpoint 1 at 0x33e0
(gdb) run
Starting program: /usr/bin/touch foo bar
Temporary breakpoint 2 at 0x5555555587da
---> File foo opened with fd 3
Temporary breakpoint 3 at 0x5555555587da
---> File bar opened with fd 0

Common debugging issues

▶ Problems may be encountered during debugging, such as bad address-> symbol conversions, "optimized out" values or functions, and empty call stacks.

▶ Here is a checklist to help introduce some problem-solving times:

  • Make sure the startup binary containsdebug symbols: When using gcc, make sure to use the-gWhen using gdb, be sure to use thenon-strippedVersion of the binary
  • If possible, in the final binary disable theoptimizationsor use a less invasive level (-Og)
    • For example, static functions can be collapsed into the caller depending on the optimization level, so they may be lost from the call stack
  • Avoid code optimization due to reuse of the frame pointer register: with GCC, make sure to use the-fno-omit-frame-pointer
    • Not just for debugging: many profiling/tracing tools rely on the call stack as well!

▶ Your application may contain many libraries: these configurations need to be applied to all components used.

Application Tracing

strace

The system call tracer -

▶ Available for all GNU/Linux systems, the tool can be built with a cross-compilation tool chain or build system

▶ You can see what the system is executing: accessing files, allocating memory, for finding simple problems

▶ Usage:

  • strace <command>: Start a new process
  • strace -f <command>:: Simultaneous tracing of subprocesses
  • strace -p <pid>: tracing an existing process
  • strace -c <command>:: Statistics for each system call
  • strace -e <expr> <command>:: Use of advanced filter expressions

For more information check out stracebrochure

strace example output

> strace cat Makefile
[...]
fstat64(3, {st_mode=S_IFREG|0644, st_size=111585, ...}) = 0
mmap2(NULL, 111585, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f69000
close(3) = 0
access("/etc/", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/tls/i686/cmov/.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320h\1\0004\0\0\0\344"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1442180, ...}) = 0
mmap2(NULL, 1451632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7e06000
mprotect(0xb7f62000, 4096, PROT_NONE) = 0
mmap2(0xb7f66000, 9840, PROT_READ|PROT_WRITE,
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7f66000
close(3) = 0
[...]
openat(AT_FDCWD, "Makefile", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=173, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7290d28000
read(3, "ifneq ($(KERNELRELEASE),)\nobj-m "..., 131072) = 173
write(1, "ifneq ($(KERNELRELEASE),)\nobj-m "..., 173ifneq ($(KERNELRELEASE),)

strace -c example output

> strace -c cheese
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
 36.24 0.523807 19 27017 poll
 28.63 0.413833 5 75287 115 ioctl
 25.83 0.373267 6 63092 57321 recvmsg
 3.03 0.043807 8 5527 writev
 2.69 0.038865 10 3712 read
 2.14 0.030927 3 10807 getpid
 0.28 0.003977 1 3341 34 futex
 0.21 0.002991 3 1030 269 openat
 0.20 0.002889 2 1619 975 stat
 0.18 0.002534 4 568 mmap
 0.13 0.001851 5 356 mprotect
 0.10 0.001512 2 784 close
 0.08 0.001171 3 461 315 access
 0.07 0.001036 2 538 fstat
...

ltrace

Used for tracing a program's use ofshared library (computing), and all the signals it receives.

▶ It can be a good supplementstraceThe latter only shows the system call

▶ Can operate without a source library in a timely manner

▶ Allow filtering by regular expression or function name list library call

▶ By-Soption allows you to display the system call

▶ By-cOption to display a summary

brochure

▶ Useglibc more effective

For more information check out/wiki/Ltrace

ltrace example output

# ltrace ffmpeg -f video4linux2 -video_size 544x288 -input_format mjpeg -i /dev
/video0 -pix_fmt rgb565le -f fbdev /dev/fb0
__libc_start_main([ "ffmpeg", "-f", "video4linux2", "-video_size"... ] <unfinished ...>
setvbuf(0xb6a0ec80, nil, 2, 0) = 0
av_log_set_flags(1, 0, 1, 0) = 1
strchr("f", ':') = nil
strlen("f") = 1
strncmp("f", "L", 1) = 26
strncmp("f", "h", 1) = -2
strncmp("f", "?", 1) = 39
strncmp("f", "help", 1) = -2
strncmp("f", "-help", 1) = 57
strncmp("f", "version", 1) = -16
strncmp("f", "buildconf", 1) = 4
strncmp("f", "formats", 1) = 0
strlen("formats") = 7
strncmp("f", "muxers", 1) = -7
strncmp("f", "demuxers", 1) = 2
strncmp("f", "devices", 1) = 2
strncmp("f", "codecs", 1) = 3
...

ltrace summary

utilization-cOptions:

% time seconds usecs/call calls function
------ ----------- ----------- --------- --------------------
52.64 5.958660 5958660 1 __libc_start_main
20.64 2.336331 2336331 1 avformat_find_stream_info
14.87 1.682895 421 3995 strncmp
7.17 0.811210 811210 1 avformat_open_input
0.75 0.085290 584 146 av_freep
0.49 0.055150 434 127 strlen
0.29 0.033008 660 50 av_log
0.22 0.025090 464 54 strcmp
0.20 0.022836 22836 1 avformat_close_input
0.16 0.017788 635 28 av_dict_free
0.15 0.016819 646 26 av_dict_get
0.15 0.016753 440 38 strchr
0.13 0.014536 581 25 memset
...
------ ----------- ----------- --------- --------------------
100.00 11.318773 4762 total

LD_PRELOAD

Shared libraries

▶ Most of the shared libraries are based on the.soELF file at the end

  • activated byLoading (dynamic loader)
  • or at runtime by passing thedlopen()(of cargo etc) load

▶ When a program is started (ELF file), the kernel parses the file and loads the corresponding parser

  • For the most part, the ELF file'sPT_INTERPThe program header is set to

▶ During loading, the dynamic linkerwill parse all symbols in the dynamic library

▶ Dynamic libraries are loaded only once by the OS and then mapped to all applications that use them

  • Easy to reduce the memory required to use the library

Hooking Library Calls

▶ To perform more complex library call hooks, you can use theLD_PRELOADenvironment variable

LD_PRELOADUsed to specify a shared library that needs to be loaded before the dynamic loader can load other libraries.

▶ You can intercept all library calls by preloading another library

  • Override library symbols with the same name
  • Allows redefining a small number of symbols
  • This can be done bydlsym (man 3 dlsym)Loading the "real" symbols

▶ Debugging/tracing library (libsegfault, libefence) will use this environment variable

▶ Available for both C and C++

LD_PRELOAD example

▶ UseLD_PRELOADPreloading Desired Libraries

#include <>
#include <>

ssize_t read(int fd, void *data, size_t size) {
	memset(data, 0x42, size);
	return size;
}

▶ UseLD_PRELOADThe libraries under the compilation:

$ gcc -shared -fPIC -o my_lib.so my_lib.c

▶ UseLD_PRELOADPreloading new libraries

$ LD_PRELOAD=./my_lib.so ./exe

uprobes and perf

uprobes

uprobeis a mechanism provided by the kernel for tracing user-space code.

▶ Tracepoints can be added dynamically to any user-space symbols.

  • The kernel tracing system will.text Breakpoints in section

▶ By/sys/kernel/debug/tracing/uprobe_eventsExpose tracing information

▶ It is usuallyperf, bccand other tools that encapsulate the use of

trace/uprobetracer

The perf tool

perftool is a tool that uses performance counters to capture information about an application's profile (man 1 perf)

▶ You can also managetracepoints, kprobes cap (a poem)uprobes

perfThe profile can be executed in both user space and kernel space.

perfkernel-exposed basedperf_eventconnector

▶ A set of operations is provided, each with specific parameters

  • stat, record, report, top, annotate, ftrace, list, probeet al. (and other authors)

Using perf record

perfCan record performance based on threads, processes and CPUs

▶ Kernel configuration required for use onlyCONFIG_PERF_EVENTS=yoptions (as in computer software settings)

▶ Data needs to be collected from program execution and output to thePapers

▶ You can passperf annotate cap (a poem)perf reportanalyzefile

  • Embedded systems can be analyzed on other computers

Probing userspace functions

▶ List functions that can be probed in a specific executable file

$ perf probe --source=<source_dir> -x my_app -F

▶ List the number of lines that can be probed in a particular executable file/function

$ perf probe --source=<source_dir> -x my_app -L my_func

▶ Creating uprobes in functions of user space libraries/executables

$ perf probe -x /lib/.6 printf
$ perf probe -x app my_func:3 my_var
$ perf probe -x app my_func%return ret=%r0

▶ Recording of executed tracepoints

$ perf record -e probe_app:my_func -e probe_libc:printf

Memory issues

Usual Memory Issues

▶ Programs almost always need to access memory

▶ A large number of errors may be generated if not handled properly

  • Segment errors may be generated when accessing invalid memory (accessing NULL pointers or freed memory)
  • A buffer overflow may occur if an address outside the buffer is accessed
  • Memory leaks can occur when memory is requested and then forgotten to be freed.

Segmentation Faults

▶ The kernel generates a segment error when a program tries to access a memory region that is not allowed to be accessed, or accesses a memory region in an incorrect way:

  • If you write to a read-only memory area
  • Trying to execute a piece of memory that can't be executed
int *ptr = NULL;
*ptr = 1;

▶ When a segment error is generated, it is displayed on the terminal.Segmentation fault

$ ./program
Segmentation fault

Buffer Overflows

▶ Buffer overflow occurs when accessing an array out of bounds

▶ In the following scenarios, it may or may not cause the program to crash depending on the access:

  • existmalloc ()Writing data after the end of the malloc's array usually overwrites the malloc's data structure, leading to a crash
  • Writing data after the end of an array requested on the stack corrupts the stack data
  • Reading data after the end of the data does not always result in a segment error, depending on the area of memory accessed
uint32_t *array = malloc(10 * sizeof(*array));
array[10] = 0xDEADBEEF;

Memory Leaks

▶ A memory leak is a type of error that does not trigger a program crash (but sooner or later does), but consumes system memory

▶ This happens when memory is requested for a program, but you forget to free this memory

▶ It may run for a long time in the production environment before being discovered

  • It is best to identify such issues early in the development phase
void func1(void) {
uint32_t *array = malloc(10 * sizeof(*array));
do_something_with_array(array);
}

Valgrind memcheck

Valgrind

Valgrindis a tool framework for building dynamic analysis tools

Valgrinditself is also a tool based on the framework, providing memory error detection, heap profile and other profile features

▶ Supports all popular platforms: Linux on x86, x86_64, arm (armv7 only), arm64, mips32, s390, ppc32 and ppc64

▶ Can be added to your code and run on its virtual CPU core. Significantly slows down execution, making it suitable for debugging and analysis

Memcheckis the default valgrind tool that detects memory management errors:

  • Accessing invalid memory regions, using uninitialized values, memory leaks, incorrectly releasing heap blocks, etc.
  • Can be run in any application without compilation
$ valgrind --tool=memcheck --leak-check=full <program>

Valgrind Memcheck usage and report

$ valgrind ./mem_leak
==202104== Memcheck, a memory error detector
==202104== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==202104== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==202104== Command: ./mem_leak
==202104==
==202104== Conditional jump or move depends on uninitialised value(s)
==202104== at 0x109161: do_actual_jump (in /home/user/mem_leak)
==202104== by 0x109187: compute_address (in /home/user/mem_leak)
==202104== by 0x1091A2: do_jump (in /home/user/mem_leak)
==202104== by 0x1091D7: main (in /home/user/mem_leak)
==202104==
==202104== HEAP SUMMARY:
==202104== in use at exit: 120 bytes in 1 blocks
==202104== total heap usage: 1 allocs, 0 frees, 120 bytes allocated
==202104==
==202104== LEAK SUMMARY:
==202104== definitely lost: 120 bytes in 1 blocks
==202104== indirectly lost: 0 bytes in 0 blocks
==202104== possibly lost: 0 bytes in 0 blocks
==202104== still reachable: 0 bytes in 0 blocks
==202104== suppressed: 0 bytes in 0 blocks
==202104== Rerun with --leak-check=full to see details of leaked memory

Valgrind and VGDB

▶ Valgrind can also act as a GDB server that receives processing commands. users can access Valgrind via the gdb client or thevgdbInteract with the valgrind gdb server.vgdbIt can be used in the following scenarios:

  • As a standalone CLI program, send the "monitor" command to valgrind.
  • Acts as a repeater between gdb clients and pre-existing valgrind sessions
  • Acting as a server for multiple valgrind sessions from remote gdb clients

▶ See moreman 1 vgdb

Using GDB with Memcheck

valgrindYou can attach GDB to the process being analyzed

$ valgrind --tool=memcheck --leak-check=full --vgdb=yes --vgdb-error=0 ./mem_leak

▶ Then attach gdb to a file that uses thevdgbon the valgrind gdbserver

$ gdb ./mem_leak
(gdb) target remote | vgdb

▶ If valgrind detects an error, it stops execution and enters the GDB

(gdb) continue
Continuing.
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000109161 in do_actual_jump (p=0x4a52040) at mem_leak.c:5
5 if (p[1])
(gdb) bt
#0 0x0000000000109161 in do_actual_jump (p=0x4a52040) at mem_leak.c:5
#1 0x0000000000109188 in compute_address (p=0x4a52040) at mem_leak.c:11
#2 0x00000000001091a3 in do_jump (p=0x4a52040) at mem_leak.c:16
#3 0x00000000001091d8 in main () at mem_leak.c:27

Electric Fence

libefence

libefenceIt's a little more thanvalgrindMore lightweight application, but also relatively low precision

▶ Two common types of memory errors can be captured

  • Buffer overflow and use of freed memory

libefenceA segment error can be triggered after the first error is encountered, generating a coredump

▶ You can use static links or useLD_PRELOADWay preloadinglibefenceshared library (computing)

$ gcc -g  -o program
$ LD_PRELOAD=.0.0 ./program
Electric Fence 2.2 Copyright (C) 1987-1999 Bruce Perens <bruce@>
Segmentation fault (core dumped)

▶ Depending on the segment error, a coredump can be generated in the current directory

▶ You can use GDB to open this coredump and locate the location where the error occurred

$ gdb ./program core-program-3485
Reading symbols from ./libefence...
[New LWP 57462]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./libefence'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 main () at :8
8 data[99] = 1;
(gdb)

Application Profiling

Profiling

▶ Profiling is an action that collects data from a program run in order to analyze the program, optimize it, or fix its problems.

▶ Profiling can be achieved by visually inserting instrumentation into the code or by utilizing kernel/userspace mechanisms

  • Profile function calls and call counts as a way to optimize performance
  • Profile processor usage to optimize performance and reduce power usage
  • Profile memory usage to optimize used memory

▶ After profiling, data needs to be used to analyze potential enhancements

Performance issues

▶ profiling is often used to identify and fix performance problems

▶ Performance can be affected by memory usage, IOs load, or CPU usage, etc.

▶ It is desirable to capture profiling data before fixing performance issues

▶ profiling is usually done for the first time with some typical tools for coarse-grained localization

▶ Once the problem type has been identified, fine-grained profiling can be performed

Profiling metrics

▶ Profile metrics can be collected with a variety of tools.

▶ UseMassif, heaptrack maybememusageto profile memory usage

▶ Useperfcap (a poem)callgrindto the profile function call

▶ Useperfto profile CPU hardware usage (Cache, MMU, etc.)

▶ The profiling data can contain both user-space application and kernel data

Visualizing data with flamegraphs

▶ Stack-based visualization

▶ You can quickly find performance bottlenecks and navigate call stacks

▶ The Brendan Gregg tool (for which it became popular) can be used for theperfResulting flame map generation

  • Scripts for generating flame maps:/brendangregg/FlameGraph
image

Going further with Flamegraphs

▶ For more see the following (technical presentation by Brendann Gregg showing the use of the various indicators in the flame chart):

  • Video
  • Slides

Memory profiling

▶ profiling the application's memory usage (heap/stack) helps optimize performance

▶ Claiming too much memory may cause the system to run out of memory.

▶ Frequent memory requests/releases can cause the kernel to spend a lot of time executing theclear_page()

  • The kernel needs to clean up memory pages before handing them over to the process to avoid data leaks

▶ Reducing the application memory footprint can optimize cache usage, such as page miss

Massif usage

MassifanvalgrindProvides tools that can be used by the profile heap during program execution (for userspace only)

▶ The principle is to create a memory request snapshot:

$ valgrind --tool=massif --time-unit=B program

▶ Once executed, it generates an.<pid> file

▶ Then you can usems_printThe tool displays a heap allocation map:

$ ms_print .275099

▶ #: Maximum memory request

▶ @: Snapshot details (can be accessed via the--detailed-freq(Number of adjustments)

Massif report
image
massif-visualizer - Visualizing massif profiling data
image

heaptrack usage

heaptrackis a heap memory profile tool

  • requireLD_PRELOADstorehouse

▶ Has better tracing and visualization capabilities than Massif

  • Each memory request is associated with a stack
  • Memory leaks, memory request hotspots and temporary memory requests can be detected

▶ You can use the GUI (heaptrack_gui) or the CLI tool (heaptrack_print) View Results

/KDE/heaptrack

$ heaptrack program

▶ Finally generate aheaptrack.<process_name>.<pid>.zstfile, which can be used on another computerheaptrack_guiView Analytics

heaptrack_gui - Visualizing heaptrack profiling data

image

heaptrack_gui - Flamegraph view

image

memusage

memusageis a program that uses profile Memory-using programs (man 1 memusage) (userspace only)

▶ You can profile heap, stack, and mmap memory usage.

▶ You can display profile information in the terminal or output it to a file or a PGN file.

▶ Compared to valgrindMassifFor that matter, it's lighter (due to the use of theLD_PRELOADmechanisms)

image

memusage usage

$ memusage convert  
Memory usage summary: heap total: 2635857, heap peak: 2250856, stack peak: 83696
         total calls total memory failed calls
 malloc|       1496      2623648 0
realloc|          6         3744 0 (nomove:0, dec:0, free:0)
 calloc|         16         8465 0
   free|       1480      2521334
Histogram for block sizes:
     0-15           329 21% ==================================================
     16-31          239 15% ====================================
     32-47          287 18% ===========================================
     48-63          321 21% ================================================
     64-79           43  2% ======
     80-95          141  9% =====================
...
21424-21439 1 <1%
32768-32783 1 <1%
32816-32831 1 <1%
large       3 <1%

Execution profiling

▶ In order to optimize a program, it is necessary to understand which hardware resources are used by the program

▶ Many hardware elements may affect program operation

  • If the application does not take memory space locality into account, it may lead to CPU cache performance degradation
  • If the application does not take memory space locality into account, it will result in a cache miss
  • Alignment errors occur when performing unaligned accesses

Using perf stat

perf statAn application can be profiled by capturing performance counters

  • Using performance counters may requirerootpermissions, which can be accessed through the# echo -1 > /proc/sys/kernel/perf_event_paranoidmodifications

▶ The number of performance counters on the hardware is usually limited

▶ Collecting too much data may lead to multiplexing, and perf amplifies the results

▶ Collect performance counters and then estimate them:

  • In order to obtain more accurate values, it is necessary to reduce the number of events and execute them more than once.perfto modify the set of events to be observed
  • See moreperf wiki
perf stat example
$ perf stat convert  
Performance counter stats for 'convert  ':

        45,52  msec  task-clock               # 1,333 CPUs utilized
            4        context-switches         # 87,874 /sec
            0        cpu-migrations           # 0,000 /sec
        1 672        page-faults              # 36,731 K/sec
  146 154 800        cycles                   # 3,211 GHz                     (81,16%)
    6 984 741        stalled-cycles-frontend  # 4,78% frontend cycles idle    (91,21%)
   81 002 469        stalled-cycles-backend   # 55,42% backend cycles idle    (91,36%)
  222 687 505        instructions             # 1,52 insn per cycle
                                              # 0,36 stalled cycles per insn  (91,21%)
   37 776 174        branches                 # 829,884 M/sec                 (74,51%)
      567 408        branch-misses            # 1,50% of all branches         (70,62%)
      
  0,034156819   seconds time elapsed
  0,041509000   seconds user
  0,004612000   seconds sys

▶ Note: The percentage at the end is the kernel's calculation of the duration of the event in the case of multiplexing

▶ List all events:

$ perf list
List of pre-defined events (to be used in -e):

branch-instructions OR branches           [Hardware event]
branch-misses                             [Hardware event]
cache-misses                              [Hardware event]
cache-references                          [Hardware event]
...

▶ Statistics for a specific commandL1-dcache-load-misses cap (a poem)branch-load-missesEvents:

$ perf stat -e L1-dcache-load-misses,branch-load-misses cat /etc/fstab
...
Performance counter stats for 'cat /etc/fstab':

23 418         L1-dcache-load-misses
 7 192         branch-load-misses
...

Cachegrind

CachegrindanvalgrindProvided tools for profile application directives and data caching hierarchies

  • CachegrindYou can also profile branches to predict success

▶ It is possible to simulate a computer with an independentI$cap (a poem)D$Supported machines with a unified L2 cache

▶ Very useful for detecting cache usage problems (too much miss, etc.)

$ valgrind --tool=cachegrind --cache-sim=yes ./my_program

▶ A measurement result is generated that contains the.<pid>file

cg_annotateis a CLI tool for presenting Cachegrind simulation results

▶ It can also be used by--diffoption compares two measurement result files.

cachegrindThere are some accuracy deficiencies in the cache simulation of seeCachegrind accuracy

Kcachegrind - Visualizing Cachegrind profiling data
image

Callgrind

CallgrindbevalgrindProvides a tool that can profile call graphs (userspace only)

▶ You can capture the number of instructions and data-related source code lines during program execution.

▶ Records the number of function and function-related calls:

$ valgrind --tool=callgrind ./my_program

callgrind_annotate It's a place where you can showcallgrindCLI tool for simulation results

KcachegrindIt is also possible to showcallgrindresults

Kcachegrind - Visualizing Callgrind profiling data
image

System-wide Profiling & Tracing

▶ The root cause of dominance problems is not limited to the application itself, but may involve multiple levels (driver, application, kernel)

▶ In this case, the entire stack needs to be analyzed

▶ The kernel provides a large number of tracepoints that can be logged by specific tools.

▶ New tracepoints can be created statically or dynamically by various mechanisms (e.g., kprobes)

Kprobes

▶ Kprobes can dynamically insert breakpoints at virtually any kernel address and extract debugging and performance information

▶ Inserting a method that calls a specific handler in text code by means of a code patch

  • kprobesIt is possible to execute a specific handler when executing hooked instructions (i.e., instructions that need to be debugged).
  • Triggered when returning from a functionkretprobesExtracting the return value of a function, and the parameters of a function call

▶ Kernel option needs to be enabledCONFIG_KPROBES=y

▶ Enable option since the probe needs to be inserted via the moduleCONFIG_MODULES=y cap (a poem)CONFIG_MODULE_UNLOAD=yto allow registration of probes

▶ When usingsymbol_nameThe field hooking probe needs to be enabled when theCONFIG_KALLSYMS_ALL=yoptions (as in computer software settings)

▶ See moretrace/kprobes

Registering a Kprobe

▶ Can be registered dynamically by loading the modulekprobesThat is to say, throughregister_kprobe()Register for astruct kprobe

▶ It is necessary to exit the module via theunregister_kprobe()Unregistered probes:

struct kprobe probe = {
  .symbol_name = "do_exit",
  .pre_handler = probe_pre,
  .post_handler = probe_post,
};

register_kprobe(&probe);

Registering a kretprobe

▶ kretprobe is registered in the same way as a normal probe, with the difference that it needs to be registered via theregister_kretprobe()Register for astruct kretprobe

  • The provided handler is called on function entry and exit.
  • On module exit you need to pass theunregister_kretprobe()Unregistered probes
int (*kretprobe_handler_t) (struct kretprobe_instance *, struct pt_regs *);
struct kretprobe probe = {
  .kp.symbol_name = "do_fork",
  .entry_handler = probe_entry,
  .handler = probe_exit,
};

register_kretprobe(&probe);

perf

▶ perf can perform a larger tracing and log the operation

▶ The kernel already contains events and tracepoints that can be used with theperf listList these

▶ Required byCONFIG_FTRACE_SYSCALLS(computing) enable (a feature)syscall tracepoints

▶ In the absence of debugging information, a new tracepoint can be created dynamically on all symbols and registers

▶ The tracing functions record the variable and parameter contents using their names. The kernel option needs to be turned onCONFIG_DEBUG_INFO

▶ If perf cannot be foundvmlinuxThe first step is to pass the-k <vmlinux>Make this document available.

perf example

▶ Display Matchingsyscalls:*of all events:

$ perf list syscalls:*
List of pre-defined events (to be used in -e):

  syscalls:sys_enter_accept [Tracepoint event]
  syscalls:sys_enter_accept4 [Tracepoint event]
  syscalls:sys_enter_access [Tracepoint event]
  syscalls:sys_enter_adjtimex_time32 [Tracepoint event]
  syscalls:sys_enter_bind [Tracepoint event]
...

▶ InDocumentation of the implementation ofsha256sumgenerated by the commandsyscalls:sys_enter_readEvents:

$ perf record -e syscalls:sys_enter_read sha256sum /bin/busybox
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB  (215 samples) ]

perf report example

▶ Display the collected samples according to the time spent:

$ perf report
Samples: 591 of event 'cycles', Event count (approx.): 393877062
Overhead Command       Shared Object             Symbol
 22,88%  firefox-esr   [nvidia]                  [k] _nv031568rm
  3,21%  firefox-esr   .2      [.] __minimal_realloc
  2,00%  firefox-esr   .6                 [.] __stpncpy_ssse3
  1,86%  firefox-esr   libglib-2..0.7400.0   [.] g_hash_table_lookup
  1,62%  firefox-esr   .2      [.] _dl_strtoul
  1,56%  firefox-esr   []         [k] clear_page_rep
  1,52%  firefox-esr   .6                 [.] __strncpy_sse2_unaligned
  1,37%  firefox-esr   .2      [.] strncmp
  1,30%  firefox-esr   firefox-esr               [.] malloc
  1,27%  firefox-esr   .6                 [.] __GI___strcasecmp_l_ssse3
  1,23%  firefox-esr   [nvidia]                  [k] _nv013165rm
  1,09%  firefox-esr   [nvidia]                  [k] _nv007298rm
  1,03%  firefox-esr   []         [k] unmap_page_range
  0,91%  firefox-esr   .2      [.] __minimal_free

perf probe

▶ Byperf probeDynamic tracepoints can be created in kernel functions and userspace functions

▶ In order to insert the probe, you need to enable the kernelCONFIG_KPROBES

  • Note: The use ofperfThe probe needs to be compiledlibelffile

▶ After creating a new dynamic probe it is possible to create a new dynamic probe in theperf recordUse this probe in

▶ Usually not available in embedded platformsvmlinuxThis time, only symbols and registers can be used

perf probe examples

▶ Lists all kernel symbols that can be probed:

$ perf probe --funcs

▶ UsefilenameThe parameters in thedo_sys_openat2Create a new probe on the

$ perf probe --vmlinux=vmlinux_file do_sys_openat2 filename:string
Added new event:
	probe:do_sys_openat2 (on do_sys_openat2 with filename:string)

▶ Implementationtailand capture the probe events created earlier:

$ perf record -e probe:do_sys_openat2 tail /var/log/messages
...
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB  (19 samples) ]

▶ Useperf scriptDemonstrate the tracepoint of the record:

$ perf script
tail 164 [000] 3552.956573: probe:do_sys_openat2: (c02c3750) filename_string="/etc/"
tail 164 [000] 3552.956642: probe:do_sys_openat2: (c02c3750) filename_string="/lib/tls/v7l/neon/vfp/.2"
...

▶ Inksys_readCreate a new probe on ther0(ARM) return value (assigned to theret):

$ perf probe ksys_read%return ret=%r0

▶ Implementationsha256sumand capture the probe events created earlier:

$ perf record -e probe:ksys_read__return sha256sum /etc/fstab

▶ Shows all probes created:

$ perf probe -l
probe:ksys_read__return (on ksys_read%return with ret)

▶ Remove an existing tracepoint:

$ perf probe -d probe:ksys_read__return

perf record example

▶ Recording all CPU events (system mode)

$ perf record -a
^C

▶ Display using perf scriptRecorded events

$ perf script
...
klogd   85 [000]  208.609712:  116584  cycles:  b6dd551c   memset+0x2c (/lib/.6)
klogd   85 [000]  208.609898:  121267  cycles:  c0a44c84   _raw_spin_unlock_irq+0x34 (vmlinux)
klogd   85 [000]  208.610094:  127434  cycles:  c02f3ef4   kmem_cache_alloc+0xd0 (vmlinux)
 perf   130 [000] 208.610311:  132915  cycles:  c0a44c84   _raw_spin_unlock_irq+0x34 (vmlinux)
 perf   130 [000] 208.619831:  143834  cycles:  c0a44cf4   _raw_spin_unlock_irqrestore+0x3c (vmlinux)
klogd   85 [000]  208.620048:  143834  cycles:  c01a07f8   syslog_print+0x170 (vmlinux)
klogd   85 [000]  208.620241:  126328  cycles:  c0100184   vector_swi+0x44 (vmlinux)
klogd   85 [000]  208.620434:  128451  cycles:  c096f228   unix_dgram_sendmsg+0x46c (vmlinux)
kworker/0:2-mm_ 44 [000] 208.620653: 133104 cycles: c0a44c84 _raw_spin_unlock_irq+0x34 (vmlinux)
 perf   130 [000] 208.620859:  138065  cycles:  c0198460   lock_acquire+0x184 (vmlinux)
...

Using perf trace

perf traceAll tracepoints/events triggered during command execution can be captured and displayed.

$ perf trace -e "net:*" ping -c 1 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
    0.000 ping/37820 net:net_dev_queue(skbaddr: 0xffff97bbc6a17900, len: 98, name: "enp34s0")
    0.005 ping/37820 net:net_dev_start_xmit(name: "enp34s0",
      skbaddr: 0xffff97bbc6a17900, protocol: 2048, len: 98,
      network_offset: 14, transport_offset_valid: 1, transport_offset: 34)
    0.009 ping/37820 net:net_dev_xmit(skbaddr: 0xffff97bbc6a17900, len: 98,name: "enp34s0")
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.867 ms

Using perf top

perf topThe kernel can be analyzed in real time

▶ You can sample function calls and sort them by time consumption.

▶ You can PROFILE the entire system:

$ perf top
Samples: 19K of event 'cycles', 4000 Hz, Event count (approx.): 4571734204 lost: 0/0 drop: 0/0
Overhead   Shared Object    Symbol
    2,01%  [nvidia]         [k] _nv023368rm
    0,94%  [kernel]         [k] __static_call_text_end
    0,89%  [vdso]           [.] 0x0000000000000655
    0,81%  [nvidia]         [k] _nv027733rm
    0,79%  [kernel]         [k] clear_page_rep
    0,76%  [kernel]         [k] psi_group_change
    0,70%  [kernel]         [k] check_preemption_disabled
    0,69%  code [.]         0x000000000623108f
    0,60%  code [.]         0x0000000006231083
    0,59%  [kernel]         [k] preempt_count_add
    0,54%  [kernel]         [k] module_get_kallsym
    0,53%  [kernel]         [k] copy_user_generic_string

ftrace and trace-cmd

ftrace

ftraceis a kernel tracing framework, short for "Function Tracer".

▶ Provides extensive tracing capabilities for observing system behavior

  • It is possible to trace tracepoints (schedulers, interrupts, etc.) that already exist in the kernel
  • relies on GCC's mount() capability and the kernel code patching mechanism to invoke theftrace tracing handler

▶ All trace data is stored in a ring buffer

▶ UsetracefsFile system to control and display tracing events

  • # mount -t tracefs nodev /sys/kernel/tracing.
    

▶ UseftraceThe kernel option must be enabled beforeCONFIG_FTRACE=y

CONFIG_DYNAMIC_FTRACEIt is possible to have the added trace feature have little to no impact on system performance when not in use.

ftrace files

ftracepass (a bill or inspection etc)/sys/kernel/tracingin a specific file to control the content of the trace:

  • current_tracer:: Currently used tracer
  • available_tracers: List available tracers compiled into the kernel
  • tracing_on: Enable/disable tracing.
  • trace: Present the trace in a readable format. Different tracers may have different formats
  • trace_pipe: withtraceSimilar, but each read consumes the trace data of its reads
  • trace_marker{_raw}: can synchronize kernel events to user space in the trace buffer
  • set_ftrace_filter: Filtering specific functions
  • set_graph_function: Graphically display the subfunctions of a particular function.

▶ There are other documents that control tracing, seetrace/ftrace

▶ Availabletrace-cmd CLI andKernelshark GUI to record and display tracing data

ftrace tracers

▶ ftrace provides a variety of "tracers".

▶ It is necessary to write the used tracer to thecurrent_tracerfile

  • nop: do not perform tracing, disable all tracing
  • function: Track all calls to kernel functions
  • function_graph: Similarfunctionbut keeps track of function entries and exits
  • hwlat: Tracking hardware latency
  • irqsoff: Tracks the portion of the interrupt that is disabled and logs the delay
  • branch: track likely()/unlikely() branch prediction calls
  • mmiotrace: Tracks all hardware accesses (read[bwlq]/write[bwlq])

▶ Warning: some tracer overheads may be high

# echo "function" > /sys/kernel/tracing/current_tracer

function_graph tracer report example

function_graphCan keep track of all functions and their associated call trees

▶ You can display process, CPU, timestamp, and function call graphs

$ trace-cmd report
...
dd-113  [000]  304.526590: funcgraph_entry:                |   sys_write() {
dd-113  [000]  304.526597: funcgraph_entry:                |     ksys_write() {
dd-113  [000]  304.526603: funcgraph_entry:                |       __fdget_pos() {
dd-113  [000]  304.526609: funcgraph_entry:     6.541 us   |         __fget_light();
dd-113  [000]  304.526621: funcgraph_exit:    + 18.500 us  |       }
dd-113  [000]  304.526627: funcgraph_entry:                |       vfs_write() {
dd-113  [000]  304.526634: funcgraph_entry:     6.334 us   |         rw_verify_area();
dd-113  [000]  304.526646: funcgraph_entry:     6.208 us   |         write_null();
dd-113  [000]  304.526658: funcgraph_entry:     6.292 us   |         __fsnotify_parent();
dd-113  [000]  304.526669: funcgraph_exit:    + 43.042 us  |       }
dd-113  [000]  304.526675: funcgraph_exit:    + 78.833 us  |     }
dd-113  [000]  304.526680: funcgraph_exit:    + 91.291 us  |   }
dd-113  [000]  304.526689: funcgraph_entry:                |   sys_read() {
dd-113  [000]  304.526695: funcgraph_entry:                |     ksys_read() {
dd-113  [000]  304.526702: funcgraph_entry:                |       __fdget_pos() {
dd-113  [000]  304.526708: funcgraph_entry:     6.167 us   |         __fget_light();
dd-113  [000]  304.526719: funcgraph_exit:    + 18.083 us  |       }

irqsoff tracer

▶ ftrace irqsoff The tracer can track interrupt delays caused by disabling interrupts for too long.

▶ Can help locate problems with high system interruption latency

▶ Need to be enabledIRQSOFF_TRACER=y

  • preemptoffpremptirqsoff The tracer keeps track of snippets with preemption disabled.
image

irqsoff tracer report example

# latency: 276 us, #104/104, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#    -----------------
#    | task: stress-ng-114 (uid:0 nice:0 policy:0 rt_prio:0)
#    -----------------
# => started at: __irq_usr
# => ended at: irq_exit
#
#
#                  _------=> CPU#
#                 /  _-----=> irqs-off
#                 | / _----=> need-resched
#                 || / _---=> hardirq/softirq
#                 ||| / _--=> preempt-depth
#                 |||| /     delay
#    cmd  pid     |||||   time | caller
#      \  /       |||||     \  |   /
stress-n-114      0d...     2us : __irq_usr
stress-n-114      0d...     7us : gic_handle_irq <-__irq_usr
stress-n-114      0d...    10us : __handle_domain_irq <-gic_handle_irq
...
stress-n-114      0d...   270us : __local_bh_disable_ip <-__do_softirq
stress-n-114      .   275us : __do_softirq <-irq_exit
stress-n-114      .   279us+: tracer_hardirqs_on <-irq_exit
stress-n-114      .   290us : <stack trace>

Hardware latency detector

▶ ftrace hwlat The tracer can help find out if the hardware is causing delays

  • For example, non-maskable system management interrupts can directly trigger certain firmware support features, causing the CPU to suspend execution
  • Interruptions from some security monitoring may also cause delays

▶ If some kind of delay is found using this tracer, it means that the system may not be suitable for real-time use

▶ The principle is to execute instructions cyclically on a single core with interrupts disabled and calculate the time difference between two consecutive reads

▶ Need to be enabledCONFIG_HWLAT_TRACER=y

image

trace_printk()

▶ Utrace_printk()You can output strings to the trace cache

▶ You can trace specific conditions in your code and display them in the trace cache:

#include <linux/>
void read_hw()
{
	if (condition)
		trace_printk("Condition is true!\n");
}

▶ Use in the trace cachefunction_graph The tracer displays the following results:

1)             |           read_hw() {
1)             |               /* Condition is true! */
1) 2.657 us    |           }

trace-cmd

trace-cmd is a program written by Steven Rostedt for use with theftraceInteractive tools (man 1 trace-cmd)

trace-cmdSupported tracer is the tracer exposed by ftrace

trace-cmdMultiple commands are supported:

  • list: List the various plugins/events that can be logged.
  • record: Write a trace tofile
  • report: DemonstrationResults obtained

▶ At the end of the acquisition, afile

Remote tracing with trace-cmd

trace-cmd The output can be quite large, making it difficult to save it on embedded platforms with limited storage

▶ For this purpose, you can use thelistencommand sends the results over the network:

  • Running on a remote system that needs to capture tracingtrace-cmd listen -p 6578

  • On the target system, use thetrace-cmd record -N <target_ip>:6578Specify the remote system that collects tracing information.

    image

trace-cmd examples

▶ List the available tracers:

$ trace-cmd list -t
blk mmiotrace function_graph function nop

▶ List the available events:

$ trace-cmd list -e
...
migrate:mm_migrate_pages_start
migrate:mm_migrate_pages
tlb:tlb_flush
syscalls:sys_exit_process_vm_writev
...

▶ Listfunctioncap (a poem)function_graph tracers filterable functions:

$ trace-cmd list -f
...
wait_for_initramfs
__ftrace_invalid_address___64
calibration_delay_done
calibrate_delay
...

▶ Enable function tracer and log global data on the system:

$ trace-cmd record -p function

▶ Tracing with the function graph tracerddCommand:

$ trace-cmd record -p function_graph dd if=/dev/mmcblk0 of=out bs=512 count=10

▶ ShowcaseThe data:

$ trace-cmd report

▶ Reset all ftrace buffers and remove tracers:

$ trace-cmd reset

▶ Execute on the systemirqsoff tracer:

$ trace-cmd record -p irqsoff

▶ Record only the system'sirq_handler_exit/irq_handler_entry events:

$ trace-cmd record -e irq:irq_handler_exit -e irq:irq_handler_entry

Adding ftrace tracepoints

▶ Customized tracepoints can be added for customization purposes.

▶ It is first necessary to set up a.hThe tracepoint is declared in the

#undef TRACE_SYSTEM
#define TRACE_SYSTEM subsys

#if !defined(_TRACE_SUBSYS_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_SUBSYS_H

#include <linux/>

DECLARE_TRACE(subsys_eventname,
        TP_PROTO(int firstarg, struct task_struct *p),
        TP_ARGS(firstarg, p));

#endif /* _TRACE_SUBSYS_H */

/* This part must be outside protection */
#include <trace/define_trace.h>

▶ Tracepoint is then injected using the above header file:

#include <trace/events/>

#define CREATE_TRACE_POINTS
DEFINE_TRACE(subsys_eventname);

void any_func(void)
{
  ...
  trace_subsys_eventname(arg, task);
  ...
}

▶ For more information, seetrace/tracepoints

Kernelshark

▶ Kernelshark is a Qt-based program that can processtrace-cmd Graphical interface of the report

▶ You can passtrace-cmdConfiguring and acquiring data

▶ Use different colors to show recorded CPU and tasks events

▶ Can be used for further analysis of specific bugs

image

LTTng

▶ LTTng is a program ofEfficiOS Company-maintained open source tracing framework for Linux

▶ LTTng provides insight into the interaction between the kernel and the application (C, C++, Java, Python).

  • Not yet applied exposes a/dev/lttng-logger

▶ Tracepoints will be associated with a payload

▶ LTTng focuses on low overhead tracing

▶ Use the Common Trace Format (so that you can read trace data using software such as babeltrace or trace-compass).

Tracepoints with LTTng

▶ LTTng has a session daemon that receives events generated from the kernel and user-space LTTng tracing components

▶ LTTng can be used to track the following:

  • LTTng kernel tracepoints
  • kprobes and kretprobes
  • Linux kernel system calls
  • Libux userspace probe
  • LTTng tracepoints in user space

Creating userspace tracepoints with LTTng

▶ New user-space tracepoints can be defined using LTTng.

▶ Multiple attributes can be configured for a tracepoint

  • A provider namespace
  • A name to identify tracepoint
  • Various types of parameters (int, char*, etc.)
  • Fields describing how to display tracepoint parameters (decimal, hexadecimal forbidden, etc.), see theLTTng-ust

▶ In order to use UST tracepoint, the developer needs to perform several operations: write a tracepoint provider (.h), write a tracepoint package (.c), build the package, call the tracepoint in the traced application, and finally build the application, link the lttng -ust library and package provider.

▶ LTTng provideslttng-gen-tpSimplify these steps by writing just one template (.tp) file

Defining a LTTng tracepoint

▶ Tracepoint template (hello_world)

LTTNG_UST_TRACEPOINT_EVENT(
  // Tracepoint provider name
  hello_world,
  
  // Tracepoint/event name
  first_tp,
  
  // Tracepoint arguments (input)
  LTTNG_UST_TP_ARGS(
  char *, text
  ),
  
  // Tracepoint/event fields (output)
  LTTNG_UST_TP_FIELDS(
  	lttng_ust_field_string(message, text)
  )
)

▶ lttng-gen-tp will use this template file to generate/build the required files (.h, .c and .o files)

Defining a LTTng tracepoint

▶ Constructing a tracepoint provider:

$ lttng-gen-tp hello_world

▶ Use Tracepoint(hello_world.c)

#include <>
#include ""

int main(int argc, char *argv[])
{
    lttng_ust_tracepoint(hello_world, my_first_tracepoint, 23, "hi there!");
    return 0;
}

▶ Compilation:

$ gcc hello_world.c hello_world -llttng-ust -o hello_world

Using LTTng

$ lttng create my-tracing-session --output=./my_traces
$ lttng list --kernel
$ lttng list --userspace
$ lttng enable-event --userspace hello_world:my_first_tracepoint
$ lttng enable-event --kernel --syscall open,close,write
$ lttng start
$ /* Run your application or do something */
$ lttng destroy
$ babeltrace2 ./my_traces

▶ Availabletrace-compassto show the results

Remote tracing with LTTng

▶ LTTng can record tracking data over the network

▶ For embedded systems with limited storage only

▶ Execution on a remote computerlttng-relaydcommand

$ lttng-relayd --output=${PWD}/traces

▶ Specify in the session created on the target machine--set-url:

$ lttng create my-session --set-url=net://remote-system

▶ This makes it possible to directly record tracking information from a remote computer

eBPF

The ancestor: Berkeley Packet filter

▶ BPF stands for Berkeley Packet Filter, which was used in the beginning for network message filtering

▶ BPF for Linux socket filtering (seenetworking/filter)

▶ tcpdump and Wireshark rely heavily on BPF (via libpcap) for message capture

BPF in libpcap: setup

▶ tcpdump can pass the user's message filter string into libpcap

▶ libpcap will convert the capture filter into a binary program

  • The program uses an abstract machine instruction set (BPF instruction set)

▶ libpcap viasetsockopt()The system call sends the binary to the kernel.

image

BPF in libpcap: capture

image

▶ The kernel implements a BPF "virtual machine".

▶ The BPF VM executes the BPF program for each message

▶ The program examines the message data and returns a non-zero value if the message needs to be captured

▶ If the return value is non-zero, the message is captured in addition to regular message processing

eBPF

eBPFis a new framework that allows user programs to run safely and efficiently in the kernel. It was introduced in kernel version 3.18 and is still evolving and being updated frequently.

▶ The eBPF program can capture and expose kernel data to user space as well as change kernel behavior based on a number of user-defined rules

▶ eBPF is event-driven: specific kernel events can trigger and execute eBPF programs

▶ One of the main benefits of eBPF is the ability to reprogram kernel behavior without having to develop against the kernel:

  • No kernel crashes due to bugs
  • Faster feature development cycles can be realized

▶ Noteworthy features of eBPF are:

  • New instruction set, interrupters and checkers
  • A wider range of "attach" locations, so that programs can be hooked almost anywhere in the kernel.
  • Use a specific structure called "maps" to exchange data between multiple eBPF programs or between programs and user space.
  • Use a specificbpf() System calls to manipulate eBPF programs and data
  • A large number of kernel helper functions are provided in the eBPF program

eBPF program lifecycle

image

Kernel configuration for eBPF

▶ ByCONFIG_NETEnable eBPF subroutine

▶ ByCONFIG_BPF_SYSCALL(computing) enable (a feature)bpf()system call

▶ ByCONFIG_BPF_JITEnable JIT in your program to improve performance

CONFIG_BPF_JIT_ALWAYS_ONMandatory JIT Enablement

CONFIG_BPF_UNPRIV_DEFAULT_OFF=n It is possible to allow non-root to use eBPF during the development phase

▶ You may want to unlock specific hook locations with additional features:

  • CONFIG_KPROBESPrograms can be hooked on kprobes
  • CONFIG_TRACING Programs can be hooked on the kernel tracepoint
  • CONFIG_NET_CLS_BPFMessage classifiers can be written
  • CONFIG_CGROUP_BPF This can be done in thecgroup hookfirst (of multiple parts)attach programs

eBPF ISA

▶ eBPF is a "virtual" ISA that defines all its instruction sets: load and store instructions, arithmetic instructions, jump instructions, and so on.

▶ It also defines a set of 10 64-bit registers, and a calling criterion:

  • R0:: Return values of functions and BPF programs
  • R1, R2, R3, R4, R5: Function parameters
  • R6, R7, R8, R9: Call save registers
  • R10:: Stack pointer
; bpf_printk("Hello %s\n", "World");
    0: r1 = 0x0 ll
    2: r2 = 0xa
    3: r3 = 0x0 ll
    5: call 0x6
; return 0;
    6: r0 = 0x0
    7: exit

The eBPF verifier

▶ When loading a program into the kernel, the eBPF verifier checks the validity of the program

▶ verifier is a complex software fragment that is used to calibrate an eBPF program with a set of rules to ensure that the running code does not compromise the entire kernel. Such as:

  • The program must return, otherwise uncertain code paths may lead to infinite runs (e.g. infinite loops)
  • The program must ensure that the referenced pointer is valid
  • Programs cannot access memory addresses at will, they must be accessed via context or valid helpers

▶ Reject a program if it violates the rules of verifier

▶ In addition to the verifier requirement, extra care must be taken when writing the program. eBPF programs enable preemption (but disable CPU migration), and thus may still suffer from concurrency problems!

  • These problems can be avoided through mechanisms and helpers such as per-cpu maps type

Program types and attach points

▶ eBPF can hook a program in different types of locations:

  • Any kprobe
  • Kernel-defined static tracepoint
  • Specific perf event
  • entire network stack
  • See morebpf_attach_type

▶ It is possible that a specific attach point may only support hooking a portion of a specific program, cf.bpf_prog_type cap (a poem)bpf/libbpf/program_types

▶ The program type defines the data that is passed into the eBPF program when the program is called, for example:

  • BPF_PROG_TYPE_TRACEPOINT The program receives a structure containing all the data returned to user space by the target tracepoint.
  • BPF_PROG_TYPE_SCHED_CLS The program (used to implement the message classifier) will receive astruct __sk_buff, which is embodied in the kernel as a socket buffer.
  • For more context passed to program types, seeinclude/linux/bpf_types.h

eBPF maps

▶ eBPF can interact with user space or other programs with data through different maps:

  • BPF_MAP_TYPE_ARRAY: Generalized array storage. Can be divided between different CPUs
  • BPF_MAP_TYPE_HASH: Contains a key-value store. keys can be of different types:__u32The device type, IP address, and so on.
  • BPF_MAP_TYPE_QUEUE: FIFO type queue
  • BPF_MAP_TYPE_CGROUP_STORAGE: a type of hash map that uses the cgroup id as the key, in addition to maps of other object types (inodes, tasks, sockets, etc.)

▶ For basic data, a simple and efficient way is to use the global variables of eBPF directly (in contrast to maps, which does not involve system calls)

The bpf() syscall

▶ The kernel is used by exposing abpf()System calls to allow interaction with eBPF subsystems

▶ This system call has a set of subcommands and receives specific data based on different subcommands:

  • BPF_PROG_LOAD: Load a bpf program
  • BPF_MAP_CREATE: Allocate maps for use by the program
  • BPF_MAP_LOOKUP_ELEM: lookup table entries in map
  • BPF_MAP_UPDATE_ELEM: Update table entries in map

▶ This system call uses file descriptors pointing to eBPF resources. These resources (programs, maps, links, etc.) will remain valid as long as at least one program holds a valid file descriptor. If no program is using them, these resources will be automatically cleaned up.

▶ See moreman 2 bpf

Writing eBPF programs

▶ An eBPF program can be written directly in raw eBPF assembly or in a high-level language (e.g., C or rust) and compiled using the clang compiler.

▶ The kernel provides a helper function for the eBPF program:

  • bpf_trace_printk Pass log to trace buffer
  • bpf_map_{lookup,update,delete}_elem Manipulating maps
  • bpf_probe_{read,write}[_user] Securely read/write data from/to kernel or userspace
  • bpf_get_current_pid_tgid Returns the current process ID and thread group ID.
  • bpf_get_current_uid_gid Returns the current user ID and group ID
  • bpf_get_current_comm Returns the name of the executable file in the current task
  • bpf_get_current_task Returns the currentstruct task_struct
  • For more helper functions, seeman 7 bpf-helpers

▶ The kernel also exposes kfuncs (cf.bpf/kfuncs), but in contrast to the bpf helper functions, they are not part of the kernel's stable interface

Manipulating eBPF program

▶ There are several ways to build, load, and manage eBPF programs:

  • One is that you can write an eBPF program, build it using clang, then load it, and use it in a custom user-space program after attaching thebpf()retrieve data
  • It is also possible to use bpftool to manipulate built eBPF programs (load, attach, read maps, etc.) without having to write any user-space tools.
  • Or you can write your own eBPF tool to handle some loads of work through some intermediate libraries such as libbpf
  • It is also possible to use specific frameworks such as BCC or bpftrace

BCC

▶ The BPF Compiler Collection (BCC) is a toolset based on the BPF

▶ BCC provides a large number of ready-to-use BPF-based tools

▶ Also provides a simpler interface for writing, loading and hooking BPF programs than using the "original" BPF language.

▶ Applicable to a large number of platforms (but not ARM32)

  • In the debian architecture, all tools named<tool>-bpfcc

▶ BCC requires kernel version >=4.1

▶ BCC is evolving quickly, and many distributions have older versions: you may need to compile the latest source code.

BCC tools

image

BCC Tools example

is a CPU profiler that captures the currently executing stack. Can convert the output to a flame map:

$ git clone /brendangregg/
$  -df -F 99 10 | ./FlameGraph/ > 

Shows all new TCP connections:

$ tcpconnect
PID COMM IP SADDR DADDR DPORT
220321 ssh 6 ::1 ::1 22
220321 ssh 4 127.0.0.1 127.0.0.1 22
17676 Chrome_Child 6 2a01:cb15:81e4:8100:37cf:d45b:d87d:d97d 2606:50c0:8003::154 443
[...]

▶ See more at /iovisor/bcc

Using BCC with python

▶ BCC exposes abccmodule, and aBPFresemble

▶ The eBPF program is written in C and stores it to an external file or directly as a python string

▶ When creating aBPFWhen an instance of a class is provided (as a file or string) to an eBPF program, it automatically builds, loads, and attaches the program

▶ There are various ways to ATTACH a program:

  • Use the appropriate program name prefix according to the target attach point (this will automatically perform the attach step)
  • By explicitly calling a previously created BPF instance method

Using BCC with python

▶ Usekprobe hook clone()A system call that prints "Hello, World!" on every hook.

from bcc import BPF

# define BPF program
prog = """
int hello(void *ctx) {
  bpf_trace_printk("Hello, World!\\n");
  return 0;
}
"""
# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")

libbpf

▶ In addition to using advanced frameworks like BCC, you can use libbpf to build custom tools to better control every aspect of your program!

▶ libbpf is a C-based library that reduces the complexity of eBPF programming by the following features:

  • Userspace API for handling open/load/attach/teardown bpf programs
  • User-space APIs for interacting with attach's programs
  • eBPF APIs that simplify writing eBPF programs

▶ Many distributions and build systems (such as Buildroot) package libbpf

▶ For more see /en/latest/

eBPF programming with libbpf

my_prog.

#include <linux/>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

#define TASK_COMM_LEN 16
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __type(key, __u32);
    __type(value, __u64);
    __uint(max_entries, 1);
} counter_map SEC(".maps");

struct sched_switch_args {
    unsigned long long pad;
    char prev_comm[TASK_COMM_LEN];
    int prev_pid;
    int prev_prio;
    long long prev_state;
    char next_comm[TASK_COMM_LEN];
    int next_pid;
    int next_prio;
};

SEC("tracepoint/sched/sched_switch")
int sched_tracer(struct sched_switch_args *ctx)
{
    __u32 key = 0;
    __u64 *counter;
    char *file;

    char fmt[] = "Old task was %s, new task is %s\n";
    bpf_trace_printk(fmt, sizeof(fmt), ctx->prev_comm, ctx->next_comm);

    counter = bpf_map_lookup_elem(&counter_map, &key);
    if(counter) {
        *counter += 1;
        bpf_map_update_elem(&counter_map, &key, counter, 0);
    }

    return 0;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

Building eBPF programs

▶ eBPF is written in C and can be constructed as a loadable object via clang:

$ clang -target bpf -O2 -g -c my_prog. -o my_prog.

▶ GCC is also available in recent versions:

  • Available in Debian/Ubuntugcc-bpfInstalling the toolchain
  • It exposesbpf-unknown-nonegoal

▶ In order to simplify the operation of libbpf-based programs in user-space programs, we need "skeleton" APIs, which can be generated by bpftool.

bpftool

bpftoolis a command-line tool that can manage bpf programs by interacting with bpf object files and the kernel:

  • Load the program into the kernel
  • List loaded programs
  • dump program instructions, BPF code or JIT code
  • dump map contents
  • Attach programs to hooks, etc.

▶ You may need to mount the bpf filesystem to pin the program (i.e., load the program even after bpftool has finished running)

$ mount -t bpf none /sys/fs/bpf

▶ List the loaded programs:

$ bpftool prog
348: tracepoint name sched_tracer tag 3051de4551f07909 gpl
loaded_at 2024-08-06T15:43:11+0200 uid 0
xlated 376B jited 215B memlock 4096B map_ids 146,148
btf_id 545

▶ Load and ATTACH a program:

$ mkdir /sys/fs/bpf/myprog
$ bpftool prog loadall trace_execve. /sys/fs/bpf/myprog autoattach

▶ Uninstall a program:

$ rm -rf /sys/fs/bpf/myprog

▶ dump a loaded program:

$ bpftool prog dump xlated id 348
int sched_tracer(struct sched_switch_args * ctx):
; int sched_tracer(struct sched_switch_args *ctx)
  0: (bf) r4 = r1
  1: (b7) r1 = 0
; __u32 key = 0;
	2: (63) *(u32 *)(r10 -4) = r1
; char fmt[] = "Old task was %s, new task is %s\n";
  3: (73) *(u8 *)(r10 -8) = r1
  4: (18) r1 = 0xa7325207369206b
  6: (7b) *(u64 *)(r10 -16) = r1
  7: (18) r1 = 0x7361742077656e20
[...]

▶ dump eBPF program logs:

image

▶ List the created maps:

$ bpftool map
80: array name counter_map flags 0x0
    key 4B value 8B max_entries 1 memlock 256B
    btf_id 421
82: array name .rodata.str1.1 flags 0x80
    key 4B value 33B max_entries 1 memlock 288B
    frozen
96: array name libbpf_global flags 0x0
		key 4B value 32B max_entries 1 memlock 280B
[...] 

▶ Display the contents of a map:

$ sudo bpftool map dump id 80
[{
  "key": 0,
  "value": 4877514 }
]

▶ Generate libbpf APIs to manipulate a program:

$ bpftool gen skeleton trace_execve. name trace_execve > trace_execve.

▶ We can use the high-level API to write our own user-space programs to better operate our eBPF programs:

  • Instantiate a global context object that can be referenced by all programs, maps, links, etc.
  • Load/attact/uninstall programs
  • The eBPF program is embedded directly into the generated header as a byte array.

Userspace code with libbpf

#include <>
#include <>
#include <>
#include "trace_sched_switch."
int main(int argc, char *argv[])
{
    struct trace_sched_switch *skel;
    int key = 0;
    long counter = 0;

    skel = trace_sched_switch__open_and_load();
    if(!skel)
        exit(EXIT_FAILURE);
    if (trace_sched_switch__attach(skel)) {
        trace_sched_switch__destroy(skel);
        exit(EXIT_FAILURE);
    }

    while(true) {
        bpf_map__lookup_elem(skel->maps.counter_map, &key, sizeof(key), &counter, sizeof(counter), 0);
        fprintf(stderr, "Scheduling switch count: %d\n", counter);
        sleep(1);
    }

    return 0;
}

eBPF programs portability

▶ In contrast to user-space APIs, stable APIs are not exposed inside the kernel, which means that eBPF programs that can manipulate certain kernel data do not necessarily run on other versions of the kernel

▶ CO-RE (Compile Once - Run Everywhere) is used to solve this problem by making programs portable between different versions of the kernel, and it relies on the following features:

  • The kernel must be passedCONFIG_DEBUG_INFO_BTF=ybuild to embed BTF. BTF is a format similar to dwarf that efficiently encodes data layouts as well as function signatures
  • The eBPF compiler must be able to emit BTF relocations (supported by recent versions of clang and GCC, using the-gParameters)
  • A BPF program that can handle BTF-based data and a BPF loader that regulates access to the corresponding data is needed.libbpfis actually the standard bpf loader
  • The eBPF API is needed to read/write CO-RE redirected variables. libbpf provides such helper functions such asbpf_core_read

▶ See moreAndrii Nakryiko’s CO-RE guide

▶ In addition to CO-RE, you may face different restrictions for different kernel versions due to the introduction or change of main kernel features (eBPF subsystem is under continuous and frequent update):

  • The eBPF tail call was added in version 4.2 (which allows a program to call a function), and in version 5.10 it allows another program to be called
  • The eBPF spinlock was added in version 5.1 to prevent concurrent access to shared maps between different CPUs
  • Different attach types are constantly being introduced, but may exist in different versions of different architectures. For examplefentry/fexit attach points were introduced in the 5.5 kernel for x86, but were introduced in version 6.0 for arm32.
  • Loops of any kind (even bounded ones) were forbidden before version 5.3
  • Added in version 5.8CAP_BPFAn eBPF task can be allowed

eBPF for tracing/profiling

▶ eBPF is a very powerful framework for probing the interior of the kernel: with a large number of attach points, it is possible to expose almost any kernel path and code.

▶ At the same time, the eBPF program x is isolated from the kernel code, making it (compared to kernel development) safer and simpler

▶ Thanks to kernel translators and optimizations such as JIT compilation, eBPF is well suited for low-overhead tracing and profiling and is very flexible even in production environments

▶ This is why eBPF is gaining acceptance in debugging, tracing, and profiling. eBPF can be used to:

  • tracing frameworks such asBCCcap (a poem)bpftrace
  • Network infrastructure setup components, such asCiliummaybeCalico
  • Network message trackers such aspwrumaybedropwatch
  • For more examples, see

eBPF: resources

▶ BCC Tutorial:/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md

▶ libbpf-bootstrap: /libbpf/libbpf-bootstrap

▶ A Beginner’s Guide to eBPF Programming - Liz Rice, 2020

  • Video./watch?v=lrSExTfS-iQ
  • Resources./lizrice/ebpf-beginners

Choosing the right tool

▶ You need to know which type of tool to use before you start a profile or trace.

▶ Tools are usually selected based on the level of the profile

▶ Usually start with application tracing/profiling tools (valgrind, perf, etc.) to analyze/optimize the application level

▶ Then analyze the performance of user space + kernel

▶ Finally, trace or profile the entire system is required if performance problems occur only in a loaded system

  • For "constant" complexity, you can use the snapshot tool
  • For occasional problems, traces can be logged and analyzed

▶ If complex configuration is required before analysis, consider using custom tools: scripts, custom traces, eBPF, etc.

Kernel Debugging

Preventing bugs

Static code analysis

▶ AvailablesparseTool performs static analysis

sparseUse annotation to detect compile-time errors.

  • Locking issues (unbalanced locks)
  • Address space issues, such as direct access to user space pointers

▶ Usemake C=2Analyzing files that need to be recompiled

▶ Or usemake C=1Analysis of all documents

▶ Unbalanced Lock Example

rzn1_a5psw.c:81:13: warning: context imbalance in 'a5psw_reg_rmw' - wrong count
at exit

Good practices in kernel development

▶ When writing driver code, the user cannot be expected to provide correct values, so it is always necessary to verify these values

▶ If you want to show the call stack for a specific scenario, you can use theWARN_ON()macro (computing)

  • It can also be used during debuggingdump_stack()Show the current call stack:
static bool check_flags(u32 flags)
{
  if (WARN_ON(flags & STATE_INVALID))
  	return -EINVAL;
  return 0;
}

▶ If you need to check variables during compilation (configuration inputs, thesizeof()structured fields), then you can use theBUILD_BUG_ON()Guaranteed fulfillment of conditions

BUILD_BUG_ON(sizeof(ctx->__reserved) != sizeof(reserved));

▶ If you get alerts about unused variables/parameters during compilation, you need to fix these issues

▶ Use --strict Can help see potential problems with code

Linux Kernel Debugging

▶ There are a variety of Linux kernel feature tools to help simplify kernel debugging

  • Specific logging frameworks
  • Using the standard way of dumping low-level crash messages
  • Multiple runtime checkers to help check for various problems: memory issues, locking issues, undefined behavior, etc.
  • Interactive or after-the-fact debugging

▶ These features need to be explicitly enabled in the kernel menuconfig; they are assigned to theKernel hacking -> Kernel debugging in the configuration table entries.

  • requireCONFIG_DEBUG_KERNELSet to "y" to enable other debugging options

Debugging using messages

There are 3 available APIs:

▶ For new debug messages, it is not recommended to use the oldprintk()

pr_*()Family Functions:pr_emerg(), pr_alert(), pr_crit(), pr_err(), pr_warn(), pr_notice(), pr_info(), pr_cont()as well as specialpr_debug()(see later)

  • Defined ininclude/linux/
  • Use a classically formatted string as a parameter, such aspr_info("Booting CPU %d\n", cpu);
  • Here is the output kernel log:[ 202.350064] Booting CPU 1

print_hex_dump_debug(): Dump the buffer contents using a hexdump-like format

dev_*()Family functions:dev_emerg(), dev_alert(), dev_crit(), dev_err(), dev_warn(), dev_notice(), dev_info() as well as specialdev_dbg() (see below):

  • They use a pointer to thestruct deviceas the first parameter, followed by a formatting string parameter

  • Defined ininclude/linux/dev_printk.hcenter

  • Can be used in drivers integrated with Linux device modules

  • Usage:dev_info(&pdev->dev, "in probe\n");

  • Kernel output:

    [ 25.878382] serial : in probe
    [ 25.884873] serial : in probe
    

*_ratelimited() The versioning method can be based on the/proc/sys/kernel/printk_ratelimit{_burst}value to limit the large amount of output under high-frequency calls

▶ Compared to the standardprintf(), the kernel defines more format descriptors:

  • %p: defaults to displaying the hash of the pointer
  • %px: always real pointer address (for insensitive addresses)
  • %pK: Display the hash pointer value according to thekptr_restrict The sysctl value can be 0 or a pointer address
  • %pOF: Device Tree Node Format Descriptor
  • %pr: Resource structure format descriptors
  • %pa: Display physical address (all 32/64 bits supported)
  • %pe: Error pointer (show the string corresponding to the corresponding error value)

▶ For use%pKThe/proc/sys/kernel/kptr_restrictSet to 1

▶ For more supported format descriptors, seecore-api/printk-formats

pr_debug() and dev_dbg()

▶ When using the definedDEBUGWhen compiling the driver, all these messages will be compiled and printed at the debug level. This can be accomplished by starting the driver with the#define DEBUGor inMakefilehit the nail on the headccflags-$(CONFIG_DRIVER) += -DDEBUGto enableDEBUG

▶ When usingCONFIG_DYNAMIC_DEBUGWhen compiling the kernel, these messages are automatically converted to be output as a single file, single module, or single message (via the/proc/dynamic_debug/controlSettings). Messaging is not enabled by default

  • For details, seeadmin-guide/dynamic-debug-howto
  • You can get debug messages of interest only

▶ UseDEBUG maybeCONFIG_DYNAMIC_DEBUGThe messages are not compiled when

pr_debug() and dev_dbg() usage

▶ You can pass/proc/dynamic_debug/control File Enable Debug Printing

  • cat /proc/dynamic_debug/controlAll message lines enabled by the kernel will be displayed
  • As:init/:1427 [main]run_init_process =p " \%s\012"

▶ Individual lines, files, or modules can be enabled with the following syntax:

  • echo "file drivers/pinctrl/ +p" > /proc/dynamic_debug/control will enabledrivers/pinctrl/ All debugging information in the
  • echo "module pciehp +p" > /proc/dynamic_debug/control will enablepciehp Debug Printing in Modules
  • echo "file init/ line 1427 +p" > /proc/dynamic_debug/control reactivateinit/ Debug printout of line 1247 of file
  • commander-in-chief (military)+p exchange (sth) for (sth else)-p To disable debug printing

Debug logs troubleshooting

▶ When using dynamic debugging, make sure that the debug call is enabled: it needs to be in thedebugfs(used form a nominal expression)controlfile is seen and must be enabled (=p)

▶ Is the log output located only in the kernel log buffer?

  • This can be done bydmesgferret out
  • can be reducedloglevelto output directly to the terminal
  • can be set from the kernel command lineignore_loglevelto force all kernel logs to be output to the terminal

▶ If external modules are being processed, it may be necessary to define them in the module source code or MakefileDEBUGInstead of using dynamic debugging

▶ If configurations are made via the kernel command line, will they be parsed correctly?

  • As of 5.14, the kernel can notify the command line of failures

    Unknown kernel command line parameters foo, will be passed to user space.
    
  • Care needs to be taken with special string escaping (e.g. quotes)

▶ Note that a part of the subsystem uses its own logging base settings as well as specific configurations/controls like=0x1ff

Kernel early debug

▶ During the booting phase, the kernel may crash before displaying the system message

▶ On ARM, you can activate the early debugging option if the kernel fails to boot or pauses without messaging any messages

  • CONFIG_DEBUG_LL=y Enable ARM early window output
  • CONFIG_EARLYPRINTK=y will allow printk to output printed information

▶ Required for useearlyprintkCommand Line Arguments to Enable the Early Printk Output Feature

Kernel crashes and oops

Kernel crashes

▶ The kernel is not immune to crashes, and many errors may cause crashes

  • Memory access errors (null pointers, out-of-bounds accesses, etc.)
  • Error detecting the use of panic
  • Incorrect kernel execution mode (e.g., sleeping was used in an atomic context)
  • Kernel detects deadlock

▶ In case of an error, the kernel temporarily sends a message "Kernel oops" to the terminal.

Kernel oops

▶ Message content depends on the architecture used

▶ Most architectures will display at least the following information:

  • CPU state at the time of oops
  • Register contents
  • Backtracking function calls that cause crashes
  • Stack contents (last X bytes)

▶ Depending on the architecture, it is possible to use the PC register (sometimes called IP, EIP, etc.) memory to distinguish the crash location

▶ UseCONFIG_KALLSYMS=ySymbolic names can be embedded in the kernel image, which in turn allows for meaningful symbolic names in the traceback stack

▶ The format of the symbols displayed in the traceback stack is:

  • <symbol_name>+<hex_offset>/<symbol_size>

▶ If oops is not significant (occurs in the process context), the kernel kills the process and continues execution

  • Must compromise for kernel stability

▶ Tasks that hang for too long may also generate oops (CONFIG_DETECT_HUNG_TASK)

▶ If KGDB is supported, the kernel switches to KGDB mode when oops occur

Oops example

imageimage

Kernel oops debugging: addr2line

▶ You can use addr2line to convert the displayed address/symbol to a source line:

  • addr2line -e vmlinux <address>

▶ GNU binutils >= 2.39 will handle symbols + offset symbols

  • addr2line -e vmlinux + <symbol_name>+<off>

▶ The kernel source code can be accessed via thefaddr2lineScript to handle older versions of symbol+offset symbols

  • scripts/faddr2line vmlinux + <symbol_name>+<off>

▶ Must be passedCONFIG_DEBUG_INFO=yCompiling the kernel to embed debugging information into vmlinux files

Kernel oops debugging: decode_stacktrace.sh

▶ This can be done by using the kernel source code provided by thedecode_stacktrace.shrealizationaddr2lineThe oops auto-decode

▶ This script converts all symbolic names/addresses to the corresponding file/line and shows the assembly code that triggered the crash

./scripts/decode_stacktrace.sh vmlinux linux_source_path/ < oops_ > decoded_oops.txt

▶ Note: You should setCROSS_COMPILEcap (a poem)ARCHenvironment variable to get the correct assembly dump

Oops behavior configuration

▶ Sometimes, crashes can be more severe, causing the kernel to panic and stop execution altogether in a busy loop

▶ You can passCONFIG_PANIC_TIMEOUTEnable automatic reboot on panic

  • 0: with no reboot
  • Negative value: Immediate restart
  • Positive: number of seconds to wait before rebooting

▶ You can configure OOPS to always be panic

  • During boot, set theoops=panicAdd to command line
  • During the build, set theCONFIG_PANIC_ON_OOPS=y

The Magic SysRq

Serial drivers provide

▶ Multiple debugging/recovery commands can be executed in case of serious problems with the kernel

  • In embedded: send the interrupt symbol at the terminal (press[Ctrl]+apress again[Ctrl]+\), and then press<character>
  • exist/proc/sysrq-triggerResponse from CCCS<character>

▶ Example:

  • h: show available commands
  • s: synchronize all mounted file systems
  • b: Reboot the system
  • w: show the kernel stacks of all sleeping processes
  • t: show kernel stacks for all running processes
  • g: enter kgdb mode
  • z: flush trace buffer
  • c: trigger a crash (kernel panic)
  • You can also register your own commands

▶ See detailsadmin-guide/sysrq

Built-in Kernel self tests

Kernel memory issue debugging

▶ Memory problems may occur when writing kernel code in user space

  • cross-border visit
  • Use the freed memory (in thekfree()after which a pointer is dereferenced)
  • Due to non-implementationkfree()lead to insufficient memory

▶ A variety of tools are available to capture these issues

  • KASANCan look for use of freed memory and out-of-bounds access issues
  • KFENCECan look for use of freed memory and out-of-bounds access issues in production systems
  • KmemleakCan find memory leaks caused by forgetting to free memory

KASAN

▶ You can look for the use of freed memory and out-of-bounds access problems.

▶ Detecting the kernel during compilation with GCC

▶ Supports almost all architectures (ARM, ARM64, PowerPC, RISC-V, S390, Xtensa and X86)

▶ Configuration via kernelCONFIG_KASANEnable KASAN

▶ You can enable KASAN for a specific file by modifying the Makefile.

  • KASAN_SANITIZE_file.o := y Enabling KASAN for specific files
  • KASAN_SANITIZE := y Enable KASAN for all files in the Makefile folder

Kmemleak

▶ Kmemleakl can look up the use ofkmalloc()Memory leaks in dynamically requested objects

  • Detecting whether a memory address is referenced by scanning memory

▶ Once enabledCONFIG_DEBUG_KMEMLEAKThe first step is to create a new version of this program in thedebugfsView files controlled by kmemleak in

▶ Scan for memory leaks every 10 minutes

  • This can be done byCONFIG_DEBUG_KMEMLEAK_AUTO_SCANprohibit the use of sth.

▶ A scan can be triggered immediately as follows

  • # echo scan > /sys/kernel/debug/kmemleak

▶ Results are displayed in the debugfs

  • # cat /sys/kernel/debug/kmemleak

▶ For more information seedev-tools/kmemleak

Kmemleak report

# cat /sys/kernel/debug/kmemleak
unreferenced object 0x82d43100 (size 64):
  comm "insmod", pid 140, jiffies 4294943424 (age 270.420s)
  hex dump (first 32 bytes):
    b4 bb e1 8f c8 a4 e1 8f 8c ce e1 8f 88 c6 e1 8f ................
    10 a5 e1 8f 18 e2 e1 8f ac c6 e1 8f 0c c1 e1 8f ................
  backtrace:
    [<c31f5b59>] slab_post_alloc_hook+0xa8/0x1b8
    [<c8200adb>] kmem_cache_alloc_trace+0xb8/0x104
    [<1836406b>] 0x7f005038
    [<89fff56d>] do_one_initcall+0x80/0x1a8
    [<31d908e3>] do_init_module+0x50/0x210
    [<2658dd55>] load_module+0x208c/0x211c
    [<e1d48f15>] sys_finit_module+0xe4/0xf4
    [<1de12529>] ret_fast_syscall+0x0/0x54
    [<7ee81f34>] 0x7eca8c80

UBSAN

▶ UBSAN is a runtime detector that detects undefined code behavior

  • Shift using values greater than type
  • integer overflow
  • Unaligned pointer access
  • Out-of-bounds access to static arrays
  • /docs/

▶ Use compile-time detection to insert checks performed at runtime

▶ Must be enabledCONFIG_UBSAN=y

▶ UBSAN can be enabled for specific files by modifying the Makefile

  • UBSAN_SANITIZE_file.o := y Enabling UBSAN for specific files
  • UBSAN_SANITIZE := y Enable UBSAN for all files in the Makefile folder

UBSAN: example of UBSAN report

▶ The following reports an undefined behavior: shifting with a value of >32

UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
shift exponent 51 is too large for 32-bit type 'int'
CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xd2/0x148 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/:159
__ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/:425
...
RIP: 0033:0x4497b9
Code: e8 8c 9f 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 0f 83 9b 6b fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fb5ef0e2c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fb5ef0e36cc RCX: 00000000004497b9
RDX: 0000000020000040 RSI: 0000000000000258 RDI: 0000000000000014
RBP: 000000000071bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000005490 R14: 00000000006ed530 R15: 00007fb5ef0e3700

Debugging locking

▶ Lock Debugging: Verifying Lock Correctness

  • CONFIG_PROVE_LOCKING
  • Detecting kernel lock code
  • Detect whether the locking principle has been violated during the life of the system, e.g:
    • Require a different lock order (keep tracking and comparing lock orders)
    • Interrupt handlers and interrupt-enabled process contexts get Spinlocks
  • Not suitable for production systems
  • For details, seelocking/lockdep-design

CONFIG_DEBUG_ATOMIC_SLEEPAllows detection of erroneously dormant code in atomic code segments (usually in the case of holding locks).

  • The detected problems can be displayed via dmesg

Concurrency issues

▶ Kernel Concurrent SANitizer Framework

▶ Introduced in Linux 5.8CONFIG_KCSAN

▶ Dynamic competition detector based on compile-time detection

▶ Concurrency problems of the system can be detected (mainly data contention)

▶ See moredev-tools/kcsan cap (a poem)/Articles/816850/

KGDB

kgdb - A kernel debugger

CONFIG_KGDB

▶ The execution of the kernel is completely controlled by gdb on another machine connected using a serial line

▶ Do almost anything, including inserting breakpoints on the interrupt handler

▶ Supports the most popular CPU architectures

CONFIG_GDB_SCRIPTS It is possible to build the GDB python scripts provided by the kernel

  • See moredev-tools/gdb-kernel-debugging

kgdb kernel config

CONFIG_DEBUG_KERNEL=y Support KDGB

CONFIG_KGDB=y Enable KGDB

CONFIG_DEBUG_INFO=y Compile the kernel with debugging information (-g)

CONFIG_FRAME_POINTER=y Can have more reliable stacks

CONFIG_KGDB_SERIAL_CONSOLE=y Enable Serial KGDB

CONFIG_GDB_SCRIPTS=y Enabling kernel GDB python scripts

CONFIG_RANDOMIZE_BASE=n Disable KASLR

CONFIG_WATCHDOG=nDisable watchdog

CONFIG_MAGIC_SYSRQ=y Enabling Magic SysReq Support

CONFIG_STRICT_KERNEL_RWX=n Disabling memory protection for kernel segments can allow adding breakpoints

kgdb pitfalls

▶ KASLR needs to be disabled to prevent gdb from manipulating random kernel addresses

  • If thekaslrIf you want to use thenokaslrcommand disablekaslrparadigm

▶ Disable platform watchdog to prevent rebooting during debugging

  • When KGDB is interrupted, all interrupts are disabled and watchdog is not serviced
  • Sometimes watchdog is enabled for high-level booting; make sure to disable watchdog here!

▶ Not availableinterruptcommand orCtrl+CInterrupt kernel execution

▶ Insertion of breakpoints at arbitrary positions is not supported (seeCONFIG_KGDB_HONOUR_BLOCKLIST)

▶ A terminal driver that supports polling is required.

▶ Certain organizations lack appropriate functions (e.g., no watchpoint on arm32), and thus may be unstable

Using kgdb

▶ See the kernel documentation for details:dev-tools/kgdb

▶ A kgbd I/O driver must be included, e.g. for use via a serial terminalkgdb(throughCONFIG_KGDB_SERIAL_CONSOLEEnable kgdboc: kgdb over console)

▶ Configure during boot by passing in the following parameterskgdboc

  • kgdboc=<tty-device>,<bauds>e.g.kgdboc=ttyS0,115200

▶ Using sysfs at runtime

  • echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc
  • If the terminal does not support polling, the command line prompts an error

▶ Then turn thekgdbwaitPassed into the kernel: it tells kgdb to wait until the debugger connects to the

▶ boot kernel, after terminal initialization, using the abort symbol +gInterrupting the kernel on a serial terminal (see Magic SysRq)

▶ On the workbench, startgdb

  • arm-linux-gdb ./vmlinux
  • (gdb) set remotebaud 115200
  • (gdb) target remote /dev/ttyS0

▶ Once connected, you can debug the kernel as if it were an application program

▶ On the GDB side, the first thread represents the CPU context (ShadowCPU), and the other threads represent a task

Kernel GDB scripts

CONFIG_GDB_SCRIPTSKernel debugging can be simplified by building python scripts (adding new commands and functions)

▶ When usinggdb vmlinuxThe files in the build root directory are automatically loaded when the

  • lx-symbols: Overloading symbols for vmlinux and modules
  • lx-dmesg: show kernel dmesg
  • lx-lsmod:: Show loaded modules
  • lx-device-{bus|class|tree}:: Display device buses, classes, and trees
  • lx-ps: ps Similar Viewing Tasks
  • $lx_current() Contains the currenttask_struct
  • $lx_per_cpu(var, cpu) Returns a single-cpu variable
  • apropos lx Show all available functions

dev-tools/gdb-kernel-debugging

KDB

CONFIG_KGDB_KDBContains a kgdb frontend name "KDB".

▶ This front-end exposes a debugging prompt on the serial terminal to debug the kernel without the need for an external gdb

▶ You can enter the KDB using the same mechanism as for entering the kgdb mode.

▶ You can use both KDB and KGDB.

  • Use in KDBkgdbEntering kgdb mode
  • Send a message via gdbmaintenance packet 3 Maintenance command to switch from kgdb to KDB mode

kdmx

▶ When the system has only one serial port, it is not possible to use KGDB and serial line output terminals at the same time because an application can access only one port

▶ Fortunately.kdmxThe tool can cut GDB messages and standard terminals from a single port into 2 words pty by (/dev/pts/x) to support simultaneous use of KGDB and serial outputs

/pub/scm/utils/kernel/kgdb/

  • kdmxsubdirectories
image

Going further with KGDB

▶ See the following link for more examples and explanations:

  • Video: /watch?v=HBOwoSyRmys
  • Slides: /images/1/1b/ELC19_Serial_kdb_kgdb.pdf

crash

crashis a CLI tool that can interact with the kernel (dead or alive)

  • utilization/dev/mem maybe/proc/kcore
  • requestCONFIG_STRICT_DEVMEM=n

▶ You can generate coredump files using kdump, kvmdump, etc.

▶ Based ongdband provides many specific commands to check the kernel state

  • Stacks, dmesg, memory mapping of processes, irqs, virtual memory domains, etc.

▶ All tasks running on the system can be checked.

/crash-utility/crash

crash example

$ crash vmlinux vmcore
[...]
	TASKS: 75
NODENAME: buildroot
  RELEASE: 5.13.0
  VERSION: #1 SMP PREEMPT Tue Nov 15 14:42:25 CET 2022
  MACHINE: armv7l (unknown Mhz)
  MEMORY: 512 MB
    PANIC: "Unable to handle kernel NULL pointer dereference at virtual address 00000070"
    	PID: 127
  COMMAND: "watchdog"
    TASK: c3f163c0 [THREAD_INFO: c3f00000]
    	CPU: 1
    STATE: TASK_RUNNING (PANIC)
    
crash> mach
   MACHINE TYPE: armv7l
  	MEMORY SIZE: 512 MB
  		     CPUS: 1
PROCESSOR SPEED: (unknown)
             HZ: 100
      PAGE SIZE: 4096
KERNEL VIRTUAL BASE: c0000000
KERNEL MODULES BASE: bf000000
KERNEL VMALLOC BASE: e0000000
KERNEL STACK SIZE: 8192

post-mortem analysis

Kernel crash post-mortem analysis

▶ Sometimes, it is not possible to access a crashed system or to keep the system offline while waiting for debugging

▶ The kernel can generate crash logs at the remote end (vmcorefile), which allows for a quick reboot of the system and supports gdb post-mortem analysis of the

▶ This feature depends onkexeccap (a poem)kdumpThe crash occurs after a crash and dumps out of thevmcorefile after booting another kernel.

  • It is possible to SSH, FTP, etc. thevmcoreSave files to local storage

kexec & kdump

▶ In panic, kernel kexec supports a "dump-capture kernel" operation directly from the crashed kernel.

  • Most of the time, a specific dump-capture kerne is compiled for the task (initramfs/initrdMinimum configuration is specified)

▶ The kexec system reserves a portion of RAM for kdump kernel execution at startup.

  • This can be done bycrashkernelparameter specifies a specific physical memory domain for the crash kernel

▶ Then usekexec-toolsLoad the dump-capture kernel into this memory domain

  • Internally it will be usedkexec_loadsystem callman 2 kexec_load

▶ Finally, at panic, the kernel reboots into dump-capture Kernel, allowing the user to dump the kernel coredump (/proc/vmcore) into any medium

▶ Different architectures may also require the optional addition of a command line

▶ See alsoadmin-guide/kdump/kdumpto fully understand how to configure the kdump kernel using kexec!

▶ In addition there are user-space services and tools to automatically collect and vmcore dumo to the remote end

  • kdump systemd service andmakedumpfileTool also compresses vmcore into a smaller file (x86, PPC, IA64, S390 only)
  • /makedumpfile/makedumpfile

kdump

image

kexec config and setup

▶ On the standard kernel: •

  • CONFIG_KEXEC=y Enable KEXEC support

  • kexec-tools With the kexec command

  • A kernel and DTB accessible by kexec

▶ dump-capture kernel:

  • CONFIG_CRASH_DUMP=y Kernel with dump crash enabled
  • CONFIG_PROC_VMCORE=y (computing) enable (a feature)/proc/vmcore be in favor of
  • CONFIG_AUTO_ZRELADDR=y ARM32 platform

▶ Setting the correctcrashkernelCommand Line Options:

  • crashkernel=size[KMG][@offset[KMG]]

▶ Use kexec to load the dump-capture kernel as the first kernel

  • kexec --type zImage -p my_zImage --dtb=my_dtb.dtb -- initrd=my_initrd --append="command line option"

Going further with kexec & kdump

▶ See the following for more information on kexec/kdump:

  • Video: /watch?v=aUGNDJPpUUg
  • Slides: /hosted_files/ossna2022/c0/Postmortem_ Kexec%2C Kdump and