Linux debugging, profiling and tracing training
This article is frombootlinand public training(computer) file。
Debugging, Profiling, Tracing
Debugging
▶ Finding and fixing problems in the software/system
▶ Different tools and methods may be used:
- Interactive debugging (e.g. GDB)
- Ex post facto analysis (e.g. coredump)
- Control flow analysis (using the tracing tool)
- Testing (integration testing)
▶ Most debugging is done in the development environment
▶ Usually intrusive, allowing suspension and resumption of program operation
Profiling
▶ Helps optimize performance by analyzing the runtime of a program
▶ Usually collects counters while the program is running
▶ Measure performance using specific tools, libraries, and operating system features such asperf,OProfile
▶ It first aggregates data during query execution, such as the number of program calls, memory usage, CPU load, cache miss, etc., and then extracts meaningful information from this data and uses it to optimize the program
Tracing
▶ Understanding bottlenecks and problems by tracing the execution flow of an application
▶ Execute the detection code at compile or run time. It is possible to use a specific tracer such asLTTng、trace-cmd、SystemTapetc. to see user-space to kernel-space function calls.
▶ Allows you to view the functions and values used in the execution of the application
▶ Trace data is usually recorded at runtime and displayed at the end of the run
- A large amount of tracing data is generated at the end of a tracing.
- Usually much larger than the profiling data
▶ Since data can be extracted via tracepoints, it can also be used for debugging purposes
Linux Application Stack
User/Kernel mode
▶ User mode and kernel mode usually refer to the privilege level of execution (privilege level)
▶ This mode actually refers to the processor execution mode, i.e., the hardware mode
▶ The kernel can control the complete processor state (exception handling, MMU, etc.), while user space can only do some basic control and execution under kernel supervision.
Processes and Threads
▶ A process is a group of resources, such as memory, threads, and file descriptors, allocated for the execution of a program.
▶ A PID represents a process for which all information is exposed in the/proc/
-
/proc/self
be sure toshowcaseInformation about the processes accessing the directory
▶ When a process is started, it initializes astruct task_structstructure that represents a thread of execution that can be scheduled
- A process is embodied in the kernel as a thread associated with multiple resources.
▶ A thread is an independent execution unit that shares resources within the process, such as address space, file descriptors, etc.
▶ Availablefork()system call to create a new process using thepthread_create() Create a new thread
▶ Only one task can be executed on a CPU core at any time (using theget_current()function to see what tasks are currently executing), and a task can also only be executed on one CPU core
▶ Different CPU cores can perform different tasks
MMU and memory management
▶ In the Linux kernel (configured with theCONFIG_MMU=y), all addresses accessed by the CPU are virtual addresses
▶ The Memory Management Unit (MMU) can map this virtual memory to physical memory (RAM or IO)
▶ The basic mapping unit of the MMU becomes a page, and the size of the page is fixed (depending on the architecture/kernel configuration).
▶ Address mapping information is inserted into the page table of the MMU hardware and is used to translate virtual addresses accessed by the CPU into physical addresses
▶ The MMU can restrict page map access through certain attributes such as No Execute, Writable, Readable bits, Privileged/User bit, cacheability, etc.
Userspace/Kernel memory layout
▶ Each process has its own virtual address space (struct task_structhit the nail on the headmm
fields) and the page table (but theShare the same kernel mapping)
▶ By default, all user-mapped addresses (base of heap, stack, text, data, etc.) are randomized to minimize attacks. This can be accomplished bynorandmaps
parameter to disable the function
Different processes have different user memory spaces:
Kernel memory map
▶ The kernel has its own specific memory map
▶ Linear mapping is configured at kernel startup by inserting all elements from the kernel's initial page table
▶ Different memory areas by location
▶ Support for randomized configuration of the kernel address space layout can be provided by thenokaslr
command to disable the function
Userspace memory segments
▶ When a process is started, the kernel sets up some virtual memory areas (provided by thestruct vm_area_structmanaged Virtual Memory Areas (VMAs)) and configure different properties.
▶ VMA memory fields are mapped to specific attributes (R/W/X)
▶ A segment error occurs when a program attempts to access an unmapped memory field or maps to a memory field that is not allowed to be accessed, such as
- Write data to a read-only memory segment
- Attempt to execute a non-executable memory segment
▶ You can passmmap()Creating a new memory field
▶ By/proc/
7f1855b2a000-7f1855b2c000 rw-p 00030000 103:01 3408650 ld-2.
7ffc01625000-7ffc01646000 rw-p 00000000 00:00 0 [stack]
7ffc016e5000-7ffc016e9000 r--p 00000000 00:00 0 [vvar]
7ffc016e9000-7ffc016eb000 r-xp 00000000 00:00 0 [vdso]
Userspace memory types
Terms for memory in Linux tools
▶ When using Linux tools, the following four terms are used to describe memory:
- VSS/VSZ: Virtual Set Size (virtual memory size, including shared libraries)
- RSS: Resident Set Size (total physical memory used, including shared libraries)
- PSS: Proportional Set Size (refers to the amount of memory shared with other processes, if a process has 10MB of memory to itself and does share 10MB with another process, then the PSS is 15MB).
- USS: Unique Set Size (physical memory occupied by the process, excluding shared mapped memory)
▶ VSS >= RSS >= PSS >= USS.
Process context
▶ The process context can be thought of as the contents of the CPU registers that are associated with a process: execution register, stack register
▶ The process context also specifies an execution state and allows hibernation in kernel mode
▶ Processes executing in a process context can be preempted
▶ When executing a process in such a context, you can pass theget_current()interviewsss struct task_struct
Scheduling
▶ There are several reasons to wake up the scheduler
- disruptionsHZPeriodic ticks (clock interruptions) caused by the
- Program interrupts caused by non-clocked systems (CONFIG_NO_HZ=y)
- Active calls in codeschedule()
- Implicitly calling functions that can be dormant (e.g.kmalloc(),wait_event()and other blocking operations)
▶ When entering a scheduling function, the scheduler selects a run a newstruct task_structThe last call toswitch_to()macro (computing)
▶ switch_to()will save the process context of the current task and restore the process context of the next task when a new current task is set to run
The Linux Kernel Scheduler
▶ The Linux kernel scheduler is a key component in enabling real-time behavior
▶ It is responsible for deciding which runnable tasks to perform
▶ It is also responsible for selecting the CPU on which the task runs and is tightly coupled with CPUidle and CPUFreq.
▶ Responsible for task scheduling in both kernel space and user space
▶ Each task is assigned a scheduling class or policy.
▶ The scheduling algorithm selects the tasks to be executed based on the type
▶ Tasks of different scheduling types can exist in the system
Non-Realtime Scheduling Classes
There are 3 non-real-time scheduling classes as follows:
▶ SCHED_OTHER
: Default policy, using the time slice algorithm
▶ SCHED_BATCH: SimilarSCHED_OTHER
, but mainly used to perform CPU-intensive tasks
▶ SCHED_IDLE: Very low priority.
▶ SCHED_OTHER
cap (a poem)SCHED_BATCH
All of them can be usednicevalue to increase or decrease its scheduling frequency
- uppernicevalue implies a lower scheduling frequency
Realtime Scheduling Classes
There are 3 types of real-time scheduling as follows:
▶ Runnable tasks preempt other low-priority tasks
▶ SCHED_FIFO:: Tasks with the same priority are scheduled on a first-in-first-out basis
▶ SCHED_RR: SimilarSCHED_FIFO
The time-slice polling is used between tasks that have the same priority, but are
▶ SCHED_FIFO
cap (a poem)SCHED_RR
Can be assigned a priority of 1 to 99
▶ SCHED_DEADLINE: Used to execute repetitivejobs, with additional attributes attached to the task:
- computation timeThe time it takes to complete a job.
- deadlineThe maximum amount of time a job is allowed to run
- periodIf you want to run a job, you can only run one job in that time period.
▶ Defining task types alone is not sufficient for real-time behavior
Changing the Scheduling Class
▶ Each task has a scheduling class (Scheduling Class), which by default is theSCHED_OTHER
▶ man 2 sched_setscheduler A system call can modify the scheduling type of a task
▶ chrtTools:
-
Modifies the scheduling type of a running task:
chrt -f/-b/-o/-r/-d -p PRIO PID
-
It is also possible to use
chrt
Pull up a program of a specific scheduling type:chrt -f/-b/-o/-r/-d PRIO CMD
-
Displays the scheduling type and priority of the current process:
chrt -p PID
▶ If usingman 2 sched_setscheduler
set upSCHED_RESET_ON_FORKflag, then the new process inherits the scheduling type of the parent process
Context switching
▶ Context switching is a behavior that changes the execution mode of the processor (Kernel ↔ User):
- Explicit execution of system call instructions (synchronizing requests from user mode to kernel)
- Implicitly received exceptions (MMU exceptions, interrupts, breakpoints, etc.)
▶ This state change will eventually be reflected in a kernel entry (usually a call vector) that will execute the necessary code and set the correct state for kernel mode execution.
▶ The kernel handles behaviors such as register saving and switching to the kernel stack:
- The kernel stack size is fixed for security purposes
Exceptions
▶ Exceptions are events that indicate that they cause the CPU to enter the exception mode (handle exceptions).
▶ There are two main types of exceptions: synchronous and asynchronous
- Asynchronous exceptions are usually generated when executing MMUs, bus interrupts, or receiving interrupts from hardware and software
- Synchronization exceptions are thrown when specific instructions are executed, such as breakpoints, system calls, etc.
▶ When such an exception is triggered, the processor jumps to the exception vector and executes the exception code
Interrupts
▶ Interrupts are asynchronous signals generated by hardware peripherals
- It can also be a synchronization signal generated by a specific instruction (e.g. (Inter Processor Interrupts )
▶ When an interrupt is received, the CPU changes its execution mode, jumps to a specific vector and switches to kernel mode to handle the interrupt
▶ When there are multiple CPUs (cores), interrupts are usually directed to a certain core
▶ You can control the interrupt load of each CPU by "IRQ affinity".
- confer (cf.)core-api/irq/irq-affinity cap (a poem)man irqbalance(1)
▶ When an interrupt is handled, the kernel runs an interrupt context called the interrupt context (interrupt context) of the special context
▶ This context does not enter user space and should not be usedget_current()
▶ Depending on the architecture, an IRQ stack may be used
▶ Disable interrupts (nested interrupts are not supported)!
System Calls
▶ System calls allow userspace to execute specific instructions by requesting services from the kernel (man 2 syscall)
- fulfillment
libc
The functions provided (e.g.read()
,write()
etc.), a system call is usually executed when the
▶ The different system calls are recognized by the numerical identifiers passed to the registers:
-
The kernel is passed through the (
(in)
__NR_<sycall>
to define system call identifiers such as:#define __NR_read 63 #define __NR_write 64
▶ The kernel holds a table of function pointers to these identifiers, which are used to call the correct handler function after validation of the system call has been completed.
▶ Passing system call parameters via registers (max. 6 parameters)
▶ When a system call is executed, the CPU changes its execution state and switches to kernel mode
▶ Each architecture has a specific hardware mechanism (man 2 syscall)
mov w8, #__NR_getpid
svc #0
tstne x0, x1
Kernel execution contexts
▶ The kernel executes code in different contexts depending on the event being processed
▶ May include disabling interrupts (by disabling interrupts, you can ensure that a particular interrupt handler does not preempt the current code), specific stacks, etc.
Kernel threads
▶ Kernel threads (kthreads) are a special type ofstruct task_struct, which is not associated with any user resource (mm == NULL
)
▶ You can get the information fromkthreadd
process to clone the kernel process, or you can use thekthread_createCreating a kernel process
▶ Similar to user processes, you can schedule as well as hibernate kernel threads in the process context
▶ Byps
command to see the name of the kernel thread (indicated by square brackets):
$ ps --ppid 2 -p 2 -o uname,pid,ppid,cmd,cls
USER PID PPID CMD CLS
root 2 0 [kthreadd] TS
root 3 2 [rcu_gp] TS
root 4 2 [rcu_par_gp] TS
root 5 2 [netns] TS
root 7 2 [kworker/0:0H-events_highpr TS
root 10 2 [mm_percpu_wq] TS
root 11 2 [rcu_tasks_kthread] TS
Workqueues
▶ Workqueues allow scheduling the execution of work at a future point in time.
▶ Workqueues executes work functions in the kernel thread:
- Allows hibernation while performing delayed work.
- Interrupts can be enabled during execution
▶ Work can be executed in a specific workqueue or in a global workqueue shared by multiple users.
softirq
▶ SoftIRQs are a kernel mechanism that runs in a software interrupt context
▶ You can execute code that needs to be executed after interrupt processing and requires low latency. The execution timing is as follows:
- Executed after the interrupt context has processed a hard interrupt
- Executed in the same context as the execution of interrupt handling, so sleep is not allowed.
▶ If you need to execute code in a soft interrupt context, you should use existing soft interrupt implementations such as tasklets, and BH workqueues (which are used in place of tasklets after 6.9) without having to implement them yourself:
Threaded interrupts
▶ Threaded interrupts are a mechanism that allows interrupts to be handled using a hard interrupt handler (IRQ handler) and a threaded interrupt handler
▶ A thread interrupt processor can execute work that may be dormant in kthread
▶ The kernel creates a kthread for each interrupt line requesting a thread interrupt
- kthread named
irq/<irq>-<name>
You can use theps
command to view
Allocations and context
▶ You can use the following function to request memory in the kernel:
void *kmalloc(size_t size, gfp_t gfp_mask);
void *kzalloc(size_t size, gfp_t gfp_mask);
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
▶ All memory request functions have agfp_mask
parameter to specify the memory type:
-
GFP_KERNEL
: normal allocation, can sleep while allocating memory (can't be used in interrupt context) -
GFP_ATOMIC
: automatic slicing, does not sleep while allocating data
Linux Common Analysis & Observability Tools
Pseudo Filesystems
▶ The kernel exposes some virtual file systems to provide system information
▶ procfs Contains process and system information
- The mount location is
/proc
- Often parsed by the tool to present the raw data in a more user-friendly way
▶ sysfsProvides hardware/logic information related to devices and drivers. The mount location is/sys
▶ debugfsShows information related to debugging
- Normally mounted in the
/sys/kernel/debug/
directory mount -t debugfs none /sys/kernel/debug
procfs
▶ procfsExposes process- and system-related information (man 5 proc)
-
/proc/cpuinfo
CPU information is displayed -
/proc/meminfo
Shows memory information (used, free, total, etc.) -
/proc/sys/
Contains adjustable system parameters. Adjustable system parameters by means ofadmin-guide/sysctl/indexYou can view the list of parameters that can be modified -
/proc/interrupts
: Counts the number of interrupts for each CPU-
/proc/irq
Each interrupt line in the
-
-
/proc/<pid>/
Shows process-related information-
/proc/<pid>/status
Shows basic information about the process -
/proc/<pid>/maps
Shows memory mapping information -
/proc/<pid>/fd
Shows the file descriptor of the process -
/proc/<pid>/task
Shows the descriptors of the threads belonging to the process
-
-
/proc/self/
Information about the processes accessing the file is displayed
▶ Can be used infilesystems/proc cap (a poem)man 5 proc to see what's available in theprocfsDocumentation and related content
sysfs
▶ sysfsThe file system exposes information about various kernel subsystems, hardware devices, and driver-related information (man 5 sysfs)
▶ You can view the connection between drivers and devices by representing the file hierarchy of the device tree inside the kernel
▶ /sys/kernel
Contains files for kernel debugging:
-
irq
: contains interrupt-related information (mapping, counting, etc.) -
tracing
: for tracing control
▶ admin-guide/abi-stable
debugfs
▶ debugfsis a simple RAM-based file system that exposes debugging information
▶ Some subsystems (clk, block, dma, gpio, etc.) use it to expose internal debugging information
▶ Usually mounted to the/sys/kernel/debug
- This can be done by
/sys/kernel/debug/dynamic_debug
Enabling Dynamic Debugging -
/sys/kernel/debug/clk/clk_summary
Exposed clock tree
ELF files analysis
ELF files
ELFindicateExecutable and Linkable Format
▶ The file contains a first part that defines the binary structure of the file
▶ A series of segments and sections containing data:
-
.text
section: Code -
.data
section: Data -
.rodata
section: read-only data -
.debug_info
section: contains debugging information
▶ Sections are part of a segment and can be loaded into memory
▶ The same format is used for all architectures supported by the kernel.vmlinux
The formatting is the same
- Many other operating systems also use ELF as a standard executable file format
binutils for ELF analysis
▶ binutils for working with binary files (object files or executables)
- including through
ld
、as
and other useful tools
▶ readelfDisplay information about an ELF file (header, section, segments, etc.).
▶ objdumpELF files can be displayed and disassembled.
▶ objcopyYou can convert ELF files or extract/translate parts of EKF files.
▶ nmYou can display a list of symbols embedded in an ELF file.
▶ addr2lineYou can find the source line/file based on the address in the ELF file
binutils example
▶ Usenm
findksys_read()
Addresses of kernel functions
$ nm vmlinux | grep ksys_read
c02c7040 T ksys_read
▶ Useaddr2line
to find the source code corresponding to a kernel OOPS address or symbolic name:
$ addr2line -s -f -e vmlinux ffffffff8145a8b0
queue_wc_show
:516
▶ Usereadelf
Demonstrate an ELF header:
$ readelf -h binary
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Position-Independent Executable file)
Machine: Advanced Micro Devices X86-64
...
▶ Useobjcopy
Converts an ELF file to a flat binary file:
$ objcopy -O binary
ldd
▶ You can use ldd to display a shared library used in ELF (man 1 ldd)
▶ ldd lists all libraries used during linking
- will not demonstrate the use of the
dlopen()
Loaded libraries
$ ldd /usr/bin/bash
.1 (0x00007ffdf3fc6000)
.8 => /usr/lib/.8 (0x00007fa2d2aef000)
.6 => /usr/lib/.6 (0x00007fa2d2905000)
.6 => /usr/lib/.6 (0x00007fa2d288e000)
/lib64/.2 => /usr/lib64/.2 (0x00007fa2d2c88000)
Processor and CPU monitoring Tools
▶ Many tools are available to monitor various parts of the system
▶ Most tools are CLI-interactive programs
- Process:ps, top, htopet al. (and other authors)
- Memory:Free, vmstatet al. (and other authors)
- reticulation
▶ Most tool dependenciessysfs maybeprocfsFile system to get process, memory and system information
- Networking tools use the netlink interface of the kernel networking subsystem
ps & top(omitted)
mpstat
▶ Displaying multiprocessor information (man 1 mpstat)
▶ For detecting unbalanced CPU load, incorrect IRQ affinity, etc.
$ mpstat -P ALL
Linux 6.0.0-1-amd64 (fixe) 19/10/2022 _x86_64_ (4 CPU)
17:02:50 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
17:02:50 all 6,77 0,00 2,09 11,67 0,00 0,06 0,00 0,00 0,00 79,40
17:02:50 0 6,88 0,00 1,93 8,22 0,00 0,13 0,00 0,00 0,00 82,84
17:02:50 1 4,91 0,00 1,50 8,91 0,00 0,03 0,00 0,00 0,00 84,64
17:02:50 2 6,96 0,00 1,74 7,23 0,00 0,01 0,00 0,00 0,00 84,06
17:02:50 3 9,32 0,00 2,80 54,67 0,00 0,00 0,00 0,00 0,00 33,20
17:02:50 4 5,40 0,00 1,29 4,92 0,00 0,00 0,00 0,00 0,00 88,40
Memory monitoring tools
free
▶ freeis a simple program that shows the amount of remaining and used memory on the system (man 1 free)
- Used to check if system memory is running out
- utilization
/proc/meminfo
to get memory information
$ free -h
total used free shared buff/cache available
Mem: 15Gi 7.5Gi 1.4Gi 192Mi 6.6Gi 7.5Gi
Swap: 14Gi 20Mi 14Gi
▶ free
A small value for a field does not mean that memory is exhausted; memory will cache unused memory in order to optimize performance. See alsoman 5 prochit the nail on the headdrop_caches
to observebuffers/cachetreat (sb a certain way)free/availableImpact of memory
vmstat
▶ vmstatShows system virtual memory usage information
▶ You can also show process, memory, page, blocking IO, traps, disk, and CPU usage.man 8 vmstat
▶ Data can be acquired periodically: vmstat
$ vmstat 1 6
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 253440 1237236 194936 9286980 3 6 186 540 134 157 3 5 82 10 0
▶ Note:vmstatTreat kernel blocks as 1024 bytes
pmap
▶ pmapDemonstrated in the form of a proposal/proc/<pid>/maps
Content in.man 1 pmap
# pmap 2002
2002: /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
...
00007f3f958bb000 56K r---- .3.32.1
00007f3f958c9000 192K r-x-- .3.32.1
00007f3f958f9000 84K r---- .3.32.1
00007f3f9590e000 8K r---- .3.32.1
00007f3f95910000 4K rw--- .3.32.1
00007f3f95937000 8K rw--- [ anon ]
00007f3f95939000 8K r---- .2
00007f3f9593b000 152K r-x-- .2
00007f3f95961000 44K r---- .2
00007f3f9596c000 8K r---- .2
00007f3f9596e000 8K rw--- .2
00007ffe13857000 132K rw--- [ stack ]
00007ffe13934000 16K r---- [ anon ]
00007ffe13938000 8K r-x-- [ anon ]
total 11088K
I/O monitoring tools
iostat
▶ iostatShows the IOs of each device on the system
▶ Used to see if a device is overloaded with IOs
$ iostat
Linux 5.19.0-2-amd64 (fixe) 11/10/2022 _x86_64_ (12 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
8,43 0,00 1,52 8,77 0,00 81,28
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme0n1 55,89 1096,88 149,33 0,00 5117334 696668 0
sda 0,03 0,92 0,00 0,00 4308 0 0
sdb 104,42 274,55 2126,64 0,00 1280853 9921488 0
iotop
▶ iotopShows information about IOs for each process
▶ Used to see which process is generating a lot of I/O streams
- Needs to be enabled in the kernelCONFIG_TASKSTATS=y, CONFIG_TASK_DELAY_ACCT=y and CONFIG_TASK_IO_ACCOUNTING=y
- It also needs to be enabled at runtime:
sysctl -w kernel.task_delayacct=1
# iotop
Total DISK READ: 20.61 K/s | Total DISK WRITE: 51.52 K/s
Current DISK READ: 20.61 K/s | Current DISK WRITE: 24.04 K/s
TID PRIO USER DISK READ DISK WRITE> COMMAND
2629 be/4 cleger 20.61 K/s 44.65 K/s firefox-esr [Cache2 I/O]
322 be/3 root 0.00 B/s 3.43 K/s [jbd2/nvme0n1p1-8]
39055 be/4 cleger 0.00 B/s 3.43 K/s firefox-esr [DOMCacheThread]
1 be/4 root 0.00 B/s 0.00 B/s init
2 be/4 root 0.00 B/s 0.00 B/s [kthreadd]
3 be/0 root 0.00 B/s 0.00 B/s [rcu_gp]
4 be/0 root 0.00 B/s 0.00 B/s [rcu_par_gp]
Networking Observability tools
ss
▶ ssShows the state of the network socket
- IPv4, IPv6, UDP, TCP, ICMP, and UNIX domain sockets.
▶ Supersedesnetstat
▶ From/proc/net
Access to information in
▶ Usage:
-
ss
: default display of connected sockets -
ss -l
Demonstrate listening to sockets -
ss -a
Demonstrate listening and connected sockets -
ss -4/-6/-x
Show only IPv4, IPv6 or UNIX sockets -
ss -t/-u
Show TCP or UDP sockets only -
ss -p
Show the process used for each socket -
ss -n
Displaying addresses in digital form -
ss -s
Demonstrate an approximation of the existing sockets
▶ See alsothe ss manpage
# ss
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
u_dgr ESTAB 0 0 * 304840 * 26673
u_str ESTAB 0 0 /run/dbus/system_bus_socket 42871 * 26100
icmp6 UNCONN 0 0 *:ipv6-icmp *:*
udp ESTAB 0 0 192.168.10.115%wlp0s20f3:bootpc 192.168.10.88:bootps
tcp ESTAB 0 136 172.16.0.1:41376 172.16.11.42:ssh
tcp ESTAB 0 273 192.168.1.77:55494 87.98.181.233:https
tcp ESTAB 0 0 [2a02:...:dbdc]:38466 [2001:...:9]:imap2
...
#
iftop
▶ iftopDemonstrate bandwidth usage for a remote host
▶ Displaying bandwidth using a histogram
▶ iftop -i eth0
:
▶ You can customize the output
▶ See alsothe iftop manpage
tcpdump
▶ tcpdumpCan capture network traffic and decode many protocols
▶ Capturing messages based on the libpcap library
▶ You can save the captured message to a file and then read it again
- It can be saved aspcapor newpcapngspecification
tcpdump -i eth0 -w
tcpdump -r
▶ Filters can be used to prevent the capture of irrelevant messages
tcpdump -i eth0 tcp and not port 22
▶ /
Wireshark(omitted)
Application Debugging
Good practices
▶ The current compiler can detect many errors during compilation with alerts
- If you want to catch errors as early as possible, it is recommended to use the
-Werror -Wall -Wextra
▶ The compiler can provide static analysis functions
- GCC can pass the-fanalyzer The flag provides this function
- LLVM provides the build process with theSpecific tools
▶ You can also use component-specific helper/hardening
- For example, when using the GNC C library, you can pass the_FORTIFY_SOURCE Macro to add runtime input detection.
Building with debug information
Debugging with ELF files
▶ GDB can debug ELF files, which contain debugging information.
▶ Debugging information using DWARF format
▶ Allows the debugger to debug based on address and symbol names, call points, etc.
▶ Debugging information is specified by the compiler during compilation by specifying the-g
Generate to ELF file
-
-g1
: Minimal debugging information (call stack usage) -
-g2
: Designation-g
Default debugging level when -
-g3
: contains additional debugging information (macro definitions)
▶ For more debugging information seeGCC Documentation
Debugging with compiler optimizations
▶ Compiler optimization (-O<level>
) can result in optimizing away certain variables and function calls
▶ This occurs when using GDB to display this information that has been optimized away:
$1 = <value optimized out>
▶ If you want to check variables and functions, it is best to use the-O0
compile the code (without enabling optimization)
- Note: This can only be done by
-O2
maybe-Os
Compile the kernel
▶ Functions can also be annotated using the compiler attribute: the
__attribute__((optimize("O0")))
▶ Remove the function ofstatic
The modifier avoids inlining the function
- Note: LTO (Link Time Optimization) can solve this problem!
▶ Setting a specific variable tovolatile
Compiler optimizations can be avoided
Instrumenting code crashes
▶ You can use the GNU extension functions.backtrace()
(man 3 backtrace) to show the application's call stack:
char **backtrace_symbols(void *const *buffer, int size);
▶ You can passsignal()
(man signal(3)) Add hooks to specific signals to print the call stack:
- This can be done, for example, by capturing
SIGSEGV
signals to dump the current call stack
void (*signal(int sig, void (*func)(int)))(int);
The ptrace system call
ptrace
▶ ptraceYou can trace a process by accessing tracee memory and register memory.
▶ A tracer can observe and control the execution status of another process
▶ By placingptrace()
The system call attach to a tracee process to implement tracing (man 2 ptrace)
▶ You can directly callptrace()
, but will usually be called indirectly through the tool:
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
▶ Debugging tools such as GDB, strace, and others can access the tracee process state
GDB
Here's a brief overview of gdb's general commands:
-
gdb <program>
: Starting to debug a program using gdb debugging -
gdb -p <pid>
: Attach gdb to a running program by specifying its PID. -
(gdb) run [prog_arg1 [prog_arg2] ...]
: Specify the command to use when running a program with GDB. -
break foobar (b)
: as a functionfoobar()
breakpoint -
break :42
: as a documentLine 42 interruptions
-
print var
,print $reg
maybeprint task->files[0].fd (p)
: Printing Variablesvar
Registers$reg
or a complex reference. -
info registers
: Display register information -
continue (c)
: Continuing execution after a breakpoint -
next (n)
: continue to the next line, skipping function calls -
step (s)
: continue to the next line into the subfunction -
stepi (si)
: Continue with the next instruction -
finish
: return function -
backtrace (bt)
: Show program call stack -
info threads (i threads)
: show list of available threads -
info breakpoints (i b)
: show list of breakpoints/watchpoints -
delete (d)
: Delete breakpoints -
thread (t)
: Selection of threads -
frame (f)
: selects a specific frame of the call stack.n
Frame representing the call stack -
watch <variable>
maybewatch \*<address>
: add a watchpoint for a specific variable/address -
print variable = value (p variable = value)
: Modify the contents of a specific variable -
break :42 if condition == value
: If a specific condition is true, enter a breakpoint -
watch if condition == value
: This watchpoint is triggered when a certain condition is true. -
display <expr>
: prints the expression automatically every time the program stops -
x/ <n><u> <address>
: Displays the memory at the specified address.n
for the amount of memory displayed.u
The data type for the display (b/h/w/g
). This can be done by using thei
Type display instructions. -
list <expr>
: Show source code related to the current program counter position -
disassemble <location,start_offset,end_offset> (disas)
: Show the currently running assembly code -
p function(arguments)
: Execute a function via GDB. Note the possible side effects of executing this function -
p $newvar = value
: Declare a new gdb variable that can be used locally or in the order of commands -
define <command_name>
: Define a new command sequence. This sequence of commands can then be called directly in GDB.
remote debugging
▶ In a non-embedded environment, you can use thegdb
As a debugging front-end
▶ gdb
Direct access to binaries and libraries compiled with debug symbols
▶ However, in an embedded context, the target platform usually restricts the direct use of thegdb
debugging
▶ Remote debugging is required at this point
-
ARCH-linux-gdb
Deployed in development workstations to provide debugging features for users -
gdbserver
Deployed in the target system (only 400KB in arm architecture)
Remote debugging: architecture
Remote debugging: target configuration
▶ At the target throughgdbserver
Run a program, at which point the program is not immediately executed:
gdbserver :<port> <executable> <args>
gdbserver /dev/ttyS0 <executable> <args>
▶ Alternatively, you can makegdbserver
attach to a running program:
gdbserver --attach :<port> <pid>
▶ Or you can start a program without executing itgdbserver
(Followed by setting up the target program on the CLIENT side):
gdbserver --multi :<port>
Remote Commissioning: Host Configuration
▶ Starting on the host sideARCH-linux-gdb <executable>
and use the followinggdb
Command:
-
let know
gdb
Shared library catalog:gdb> set sysroot <library-path>
-
connectivity target
gdb> target remote <ip-addr>:<port> (networking) gdb> target remote /dev/ttyUSB0 (serial link)
If the startup
gdbserver
When you specify the--multi
option, you need to use thetarget extended-remote
interchangeabilitytarget remote
-
If it is not in the
gdbserver
If you specify the program to be debugged on the command line, the following command is executed:gdb> set remote exec-file <path_to_program_on_target>
Coredumps
▶ When a program crashes due to a segment error, it will not be controlled by the debugger
▶ Fortunately, Linux can generate an ELF-formatted image file containing the memory of the program at the time of its crash.core
file. gdb can use thecore
File analysis of crashed program status
▶ At the target end
- pass (a bill or inspection etc)
ulimit -c unlimited
Start the application so that if the program crashes it can generate acore
file - This can be done by
/proc/sys/kernel/core_pattern
(man 5 core) Modify the name of the output coredump file - in using
systemd
The coredump feature is disabled by default for security reasons, and can be accessed via theecho core > /proc/sys/kernel/core_pattern
go live (on a temporary basis)
▶ On the host side
- After the program crashes, the
core
The file is transferred from the target to the host, and then theARCH-linux-gdb -c core-file application-binary
minicoredumper
▶ For complex programs, coredump may be larger
▶ minicoredumperis a user-space tool that is based on the standard core dump feature
▶ You can redirect the core dump output to a user-space program through a pipe
▶ JSON-based configuration is possible:
-
Only relevant sections (stack, heap, selected ELF sections) are saved.
-
Compressing output files
-
through (a gap)
/proc
Save additional information
▶ /diamon/minicoredumper
▶ "Efficient and Practical Crash Data Acquisition for Embedded Systems"
- Video:/watch?v=q2zmwrgLJGs
- Slides:/images/8/81/Eoss2023_ogness_minicoredumper.pdf
GDB: going further
▶ Tutorial: Debugging Embedded Devices using GDB - Chris Simmonds, 2020
- Slides: /images/0/01/
- Video: /watch?v=JGhAgd2a_Ck
GDB Python Extension
▶ GDB provides apython integrationfeature, which allows you to script some debugging operations.
▶ When executing Python with GDB, a file namedgdbmodule, which contains all the classes related to GDB
▶ You can add new commands, breakpoints, and pointer types
▶ You can fully control and observe the debugged program through the GDB capability in Python scripts.
- Control execution, add breakpoints, watchpoints, etc.
- Access to program memory, frames, symbols, etc.
GDB Python Extension
class PrintOpenFD():
def __init__(self, file):
= file
super(PrintOpenFD, self).__init__()
def stop (self):
print ("---> File " + + " opened with fd " + str(self.return_value))
return False
class PrintOpen():
def stop(self):
PrintOpenFD(gdb.parse_and_eval("file").string())
return False
class TraceFDs ():
def __init__(self):
super(TraceFDs, self).__init__("tracefds", gdb.COMMAND_USER)
def invoke(self, arg, from_tty):
print("Hooking open() with custom breakpoint")
PrintOpen("open")
TraceFDs()
▶ via gdbsource
command to load a Python script
- If the name of the script is
<program>-
, then it will be automatically loaded by GDB:
(gdb) source trace_fds.py
(gdb) tracefds
Hooking open() with custom breakpoint
Breakpoint 1 at 0x33e0
(gdb) run
Starting program: /usr/bin/touch foo bar
Temporary breakpoint 2 at 0x5555555587da
---> File foo opened with fd 3
Temporary breakpoint 3 at 0x5555555587da
---> File bar opened with fd 0
Common debugging issues
▶ Problems may be encountered during debugging, such as bad address-> symbol conversions, "optimized out" values or functions, and empty call stacks.
▶ Here is a checklist to help introduce some problem-solving times:
- Make sure the startup binary containsdebug symbols: When using gcc, make sure to use the
-g
When using gdb, be sure to use thenon-stripped
Version of the binary - If possible, in the final binary disable theoptimizationsor use a less invasive level (
-Og
)- For example, static functions can be collapsed into the caller depending on the optimization level, so they may be lost from the call stack
- Avoid code optimization due to reuse of the frame pointer register: with GCC, make sure to use the
-fno-omit-frame-pointer
- Not just for debugging: many profiling/tracing tools rely on the call stack as well!
▶ Your application may contain many libraries: these configurations need to be applied to all components used.
Application Tracing
strace
The system call tracer -
▶ Available for all GNU/Linux systems, the tool can be built with a cross-compilation tool chain or build system
▶ You can see what the system is executing: accessing files, allocating memory, for finding simple problems
▶ Usage:
-
strace <command>
: Start a new process -
strace -f <command>
:: Simultaneous tracing of subprocesses -
strace -p <pid>
: tracing an existing process -
strace -c <command>
:: Statistics for each system call -
strace -e <expr> <command>
:: Use of advanced filter expressions
For more information check out stracebrochure
strace example output
> strace cat Makefile
[...]
fstat64(3, {st_mode=S_IFREG|0644, st_size=111585, ...}) = 0
mmap2(NULL, 111585, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f69000
close(3) = 0
access("/etc/", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/tls/i686/cmov/.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320h\1\0004\0\0\0\344"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1442180, ...}) = 0
mmap2(NULL, 1451632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7e06000
mprotect(0xb7f62000, 4096, PROT_NONE) = 0
mmap2(0xb7f66000, 9840, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7f66000
close(3) = 0
[...]
openat(AT_FDCWD, "Makefile", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=173, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7290d28000
read(3, "ifneq ($(KERNELRELEASE),)\nobj-m "..., 131072) = 173
write(1, "ifneq ($(KERNELRELEASE),)\nobj-m "..., 173ifneq ($(KERNELRELEASE),)
strace -c example output
> strace -c cheese
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
36.24 0.523807 19 27017 poll
28.63 0.413833 5 75287 115 ioctl
25.83 0.373267 6 63092 57321 recvmsg
3.03 0.043807 8 5527 writev
2.69 0.038865 10 3712 read
2.14 0.030927 3 10807 getpid
0.28 0.003977 1 3341 34 futex
0.21 0.002991 3 1030 269 openat
0.20 0.002889 2 1619 975 stat
0.18 0.002534 4 568 mmap
0.13 0.001851 5 356 mprotect
0.10 0.001512 2 784 close
0.08 0.001171 3 461 315 access
0.07 0.001036 2 538 fstat
...
ltrace
Used for tracing a program's use ofshared library (computing), and all the signals it receives.
▶ It can be a good supplementstrace
The latter only shows the system call
▶ Can operate without a source library in a timely manner
▶ Allow filtering by regular expression or function name list library call
▶ By-S
option allows you to display the system call
▶ By-c
Option to display a summary
▶ brochure
▶ Useglibc more effective
For more information check out/wiki/Ltrace
ltrace example output
# ltrace ffmpeg -f video4linux2 -video_size 544x288 -input_format mjpeg -i /dev
/video0 -pix_fmt rgb565le -f fbdev /dev/fb0
__libc_start_main([ "ffmpeg", "-f", "video4linux2", "-video_size"... ] <unfinished ...>
setvbuf(0xb6a0ec80, nil, 2, 0) = 0
av_log_set_flags(1, 0, 1, 0) = 1
strchr("f", ':') = nil
strlen("f") = 1
strncmp("f", "L", 1) = 26
strncmp("f", "h", 1) = -2
strncmp("f", "?", 1) = 39
strncmp("f", "help", 1) = -2
strncmp("f", "-help", 1) = 57
strncmp("f", "version", 1) = -16
strncmp("f", "buildconf", 1) = 4
strncmp("f", "formats", 1) = 0
strlen("formats") = 7
strncmp("f", "muxers", 1) = -7
strncmp("f", "demuxers", 1) = 2
strncmp("f", "devices", 1) = 2
strncmp("f", "codecs", 1) = 3
...
ltrace summary
utilization-c
Options:
% time seconds usecs/call calls function
------ ----------- ----------- --------- --------------------
52.64 5.958660 5958660 1 __libc_start_main
20.64 2.336331 2336331 1 avformat_find_stream_info
14.87 1.682895 421 3995 strncmp
7.17 0.811210 811210 1 avformat_open_input
0.75 0.085290 584 146 av_freep
0.49 0.055150 434 127 strlen
0.29 0.033008 660 50 av_log
0.22 0.025090 464 54 strcmp
0.20 0.022836 22836 1 avformat_close_input
0.16 0.017788 635 28 av_dict_free
0.15 0.016819 646 26 av_dict_get
0.15 0.016753 440 38 strchr
0.13 0.014536 581 25 memset
...
------ ----------- ----------- --------- --------------------
100.00 11.318773 4762 total
LD_PRELOAD
Shared libraries
▶ Most of the shared libraries are based on the.so
ELF file at the end
- activated by
Loading (dynamic loader)
- or at runtime by passing the
dlopen()
(of cargo etc) load
▶ When a program is started (ELF file), the kernel parses the file and loads the corresponding parser
- For the most part, the ELF file's
PT_INTERP
The program header is set to
▶ During loading, the dynamic linkerwill parse all symbols in the dynamic library
▶ Dynamic libraries are loaded only once by the OS and then mapped to all applications that use them
- Easy to reduce the memory required to use the library
Hooking Library Calls
▶ To perform more complex library call hooks, you can use theLD_PRELOADenvironment variable
▶ LD_PRELOADUsed to specify a shared library that needs to be loaded before the dynamic loader can load other libraries.
▶ You can intercept all library calls by preloading another library
- Override library symbols with the same name
- Allows redefining a small number of symbols
- This can be done by
dlsym
(man 3 dlsym)Loading the "real" symbols
▶ Debugging/tracing library (libsegfault, libefence) will use this environment variable
▶ Available for both C and C++
LD_PRELOAD example
▶ UseLD_PRELOAD
Preloading Desired Libraries
#include <>
#include <>
ssize_t read(int fd, void *data, size_t size) {
memset(data, 0x42, size);
return size;
}
▶ UseLD_PRELOADThe libraries under the compilation:
$ gcc -shared -fPIC -o my_lib.so my_lib.c
▶ UseLD_PRELOADPreloading new libraries
$ LD_PRELOAD=./my_lib.so ./exe
uprobes and perf
uprobes
▶ uprobeis a mechanism provided by the kernel for tracing user-space code.
▶ Tracepoints can be added dynamically to any user-space symbols.
- The kernel tracing system will
.text
Breakpoints in section
▶ By/sys/kernel/debug/tracing/uprobe_events
Expose tracing information
▶ It is usuallyperf
, bcc
and other tools that encapsulate the use of
▶ trace/uprobetracer
The perf tool
▶ perftool is a tool that uses performance counters to capture information about an application's profile (man 1 perf)
▶ You can also managetracepoints
, kprobes
cap (a poem)uprobes
▶ perfThe profile can be executed in both user space and kernel space.
▶ perfkernel-exposed basedperf_event
connector
▶ A set of operations is provided, each with specific parameters
-
stat
,record
,report
,top
,annotate
,ftrace
,list
,probe
et al. (and other authors)
Using perf record
▶ perfCan record performance based on threads, processes and CPUs
▶ Kernel configuration required for use onlyCONFIG_PERF_EVENTS=yoptions (as in computer software settings)
▶ Data needs to be collected from program execution and output to thePapers
▶ You can passperf annotate
cap (a poem)perf report
analyzefile
- Embedded systems can be analyzed on other computers
Probing userspace functions
▶ List functions that can be probed in a specific executable file
$ perf probe --source=<source_dir> -x my_app -F
▶ List the number of lines that can be probed in a particular executable file/function
$ perf probe --source=<source_dir> -x my_app -L my_func
▶ Creating uprobes in functions of user space libraries/executables
$ perf probe -x /lib/.6 printf
$ perf probe -x app my_func:3 my_var
$ perf probe -x app my_func%return ret=%r0
▶ Recording of executed tracepoints
$ perf record -e probe_app:my_func -e probe_libc:printf
Memory issues
Usual Memory Issues
▶ Programs almost always need to access memory
▶ A large number of errors may be generated if not handled properly
- Segment errors may be generated when accessing invalid memory (accessing NULL pointers or freed memory)
- A buffer overflow may occur if an address outside the buffer is accessed
- Memory leaks can occur when memory is requested and then forgotten to be freed.
Segmentation Faults
▶ The kernel generates a segment error when a program tries to access a memory region that is not allowed to be accessed, or accesses a memory region in an incorrect way:
- If you write to a read-only memory area
- Trying to execute a piece of memory that can't be executed
int *ptr = NULL;
*ptr = 1;
▶ When a segment error is generated, it is displayed on the terminal.Segmentation fault
$ ./program
Segmentation fault
Buffer Overflows
▶ Buffer overflow occurs when accessing an array out of bounds
▶ In the following scenarios, it may or may not cause the program to crash depending on the access:
- exist
malloc ()
Writing data after the end of the malloc's array usually overwrites the malloc's data structure, leading to a crash - Writing data after the end of an array requested on the stack corrupts the stack data
- Reading data after the end of the data does not always result in a segment error, depending on the area of memory accessed
uint32_t *array = malloc(10 * sizeof(*array));
array[10] = 0xDEADBEEF;
Memory Leaks
▶ A memory leak is a type of error that does not trigger a program crash (but sooner or later does), but consumes system memory
▶ This happens when memory is requested for a program, but you forget to free this memory
▶ It may run for a long time in the production environment before being discovered
- It is best to identify such issues early in the development phase
void func1(void) {
uint32_t *array = malloc(10 * sizeof(*array));
do_something_with_array(array);
}
Valgrind memcheck
Valgrind
▶ Valgrindis a tool framework for building dynamic analysis tools
▶ Valgrind
itself is also a tool based on the framework, providing memory error detection, heap profile and other profile features
▶ Supports all popular platforms: Linux on x86, x86_64, arm (armv7 only), arm64, mips32, s390, ppc32 and ppc64
▶ Can be added to your code and run on its virtual CPU core. Significantly slows down execution, making it suitable for debugging and analysis
▶ Memcheckis the default valgrind tool that detects memory management errors:
- Accessing invalid memory regions, using uninitialized values, memory leaks, incorrectly releasing heap blocks, etc.
- Can be run in any application without compilation
$ valgrind --tool=memcheck --leak-check=full <program>
Valgrind Memcheck usage and report
$ valgrind ./mem_leak
==202104== Memcheck, a memory error detector
==202104== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==202104== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==202104== Command: ./mem_leak
==202104==
==202104== Conditional jump or move depends on uninitialised value(s)
==202104== at 0x109161: do_actual_jump (in /home/user/mem_leak)
==202104== by 0x109187: compute_address (in /home/user/mem_leak)
==202104== by 0x1091A2: do_jump (in /home/user/mem_leak)
==202104== by 0x1091D7: main (in /home/user/mem_leak)
==202104==
==202104== HEAP SUMMARY:
==202104== in use at exit: 120 bytes in 1 blocks
==202104== total heap usage: 1 allocs, 0 frees, 120 bytes allocated
==202104==
==202104== LEAK SUMMARY:
==202104== definitely lost: 120 bytes in 1 blocks
==202104== indirectly lost: 0 bytes in 0 blocks
==202104== possibly lost: 0 bytes in 0 blocks
==202104== still reachable: 0 bytes in 0 blocks
==202104== suppressed: 0 bytes in 0 blocks
==202104== Rerun with --leak-check=full to see details of leaked memory
Valgrind and VGDB
▶ Valgrind can also act as a GDB server that receives processing commands. users can access Valgrind via the gdb client or thevgdb
Interact with the valgrind gdb server.vgdb
It can be used in the following scenarios:
- As a standalone CLI program, send the "monitor" command to valgrind.
- Acts as a repeater between gdb clients and pre-existing valgrind sessions
- Acting as a server for multiple valgrind sessions from remote gdb clients
▶ See moreman 1 vgdb
Using GDB with Memcheck
▶ valgrindYou can attach GDB to the process being analyzed
$ valgrind --tool=memcheck --leak-check=full --vgdb=yes --vgdb-error=0 ./mem_leak
▶ Then attach gdb to a file that uses thevdgbon the valgrind gdbserver
$ gdb ./mem_leak
(gdb) target remote | vgdb
▶ If valgrind detects an error, it stops execution and enters the GDB
(gdb) continue
Continuing.
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000109161 in do_actual_jump (p=0x4a52040) at mem_leak.c:5
5 if (p[1])
(gdb) bt
#0 0x0000000000109161 in do_actual_jump (p=0x4a52040) at mem_leak.c:5
#1 0x0000000000109188 in compute_address (p=0x4a52040) at mem_leak.c:11
#2 0x00000000001091a3 in do_jump (p=0x4a52040) at mem_leak.c:16
#3 0x00000000001091d8 in main () at mem_leak.c:27
Electric Fence
libefence
▶ libefenceIt's a little more thanvalgrindMore lightweight application, but also relatively low precision
▶ Two common types of memory errors can be captured
- Buffer overflow and use of freed memory
▶ libefenceA segment error can be triggered after the first error is encountered, generating a coredump
▶ You can use static links or useLD_PRELOAD
Way preloadinglibefenceshared library (computing)
$ gcc -g -o program
$ LD_PRELOAD=.0.0 ./program
Electric Fence 2.2 Copyright (C) 1987-1999 Bruce Perens <bruce@>
Segmentation fault (core dumped)
▶ Depending on the segment error, a coredump can be generated in the current directory
▶ You can use GDB to open this coredump and locate the location where the error occurred
$ gdb ./program core-program-3485
Reading symbols from ./libefence...
[New LWP 57462]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./libefence'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 main () at :8
8 data[99] = 1;
(gdb)
Application Profiling
Profiling
▶ Profiling is an action that collects data from a program run in order to analyze the program, optimize it, or fix its problems.
▶ Profiling can be achieved by visually inserting instrumentation into the code or by utilizing kernel/userspace mechanisms
- Profile function calls and call counts as a way to optimize performance
- Profile processor usage to optimize performance and reduce power usage
- Profile memory usage to optimize used memory
▶ After profiling, data needs to be used to analyze potential enhancements
Performance issues
▶ profiling is often used to identify and fix performance problems
▶ Performance can be affected by memory usage, IOs load, or CPU usage, etc.
▶ It is desirable to capture profiling data before fixing performance issues
▶ profiling is usually done for the first time with some typical tools for coarse-grained localization
▶ Once the problem type has been identified, fine-grained profiling can be performed
Profiling metrics
▶ Profile metrics can be collected with a variety of tools.
▶ UseMassif, heaptrack maybememusageto profile memory usage
▶ Useperfcap (a poem)callgrindto the profile function call
▶ Useperfto profile CPU hardware usage (Cache, MMU, etc.)
▶ The profiling data can contain both user-space application and kernel data
Visualizing data with flamegraphs
▶ Stack-based visualization
▶ You can quickly find performance bottlenecks and navigate call stacks
▶ The Brendan Gregg tool (for which it became popular) can be used for theperf
Resulting flame map generation
- Scripts for generating flame maps:/brendangregg/FlameGraph
Going further with Flamegraphs
▶ For more see the following (technical presentation by Brendann Gregg showing the use of the various indicators in the flame chart):
- Video
- Slides
Memory profiling
▶ profiling the application's memory usage (heap/stack) helps optimize performance
▶ Claiming too much memory may cause the system to run out of memory.
▶ Frequent memory requests/releases can cause the kernel to spend a lot of time executing theclear_page()
- The kernel needs to clean up memory pages before handing them over to the process to avoid data leaks
▶ Reducing the application memory footprint can optimize cache usage, such as page miss
Massif usage
▶ MassifanvalgrindProvides tools that can be used by the profile heap during program execution (for userspace only)
▶ The principle is to create a memory request snapshot:
$ valgrind --tool=massif --time-unit=B program
▶ Once executed, it generates an.<pid>
file
▶ Then you can usems_print
The tool displays a heap allocation map:
$ ms_print .275099
▶ #: Maximum memory request
▶ @: Snapshot details (can be accessed via the--detailed-freq
(Number of adjustments)
Massif report
massif-visualizer - Visualizing massif profiling data
heaptrack usage
▶ heaptrackis a heap memory profile tool
- require
LD_PRELOAD
storehouse
▶ Has better tracing and visualization capabilities than Massif
- Each memory request is associated with a stack
- Memory leaks, memory request hotspots and temporary memory requests can be detected
▶ You can use the GUI (heaptrack_gui
) or the CLI tool (heaptrack_print
) View Results
▶ /KDE/heaptrack
$ heaptrack program
▶ Finally generate aheaptrack.<process_name>.<pid>.zst
file, which can be used on another computerheaptrack_gui
View Analytics
heaptrack_gui - Visualizing heaptrack profiling data
heaptrack_gui - Flamegraph view
memusage
▶ memusageis a program that uses profile Memory-using programs (man 1 memusage) (userspace only)
▶ You can profile heap, stack, and mmap memory usage.
▶ You can display profile information in the terminal or output it to a file or a PGN file.
▶ Compared to valgrindMassifFor that matter, it's lighter (due to the use of theLD_PRELOAD
mechanisms)
memusage usage
$ memusage convert
Memory usage summary: heap total: 2635857, heap peak: 2250856, stack peak: 83696
total calls total memory failed calls
malloc| 1496 2623648 0
realloc| 6 3744 0 (nomove:0, dec:0, free:0)
calloc| 16 8465 0
free| 1480 2521334
Histogram for block sizes:
0-15 329 21% ==================================================
16-31 239 15% ====================================
32-47 287 18% ===========================================
48-63 321 21% ================================================
64-79 43 2% ======
80-95 141 9% =====================
...
21424-21439 1 <1%
32768-32783 1 <1%
32816-32831 1 <1%
large 3 <1%
Execution profiling
▶ In order to optimize a program, it is necessary to understand which hardware resources are used by the program
▶ Many hardware elements may affect program operation
- If the application does not take memory space locality into account, it may lead to CPU cache performance degradation
- If the application does not take memory space locality into account, it will result in a cache miss
- Alignment errors occur when performing unaligned accesses
Using perf stat
▶ perf stat
An application can be profiled by capturing performance counters
- Using performance counters may requirerootpermissions, which can be accessed through the
# echo -1 > /proc/sys/kernel/perf_event_paranoid
modifications
▶ The number of performance counters on the hardware is usually limited
▶ Collecting too much data may lead to multiplexing, and perf amplifies the results
▶ Collect performance counters and then estimate them:
- In order to obtain more accurate values, it is necessary to reduce the number of events and execute them more than once.
perf
to modify the set of events to be observed - See moreperf wiki
perf stat example
$ perf stat convert
Performance counter stats for 'convert ':
45,52 msec task-clock # 1,333 CPUs utilized
4 context-switches # 87,874 /sec
0 cpu-migrations # 0,000 /sec
1 672 page-faults # 36,731 K/sec
146 154 800 cycles # 3,211 GHz (81,16%)
6 984 741 stalled-cycles-frontend # 4,78% frontend cycles idle (91,21%)
81 002 469 stalled-cycles-backend # 55,42% backend cycles idle (91,36%)
222 687 505 instructions # 1,52 insn per cycle
# 0,36 stalled cycles per insn (91,21%)
37 776 174 branches # 829,884 M/sec (74,51%)
567 408 branch-misses # 1,50% of all branches (70,62%)
0,034156819 seconds time elapsed
0,041509000 seconds user
0,004612000 seconds sys
▶ Note: The percentage at the end is the kernel's calculation of the duration of the event in the case of multiplexing
▶ List all events:
$ perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
...
▶ Statistics for a specific commandL1-dcache-load-misses cap (a poem)branch-load-missesEvents:
$ perf stat -e L1-dcache-load-misses,branch-load-misses cat /etc/fstab
...
Performance counter stats for 'cat /etc/fstab':
23 418 L1-dcache-load-misses
7 192 branch-load-misses
...
Cachegrind
▶ CachegrindanvalgrindProvided tools for profile application directives and data caching hierarchies
- CachegrindYou can also profile branches to predict success
▶ It is possible to simulate a computer with an independentI$
cap (a poem)D$
Supported machines with a unified L2 cache
▶ Very useful for detecting cache usage problems (too much miss, etc.)
$ valgrind --tool=cachegrind --cache-sim=yes ./my_program
▶ A measurement result is generated that contains the.<pid>
file
▶ cg_annotate
is a CLI tool for presenting Cachegrind simulation results
▶ It can also be used by--diff
option compares two measurement result files.
▶ cachegrind
There are some accuracy deficiencies in the cache simulation of seeCachegrind accuracy
Kcachegrind - Visualizing Cachegrind profiling data
Callgrind
▶ CallgrindbevalgrindProvides a tool that can profile call graphs (userspace only)
▶ You can capture the number of instructions and data-related source code lines during program execution.
▶ Records the number of function and function-related calls:
$ valgrind --tool=callgrind ./my_program
▶ callgrind_annotate
It's a place where you can showcallgrind
CLI tool for simulation results
▶ Kcachegrind
It is also possible to showcallgrind
results
Kcachegrind - Visualizing Callgrind profiling data
System-wide Profiling & Tracing
▶ The root cause of dominance problems is not limited to the application itself, but may involve multiple levels (driver, application, kernel)
▶ In this case, the entire stack needs to be analyzed
▶ The kernel provides a large number of tracepoints that can be logged by specific tools.
▶ New tracepoints can be created statically or dynamically by various mechanisms (e.g., kprobes)
Kprobes
▶ Kprobes can dynamically insert breakpoints at virtually any kernel address and extract debugging and performance information
▶ Inserting a method that calls a specific handler in text code by means of a code patch
-
kprobes
It is possible to execute a specific handler when executing hooked instructions (i.e., instructions that need to be debugged). - Triggered when returning from a function
kretprobes
Extracting the return value of a function, and the parameters of a function call
▶ Kernel option needs to be enabledCONFIG_KPROBES=y
▶ Enable option since the probe needs to be inserted via the moduleCONFIG_MODULES=y cap (a poem)CONFIG_MODULE_UNLOAD=yto allow registration of probes
▶ When usingsymbol_name
The field hooking probe needs to be enabled when theCONFIG_KALLSYMS_ALL=yoptions (as in computer software settings)
▶ See moretrace/kprobes
Registering a Kprobe
▶ Can be registered dynamically by loading the modulekprobes
That is to say, throughregister_kprobe()Register for astruct kprobe
▶ It is necessary to exit the module via theunregister_kprobe()Unregistered probes:
struct kprobe probe = {
.symbol_name = "do_exit",
.pre_handler = probe_pre,
.post_handler = probe_post,
};
register_kprobe(&probe);
Registering a kretprobe
▶ kretprobe is registered in the same way as a normal probe, with the difference that it needs to be registered via theregister_kretprobe()Register for astruct kretprobe
- The provided handler is called on function entry and exit.
- On module exit you need to pass theunregister_kretprobe()Unregistered probes
int (*kretprobe_handler_t) (struct kretprobe_instance *, struct pt_regs *);
struct kretprobe probe = {
.kp.symbol_name = "do_fork",
.entry_handler = probe_entry,
.handler = probe_exit,
};
register_kretprobe(&probe);
perf
▶ perf can perform a larger tracing and log the operation
▶ The kernel already contains events and tracepoints that can be used with theperf list
List these
▶ Required byCONFIG_FTRACE_SYSCALLS(computing) enable (a feature)syscall
tracepoints
▶ In the absence of debugging information, a new tracepoint can be created dynamically on all symbols and registers
▶ The tracing functions record the variable and parameter contents using their names. The kernel option needs to be turned onCONFIG_DEBUG_INFO
▶ If perf cannot be foundvmlinux
The first step is to pass the-k <vmlinux>
Make this document available.
perf example
▶ Display Matchingsyscalls:*
of all events:
$ perf list syscalls:*
List of pre-defined events (to be used in -e):
syscalls:sys_enter_accept [Tracepoint event]
syscalls:sys_enter_accept4 [Tracepoint event]
syscalls:sys_enter_access [Tracepoint event]
syscalls:sys_enter_adjtimex_time32 [Tracepoint event]
syscalls:sys_enter_bind [Tracepoint event]
...
▶ InDocumentation of the implementation of
sha256sum
generated by the commandsyscalls:sys_enter_read
Events:
$ perf record -e syscalls:sys_enter_read sha256sum /bin/busybox
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB (215 samples) ]
perf report example
▶ Display the collected samples according to the time spent:
$ perf report
Samples: 591 of event 'cycles', Event count (approx.): 393877062
Overhead Command Shared Object Symbol
22,88% firefox-esr [nvidia] [k] _nv031568rm
3,21% firefox-esr .2 [.] __minimal_realloc
2,00% firefox-esr .6 [.] __stpncpy_ssse3
1,86% firefox-esr libglib-2..0.7400.0 [.] g_hash_table_lookup
1,62% firefox-esr .2 [.] _dl_strtoul
1,56% firefox-esr [] [k] clear_page_rep
1,52% firefox-esr .6 [.] __strncpy_sse2_unaligned
1,37% firefox-esr .2 [.] strncmp
1,30% firefox-esr firefox-esr [.] malloc
1,27% firefox-esr .6 [.] __GI___strcasecmp_l_ssse3
1,23% firefox-esr [nvidia] [k] _nv013165rm
1,09% firefox-esr [nvidia] [k] _nv007298rm
1,03% firefox-esr [] [k] unmap_page_range
0,91% firefox-esr .2 [.] __minimal_free
perf probe
▶ Byperf probeDynamic tracepoints can be created in kernel functions and userspace functions
▶ In order to insert the probe, you need to enable the kernelCONFIG_KPROBES
- Note: The use ofperfThe probe needs to be compiledlibelffile
▶ After creating a new dynamic probe it is possible to create a new dynamic probe in theperf recordUse this probe in
▶ Usually not available in embedded platformsvmlinux
This time, only symbols and registers can be used
perf probe examples
▶ Lists all kernel symbols that can be probed:
$ perf probe --funcs
▶ UsefilenameThe parameters in thedo_sys_openat2
Create a new probe on the
$ perf probe --vmlinux=vmlinux_file do_sys_openat2 filename:string
Added new event:
probe:do_sys_openat2 (on do_sys_openat2 with filename:string)
▶ Implementationtail
and capture the probe events created earlier:
$ perf record -e probe:do_sys_openat2 tail /var/log/messages
...
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB (19 samples) ]
▶ Useperf script
Demonstrate the tracepoint of the record:
$ perf script
tail 164 [000] 3552.956573: probe:do_sys_openat2: (c02c3750) filename_string="/etc/"
tail 164 [000] 3552.956642: probe:do_sys_openat2: (c02c3750) filename_string="/lib/tls/v7l/neon/vfp/.2"
...
▶ Inksys_read
Create a new probe on ther0
(ARM) return value (assigned to theret
):
$ perf probe ksys_read%return ret=%r0
▶ Implementationsha256sum
and capture the probe events created earlier:
$ perf record -e probe:ksys_read__return sha256sum /etc/fstab
▶ Shows all probes created:
$ perf probe -l
probe:ksys_read__return (on ksys_read%return with ret)
▶ Remove an existing tracepoint:
$ perf probe -d probe:ksys_read__return
perf record example
▶ Recording all CPU events (system mode)
$ perf record -a
^C
▶ Display using perf scriptRecorded events
$ perf script
...
klogd 85 [000] 208.609712: 116584 cycles: b6dd551c memset+0x2c (/lib/.6)
klogd 85 [000] 208.609898: 121267 cycles: c0a44c84 _raw_spin_unlock_irq+0x34 (vmlinux)
klogd 85 [000] 208.610094: 127434 cycles: c02f3ef4 kmem_cache_alloc+0xd0 (vmlinux)
perf 130 [000] 208.610311: 132915 cycles: c0a44c84 _raw_spin_unlock_irq+0x34 (vmlinux)
perf 130 [000] 208.619831: 143834 cycles: c0a44cf4 _raw_spin_unlock_irqrestore+0x3c (vmlinux)
klogd 85 [000] 208.620048: 143834 cycles: c01a07f8 syslog_print+0x170 (vmlinux)
klogd 85 [000] 208.620241: 126328 cycles: c0100184 vector_swi+0x44 (vmlinux)
klogd 85 [000] 208.620434: 128451 cycles: c096f228 unix_dgram_sendmsg+0x46c (vmlinux)
kworker/0:2-mm_ 44 [000] 208.620653: 133104 cycles: c0a44c84 _raw_spin_unlock_irq+0x34 (vmlinux)
perf 130 [000] 208.620859: 138065 cycles: c0198460 lock_acquire+0x184 (vmlinux)
...
Using perf trace
▶ perf trace
All tracepoints/events triggered during command execution can be captured and displayed.
$ perf trace -e "net:*" ping -c 1 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
0.000 ping/37820 net:net_dev_queue(skbaddr: 0xffff97bbc6a17900, len: 98, name: "enp34s0")
0.005 ping/37820 net:net_dev_start_xmit(name: "enp34s0",
skbaddr: 0xffff97bbc6a17900, protocol: 2048, len: 98,
network_offset: 14, transport_offset_valid: 1, transport_offset: 34)
0.009 ping/37820 net:net_dev_xmit(skbaddr: 0xffff97bbc6a17900, len: 98,name: "enp34s0")
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.867 ms
Using perf top
▶ perf top
The kernel can be analyzed in real time
▶ You can sample function calls and sort them by time consumption.
▶ You can PROFILE the entire system:
$ perf top
Samples: 19K of event 'cycles', 4000 Hz, Event count (approx.): 4571734204 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
2,01% [nvidia] [k] _nv023368rm
0,94% [kernel] [k] __static_call_text_end
0,89% [vdso] [.] 0x0000000000000655
0,81% [nvidia] [k] _nv027733rm
0,79% [kernel] [k] clear_page_rep
0,76% [kernel] [k] psi_group_change
0,70% [kernel] [k] check_preemption_disabled
0,69% code [.] 0x000000000623108f
0,60% code [.] 0x0000000006231083
0,59% [kernel] [k] preempt_count_add
0,54% [kernel] [k] module_get_kallsym
0,53% [kernel] [k] copy_user_generic_string
ftrace and trace-cmd
ftrace
▶ ftraceis a kernel tracing framework, short for "Function Tracer".
▶ Provides extensive tracing capabilities for observing system behavior
- It is possible to trace tracepoints (schedulers, interrupts, etc.) that already exist in the kernel
- relies on GCC's mount() capability and the kernel code patching mechanism to invoke theftrace tracing handler
▶ All trace data is stored in a ring buffer
▶ UsetracefsFile system to control and display tracing events
-
# mount -t tracefs nodev /sys/kernel/tracing.
▶ UseftraceThe kernel option must be enabled beforeCONFIG_FTRACE=y
▶ CONFIG_DYNAMIC_FTRACEIt is possible to have the added trace feature have little to no impact on system performance when not in use.
ftrace files
▶ ftracepass (a bill or inspection etc)/sys/kernel/tracing
in a specific file to control the content of the trace:
-
current_tracer
:: Currently used tracer -
available_tracers
: List available tracers compiled into the kernel -
tracing_on
: Enable/disable tracing. -
trace
: Present the trace in a readable format. Different tracers may have different formats -
trace_pipe
: withtrace
Similar, but each read consumes the trace data of its reads -
trace_marker{_raw}
: can synchronize kernel events to user space in the trace buffer -
set_ftrace_filter
: Filtering specific functions -
set_graph_function
: Graphically display the subfunctions of a particular function.
▶ There are other documents that control tracing, seetrace/ftrace
▶ Availabletrace-cmd CLI andKernelshark GUI to record and display tracing data
ftrace tracers
▶ ftrace provides a variety of "tracers".
▶ It is necessary to write the used tracer to thecurrent_tracer
file
-
nop
: do not perform tracing, disable all tracing -
function
: Track all calls to kernel functions -
function_graph
: Similarfunction
but keeps track of function entries and exits -
hwlat
: Tracking hardware latency -
irqsoff
: Tracks the portion of the interrupt that is disabled and logs the delay -
branch
: track likely()/unlikely() branch prediction calls -
mmiotrace
: Tracks all hardware accesses (read[bwlq]/write[bwlq]
)
▶ Warning: some tracer overheads may be high
# echo "function" > /sys/kernel/tracing/current_tracer
function_graph tracer report example
▶ function_graphCan keep track of all functions and their associated call trees
▶ You can display process, CPU, timestamp, and function call graphs
$ trace-cmd report
...
dd-113 [000] 304.526590: funcgraph_entry: | sys_write() {
dd-113 [000] 304.526597: funcgraph_entry: | ksys_write() {
dd-113 [000] 304.526603: funcgraph_entry: | __fdget_pos() {
dd-113 [000] 304.526609: funcgraph_entry: 6.541 us | __fget_light();
dd-113 [000] 304.526621: funcgraph_exit: + 18.500 us | }
dd-113 [000] 304.526627: funcgraph_entry: | vfs_write() {
dd-113 [000] 304.526634: funcgraph_entry: 6.334 us | rw_verify_area();
dd-113 [000] 304.526646: funcgraph_entry: 6.208 us | write_null();
dd-113 [000] 304.526658: funcgraph_entry: 6.292 us | __fsnotify_parent();
dd-113 [000] 304.526669: funcgraph_exit: + 43.042 us | }
dd-113 [000] 304.526675: funcgraph_exit: + 78.833 us | }
dd-113 [000] 304.526680: funcgraph_exit: + 91.291 us | }
dd-113 [000] 304.526689: funcgraph_entry: | sys_read() {
dd-113 [000] 304.526695: funcgraph_entry: | ksys_read() {
dd-113 [000] 304.526702: funcgraph_entry: | __fdget_pos() {
dd-113 [000] 304.526708: funcgraph_entry: 6.167 us | __fget_light();
dd-113 [000] 304.526719: funcgraph_exit: + 18.083 us | }
irqsoff tracer
▶ ftrace irqsoff The tracer can track interrupt delays caused by disabling interrupts for too long.
▶ Can help locate problems with high system interruption latency
▶ Need to be enabledIRQSOFF_TRACER=y
-
preemptoff
,premptirqsoff
The tracer keeps track of snippets with preemption disabled.
irqsoff tracer report example
# latency: 276 us, #104/104, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
# -----------------
# | task: stress-ng-114 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: __irq_usr
# => ended at: irq_exit
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
stress-n-114 0d... 2us : __irq_usr
stress-n-114 0d... 7us : gic_handle_irq <-__irq_usr
stress-n-114 0d... 10us : __handle_domain_irq <-gic_handle_irq
...
stress-n-114 0d... 270us : __local_bh_disable_ip <-__do_softirq
stress-n-114 . 275us : __do_softirq <-irq_exit
stress-n-114 . 279us+: tracer_hardirqs_on <-irq_exit
stress-n-114 . 290us : <stack trace>
Hardware latency detector
▶ ftrace hwlat The tracer can help find out if the hardware is causing delays
- For example, non-maskable system management interrupts can directly trigger certain firmware support features, causing the CPU to suspend execution
- Interruptions from some security monitoring may also cause delays
▶ If some kind of delay is found using this tracer, it means that the system may not be suitable for real-time use
▶ The principle is to execute instructions cyclically on a single core with interrupts disabled and calculate the time difference between two consecutive reads
▶ Need to be enabledCONFIG_HWLAT_TRACER=y
trace_printk()
▶ Utrace_printk()You can output strings to the trace cache
▶ You can trace specific conditions in your code and display them in the trace cache:
#include <linux/>
void read_hw()
{
if (condition)
trace_printk("Condition is true!\n");
}
▶ Use in the trace cachefunction_graph
The tracer displays the following results:
1) | read_hw() {
1) | /* Condition is true! */
1) 2.657 us | }
trace-cmd
▶ trace-cmd is a program written by Steven Rostedt for use with theftraceInteractive tools (man 1 trace-cmd)
▶ trace-cmdSupported tracer is the tracer exposed by ftrace
▶ trace-cmdMultiple commands are supported:
-
list
: List the various plugins/events that can be logged. -
record
: Write a trace tofile
-
report
: DemonstrationResults obtained
▶ At the end of the acquisition, afile
Remote tracing with trace-cmd
▶ trace-cmd
The output can be quite large, making it difficult to save it on embedded platforms with limited storage
▶ For this purpose, you can use thelisten
command sends the results over the network:
-
Running on a remote system that needs to capture tracing
trace-cmd listen -p 6578
-
On the target system, use the
trace-cmd record -N <target_ip>:6578
Specify the remote system that collects tracing information.
trace-cmd examples
▶ List the available tracers:
$ trace-cmd list -t
blk mmiotrace function_graph function nop
▶ List the available events:
$ trace-cmd list -e
...
migrate:mm_migrate_pages_start
migrate:mm_migrate_pages
tlb:tlb_flush
syscalls:sys_exit_process_vm_writev
...
▶ Listfunction
cap (a poem)function_graph
tracers filterable functions:
$ trace-cmd list -f
...
wait_for_initramfs
__ftrace_invalid_address___64
calibration_delay_done
calibrate_delay
...
▶ Enable function tracer and log global data on the system:
$ trace-cmd record -p function
▶ Tracing with the function graph tracerdd
Command:
$ trace-cmd record -p function_graph dd if=/dev/mmcblk0 of=out bs=512 count=10
▶ ShowcaseThe data:
$ trace-cmd report
▶ Reset all ftrace buffers and remove tracers:
$ trace-cmd reset
▶ Execute on the systemirqsoff tracer:
$ trace-cmd record -p irqsoff
▶ Record only the system'sirq_handler_exit/irq_handler_entry
events:
$ trace-cmd record -e irq:irq_handler_exit -e irq:irq_handler_entry
Adding ftrace tracepoints
▶ Customized tracepoints can be added for customization purposes.
▶ It is first necessary to set up a.h
The tracepoint is declared in the
#undef TRACE_SYSTEM
#define TRACE_SYSTEM subsys
#if !defined(_TRACE_SUBSYS_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_SUBSYS_H
#include <linux/>
DECLARE_TRACE(subsys_eventname,
TP_PROTO(int firstarg, struct task_struct *p),
TP_ARGS(firstarg, p));
#endif /* _TRACE_SUBSYS_H */
/* This part must be outside protection */
#include <trace/define_trace.h>
▶ Tracepoint is then injected using the above header file:
#include <trace/events/>
#define CREATE_TRACE_POINTS
DEFINE_TRACE(subsys_eventname);
void any_func(void)
{
...
trace_subsys_eventname(arg, task);
...
}
▶ For more information, seetrace/tracepoints
Kernelshark
▶ Kernelshark is a Qt-based program that can processtrace-cmd Graphical interface of the report
▶ You can passtrace-cmdConfiguring and acquiring data
▶ Use different colors to show recorded CPU and tasks events
▶ Can be used for further analysis of specific bugs
LTTng
▶ LTTng is a program ofEfficiOS Company-maintained open source tracing framework for Linux
▶ LTTng provides insight into the interaction between the kernel and the application (C, C++, Java, Python).
- Not yet applied exposes a
/dev/lttng-logger
▶ Tracepoints will be associated with a payload
▶ LTTng focuses on low overhead tracing
▶ Use the Common Trace Format (so that you can read trace data using software such as babeltrace or trace-compass).
Tracepoints with LTTng
▶ LTTng has a session daemon that receives events generated from the kernel and user-space LTTng tracing components
▶ LTTng can be used to track the following:
- LTTng kernel tracepoints
- kprobes and kretprobes
- Linux kernel system calls
- Libux userspace probe
- LTTng tracepoints in user space
Creating userspace tracepoints with LTTng
▶ New user-space tracepoints can be defined using LTTng.
▶ Multiple attributes can be configured for a tracepoint
- A provider namespace
- A name to identify tracepoint
- Various types of parameters (int, char*, etc.)
- Fields describing how to display tracepoint parameters (decimal, hexadecimal forbidden, etc.), see theLTTng-ust
▶ In order to use UST tracepoint, the developer needs to perform several operations: write a tracepoint provider (.h), write a tracepoint package (.c), build the package, call the tracepoint in the traced application, and finally build the application, link the lttng -ust library and package provider.
▶ LTTng provideslttng-gen-tp
Simplify these steps by writing just one template (.tp) file
Defining a LTTng tracepoint
▶ Tracepoint template (hello_world
)
LTTNG_UST_TRACEPOINT_EVENT(
// Tracepoint provider name
hello_world,
// Tracepoint/event name
first_tp,
// Tracepoint arguments (input)
LTTNG_UST_TP_ARGS(
char *, text
),
// Tracepoint/event fields (output)
LTTNG_UST_TP_FIELDS(
lttng_ust_field_string(message, text)
)
)
▶ lttng-gen-tp will use this template file to generate/build the required files (.h, .c and .o files)
Defining a LTTng tracepoint
▶ Constructing a tracepoint provider:
$ lttng-gen-tp hello_world
▶ Use Tracepoint(hello_world.c
)
#include <>
#include ""
int main(int argc, char *argv[])
{
lttng_ust_tracepoint(hello_world, my_first_tracepoint, 23, "hi there!");
return 0;
}
▶ Compilation:
$ gcc hello_world.c hello_world -llttng-ust -o hello_world
Using LTTng
$ lttng create my-tracing-session --output=./my_traces
$ lttng list --kernel
$ lttng list --userspace
$ lttng enable-event --userspace hello_world:my_first_tracepoint
$ lttng enable-event --kernel --syscall open,close,write
$ lttng start
$ /* Run your application or do something */
$ lttng destroy
$ babeltrace2 ./my_traces
▶ Availabletrace-compassto show the results
Remote tracing with LTTng
▶ LTTng can record tracking data over the network
▶ For embedded systems with limited storage only
▶ Execution on a remote computerlttng-relayd
command
$ lttng-relayd --output=${PWD}/traces
▶ Specify in the session created on the target machine--set-url
:
$ lttng create my-session --set-url=net://remote-system
▶ This makes it possible to directly record tracking information from a remote computer
eBPF
The ancestor: Berkeley Packet filter
▶ BPF stands for Berkeley Packet Filter, which was used in the beginning for network message filtering
▶ BPF for Linux socket filtering (seenetworking/filter)
▶ tcpdump and Wireshark rely heavily on BPF (via libpcap) for message capture
BPF in libpcap: setup
▶ tcpdump can pass the user's message filter string into libpcap
▶ libpcap will convert the capture filter into a binary program
- The program uses an abstract machine instruction set (BPF instruction set)
▶ libpcap viasetsockopt()
The system call sends the binary to the kernel.
BPF in libpcap: capture
▶ The kernel implements a BPF "virtual machine".
▶ The BPF VM executes the BPF program for each message
▶ The program examines the message data and returns a non-zero value if the message needs to be captured
▶ If the return value is non-zero, the message is captured in addition to regular message processing
eBPF
▶ eBPFis a new framework that allows user programs to run safely and efficiently in the kernel. It was introduced in kernel version 3.18 and is still evolving and being updated frequently.
▶ The eBPF program can capture and expose kernel data to user space as well as change kernel behavior based on a number of user-defined rules
▶ eBPF is event-driven: specific kernel events can trigger and execute eBPF programs
▶ One of the main benefits of eBPF is the ability to reprogram kernel behavior without having to develop against the kernel:
- No kernel crashes due to bugs
- Faster feature development cycles can be realized
▶ Noteworthy features of eBPF are:
- New instruction set, interrupters and checkers
- A wider range of "attach" locations, so that programs can be hooked almost anywhere in the kernel.
- Use a specific structure called "maps" to exchange data between multiple eBPF programs or between programs and user space.
- Use a specific
bpf()
System calls to manipulate eBPF programs and data - A large number of kernel helper functions are provided in the eBPF program
eBPF program lifecycle
Kernel configuration for eBPF
▶ ByCONFIG_NETEnable eBPF subroutine
▶ ByCONFIG_BPF_SYSCALL(computing) enable (a feature)bpf()
system call
▶ ByCONFIG_BPF_JITEnable JIT in your program to improve performance
▶ CONFIG_BPF_JIT_ALWAYS_ONMandatory JIT Enablement
▶ CONFIG_BPF_UNPRIV_DEFAULT_OFF=n It is possible to allow non-root to use eBPF during the development phase
▶ You may want to unlock specific hook locations with additional features:
- CONFIG_KPROBESPrograms can be hooked on kprobes
- CONFIG_TRACING Programs can be hooked on the kernel tracepoint
- CONFIG_NET_CLS_BPFMessage classifiers can be written
- CONFIG_CGROUP_BPF This can be done in thecgroup hookfirst (of multiple parts)attach programs
eBPF ISA
▶ eBPF is a "virtual" ISA that defines all its instruction sets: load and store instructions, arithmetic instructions, jump instructions, and so on.
▶ It also defines a set of 10 64-bit registers, and a calling criterion:
-
R0
:: Return values of functions and BPF programs -
R1, R2, R3, R4, R5
: Function parameters -
R6, R7, R8, R9
: Call save registers -
R10
:: Stack pointer
; bpf_printk("Hello %s\n", "World");
0: r1 = 0x0 ll
2: r2 = 0xa
3: r3 = 0x0 ll
5: call 0x6
; return 0;
6: r0 = 0x0
7: exit
The eBPF verifier
▶ When loading a program into the kernel, the eBPF verifier checks the validity of the program
▶ verifier is a complex software fragment that is used to calibrate an eBPF program with a set of rules to ensure that the running code does not compromise the entire kernel. Such as:
- The program must return, otherwise uncertain code paths may lead to infinite runs (e.g. infinite loops)
- The program must ensure that the referenced pointer is valid
- Programs cannot access memory addresses at will, they must be accessed via context or valid helpers
▶ Reject a program if it violates the rules of verifier
▶ In addition to the verifier requirement, extra care must be taken when writing the program. eBPF programs enable preemption (but disable CPU migration), and thus may still suffer from concurrency problems!
- These problems can be avoided through mechanisms and helpers such as per-cpu maps type
Program types and attach points
▶ eBPF can hook a program in different types of locations:
- Any kprobe
- Kernel-defined static tracepoint
- Specific perf event
- entire network stack
- See morebpf_attach_type
▶ It is possible that a specific attach point may only support hooking a portion of a specific program, cf.bpf_prog_type cap (a poem)bpf/libbpf/program_types
▶ The program type defines the data that is passed into the eBPF program when the program is called, for example:
-
BPF_PROG_TYPE_TRACEPOINT
The program receives a structure containing all the data returned to user space by the target tracepoint. -
BPF_PROG_TYPE_SCHED_CLS
The program (used to implement the message classifier) will receive astruct __sk_buff, which is embodied in the kernel as a socket buffer. - For more context passed to program types, seeinclude/linux/bpf_types.h
eBPF maps
▶ eBPF can interact with user space or other programs with data through different maps:
-
BPF_MAP_TYPE_ARRAY
: Generalized array storage. Can be divided between different CPUs -
BPF_MAP_TYPE_HASH
: Contains a key-value store. keys can be of different types:__u32
The device type, IP address, and so on. -
BPF_MAP_TYPE_QUEUE
: FIFO type queue -
BPF_MAP_TYPE_CGROUP_STORAGE
: a type of hash map that uses the cgroup id as the key, in addition to maps of other object types (inodes, tasks, sockets, etc.)
▶ For basic data, a simple and efficient way is to use the global variables of eBPF directly (in contrast to maps, which does not involve system calls)
The bpf() syscall
▶ The kernel is used by exposing abpf()
System calls to allow interaction with eBPF subsystems
▶ This system call has a set of subcommands and receives specific data based on different subcommands:
- BPF_PROG_LOAD: Load a bpf program
- BPF_MAP_CREATE: Allocate maps for use by the program
- BPF_MAP_LOOKUP_ELEM: lookup table entries in map
- BPF_MAP_UPDATE_ELEM: Update table entries in map
▶ This system call uses file descriptors pointing to eBPF resources. These resources (programs, maps, links, etc.) will remain valid as long as at least one program holds a valid file descriptor. If no program is using them, these resources will be automatically cleaned up.
▶ See moreman 2 bpf
Writing eBPF programs
▶ An eBPF program can be written directly in raw eBPF assembly or in a high-level language (e.g., C or rust) and compiled using the clang compiler.
▶ The kernel provides a helper function for the eBPF program:
-
bpf_trace_printk
Pass log to trace buffer -
bpf_map_{lookup,update,delete}_elem
Manipulating maps -
bpf_probe_{read,write}[_user]
Securely read/write data from/to kernel or userspace -
bpf_get_current_pid_tgid
Returns the current process ID and thread group ID. -
bpf_get_current_uid_gid
Returns the current user ID and group ID -
bpf_get_current_comm
Returns the name of the executable file in the current task -
bpf_get_current_task
Returns the currentstruct task_struct - For more helper functions, seeman 7 bpf-helpers
▶ The kernel also exposes kfuncs (cf.bpf/kfuncs), but in contrast to the bpf helper functions, they are not part of the kernel's stable interface
Manipulating eBPF program
▶ There are several ways to build, load, and manage eBPF programs:
- One is that you can write an eBPF program, build it using clang, then load it, and use it in a custom user-space program after attaching the
bpf()
retrieve data - It is also possible to use bpftool to manipulate built eBPF programs (load, attach, read maps, etc.) without having to write any user-space tools.
- Or you can write your own eBPF tool to handle some loads of work through some intermediate libraries such as libbpf
- It is also possible to use specific frameworks such as BCC or bpftrace
BCC
▶ The BPF Compiler Collection (BCC) is a toolset based on the BPF
▶ BCC provides a large number of ready-to-use BPF-based tools
▶ Also provides a simpler interface for writing, loading and hooking BPF programs than using the "original" BPF language.
▶ Applicable to a large number of platforms (but not ARM32)
- In the debian architecture, all tools named
<tool>-bpfcc
▶ BCC requires kernel version >=4.1
▶ BCC is evolving quickly, and many distributions have older versions: you may need to compile the latest source code.
BCC tools
BCC Tools example
▶ is a CPU profiler that captures the currently executing stack. Can convert the output to a flame map:
$ git clone /brendangregg/
$ -df -F 99 10 | ./FlameGraph/ >
▶ Shows all new TCP connections:
$ tcpconnect
PID COMM IP SADDR DADDR DPORT
220321 ssh 6 ::1 ::1 22
220321 ssh 4 127.0.0.1 127.0.0.1 22
17676 Chrome_Child 6 2a01:cb15:81e4:8100:37cf:d45b:d87d:d97d 2606:50c0:8003::154 443
[...]
▶ See more at /iovisor/bcc
Using BCC with python
▶ BCC exposes abcc
module, and aBPF
resemble
▶ The eBPF program is written in C and stores it to an external file or directly as a python string
▶ When creating aBPF
When an instance of a class is provided (as a file or string) to an eBPF program, it automatically builds, loads, and attaches the program
▶ There are various ways to ATTACH a program:
- Use the appropriate program name prefix according to the target attach point (this will automatically perform the attach step)
- By explicitly calling a previously created BPF instance method
Using BCC with python
▶ Usekprobe hook clone()
A system call that prints "Hello, World!" on every hook.
from bcc import BPF
# define BPF program
prog = """
int hello(void *ctx) {
bpf_trace_printk("Hello, World!\\n");
return 0;
}
"""
# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")
libbpf
▶ In addition to using advanced frameworks like BCC, you can use libbpf to build custom tools to better control every aspect of your program!
▶ libbpf is a C-based library that reduces the complexity of eBPF programming by the following features:
- Userspace API for handling open/load/attach/teardown bpf programs
- User-space APIs for interacting with attach's programs
- eBPF APIs that simplify writing eBPF programs
▶ Many distributions and build systems (such as Buildroot) package libbpf
▶ For more see /en/latest/
eBPF programming with libbpf
my_prog.
#include <linux/>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#define TASK_COMM_LEN 16
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__type(key, __u32);
__type(value, __u64);
__uint(max_entries, 1);
} counter_map SEC(".maps");
struct sched_switch_args {
unsigned long long pad;
char prev_comm[TASK_COMM_LEN];
int prev_pid;
int prev_prio;
long long prev_state;
char next_comm[TASK_COMM_LEN];
int next_pid;
int next_prio;
};
SEC("tracepoint/sched/sched_switch")
int sched_tracer(struct sched_switch_args *ctx)
{
__u32 key = 0;
__u64 *counter;
char *file;
char fmt[] = "Old task was %s, new task is %s\n";
bpf_trace_printk(fmt, sizeof(fmt), ctx->prev_comm, ctx->next_comm);
counter = bpf_map_lookup_elem(&counter_map, &key);
if(counter) {
*counter += 1;
bpf_map_update_elem(&counter_map, &key, counter, 0);
}
return 0;
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
Building eBPF programs
▶ eBPF is written in C and can be constructed as a loadable object via clang:
$ clang -target bpf -O2 -g -c my_prog. -o my_prog.
▶ GCC is also available in recent versions:
- Available in Debian/Ubuntu
gcc-bpf
Installing the toolchain - It exposes
bpf-unknown-none
goal
▶ In order to simplify the operation of libbpf-based programs in user-space programs, we need "skeleton" APIs, which can be generated by bpftool.
bpftool
▶ bpftool
is a command-line tool that can manage bpf programs by interacting with bpf object files and the kernel:
- Load the program into the kernel
- List loaded programs
- dump program instructions, BPF code or JIT code
- dump map contents
- Attach programs to hooks, etc.
▶ You may need to mount the bpf filesystem to pin the program (i.e., load the program even after bpftool has finished running)
$ mount -t bpf none /sys/fs/bpf
▶ List the loaded programs:
$ bpftool prog
348: tracepoint name sched_tracer tag 3051de4551f07909 gpl
loaded_at 2024-08-06T15:43:11+0200 uid 0
xlated 376B jited 215B memlock 4096B map_ids 146,148
btf_id 545
▶ Load and ATTACH a program:
$ mkdir /sys/fs/bpf/myprog
$ bpftool prog loadall trace_execve. /sys/fs/bpf/myprog autoattach
▶ Uninstall a program:
$ rm -rf /sys/fs/bpf/myprog
▶ dump a loaded program:
$ bpftool prog dump xlated id 348
int sched_tracer(struct sched_switch_args * ctx):
; int sched_tracer(struct sched_switch_args *ctx)
0: (bf) r4 = r1
1: (b7) r1 = 0
; __u32 key = 0;
2: (63) *(u32 *)(r10 -4) = r1
; char fmt[] = "Old task was %s, new task is %s\n";
3: (73) *(u8 *)(r10 -8) = r1
4: (18) r1 = 0xa7325207369206b
6: (7b) *(u64 *)(r10 -16) = r1
7: (18) r1 = 0x7361742077656e20
[...]
▶ dump eBPF program logs:
▶ List the created maps:
$ bpftool map
80: array name counter_map flags 0x0
key 4B value 8B max_entries 1 memlock 256B
btf_id 421
82: array name .rodata.str1.1 flags 0x80
key 4B value 33B max_entries 1 memlock 288B
frozen
96: array name libbpf_global flags 0x0
key 4B value 32B max_entries 1 memlock 280B
[...]
▶ Display the contents of a map:
$ sudo bpftool map dump id 80
[{
"key": 0,
"value": 4877514 }
]
▶ Generate libbpf APIs to manipulate a program:
$ bpftool gen skeleton trace_execve. name trace_execve > trace_execve.
▶ We can use the high-level API to write our own user-space programs to better operate our eBPF programs:
- Instantiate a global context object that can be referenced by all programs, maps, links, etc.
- Load/attact/uninstall programs
- The eBPF program is embedded directly into the generated header as a byte array.
Userspace code with libbpf
#include <>
#include <>
#include <>
#include "trace_sched_switch."
int main(int argc, char *argv[])
{
struct trace_sched_switch *skel;
int key = 0;
long counter = 0;
skel = trace_sched_switch__open_and_load();
if(!skel)
exit(EXIT_FAILURE);
if (trace_sched_switch__attach(skel)) {
trace_sched_switch__destroy(skel);
exit(EXIT_FAILURE);
}
while(true) {
bpf_map__lookup_elem(skel->maps.counter_map, &key, sizeof(key), &counter, sizeof(counter), 0);
fprintf(stderr, "Scheduling switch count: %d\n", counter);
sleep(1);
}
return 0;
}
eBPF programs portability
▶ In contrast to user-space APIs, stable APIs are not exposed inside the kernel, which means that eBPF programs that can manipulate certain kernel data do not necessarily run on other versions of the kernel
▶ CO-RE (Compile Once - Run Everywhere) is used to solve this problem by making programs portable between different versions of the kernel, and it relies on the following features:
- The kernel must be passedCONFIG_DEBUG_INFO_BTF=ybuild to embed BTF. BTF is a format similar to dwarf that efficiently encodes data layouts as well as function signatures
- The eBPF compiler must be able to emit BTF relocations (supported by recent versions of clang and GCC, using the
-g
Parameters) - A BPF program that can handle BTF-based data and a BPF loader that regulates access to the corresponding data is needed.
libbpf
is actually the standard bpf loader - The eBPF API is needed to read/write CO-RE redirected variables. libbpf provides such helper functions such as
bpf_core_read
▶ See moreAndrii Nakryiko’s CO-RE guide
▶ In addition to CO-RE, you may face different restrictions for different kernel versions due to the introduction or change of main kernel features (eBPF subsystem is under continuous and frequent update):
- The eBPF tail call was added in version 4.2 (which allows a program to call a function), and in version 5.10 it allows another program to be called
- The eBPF spinlock was added in version 5.1 to prevent concurrent access to shared maps between different CPUs
- Different attach types are constantly being introduced, but may exist in different versions of different architectures. For example
fentry
/fexit
attach points were introduced in the 5.5 kernel for x86, but were introduced in version 6.0 for arm32. - Loops of any kind (even bounded ones) were forbidden before version 5.3
- Added in version 5.8
CAP_BPF
An eBPF task can be allowed
eBPF for tracing/profiling
▶ eBPF is a very powerful framework for probing the interior of the kernel: with a large number of attach points, it is possible to expose almost any kernel path and code.
▶ At the same time, the eBPF program x is isolated from the kernel code, making it (compared to kernel development) safer and simpler
▶ Thanks to kernel translators and optimizations such as JIT compilation, eBPF is well suited for low-overhead tracing and profiling and is very flexible even in production environments
▶ This is why eBPF is gaining acceptance in debugging, tracing, and profiling. eBPF can be used to:
- tracing frameworks such asBCCcap (a poem)bpftrace
- Network infrastructure setup components, such asCiliummaybeCalico
- Network message trackers such aspwrumaybedropwatch
- For more examples, see
eBPF: resources
▶ BCC Tutorial:/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md
▶ libbpf-bootstrap: /libbpf/libbpf-bootstrap
▶ A Beginner’s Guide to eBPF Programming - Liz Rice, 2020
- Video./watch?v=lrSExTfS-iQ
- Resources./lizrice/ebpf-beginners
Choosing the right tool
▶ You need to know which type of tool to use before you start a profile or trace.
▶ Tools are usually selected based on the level of the profile
▶ Usually start with application tracing/profiling tools (valgrind, perf, etc.) to analyze/optimize the application level
▶ Then analyze the performance of user space + kernel
▶ Finally, trace or profile the entire system is required if performance problems occur only in a loaded system
- For "constant" complexity, you can use the snapshot tool
- For occasional problems, traces can be logged and analyzed
▶ If complex configuration is required before analysis, consider using custom tools: scripts, custom traces, eBPF, etc.
Kernel Debugging
Preventing bugs
Static code analysis
▶ AvailablesparseTool performs static analysis
▶ sparseUse annotation to detect compile-time errors.
- Locking issues (unbalanced locks)
- Address space issues, such as direct access to user space pointers
▶ Usemake C=2
Analyzing files that need to be recompiled
▶ Or usemake C=1
Analysis of all documents
▶ Unbalanced Lock Example
rzn1_a5psw.c:81:13: warning: context imbalance in 'a5psw_reg_rmw' - wrong count
at exit
Good practices in kernel development
▶ When writing driver code, the user cannot be expected to provide correct values, so it is always necessary to verify these values
▶ If you want to show the call stack for a specific scenario, you can use theWARN_ON()macro (computing)
- It can also be used during debuggingdump_stack()Show the current call stack:
static bool check_flags(u32 flags)
{
if (WARN_ON(flags & STATE_INVALID))
return -EINVAL;
return 0;
}
▶ If you need to check variables during compilation (configuration inputs, thesizeof()structured fields), then you can use theBUILD_BUG_ON()Guaranteed fulfillment of conditions
BUILD_BUG_ON(sizeof(ctx->__reserved) != sizeof(reserved));
▶ If you get alerts about unused variables/parameters during compilation, you need to fix these issues
▶ Use --strict
Can help see potential problems with code
Linux Kernel Debugging
▶ There are a variety of Linux kernel feature tools to help simplify kernel debugging
- Specific logging frameworks
- Using the standard way of dumping low-level crash messages
- Multiple runtime checkers to help check for various problems: memory issues, locking issues, undefined behavior, etc.
- Interactive or after-the-fact debugging
▶ These features need to be explicitly enabled in the kernel menuconfig; they are assigned to theKernel hacking -> Kernel debugging
in the configuration table entries.
- requireCONFIG_DEBUG_KERNELSet to "y" to enable other debugging options
Debugging using messages
There are 3 available APIs:
▶ For new debug messages, it is not recommended to use the oldprintk()
▶ pr_*()
Family Functions:pr_emerg(), pr_alert(), pr_crit(), pr_err(), pr_warn(), pr_notice(), pr_info(), pr_cont()as well as specialpr_debug()(see later)
- Defined ininclude/linux/
- Use a classically formatted string as a parameter, such as
pr_info("Booting CPU %d\n", cpu);
- Here is the output kernel log:
[ 202.350064] Booting CPU 1
▶ print_hex_dump_debug(): Dump the buffer contents using a hexdump-like format
▶ dev_*()
Family functions:dev_emerg(), dev_alert(), dev_crit(), dev_err(), dev_warn(), dev_notice(), dev_info() as well as specialdev_dbg() (see below):
-
They use a pointer to thestruct deviceas the first parameter, followed by a formatting string parameter
-
Defined ininclude/linux/dev_printk.hcenter
-
Can be used in drivers integrated with Linux device modules
-
Usage:
dev_info(&pdev->dev, "in probe\n");
-
Kernel output:
[ 25.878382] serial : in probe [ 25.884873] serial : in probe
▶ *_ratelimited()
The versioning method can be based on the/proc/sys/kernel/printk_ratelimit{_burst}
value to limit the large amount of output under high-frequency calls
▶ Compared to the standardprintf()
, the kernel defines more format descriptors:
-
%p
: defaults to displaying the hash of the pointer -
%px
: always real pointer address (for insensitive addresses) -
%pK
: Display the hash pointer value according to thekptr_restrict
The sysctl value can be 0 or a pointer address -
%pOF
: Device Tree Node Format Descriptor -
%pr
: Resource structure format descriptors -
%pa
: Display physical address (all 32/64 bits supported) -
%pe
: Error pointer (show the string corresponding to the corresponding error value)
▶ For use%pK
The/proc/sys/kernel/kptr_restrict
Set to 1
▶ For more supported format descriptors, seecore-api/printk-formats
pr_debug() and dev_dbg()
▶ When using the definedDEBUG
When compiling the driver, all these messages will be compiled and printed at the debug level. This can be accomplished by starting the driver with the#define DEBUG
or inMakefile
hit the nail on the headccflags-$(CONFIG_DRIVER) += -DDEBUG
to enableDEBUG
▶ When usingCONFIG_DYNAMIC_DEBUGWhen compiling the kernel, these messages are automatically converted to be output as a single file, single module, or single message (via the/proc/dynamic_debug/control
Settings). Messaging is not enabled by default
- For details, seeadmin-guide/dynamic-debug-howto
- You can get debug messages of interest only
▶ UseDEBUG
maybeCONFIG_DYNAMIC_DEBUG
The messages are not compiled when
pr_debug() and dev_dbg() usage
▶ You can pass/proc/dynamic_debug/control
File Enable Debug Printing
-
cat /proc/dynamic_debug/control
All message lines enabled by the kernel will be displayed - As:
init/:1427 [main]run_init_process =p " \%s\012"
▶ Individual lines, files, or modules can be enabled with the following syntax:
-
echo "file drivers/pinctrl/ +p" > /proc/dynamic_debug/control
will enabledrivers/pinctrl/
All debugging information in the -
echo "module pciehp +p" > /proc/dynamic_debug/control
will enablepciehp
Debug Printing in Modules -
echo "file init/ line 1427 +p" > /proc/dynamic_debug/control
reactivateinit/
Debug printout of line 1247 of file - commander-in-chief (military)
+p
exchange (sth) for (sth else)-p
To disable debug printing
Debug logs troubleshooting
▶ When using dynamic debugging, make sure that the debug call is enabled: it needs to be in thedebugfs
(used form a nominal expression)control
file is seen and must be enabled (=p
)
▶ Is the log output located only in the kernel log buffer?
- This can be done by
dmesg
ferret out - can be reduced
loglevel
to output directly to the terminal - can be set from the kernel command line
ignore_loglevel
to force all kernel logs to be output to the terminal
▶ If external modules are being processed, it may be necessary to define them in the module source code or MakefileDEBUG
Instead of using dynamic debugging
▶ If configurations are made via the kernel command line, will they be parsed correctly?
-
As of 5.14, the kernel can notify the command line of failures
Unknown kernel command line parameters foo, will be passed to user space.
-
Care needs to be taken with special string escaping (e.g. quotes)
▶ Note that a part of the subsystem uses its own logging base settings as well as specific configurations/controls like=0x1ff
Kernel early debug
▶ During the booting phase, the kernel may crash before displaying the system message
▶ On ARM, you can activate the early debugging option if the kernel fails to boot or pauses without messaging any messages
- CONFIG_DEBUG_LL=y Enable ARM early window output
- CONFIG_EARLYPRINTK=y will allow printk to output printed information
▶ Required for useearlyprintk
Command Line Arguments to Enable the Early Printk Output Feature
Kernel crashes and oops
Kernel crashes
▶ The kernel is not immune to crashes, and many errors may cause crashes
- Memory access errors (null pointers, out-of-bounds accesses, etc.)
- Error detecting the use of panic
- Incorrect kernel execution mode (e.g., sleeping was used in an atomic context)
- Kernel detects deadlock
▶ In case of an error, the kernel temporarily sends a message "Kernel oops" to the terminal.
Kernel oops
▶ Message content depends on the architecture used
▶ Most architectures will display at least the following information:
- CPU state at the time of oops
- Register contents
- Backtracking function calls that cause crashes
- Stack contents (last X bytes)
▶ Depending on the architecture, it is possible to use the PC register (sometimes called IP, EIP, etc.) memory to distinguish the crash location
▶ UseCONFIG_KALLSYMS=ySymbolic names can be embedded in the kernel image, which in turn allows for meaningful symbolic names in the traceback stack
▶ The format of the symbols displayed in the traceback stack is:
<symbol_name>+<hex_offset>/<symbol_size>
▶ If oops is not significant (occurs in the process context), the kernel kills the process and continues execution
- Must compromise for kernel stability
▶ Tasks that hang for too long may also generate oops (CONFIG_DETECT_HUNG_TASK)
▶ If KGDB is supported, the kernel switches to KGDB mode when oops occur
Oops example
Kernel oops debugging: addr2line
▶ You can use addr2line to convert the displayed address/symbol to a source line:
addr2line -e vmlinux <address>
▶ GNU binutils >= 2.39 will handle symbols + offset symbols
addr2line -e vmlinux + <symbol_name>+<off>
▶ The kernel source code can be accessed via thefaddr2line
Script to handle older versions of symbol+offset symbols
scripts/faddr2line vmlinux + <symbol_name>+<off>
▶ Must be passedCONFIG_DEBUG_INFO=yCompiling the kernel to embed debugging information into vmlinux files
Kernel oops debugging: decode_stacktrace.sh
▶ This can be done by using the kernel source code provided by thedecode_stacktrace.sh
realizationaddr2line
The oops auto-decode
▶ This script converts all symbolic names/addresses to the corresponding file/line and shows the assembly code that triggered the crash
▶ ./scripts/decode_stacktrace.sh vmlinux linux_source_path/ < oops_ > decoded_oops.txt
▶ Note: You should setCROSS_COMPILE
cap (a poem)ARCH
environment variable to get the correct assembly dump
Oops behavior configuration
▶ Sometimes, crashes can be more severe, causing the kernel to panic and stop execution altogether in a busy loop
▶ You can passCONFIG_PANIC_TIMEOUTEnable automatic reboot on panic
- 0: with no reboot
- Negative value: Immediate restart
- Positive: number of seconds to wait before rebooting
▶ You can configure OOPS to always be panic
- During boot, set the
oops=panic
Add to command line - During the build, set theCONFIG_PANIC_ON_OOPS=y
The Magic SysRq
Serial drivers provide
▶ Multiple debugging/recovery commands can be executed in case of serious problems with the kernel
- In embedded: send the interrupt symbol at the terminal (press
[Ctrl]+a
press again[Ctrl]+\
), and then press<character>
- exist
/proc/sysrq-trigger
Response from CCCS<character>
▶ Example:
- h: show available commands
- s: synchronize all mounted file systems
- b: Reboot the system
- w: show the kernel stacks of all sleeping processes
- t: show kernel stacks for all running processes
- g: enter kgdb mode
- z: flush trace buffer
- c: trigger a crash (kernel panic)
- You can also register your own commands
▶ See detailsadmin-guide/sysrq
Built-in Kernel self tests
Kernel memory issue debugging
▶ Memory problems may occur when writing kernel code in user space
- cross-border visit
- Use the freed memory (in the
kfree()
after which a pointer is dereferenced) - Due to non-implementation
kfree()
lead to insufficient memory
▶ A variety of tools are available to capture these issues
- KASANCan look for use of freed memory and out-of-bounds access issues
- KFENCECan look for use of freed memory and out-of-bounds access issues in production systems
- KmemleakCan find memory leaks caused by forgetting to free memory
KASAN
▶ You can look for the use of freed memory and out-of-bounds access problems.
▶ Detecting the kernel during compilation with GCC
▶ Supports almost all architectures (ARM, ARM64, PowerPC, RISC-V, S390, Xtensa and X86)
▶ Configuration via kernelCONFIG_KASANEnable KASAN
▶ You can enable KASAN for a specific file by modifying the Makefile.
-
KASAN_SANITIZE_file.o := y
Enabling KASAN for specific files -
KASAN_SANITIZE := y
Enable KASAN for all files in the Makefile folder
Kmemleak
▶ Kmemleakl can look up the use ofkmalloc()
Memory leaks in dynamically requested objects
- Detecting whether a memory address is referenced by scanning memory
▶ Once enabledCONFIG_DEBUG_KMEMLEAKThe first step is to create a new version of this program in thedebugfsView files controlled by kmemleak in
▶ Scan for memory leaks every 10 minutes
- This can be done byCONFIG_DEBUG_KMEMLEAK_AUTO_SCANprohibit the use of sth.
▶ A scan can be triggered immediately as follows
# echo scan > /sys/kernel/debug/kmemleak
▶ Results are displayed in the debugfs
# cat /sys/kernel/debug/kmemleak
▶ For more information seedev-tools/kmemleak
Kmemleak report
# cat /sys/kernel/debug/kmemleak
unreferenced object 0x82d43100 (size 64):
comm "insmod", pid 140, jiffies 4294943424 (age 270.420s)
hex dump (first 32 bytes):
b4 bb e1 8f c8 a4 e1 8f 8c ce e1 8f 88 c6 e1 8f ................
10 a5 e1 8f 18 e2 e1 8f ac c6 e1 8f 0c c1 e1 8f ................
backtrace:
[<c31f5b59>] slab_post_alloc_hook+0xa8/0x1b8
[<c8200adb>] kmem_cache_alloc_trace+0xb8/0x104
[<1836406b>] 0x7f005038
[<89fff56d>] do_one_initcall+0x80/0x1a8
[<31d908e3>] do_init_module+0x50/0x210
[<2658dd55>] load_module+0x208c/0x211c
[<e1d48f15>] sys_finit_module+0xe4/0xf4
[<1de12529>] ret_fast_syscall+0x0/0x54
[<7ee81f34>] 0x7eca8c80
UBSAN
▶ UBSAN is a runtime detector that detects undefined code behavior
- Shift using values greater than type
- integer overflow
- Unaligned pointer access
- Out-of-bounds access to static arrays
- /docs/
▶ Use compile-time detection to insert checks performed at runtime
▶ Must be enabledCONFIG_UBSAN=y
▶ UBSAN can be enabled for specific files by modifying the Makefile
-
UBSAN_SANITIZE_file.o := y
Enabling UBSAN for specific files -
UBSAN_SANITIZE := y
Enable UBSAN for all files in the Makefile folder
UBSAN: example of UBSAN report
▶ The following reports an undefined behavior: shifting with a value of >32
UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
shift exponent 51 is too large for 32-bit type 'int'
CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xd2/0x148 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/:159
__ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/:425
...
RIP: 0033:0x4497b9
Code: e8 8c 9f 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 0f 83 9b 6b fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fb5ef0e2c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fb5ef0e36cc RCX: 00000000004497b9
RDX: 0000000020000040 RSI: 0000000000000258 RDI: 0000000000000014
RBP: 000000000071bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000005490 R14: 00000000006ed530 R15: 00007fb5ef0e3700
Debugging locking
▶ Lock Debugging: Verifying Lock Correctness
- CONFIG_PROVE_LOCKING
- Detecting kernel lock code
- Detect whether the locking principle has been violated during the life of the system, e.g:
- Require a different lock order (keep tracking and comparing lock orders)
- Interrupt handlers and interrupt-enabled process contexts get Spinlocks
- Not suitable for production systems
- For details, seelocking/lockdep-design
▶ CONFIG_DEBUG_ATOMIC_SLEEPAllows detection of erroneously dormant code in atomic code segments (usually in the case of holding locks).
- The detected problems can be displayed via dmesg
Concurrency issues
▶ Kernel Concurrent SANitizer Framework
▶ Introduced in Linux 5.8CONFIG_KCSAN
▶ Dynamic competition detector based on compile-time detection
▶ Concurrency problems of the system can be detected (mainly data contention)
▶ See moredev-tools/kcsan cap (a poem)/Articles/816850/
KGDB
kgdb - A kernel debugger
▶ CONFIG_KGDB
▶ The execution of the kernel is completely controlled by gdb on another machine connected using a serial line
▶ Do almost anything, including inserting breakpoints on the interrupt handler
▶ Supports the most popular CPU architectures
▶ CONFIG_GDB_SCRIPTS It is possible to build the GDB python scripts provided by the kernel
- See moredev-tools/gdb-kernel-debugging
kgdb kernel config
▶ CONFIG_DEBUG_KERNEL=y Support KDGB
▶ CONFIG_KGDB=y Enable KGDB
▶ CONFIG_DEBUG_INFO=y Compile the kernel with debugging information (-g
)
▶ CONFIG_FRAME_POINTER=y Can have more reliable stacks
▶ CONFIG_KGDB_SERIAL_CONSOLE=y Enable Serial KGDB
▶ CONFIG_GDB_SCRIPTS=y Enabling kernel GDB python scripts
▶ CONFIG_RANDOMIZE_BASE=n Disable KASLR
▶ CONFIG_WATCHDOG=nDisable watchdog
▶ CONFIG_MAGIC_SYSRQ=y Enabling Magic SysReq Support
▶ CONFIG_STRICT_KERNEL_RWX=n Disabling memory protection for kernel segments can allow adding breakpoints
kgdb pitfalls
▶ KASLR needs to be disabled to prevent gdb from manipulating random kernel addresses
- If thekaslrIf you want to use the
nokaslr
command disablekaslrparadigm
▶ Disable platform watchdog to prevent rebooting during debugging
- When KGDB is interrupted, all interrupts are disabled and watchdog is not serviced
- Sometimes watchdog is enabled for high-level booting; make sure to disable watchdog here!
▶ Not availableinterrupt
command orCtrl+C
Interrupt kernel execution
▶ Insertion of breakpoints at arbitrary positions is not supported (seeCONFIG_KGDB_HONOUR_BLOCKLIST)
▶ A terminal driver that supports polling is required.
▶ Certain organizations lack appropriate functions (e.g., no watchpoint on arm32), and thus may be unstable
Using kgdb
▶ See the kernel documentation for details:dev-tools/kgdb
▶ A kgbd I/O driver must be included, e.g. for use via a serial terminalkgdb
(throughCONFIG_KGDB_SERIAL_CONSOLEEnable kgdboc: kgdb over console)
▶ Configure during boot by passing in the following parameterskgdboc
-
kgdboc=<tty-device>,<bauds>
e.g.kgdboc=ttyS0,115200
▶ Using sysfs at runtime
echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc
- If the terminal does not support polling, the command line prompts an error
▶ Then turn thekgdbwait
Passed into the kernel: it tells kgdb to wait until the debugger connects to the
▶ boot kernel, after terminal initialization, using the abort symbol +g
Interrupting the kernel on a serial terminal (see Magic SysRq)
▶ On the workbench, startgdb
:
arm-linux-gdb ./vmlinux
(gdb) set remotebaud 115200
(gdb) target remote /dev/ttyS0
▶ Once connected, you can debug the kernel as if it were an application program
▶ On the GDB side, the first thread represents the CPU context (ShadowCPU
Kernel GDB scripts
▶ CONFIG_GDB_SCRIPTSKernel debugging can be simplified by building python scripts (adding new commands and functions)
▶ When usinggdb vmlinux
The files in the build root directory are automatically loaded when the
-
lx-symbols
: Overloading symbols for vmlinux and modules -
lx-dmesg
: show kernel dmesg -
lx-lsmod
:: Show loaded modules -
lx-device-{bus|class|tree}
:: Display device buses, classes, and trees -
lx-ps: ps
Similar Viewing Tasks -
$lx_current()
Contains the currenttask_struct
-
$lx_per_cpu(var, cpu)
Returns a single-cpu variable -
apropos lx
Show all available functions
▶ dev-tools/gdb-kernel-debugging
KDB
▶ CONFIG_KGDB_KDBContains a kgdb frontend name "KDB".
▶ This front-end exposes a debugging prompt on the serial terminal to debug the kernel without the need for an external gdb
▶ You can enter the KDB using the same mechanism as for entering the kgdb mode.
▶ You can use both KDB and KGDB.
- Use in KDB
kgdb
Entering kgdb mode - Send a message via gdb
maintenance packet 3
Maintenance command to switch from kgdb to KDB mode
kdmx
▶ When the system has only one serial port, it is not possible to use KGDB and serial line output terminals at the same time because an application can access only one port
▶ Fortunately.kdmxThe tool can cut GDB messages and standard terminals from a single port into 2 words pty by (/dev/pts/x
) to support simultaneous use of KGDB and serial outputs
▶ /pub/scm/utils/kernel/kgdb/
- kdmxsubdirectories
Going further with KGDB
▶ See the following link for more examples and explanations:
- Video: /watch?v=HBOwoSyRmys
- Slides: /images/1/1b/ELC19_Serial_kdb_kgdb.pdf
crash
▶ crashis a CLI tool that can interact with the kernel (dead or alive)
- utilization
/dev/mem
maybe/proc/kcore
- requestCONFIG_STRICT_DEVMEM=n
▶ You can generate coredump files using kdump, kvmdump, etc.
▶ Based ongdb
and provides many specific commands to check the kernel state
- Stacks, dmesg, memory mapping of processes, irqs, virtual memory domains, etc.
▶ All tasks running on the system can be checked.
▶ /crash-utility/crash
crash example
$ crash vmlinux vmcore
[...]
TASKS: 75
NODENAME: buildroot
RELEASE: 5.13.0
VERSION: #1 SMP PREEMPT Tue Nov 15 14:42:25 CET 2022
MACHINE: armv7l (unknown Mhz)
MEMORY: 512 MB
PANIC: "Unable to handle kernel NULL pointer dereference at virtual address 00000070"
PID: 127
COMMAND: "watchdog"
TASK: c3f163c0 [THREAD_INFO: c3f00000]
CPU: 1
STATE: TASK_RUNNING (PANIC)
crash> mach
MACHINE TYPE: armv7l
MEMORY SIZE: 512 MB
CPUS: 1
PROCESSOR SPEED: (unknown)
HZ: 100
PAGE SIZE: 4096
KERNEL VIRTUAL BASE: c0000000
KERNEL MODULES BASE: bf000000
KERNEL VMALLOC BASE: e0000000
KERNEL STACK SIZE: 8192
post-mortem analysis
Kernel crash post-mortem analysis
▶ Sometimes, it is not possible to access a crashed system or to keep the system offline while waiting for debugging
▶ The kernel can generate crash logs at the remote end (vmcorefile), which allows for a quick reboot of the system and supports gdb post-mortem analysis of the
▶ This feature depends onkexec
cap (a poem)kdump
The crash occurs after a crash and dumps out of thevmcorefile after booting another kernel.
- It is possible to SSH, FTP, etc. thevmcoreSave files to local storage
kexec & kdump
▶ In panic, kernel kexec supports a "dump-capture kernel" operation directly from the crashed kernel.
- Most of the time, a specific dump-capture kerne is compiled for the task (
initramfs/initrd
Minimum configuration is specified)
▶ The kexec system reserves a portion of RAM for kdump kernel execution at startup.
- This can be done by
crashkernel
parameter specifies a specific physical memory domain for the crash kernel
▶ Then usekexec-tools
Load the dump-capture kernel into this memory domain
- Internally it will be used
kexec_load
system callman 2 kexec_load
▶ Finally, at panic, the kernel reboots into dump-capture Kernel, allowing the user to dump the kernel coredump (/proc/vmcore
) into any medium
▶ Different architectures may also require the optional addition of a command line
▶ See alsoadmin-guide/kdump/kdumpto fully understand how to configure the kdump kernel using kexec!
▶ In addition there are user-space services and tools to automatically collect and vmcore dumo to the remote end
- kdump systemd service and
makedumpfile
Tool also compresses vmcore into a smaller file (x86, PPC, IA64, S390 only) - /makedumpfile/makedumpfile
kdump
kexec config and setup
▶ On the standard kernel: •
-
CONFIG_KEXEC=y Enable KEXEC support
-
kexec-tools
With the kexec command -
A kernel and DTB accessible by kexec
▶ dump-capture kernel:
- CONFIG_CRASH_DUMP=y Kernel with dump crash enabled
-
CONFIG_PROC_VMCORE=y (computing) enable (a feature)
/proc/vmcore
be in favor of - CONFIG_AUTO_ZRELADDR=y ARM32 platform
▶ Setting the correctcrashkernel
Command Line Options:
crashkernel=size[KMG][@offset[KMG]]
▶ Use kexec to load the dump-capture kernel as the first kernel
kexec --type zImage -p my_zImage --dtb=my_dtb.dtb -- initrd=my_initrd --append="command line option"
Going further with kexec & kdump
▶ See the following for more information on kexec/kdump:
- Video: /watch?v=aUGNDJPpUUg
- Slides: /hosted_files/ossna2022/c0/Postmortem_ Kexec%2C Kdump and