[kernel] Looking at the source code with questions -- how scripts are called by execve

preamble

In the[apue] Process control stuffIn the "Process Creation -> exec -> Interpreter Files" section of the "Process Creation -> Exec -> Interpreter Files" article, it was mentioned that the recognition of script files is done by the kernel as part of the exec syscall processing, and that it has the following characteristics:

The length of the first line of the specified interpreter beginning with #! (shebang), the first line of the specified interpreter must not be longer than 128
shebang can only specify one parameter at most
The command and parameters specified by shebang will be the first two parameters of the new process, with the other user-supplied parameters following in order.

How are these features implemented? With this question in mind, find out the corresponding kernel source code of the system to see what's going on.

source code locator

and the[kernel] Looking at the source code with questions - how process IDs are assignedThe same is true for thebootlin Looking at the kernel version 3.10.0 source, the script file is parsed at execve, so search for sys_ execve first:

The entire call chain is as follows:

sys_execve -> do_execve -> do_execve_common -> search_binary_handler-> load_binary -> load_script (binfmt_script.c)

In order to get to the point quickly, we're not going to look at all of the previous sections, but we're going to explain search_binary_handler.

The way in which different file formats are loaded in Linux is extensible, and this is mainly done through kernel modules, each of which implements a format, and new formats can be quickly supported by writing kernel modules without modifying the kernel source code. I've just had a first glimpse of this when browsing through the code:

These are a few of the modules currently built into the kernel

binfmt_elf: the most commonly used Linux binary executable
binfmt_elf_fdpic: binary executable with missing MMU architecture
binfmt_em86: Linux binaries for running Intel on an Aplha machine
binfmt_aout: Linux old executables
binfmt_script: script file
binfmt_misc: a new mechanism to support the binding of runtime file formats to their application counterparts

It can basically be summarized into three main categories:

executable file
Script
Mechanism Development (misc)

The misc mechanism is similar to the Windows mechanism that binds files to applications through suffixes, but in addition to suffixes, it also detects the Magic field in the file as the basis for determining the file type. Currently, the main direction of application is to run across architectures, such as running arm64 or even Windows programs (wine) on x86 machines, which is another level of convenience compared to writing kernel modules.

This article focuses on the script file process.

binfmt

The kernel module itself is not difficult to implement, using script as an example:

static struct linux_binfmt script_format = {
	.module		= THIS_MODULE,
	.load_binary	= load_script,
};

static int __init init_script_binfmt(void)
{
	register_binfmt(&script_format);
	return 0;
}

static void __exit exit_script_binfmt(void)
{
	unregister_binfmt(&script_format);
}

core_initcall(init_script_binfmt);
module_exit(exit_script_binfmt);

The main purpose is to insert and remove linux_binfmt information nodes by register_binfmt / unregister_binfmt.

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};

linux_binfmt is not much, and the callback functions do not have to be implemented in full, and the ones that are not used are left blank. Here is the process of inserting a node:

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt, int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh, &formats) :
		 list_add_tail(&fmt->lh, &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt, 0);
}

The linux_binfmt.lh field (list_head) is used to implement the chain table insertion, with the head of the chain table being the global variable formats.

search_binary_handler

Look again at search_binary_handler's use of formats to traverse a linked table:

retval = -ENOENT;
for (try=0; try<2; try++) {

Maximum 2 attempts

    read_lock(&binfmt_lock);
	list_for_each_entry(fmt, &formats, lh) {

Locking; traversing the entire linked table via formats

        int (*fn)(struct linux_binprm *) = fmt->load_binary;
		if (!fn)
			continue;
		if (!try_module_get(fmt->module))
			continue;
		read_unlock(&binfmt_lock);
		bprm->recursion_depth = depth + 1;

Check if the kernel module is alive; unlock the formats link table for nesting before executing load_binary; update the nesting depth

        retval = fn(bprm);
		bprm->recursion_depth = depth;
		if (retval >= 0) {
			if (depth == 0) {
				trace_sched_process_exec(current, old_pid, bprm);
				ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
			}
			put_binfmt(fmt);
			allow_write_access(bprm->file);
			if (bprm->file)
				fput(bprm->file);
			bprm->file = NULL;
			current->did_exec = 1;
			proc_exec_connector(current);
			return retval;
		}

Resume nested attempts; execution succeeds with early exit

        read_lock(&binfmt_lock);
		put_binfmt(fmt);
		if (retval != -ENOEXEC || bprm->mm == NULL)
			break;
		if (!bprm->file) {
			read_unlock(&binfmt_lock);
			return retval;
		}
	}

Execution fails, re-lock; if non-ENOEXEC error, continue to try next fmt

    read_unlock(&binfmt_lock);
	break;
}

Iteration complete. Exit.

The list_for_each_entry macro is a Linux wrapper for list traversal:

/**
 * list_for_each_entry	-	iterate over list of given type
 * @pos:	the type * to use as a loop cursor.
 * @head:	the head for your list.
 * @member:	the name of the list_struct within the struct.
 */
#define list_for_each_entry(pos, head, member)				\
	for (pos = list_entry((head)->next, typeof(*pos), member);	\
	     &pos->member != (head); 	\
	     pos = list_entry(pos->, typeof(*pos), member))

It's essentially a for loop. Also, the previous for (try < 2) doesn't really work, because it's always interrupted by a break at the end.

But here's a hint on how load_binary is written: when the interface returns -ENOEXEC, it means that the file is "not to your liking", so keep traversing the formats list to try it out, and pay attention to it when you read load_script.

Also binfmt is nestable, so assuming a script is being called that uses awk as its interpreter, the entire execution would look something like the following:

execve () -> load_script (binfmt_script) -> load_elf_binary (binfmt_elf)

This is because awk, as an executable, itself requires binfmt processing, as you'll see in load_script in a moment.

Currently Linux does not impose a limit on nesting depth.

source code analysis

After a bit of background knowledge, it's finally time to take a look at binfmt_script:

static int load_script(struct linux_binprm *bprm)
{
	const char *i_arg, *i_name;
	char *cp;
	struct file *file;
	char interp[BINPRM_BUF_SIZE];
	int retval;

	if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!'))
		return -ENOEXEC;

Scripts not starting with #! are ignored; note the length of the interp array: #define BINPRM_BUF_SIZE 128, which is the source of the idea that a shebang can't exceed 128!

    /*
	 * This section does the #! interpretation.
	 * Sorta complicated, but hopefully it will work.  -TYT
	 */

	allow_write_access(bprm->file);
	fput(bprm->file);
	bprm->file = NULL;

The system has read a portion of the bytes in the header of the file into memory, and the script file runs out, freeing the

    bprm->buf[BINPRM_BUF_SIZE - 1] = '\0';
	if ((cp = strchr(bprm->buf, '\n')) == NULL)
		cp = bprm->buf+BINPRM_BUF_SIZE-1;
	*cp = '\0';

Intercepts up to the first 127 characters and searches forward to the end of shebang (\n), and if there is one, sets a new ending there

    while (cp > bprm->buf) {
		cp--;
		if ((*cp == ' ') || (*cp == '\t'))
			*cp = '\0';
		else
			break;
	}
	for (cp = bprm->buf+2; (*cp == ' ') || (*cp == '\t'); cp++);
	if (*cp == '\0') 
		return -ENOEXEC; /* No interpreter name found */

Before and after trim blank characters, if there is nothing, ignore; note that initially cp points to the end of the string, at the end, cp points to the head of the valid message

    i_name = cp;
	i_arg = NULL;
	for ( ; *cp && (*cp != ' ') && (*cp != '\t'); cp++)
		/* nothing */ ;
	while ((*cp == ' ') || (*cp == '\t'))
		*cp++ = '\0';
	if (*cp)
		i_arg = cp;
	strcpy (interp, i_name);

Skip the command name; ignore whitespace; take all the rest as one argument; copy the command name to the interp array for backup

    /*
	 * OK, we've parsed out the interpreter name and
	 * (optional) argument.
	 * Splice in (1) the interpreter's name for argv[0]
	 *           (2) (optional) argument to interpreter
	 *           (3) filename of shell script (replace argv[0])
	 *
	 * This is done in reverse order, because of how the
	 * user environment and arguments are stored.
	 */
	retval = remove_arg_zero(bprm);
	if (retval)
		return retval;
	retval = copy_strings_kernel(1, &bprm->interp, bprm);
	if (retval < 0) return retval; 
	bprm->argc++;
	if (i_arg) {
		retval = copy_strings_kernel(1, &i_arg, bprm);
		if (retval < 0) return retval; 
		bprm->argc++;
	}
	retval = copy_strings_kernel(1, &i_name, bprm);
	if (retval) return retval; 
	bprm->argc++;
	retval = bprm_change_interp(interp, bprm);
	if (retval < 0)
		return retval;

Remove the first argument of argv, and place the command name (i_name), arguments (i_arg if any), and script filename (bprm->interp) in the first three digits of argv, respectively.

Note that the order of the calls here is reversed: bprm->interp, i_arg, i_name, due to the special way argv is stored in the process; refer to the explanation later;Lastly, update the command names in bprm

    /*
	 * OK, now restart the process with the interpreter's dentry.
	 */
	file = open_exec(interp);
	if (IS_ERR(file))
		return PTR_ERR(file);

	bprm->file = file;
	retval = prepare_binprm(bprm);
	if (retval < 0)
		return retval;

Open the file through the path specified by the command name and set to the current process to prepare various information before loading, including pre-reading some of the contents of the file's header

    return search_binary_handler(bprm);
}

Use the information from the new command to continue searching for and loading the binfmt module

The main thing to add here is that for the command name field in shebang, it can't contain spaces in the middle, otherwise it will be truncated in advance, even if it is surrounded by quotes (the parsing code doesn't handle quotes at all), here is an example:

> pwd
/ext/code/apue//test black
> ls -lh
total 52K
-rwxr-xr-x 1 yunhai01 DOORGOD 48K Aug 23 19:17 echo
-rwxr--r-- 1 yunhai01 DOORGOD  47 Aug 23 19:17 
> cat 
#! /ext/code/apue//test black/demo

> ./echo a b c
argv[0] = ./echo
argv[1] = a
argv[2] = b
argv[3] = c
> ./ a b c
bash: ./: /ext/code/apue//test: bad interpreter: No such file or directory

Header pre-reading

There are two main points to explain here, one is that prepare_binprm will pre-read some data in the header of the file for use in binfmt judgment later:

/* 
 * Fill the binprm structure from the inode. 
 * Check permissions, then read the first 128 (BINPRM_BUF_SIZE) bytes
 *
 * This may be called multiple times for binary chains (scripts for example).
 */
int prepare_binprm(struct linux_binprm *bprm)
{
	umode_t mode;
	struct inode * inode = file_inode(bprm->file);
	int retval;

	mode = inode->i_mode;
	if (bprm->file->f_op == NULL)
		return -EACCES;

    ...

	/* fill in binprm security blob */
	retval = security_bprm_set_creds(bprm);
	if (retval)
		return retval;
	bprm->cred_prepared = 1;

	memset(bprm->buf, 0, BINPRM_BUF_SIZE);
	return kernel_read(bprm->file, 0, bprm->buf, BINPRM_BUF_SIZE);
}

The current length of this BINPRM_BUF_SIZE is also 128:

#define BINPRM_BUF_SIZE 128

This interface is also called in do_execve_common to prepare for the first binfmt recognition:

/*
 * sys_execve() executes a new program.
 */
static int do_execve_common(const char *filename,
				struct user_arg_ptr argv,
				struct user_arg_ptr envp)
{
	struct linux_binprm *bprm;
	struct file *file;
	struct files_struct *displaced;
	bool clear_in_exec;
	int retval;
	const struct cred *cred = current_cred();

    ...

	file = open_exec(filename);
	retval = PTR_ERR(file);
	if (IS_ERR(file))
		goto out_unmark;

	sched_exec();

	bprm->file = file;
	bprm->filename = filename;
	bprm->interp = filename;

	retval = bprm_mm_init(bprm);
	if (retval)
		goto out_file;

	bprm->argc = count(argv, MAX_ARG_STRINGS);
	if ((retval = bprm->argc) < 0)
		goto out;

	bprm->envc = count(envp, MAX_ARG_STRINGS);
	if ((retval = bprm->envc) < 0)
		goto out;

	retval = prepare_binprm(bprm);
	if (retval < 0)
		goto out;

Yes, this is it.

    retval = copy_strings_kernel(1, &bprm->filename, bprm);
	if (retval < 0)
		goto out;

	bprm->exec = bprm->p;
	retval = copy_strings(bprm->envc, envp, bprm);
	if (retval < 0)
		goto out;

	retval = copy_strings(bprm->argc, argv, bprm);
	if (retval < 0)
		goto out;

	retval = search_binary_handler(bprm);
	if (retval < 0)
		goto out;

    ...
}

argv Adjustment

Another point is the layout of argv in memory, cf. the previously written[apue] Process environment stuff", here is a direct posting:

Command line arguments and environment variables are strings placed at the end of the high address space of the process, with \0 intervals. Because of the high address "ceiling", it is necessary to locate the starting position according to the length of the string, and then copy the entire string. In addition, in order to ensure that the address of argv[0] is smaller than argv[1], the entire array needs to be traversed from back to front. Here is an example borrowed from a previous writing to prove this point:

#include <>
#include <> 

int data1 = 2;
int data2 = 3;
int data3;
int data4;

int main (int argc, char *argv[])
{
  char buf1[1024] = { 0 };
  char buf2[1024] = { 0 };
  char *buf3 = malloc(1024);
  char *buf4 = malloc(1024);
  printf ("onstack %p, %p\n",
    buf1,
    buf2);

  extern char ** environ;
  printf ("env %p\n", environ);
  for (int i=0; environ[i] != 0; ++ i)
    printf ("env[%d] %p\n", i, environ[i]);

  printf ("arg %p\n", argv);
  for (int i=0; i < argc; ++ i)
    printf ("arg[%d] %p\n", i, argv[i]);

  printf ("onheap %p, %p\n",
    buf3,
    buf4);

  free (buf3);
  free (buf4);

  printf ("on bss %p, %p\n",
    &data3,
    &data4);

  printf ("on init %p, %p\n",
    &data1,
    &data2);

  printf ("on code %p\n", main);
  return 0;
}

Give it some random parameters and let it run an output:

> ./layout a b c d
onstack 0x7fff2757a970, 0x7fff2757a570
env 0x7fff2757aea8
env[0] 0x7fff2757b4fb
env[1] 0x7fff2757b511
env[2] 0x7fff2757b534
env[3] 0x7fff2757b544
env[4] 0x7fff2757b558
env[5] 0x7fff2757b566
env[6] 0x7fff2757b587
env[7] 0x7fff2757b5af
env[8] 0x7fff2757b5c7
env[9] 0x7fff2757b5e7
env[10] 0x7fff2757b5fa
env[11] 0x7fff2757b608
env[12] 0x7fff2757bcc0
env[13] 0x7fff2757bcc8
env[14] 0x7fff2757be1d
env[15] 0x7fff2757be3b
env[16] 0x7fff2757be59
env[17] 0x7fff2757be6a
env[18] 0x7fff2757be81
env[19] 0x7fff2757be9b
env[20] 0x7fff2757bea3
env[21] 0x7fff2757beb3
env[22] 0x7fff2757bec4
env[23] 0x7fff2757bee0
env[24] 0x7fff2757bf13
env[25] 0x7fff2757bf36
env[26] 0x7fff2757bf62
env[27] 0x7fff2757bf83
env[28] 0x7fff2757bfa1
env[29] 0x7fff2757bfc3
env[30] 0x7fff2757bfce
arg 0x7fff2757ae78
arg[0] 0x7fff2757b4ea
arg[1] 0x7fff2757b4f3
arg[2] 0x7fff2757b4f5
arg[3] 0x7fff2757b4f7
arg[4] 0x7fff2757b4f9
onheap 0x1056010, 0x1056420
on bss 0x6066b8, 0x6066bc
on init 0x606224, 0x606228
on code 0x40179d

Focusing on the addresses of argv and envp, envp is higher than argv; and then look inside each array, the address with the low index is also low. Combined with the previous memory layout diagram, this is how you need to arrange each parameter:

Lay out envp first, traversing backwards and forwards within envp.
Backward aligned argv, traversed internally from back to front within argv

The code does say that too:

    retval = copy_strings(bprm->envc, envp, bprm);
	if (retval < 0)
		goto out;

	retval = copy_strings(bprm->argc, argv, bprm);
	if (retval < 0)
		goto out;

The above paragraph was previously shown in do_execve_common, lining up envp followed by argv, and then looking at the internal processing of the array:

/*
 * 'copy_strings()' copies argument/environment strings from the old
 * processes's memory to the new process's stack.  The call to get_user_pages()
 * ensures the destination page is created and not swapped out.
 */
static int copy_strings(int argc, struct user_arg_ptr argv,
			struct linux_binprm *bprm)
{
	struct page *kmapped_page = NULL;
	char *kaddr = NULL;
	unsigned long kpos = 0;
	int ret;

	while (argc-- > 0) {

Iterate through the array in reverse order

        const char __user *str;
		int len;
		unsigned long pos;
        ret = -EFAULT;

		str = get_user_arg_ptr(argv, argc);
		if (IS_ERR(str))
			goto out;

		len = strnlen_user(str, MAX_ARG_STRLEN);
		if (!len)
			goto out;

		ret = -E2BIG;
		if (!valid_arg_len(bprm, len))
			goto out;

		/* We're going to work our way backwords. */
		pos = bprm->p;
		str += len;
		bprm->p -= len;

Calculate the length of the current string and reserve space, note that there may be a cross-page situation when copying, the string is also divided from the end to the beginning of a block to copy

        while (len > 0) {
			int offset, bytes_to_copy;

			if (fatal_signal_pending(current)) {
				ret = -ERESTARTNOHAND;
				goto out;
			}
			cond_resched();

			offset = pos % PAGE_SIZE;
			if (offset == 0)
				offset = PAGE_SIZE;

			bytes_to_copy = offset;
			if (bytes_to_copy > len)
				bytes_to_copy = len;

			offset -= bytes_to_copy;
			pos -= bytes_to_copy;
			str -= bytes_to_copy;
			len -= bytes_to_copy;

			if (!kmapped_page || kpos != (pos & PAGE_MASK)) {
				struct page *page;

				page = get_arg_page(bprm, pos, 1);
				if (!page) {
					ret = -E2BIG;
					goto out;
				}

				if (kmapped_page) {
					flush_kernel_dcache_page(kmapped_page);
					kunmap(kmapped_page);
					put_arg_page(kmapped_page);
				}
				kmapped_page = page;
				kaddr = kmap(kmapped_page);
				kpos = pos & PAGE_MASK;
				flush_arg_page(bprm, kpos, kmapped_page);
			}
			if (copy_from_user(kaddr+offset, str, bytes_to_copy)) {
				ret = -EFAULT;
				goto out;
			}
		}
	}
	ret = 0;

Copying a single string, which can be very large, several pages for one, works mainly copy_from_user

out:
	if (kmapped_page) {
		flush_kernel_dcache_page(kmapped_page);
		kunmap(kmapped_page);
		put_arg_page(kmapped_page);
	}
	return ret;
}

error handling

After understanding the layout of argv and envp, it suddenly becomes easier to insert an element before the array, but you need to remove the first element first, where Linux uses a trick: just move the argv pointer (bprm->p) to skip the first argument:

/*
 * Arguments are '\0' separated strings found at the location bprm->p
 * points to; chop off the first by relocating brpm->p to right after
 * the first '\0' encountered.
 */
int remove_arg_zero(struct linux_binprm *bprm)
{
	int ret = 0;
	unsigned long offset;
	char *kaddr;
	struct page *page;

	if (!bprm->argc)
		return 0;

	do {
		offset = bprm->p & ~PAGE_MASK;
		page = get_arg_page(bprm, bprm->p, 0);
		if (!page) {
			ret = -EFAULT;
			goto out;
		}
		kaddr = kmap_atomic(page);

		for (; offset < PAGE_SIZE && kaddr[offset];
				offset++, bprm->p++)
			;

		kunmap_atomic(kaddr);
		put_arg_page(page);

		if (offset == PAGE_SIZE)
			free_arg_page(bprm, (bprm->p >> PAGE_SHIFT) - 1);
	} while (offset == PAGE_SIZE);

	bprm->p++;
	bprm->argc--;
	ret = 0;

out:
	return ret;
}

After the update, bprm->p points to the second argument, argc is reduced by 1, and it is automatically overwritten when new arguments are inserted later:

    retval = copy_strings_kernel(1, &bprm->interp, bprm);
	if (retval < 0) return retval; 
	bprm->argc++;
	if (i_arg) {
		retval = copy_strings_kernel(1, &i_arg, bprm);
		if (retval < 0) return retval; 
		bprm->argc++;
	}
	retval = copy_strings_kernel(1, &i_name, bprm);
	if (retval) return retval; 
	bprm->argc++;

copy_string_kernel will call copy_strings based on the source string obtained by the kernel, so everything is back to the previous logic, just keep the arguments in reverse order, this code has been shown in load_script before, do you see it this time?

summarize

The three questions posed at the beginning:

The length of the first line of the specified interpreter that begins with shebang must not exceed 128
shebang can only specify one parameter at most
The command and parameters specified by shebang become the first two parameters of the new process, with the user-supplied parameters following in order of precedence.

The shebang length limit of 128, which is tied to the entire execve read-ahead length ()BINPRM_BUF_SIZE), and the format specified by binfmt_misc, doesn't seem to be easy to break.

Also by reading through the source code, the following additional knowledge was gained:

Interpreter files can be nested and have no depth limits
Command names in the first line of the interpreter cannot contain whitespace characters
Command line arguments are inverted in memory space, this seems to be mainly to insert elements in the head of the array is more convenient, if there is a need to insert elements in the tail of the array, may have to be changed to the front of the row of the

Finally, as for shebang's support for multiple arguments, it looks like it should be possible to achieve this by modifying binfmt_scrpts, so I'll leave it to interested readers as a homework assignment, haha.

consultation

[1]. Use binfmt_misc under linux to set different binary open programs

[2]. Principle analysis of binfmt-misc in Linux

[3]. Chinese Manual

[4]. Introduction to the binfmt_misc (binfmt) module for Linux

[5]. Detailed analysis of the executable file format of the Linux system

[6]. Kernel Support for miscellaneous Binary Formats (binfmt_misc)