From: Menglong Dong <[email protected]>
As Luis Chamberlain suggested, I split the patch:
[init/initramfs.c: make initramfs support pivot_root]
(https://lore.kernel.org/linux-fsdevel/[email protected]/)
into three.
The goal of the series patches is to make pivot_root() support initramfs.
In the first patch, I introduce the function ramdisk_exec_exist(), which
is used to check the exist of 'ramdisk_execute_command' in relative path
mode.
In the second patch, I create a second mount, which is called
'user root', and make it become the root. Therefore, the root has a
parent mount, and it can be umounted or pivot_root.
Before change root, I have to check the exist of ramdisk_execute_command,
because 'user root' should be umounted if ramdisk_execute_command not
exist. 'user root' is mounted on '/root', and cpio is unpacked to it. So
I have to use relative path to do this check, as 'user root' is not the
root yet.
Maybe I can do the check after change root, but it seems complex to
change root back to '/'. What's weird is that I try to move 'user root'
from '/root' to '/', but the absolute path lookup seems never follow the
mount. That's why I introduced ramdisk_exec_exist.
In the third patch, I fix rootfs_fs_type with ramfs, as it is not used
directly any more, and it make no sense to switch it between ramfs and
tmpfs, just fix it with ramfs to simplify the code.
Menglong Dong (3):
init/main.c: introduce function ramdisk_exec_exist()
init/do_cmounts.c: introduce 'user_root' for initramfs
init/do_mounts.c: fix rootfs_fs_type with ramfs
fs/namespace.c | 2 --
include/linux/init.h | 1 -
init/do_mounts.c | 82 +++++++++++++++++++++++++++++++++++++-------
init/do_mounts.h | 7 +++-
init/initramfs.c | 10 ++++++
init/main.c | 17 ++++++++-
6 files changed, 101 insertions(+), 18 deletions(-)
--
2.31.1
From: Menglong Dong <[email protected]>
Introduce the function ramdisk_exec_exist, which is used to check
the exist of 'ramdisk_execute_command'.
It can do absolute path and relative path check. For relative path,
it will ignore '/' and '.' in the start of
'ramdisk_execute_command'.
Signed-off-by: Menglong Dong <[email protected]>
---
init/main.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/init/main.c b/init/main.c
index eb01e121d2f1..95cab17046e0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1522,6 +1522,21 @@ void __init console_on_rootfs(void)
fput(file);
}
+bool __init ramdisk_exec_exist(bool absolute)
+{
+ char *tmp_command = ramdisk_execute_command;
+
+ if (!tmp_command)
+ return false;
+
+ if (!absolute) {
+ while (*tmp_command == '/' || *tmp_command == '.')
+ tmp_command++;
+ }
+
+ return init_eaccess(tmp_command) == 0;
+}
+
static noinline void __init kernel_init_freeable(void)
{
/*
@@ -1568,7 +1583,7 @@ static noinline void __init kernel_init_freeable(void)
* check if there is an early userspace init. If yes, let it do all
* the work
*/
- if (init_eaccess(ramdisk_execute_command) != 0) {
+ if (!ramdisk_exec_exist(true)) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
--
2.31.1
From: Menglong Dong <[email protected]>
During the kernel initialization, the root of the mount tree is
created with file system type of ramfs or tmpfs.
While using initramfs as the root file system, cpio file is unpacked
into the rootfs. Thus, this rootfs is exactly what users see in user
space, and some problems arose: this rootfs has no parent mount,
which make it can't be umounted or pivot_root.
'pivot_root' is used to change the rootfs and clean the old mounts,
and it is essential for some users, such as docker. Docker use
'pivot_root' to change the root fs of a process if the current root
fs is a block device of initrd. However, when it comes to initramfs,
things is different: docker has to use 'chroot()' to change the root
fs, as 'pivot_root()' is not supported in initramfs.
The usage of 'chroot()' to create root fs for a container introduced
a lot problems.
First, 'chroot()' can't clean the old mountpoints which inherited
from the host. It means that the mountpoints in host will have a
duplicate in every container. Users may remove a USB after it
is umounted successfully in the host. However, the USB may still
be mounted in containers, although it can't be seen by the 'mount'
commond. This means the USB is not released yet, and data may not
write back. Therefore, data lose arise.
Second, net-namespace leak is another problem. The net-namespace
of containers will be mounted in /var/run/docker/netns/ in host
by dockerd. It means that the net-namespace of a container will
be mounted in containers which are created after it. Things
become worse now, as the net-namespace can't be remove after
the destroy of that container, as it is still mounted in other
containers. If users want to recreate that container, he will
fail if a certain mac address is to be binded with the container,
as it is not release yet.
In this patch, a second mount, which is called 'user root', is
created before 'cpio' unpacking. The file system of 'user root'
is exactly the same as rootfs, and both ramfs and tmpfs are
supported. Then, the 'cpio' is unpacked into the 'user root'.
Now, the rootfs has a parent mount, and pivot_root() will be
supported for initramfs.
What's more, after this patch, 'rootflags' in boot cmd is supported
by initramfs. Therefore, users can set the size of tmpfs with
'rootflags=size=1024M'.
Signed-off-by: Menglong Dong <[email protected]>
---
init/do_mounts.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
init/do_mounts.h | 7 ++++-
init/initramfs.c | 10 +++++++
3 files changed, 88 insertions(+), 1 deletion(-)
diff --git a/init/do_mounts.c b/init/do_mounts.c
index a78e44ee6adb..943cb58846fb 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -617,6 +617,78 @@ void __init prepare_namespace(void)
init_chroot(".");
}
+#ifdef CONFIG_TMPFS
+static __init bool is_tmpfs_enabled(void)
+{
+ return (!root_fs_names || strstr(root_fs_names, "tmpfs")) &&
+ !saved_root_name[0];
+}
+#endif
+
+static __init bool is_ramfs_enabled(void)
+{
+ return true;
+}
+
+struct fs_user_root {
+ bool (*enabled)(void);
+ char *dev_name;
+ char *fs_name;
+};
+
+static struct fs_user_root user_roots[] __initdata = {
+#ifdef CONFIG_TMPFS
+ {.fs_name = "tmpfs", .enabled = is_tmpfs_enabled },
+#endif
+ {.fs_name = "ramfs", .enabled = is_ramfs_enabled }
+};
+static struct fs_user_root * __initdata user_root;
+
+/* Mount the user_root on '/'. */
+int __init mount_user_root(void)
+{
+ return do_mount_root(user_root->dev_name,
+ user_root->fs_name,
+ root_mountflags & ~MS_RDONLY,
+ root_mount_data);
+}
+
+/*
+ * This function is used to chroot to new initramfs root that
+ * we unpacked on success. It will chdir to '/' and umount
+ * the secound mount on failure.
+ */
+void __init end_mount_user_root(bool succeed)
+{
+ if (!succeed)
+ goto on_failed;
+
+ if (!ramdisk_exec_exist(false))
+ goto on_failed;
+
+ init_mount(".", "/", NULL, MS_MOVE, NULL);
+ init_chroot(".");
+ return;
+
+on_failed:
+ init_chdir("/");
+ init_umount("/..", 0);
+}
+
+void __init init_user_rootfs(void)
+{
+ struct fs_user_root *root;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(user_roots); i++) {
+ root = &user_roots[i];
+ if (root->enabled()) {
+ user_root = root;
+ break;
+ }
+ }
+}
+
static bool is_tmpfs;
static int rootfs_init_fs_context(struct fs_context *fc)
{
diff --git a/init/do_mounts.h b/init/do_mounts.h
index 7a29ac3e427b..92c004bdd320 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -10,9 +10,14 @@
#include <linux/root_dev.h>
#include <linux/init_syscalls.h>
+extern int root_mountflags;
+
void mount_block_root(char *name, int flags);
void mount_root(void);
-extern int root_mountflags;
+int mount_user_root(void);
+void end_mount_user_root(bool succeed);
+void init_user_rootfs(void);
+bool ramdisk_exec_exist(bool abs);
static inline __init int create_dev(char *name, dev_t dev)
{
diff --git a/init/initramfs.c b/init/initramfs.c
index af27abc59643..ffa78932ae65 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -16,6 +16,8 @@
#include <linux/namei.h>
#include <linux/init_syscalls.h>
+#include "do_mounts.h"
+
static ssize_t __init xwrite(struct file *file, const char *p, size_t count,
loff_t *pos)
{
@@ -682,15 +684,23 @@ static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
else
printk(KERN_INFO "Unpacking initramfs...\n");
+ init_user_rootfs();
+
+ if (mount_user_root())
+ panic("Failed to create user root");
+
err = unpack_to_rootfs((char *)initrd_start, initrd_end - initrd_start);
if (err) {
+ end_mount_user_root(false);
#ifdef CONFIG_BLK_DEV_RAM
populate_initrd_image(err);
#else
printk(KERN_EMERG "Initramfs unpacking failed: %s\n", err);
#endif
+ goto done;
}
+ end_mount_user_root(true);
done:
/*
* If the initrd region is overlapped with crashkernel reserved region,
--
2.31.1
From: Menglong Dong <[email protected]>
As for the existence of 'user root' which is introduced in previous
patch, 'rootfs_fs_type', which is used as the root of mount tree,
is not used directly any more. So it make no sense to switch it
between ramfs and tmpfs, just fix it with ramfs to simplify the
code.
Signed-off-by: Menglong Dong <[email protected]>
---
fs/namespace.c | 2 --
include/linux/init.h | 1 -
init/do_mounts.c | 18 +-----------------
3 files changed, 1 insertion(+), 20 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index f63337828e1c..8d2b57938e3a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -17,7 +17,6 @@
#include <linux/security.h>
#include <linux/cred.h>
#include <linux/idr.h>
-#include <linux/init.h> /* init_rootfs */
#include <linux/fs_struct.h> /* get_fs_root et.al. */
#include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */
#include <linux/file.h>
@@ -4241,7 +4240,6 @@ void __init mnt_init(void)
if (!fs_kobj)
printk(KERN_WARNING "%s: kobj create error\n", __func__);
shmem_init();
- init_rootfs();
init_mount_tree();
}
diff --git a/include/linux/init.h b/include/linux/init.h
index 045ad1650ed1..86bd92bb9550 100644
--- a/include/linux/init.h
+++ b/include/linux/init.h
@@ -148,7 +148,6 @@ extern unsigned int reset_devices;
/* used by init/main.c */
void setup_arch(char **);
void prepare_namespace(void);
-void __init init_rootfs(void);
extern struct file_system_type rootfs_fs_type;
#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 943cb58846fb..6d1253f75bb0 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -689,24 +689,8 @@ void __init init_user_rootfs(void)
}
}
-static bool is_tmpfs;
-static int rootfs_init_fs_context(struct fs_context *fc)
-{
- if (IS_ENABLED(CONFIG_TMPFS) && is_tmpfs)
- return shmem_init_fs_context(fc);
-
- return ramfs_init_fs_context(fc);
-}
-
struct file_system_type rootfs_fs_type = {
.name = "rootfs",
- .init_fs_context = rootfs_init_fs_context,
+ .init_fs_context = ramfs_init_fs_context,
.kill_sb = kill_litter_super,
};
-
-void __init init_rootfs(void)
-{
- if (IS_ENABLED(CONFIG_TMPFS) && !saved_root_name[0] &&
- (!root_fs_names || strstr(root_fs_names, "tmpfs")))
- is_tmpfs = true;
-}
--
2.31.1
On Sat, May 22, 2021 at 07:31:55PM +0800, [email protected] wrote:
> From: Menglong Dong <[email protected]>
>
> As for the existence of 'user root' which is introduced in previous
> patch, 'rootfs_fs_type', which is used as the root of mount tree,
> is not used directly any more. So it make no sense to switch it
> between ramfs and tmpfs, just fix it with ramfs to simplify the
> code.
> diff --git a/init/do_mounts.c b/init/do_mounts.c
> index 943cb58846fb..6d1253f75bb0 100644
> --- a/init/do_mounts.c
> +++ b/init/do_mounts.c
> @@ -689,24 +689,8 @@ void __init init_user_rootfs(void)
> }
> }
>
> -static bool is_tmpfs;
> -static int rootfs_init_fs_context(struct fs_context *fc)
> -{
> - if (IS_ENABLED(CONFIG_TMPFS) && is_tmpfs)
> - return shmem_init_fs_context(fc);
> -
> - return ramfs_init_fs_context(fc);
> -}
> -
> struct file_system_type rootfs_fs_type = {
Then why not also just rename rootfs_fs_type to ram_rootfs_fs_type to
make this even clearer now?
> .name = "rootfs",
> - .init_fs_context = rootfs_init_fs_context,
> + .init_fs_context = ramfs_init_fs_context,
> .kill_sb = kill_litter_super,
> };
Luis
Cc'ing Josh as I think he might be interested in this.
On Sat, May 22, 2021 at 07:31:54PM +0800, [email protected] wrote:
> From: Menglong Dong <[email protected]>
>
> During the kernel initialization, the root of the mount tree is
> created with file system type of ramfs or tmpfs.
ramfs (initrd)
> While using initramfs as the root file system, cpio file is unpacked
> into the rootfs. Thus, this rootfs is exactly what users see in user
> space, and some problems arose: this rootfs has no parent mount,
> which make it can't be umounted or pivot_root.
> 'pivot_root' is used to change the rootfs and clean the old mounts,
> and it is essential for some users, such as docker. Docker use
> 'pivot_root' to change the root fs of a process if the current root
> fs is a block device of initrd. However, when it comes to initramfs,
> things is different: docker has to use 'chroot()' to change the root
> fs, as 'pivot_root()' is not supported in initramfs.
>
> The usage of 'chroot()' to create root fs for a container introduced
> a lot problems.
>
> First, 'chroot()' can't clean the old mountpoints which inherited
> from the host. It means that the mountpoints in host will have a
> duplicate in every container. Users may remove a USB after it
> is umounted successfully in the host. However, the USB may still
> be mounted in containers, although it can't be seen by the 'mount'
> commond. This means the USB is not released yet, and data may not
> write back. Therefore, data lose arise.
>
> Second, net-namespace leak is another problem. The net-namespace
> of containers will be mounted in /var/run/docker/netns/ in host
> by dockerd. It means that the net-namespace of a container will
> be mounted in containers which are created after it. Things
> become worse now, as the net-namespace can't be remove after
> the destroy of that container, as it is still mounted in other
> containers. If users want to recreate that container, he will
> fail if a certain mac address is to be binded with the container,
> as it is not release yet.
I think you can clarify this a bit more with:
If using container platforms such as Docker, upon initialization it
wants to use pivot_root() so that currently mounted devices do not
propagate to containers. An example of value in this is that
a USB device connected prior to the creation of a containers on the
host gets disconnected after a container is created; if the
USB device was mounted on containers, but already removed and
umounted on the host, the mount point will not go away untill all
containers unmount the USB device.
Another reason for container platforms such as Docker to use pivot_root
is that upon initialization the net-namspace is mounted under
/var/run/docker/netns/ on the host by dockerd. Without pivot_root
Docker must either wait to create the network namespace prior to
the creation of containers or simply deal with leaking this to each
container.
pivot_root is supported if the rootfs is a ramfs (initrd), but its
not supported if the rootfs uses an initramfs (tmpfs). This means
container platforms today must resort to using ramfs (initrd) if
they want to pivot_root from the rootfs. A workaround to use chroot()
is not a clean viable option given every container will have a
duplicate of every mount point on the host.
In order to support using container platforms such as Docker on
all the supported rootfs types we must extend Linux to support
pivot_root on initramfs as well. This patch does the work to do
just that.
So remind me.. so it would seem that if the rootfs uses a ramfs (initrd)
that pivot_root works just fine. Why is that? Did someone add support
for that? Has that always been the case that it works? If not, was it a
consequence of how ramfs (initrd) works?
And finally, why can't we share the same mechanism used for ramfs
(initrd) for initramfs (tmpfs)?
> In this patch, a second mount, which is called 'user root', is
> created before 'cpio' unpacking. The file system of 'user root'
> is exactly the same as rootfs, and both ramfs and tmpfs are
> supported. Then, the 'cpio' is unpacked into the 'user root'.
> Now, the rootfs has a parent mount, and pivot_root() will be
> supported for initramfs.
How about something like:
In order to support pivot_root on initramfs we introduce a second
"user_root" mount which is created before we do the cpio unpacking.
The filesystem of the "user_root" mount is the same the rootfs.
It begs the question, why add this infrastructure to suppor this for
ramfs (initrd) if we only need this hack for initramfs (tmpfs)?
> What's more, after this patch, 'rootflags' in boot cmd is supported
> by initramfs. Therefore, users can set the size of tmpfs with
> 'rootflags=size=1024M'.
Why is that exactly?
> Signed-off-by: Menglong Dong <[email protected]>
> ---
> init/do_mounts.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
> init/do_mounts.h | 7 ++++-
> init/initramfs.c | 10 +++++++
> 3 files changed, 88 insertions(+), 1 deletion(-)
>
> diff --git a/init/do_mounts.c b/init/do_mounts.c
> index a78e44ee6adb..943cb58846fb 100644
> --- a/init/do_mounts.c
> +++ b/init/do_mounts.c
> @@ -617,6 +617,78 @@ void __init prepare_namespace(void)
> init_chroot(".");
> }
>
> +#ifdef CONFIG_TMPFS
> +static __init bool is_tmpfs_enabled(void)
> +{
> + return (!root_fs_names || strstr(root_fs_names, "tmpfs")) &&
> + !saved_root_name[0];
> +}
> +#endif
> +
> +static __init bool is_ramfs_enabled(void)
> +{
> + return true;
> +}
> +
> +struct fs_user_root {
> + bool (*enabled)(void);
> + char *dev_name;
What's the point of dev_name if its never set?
> + char *fs_name;
> +};
> +
> +static struct fs_user_root user_roots[] __initdata = {
> +#ifdef CONFIG_TMPFS
> + {.fs_name = "tmpfs", .enabled = is_tmpfs_enabled },
> +#endif
> + {.fs_name = "ramfs", .enabled = is_ramfs_enabled }
> +};
> +static struct fs_user_root * __initdata user_root;
> +
> +/* Mount the user_root on '/'. */
> +int __init mount_user_root(void)
> +{
> + return do_mount_root(user_root->dev_name,
See, isn't dev_name here always NULL?
> + user_root->fs_name,
> + root_mountflags & ~MS_RDONLY,
> + root_mount_data);
> +}
> +
> +/*
> + * This function is used to chroot to new initramfs root that
> + * we unpacked on success.
Might be a good place to document that we do this so folks can
pivot_root on rootfs, and why that is desirable (mentioned above on the
commit log edits I suggested). Otherwise I don't think its easy for a
reader of the code to understand why we are doing all this work.
> It will chdir to '/' and umount
> + * the secound mount on failure.
> + */
> +void __init end_mount_user_root(bool succeed)
> +{
> + if (!succeed)
> + goto on_failed;
> +
> + if (!ramdisk_exec_exist(false))
> + goto on_failed;
> +
> + init_mount(".", "/", NULL, MS_MOVE, NULL);
> + init_chroot(".");
> + return;
> +
> +on_failed:
> + init_chdir("/");
> + init_umount("/..", 0);
> +}
Is anything extra needed on shutdown / reboot?
Luis
On Sat, May 22, 2021 at 07:31:55PM +0800, [email protected] wrote:
> From: Menglong Dong <[email protected]>
>
> As for the existence of 'user root' which is introduced in previous
> patch, 'rootfs_fs_type', which is used as the root of mount tree,
> is not used directly any more. So it make no sense to switch it
> between ramfs and tmpfs, just fix it with ramfs to simplify the
> code.
You also noted this could be arbitrary. I don't see why its true that
the context used for init_mount_tree() can be arbitrary. And if it can
be, then why remove it / replace it with something more NULL like?
Luis
> Signed-off-by: Menglong Dong <[email protected]>
> ---
> fs/namespace.c | 2 --
> include/linux/init.h | 1 -
> init/do_mounts.c | 18 +-----------------
> 3 files changed, 1 insertion(+), 20 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index f63337828e1c..8d2b57938e3a 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -17,7 +17,6 @@
> #include <linux/security.h>
> #include <linux/cred.h>
> #include <linux/idr.h>
> -#include <linux/init.h> /* init_rootfs */
> #include <linux/fs_struct.h> /* get_fs_root et.al. */
> #include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */
> #include <linux/file.h>
> @@ -4241,7 +4240,6 @@ void __init mnt_init(void)
> if (!fs_kobj)
> printk(KERN_WARNING "%s: kobj create error\n", __func__);
> shmem_init();
> - init_rootfs();
> init_mount_tree();
> }
>
> diff --git a/include/linux/init.h b/include/linux/init.h
> index 045ad1650ed1..86bd92bb9550 100644
> --- a/include/linux/init.h
> +++ b/include/linux/init.h
> @@ -148,7 +148,6 @@ extern unsigned int reset_devices;
> /* used by init/main.c */
> void setup_arch(char **);
> void prepare_namespace(void);
> -void __init init_rootfs(void);
> extern struct file_system_type rootfs_fs_type;
>
> #if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
> diff --git a/init/do_mounts.c b/init/do_mounts.c
> index 943cb58846fb..6d1253f75bb0 100644
> --- a/init/do_mounts.c
> +++ b/init/do_mounts.c
> @@ -689,24 +689,8 @@ void __init init_user_rootfs(void)
> }
> }
>
> -static bool is_tmpfs;
> -static int rootfs_init_fs_context(struct fs_context *fc)
> -{
> - if (IS_ENABLED(CONFIG_TMPFS) && is_tmpfs)
> - return shmem_init_fs_context(fc);
> -
> - return ramfs_init_fs_context(fc);
> -}
> -
> struct file_system_type rootfs_fs_type = {
> .name = "rootfs",
> - .init_fs_context = rootfs_init_fs_context,
> + .init_fs_context = ramfs_init_fs_context,
> .kill_sb = kill_litter_super,
> };
You mentioned
> -
> -void __init init_rootfs(void)
> -{
> - if (IS_ENABLED(CONFIG_TMPFS) && !saved_root_name[0] &&
> - (!root_fs_names || strstr(root_fs_names, "tmpfs")))
> - is_tmpfs = true;
> -}
> --
> 2.31.1
>
On Sat, May 22, 2021 at 07:31:53PM +0800, [email protected] wrote:
> Introduce the function ramdisk_exec_exist, which is used to check
> the exist of 'ramdisk_execute_command'.
>
> It can do absolute path and relative path check. For relative path,
> it will ignore '/' and '.' in the start of
> 'ramdisk_execute_command'.
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1522,6 +1522,21 @@ void __init console_on_rootfs(void)
> fput(file);
> }
>
> +bool __init ramdisk_exec_exist(bool absolute)
> +{
> + char *tmp_command = ramdisk_execute_command;
> +
> + if (!tmp_command)
> + return false;
> +
> + if (!absolute) {
> + while (*tmp_command == '/' || *tmp_command == '.')
> + tmp_command++;
As far as I can tell, this will break if the user wants to use
".mybinary" or ".mydir/mybinary" as the name of their init program.
For that matter, it would break "...prog" or "...somedir/prog", which
would be strange but not something the kernel should prevent.
I don't think this code should be attempting to recreate
relative-to-absolute filename resolution.
On Tue, May 25, 2021 at 9:02 AM Josh Triplett <[email protected]> wrote:
>
......
>
> As far as I can tell, this will break if the user wants to use
> ".mybinary" or ".mydir/mybinary" as the name of their init program.
>
> For that matter, it would break "...prog" or "...somedir/prog", which
> would be strange but not something the kernel should prevent.
>
Wow, seems I didn't give enough thought to it.
> I don't think this code should be attempting to recreate
> relative-to-absolute filename resolution.
Trust me, I don't want to do it either. However, I need to check if
ramdisk_execute_command exist before chroot while the cpio is unpacked
to '/root'.
Maybe I can check it after chroot, but I need to chroot back if it not
exist. Can I chroot back in a nice way?
I tried to move the mount on '/root' to '/' before I do this check in
absolute path, but seems '/' is special, the lookup of '/init' never
follow the mount on '/' and it can't be found. However, if I lookup
'/../init', it can be found!
Is there any one have a good idea? Or I have to dig into the code
of 'kern_path()' and figure out the reason.
Thanks!
Menglong Dong
On Tue, May 25, 2021 at 8:44 AM Luis Chamberlain <[email protected]> wrote:
>
> Cc'ing Josh as I think he might be interested in this.
>
......
>
> I think you can clarify this a bit more with:
>
> If using container platforms such as Docker, upon initialization it
> wants to use pivot_root() so that currently mounted devices do not
> propagate to containers. An example of value in this is that
> a USB device connected prior to the creation of a containers on the
> host gets disconnected after a container is created; if the
> USB device was mounted on containers, but already removed and
> umounted on the host, the mount point will not go away untill all
> containers unmount the USB device.
Thanks! It's really difficult for me to organize these words.
>
> So remind me.. so it would seem that if the rootfs uses a ramfs (initrd)
> that pivot_root works just fine. Why is that? Did someone add support
> for that? Has that always been the case that it works? If not, was it a
> consequence of how ramfs (initrd) works?
>
> And finally, why can't we share the same mechanism used for ramfs
> (initrd) for initramfs (tmpfs)?
In fact, initrd is totally different from initramfs. Initrd is not using
ramfs, it actually is a block fs, which is mounted on the first mount.
And initramfs can use ramfs or tmpfs.
During pivot_root, the mount of the root will be unmounted from its parent
mount. Initrd or block device fs has a parent mount, which is the first mount.
However, initramfs doesn't has a parent mount, because the first mount is
actually the root, which cpio is unpacked to.
The first mount is used by init_task, and I think it can't be unmounted,
because it is used by the kernel.
So the primary cause that pivot_root doesn't support is that it use
the first mount as its root.
>
> > What's more, after this patch, 'rootflags' in boot cmd is supported
> > by initramfs. Therefore, users can set the size of tmpfs with
> > 'rootflags=size=1024M'.
>
> Why is that exactly?
During the mount of user_mount, I passed root_mountflags and root_mount_data
to do_mount_root(), which make 'rootflags' works for 'user root'.
> > +
> > +struct fs_user_root {
> > + bool (*enabled)(void);
> > + char *dev_name;
>
> What's the point of dev_name if its never set?
Seems it's better to make it be set, I'll do it.
>
> Might be a good place to document that we do this so folks can
> pivot_root on rootfs, and why that is desirable (mentioned above on the
> commit log edits I suggested). Otherwise I don't think its easy for a
> reader of the code to understand why we are doing all this work.
>
Ok, sounds nice!
>
> Is anything extra needed on shutdown / reboot?
>
I'm not sure, seems no. The way I create 'user root' is exactly the same
as a block root fs does.
Thanks!
Menglong Dong
On Tue, May 25, 2021 at 11:43 AM Menglong Dong <[email protected]> wrote:
>
> On Tue, May 25, 2021 at 9:02 AM Josh Triplett <[email protected]> wrote:
> >
> ......
> >
> > As far as I can tell, this will break if the user wants to use
> > ".mybinary" or ".mydir/mybinary" as the name of their init program.
> >
> > For that matter, it would break "...prog" or "...somedir/prog", which
> > would be strange but not something the kernel should prevent.
> >
>
> Wow, seems I didn't give enough thought to it.
>
> > I don't think this code should be attempting to recreate
> > relative-to-absolute filename resolution.
>
> Trust me, I don't want to do it either. However, I need to check if
> ramdisk_execute_command exist before chroot while the cpio is unpacked
> to '/root'.
>
> Maybe I can check it after chroot, but I need to chroot back if it not
> exist. Can I chroot back in a nice way?
>
> I tried to move the mount on '/root' to '/' before I do this check in
> absolute path, but seems '/' is special, the lookup of '/init' never
> follow the mount on '/' and it can't be found. However, if I lookup
> '/../init', it can be found!
>
I have figured it out. While path lookup, '/' won't follow the mount.
However, with the set of LOOKUP_DOWN, it will be followed.
So I will move the mount on '/root' to '/' and check the exist of
ramdisk_execute_command with LOOKUP_DOWN setted.
Seems there is still a long way to go on kernel......
Thanks!
Menglong Dong