In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.
On affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even
having as many as 150 duplicated inode numbers with differing file
content.
This causes actual, tangible problems in production. For example, we
have complaints from those working on remote caches that their
application is reporting cache corruptions because it uses (device,
inodenum) to establish the identity of a particular cache object, but
because it's not unique any more, the application refuses to continue
and reports cache corruption. Even worse, sometimes applications may not
even detect the corruption but may continue anyway, causing phantom and
hard to debug behaviour.
In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough.
One might also need to check the generation, but in this case:
1. That's not currently exposed to userspace
(ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
same inode number on one device.
In order to mitigate this, we take a two-pronged approach:
1. Moving inum generation from being global to per-sb for tmpfs. This
itself allows some reduction in i_ino churn. This works on both 64-
and 32- bit machines.
2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
64-bit ino_t only: we allow users to mount tmpfs with a new inode64
option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
You can see how this compares to previous related patches which didn't
implement this per-superblock:
- https://patchwork.kernel.org/patch/11254001/
- https://patchwork.kernel.org/patch/11023915/
Chris Down (2):
tmpfs: Add per-superblock i_ino support
tmpfs: Support 64-bit inums per-sb
Documentation/filesystems/tmpfs.txt | 11 ++++
fs/Kconfig | 15 +++++
include/linux/shmem_fs.h | 2 +
mm/shmem.c | 94 ++++++++++++++++++++++++++++-
4 files changed, 121 insertions(+), 1 deletion(-)
--
2.24.1
get_next_ino has a number of problems:
- It uses and returns a uint, which is susceptible to become overflowed
if a lot of volatile inodes that use get_next_ino are created.
- It's global, with no specificity per-sb or even per-filesystem. This
means it's not that difficult to cause inode number wraparounds on a
single device, which can result in having multiple distinct inodes
with the same inode number.
This patch adds a per-superblock counter that mitigates the second case.
This design also allows us to later have a specific i_ino size
per-device, for example, allowing users to choose whether to use 32- or
64-bit inodes for each tmpfs mount. This is implemented in the next
commit.
Signed-off-by: Chris Down <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/shmem_fs.h | 1 +
mm/shmem.c | 30 +++++++++++++++++++++++++++++-
2 files changed, 30 insertions(+), 1 deletion(-)
v5: Nothing in code, just resending with correct linux-mm domain.
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index de8e4b71e3ba..7fac91f490dc 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -35,6 +35,7 @@ struct shmem_sb_info {
unsigned char huge; /* Whether to try for hugepages */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
+ ino_t next_ino; /* The next per-sb inode number to use */
struct mempolicy *mpol; /* default memory policy for mappings */
spinlock_t shrinklist_lock; /* Protects shrinklist */
struct list_head shrinklist; /* List of shinkable inodes */
diff --git a/mm/shmem.c b/mm/shmem.c
index 8793e8cc1a48..9e97ba972225 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2236,6 +2236,12 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}
+/*
+ * shmem_get_inode - reserve, allocate, and initialise a new inode
+ *
+ * If this tmpfs is from kern_mount we use get_next_ino, which is global, since
+ * inum churn there is low and this avoids taking locks.
+ */
static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
umode_t mode, dev_t dev, unsigned long flags)
{
@@ -2248,7 +2254,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
inode = new_inode(sb);
if (inode) {
- inode->i_ino = get_next_ino();
+ if (sb->s_flags & SB_KERNMOUNT) {
+ /*
+ * __shmem_file_setup, one of our callers, is lock-free:
+ * it doesn't hold stat_lock in shmem_reserve_inode
+ * since max_inodes is always 0, and is called from
+ * potentially unknown contexts. As such, use the global
+ * allocator which doesn't require the per-sb stat_lock.
+ */
+ inode->i_ino = get_next_ino();
+ } else {
+ spin_lock(&sbinfo->stat_lock);
+ if (unlikely(sbinfo->next_ino > UINT_MAX)) {
+ /*
+ * Emulate get_next_ino uint wraparound for
+ * compatibility
+ */
+ sbinfo->next_ino = 1;
+ }
+ inode->i_ino = sbinfo->next_ino++;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
inode_init_owner(inode, dir, mode);
inode->i_blocks = 0;
inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -3662,6 +3689,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
#else
sb->s_flags |= SB_NOUSER;
#endif
+ sbinfo->next_ino = 1;
sbinfo->max_blocks = ctx->blocks;
sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
sbinfo->uid = ctx->uid;
--
2.24.1
The default is still set to inode32 for backwards compatibility, but
system administrators can opt in to the new 64-bit inode numbers by
either:
1. Passing inode64 on the command line when mounting, or
2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
The inode64 and inode32 names are used based on existing precedent from
XFS.
Signed-off-by: Chris Down <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/filesystems/tmpfs.txt | 11 +++++
fs/Kconfig | 15 +++++++
include/linux/shmem_fs.h | 1 +
mm/shmem.c | 66 ++++++++++++++++++++++++++++-
4 files changed, 92 insertions(+), 1 deletion(-)
v5: Nothing in code, just resending with correct linux-mm domain.
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
index 5ecbc03e6b2f..203e12a684c9 100644
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -136,6 +136,15 @@ These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
+tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
+numbers:
+
+inode64 Use 64-bit inode numbers
+inode32 Use 32-bit inode numbers
+
+On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
+not legal and will produce an error at mount time.
+
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
@@ -147,3 +156,5 @@ Updated:
Hugh Dickins, 4 June 2007
Updated:
KOSAKI Motohiro, 16 Mar 2010
+Updated:
+ Chris Down, 2 Jan 2020
diff --git a/fs/Kconfig b/fs/Kconfig
index 7b623e9fc1b0..e74f127df22a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -199,6 +199,21 @@ config TMPFS_XATTR
If unsure, say N.
+config TMPFS_INODE64
+ bool "Use 64-bit ino_t by default in tmpfs"
+ depends on TMPFS && 64BIT
+ default n
+ help
+ tmpfs has historically used only inode numbers as wide as an unsigned
+ int. In some cases this can cause wraparound, potentially resulting in
+ multiple files with the same inode number on a single device. This option
+ makes tmpfs use the full width of ino_t by default, similarly to the
+ inode64 mount option.
+
+ To override this default, use the inode32 or inode64 mount options.
+
+ If unsure, say N.
+
config HUGETLBFS
bool "HugeTLB file system support"
depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 7fac91f490dc..8925eb774518 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -35,6 +35,7 @@ struct shmem_sb_info {
unsigned char huge; /* Whether to try for hugepages */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
+ bool full_inums; /* If i_ino should be uint or ino_t */
ino_t next_ino; /* The next per-sb inode number to use */
struct mempolicy *mpol; /* default memory policy for mappings */
spinlock_t shrinklist_lock; /* Protects shrinklist */
diff --git a/mm/shmem.c b/mm/shmem.c
index 9e97ba972225..05de65112bed 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -115,11 +115,13 @@ struct shmem_options {
kuid_t uid;
kgid_t gid;
umode_t mode;
+ bool full_inums;
int huge;
int seen;
#define SHMEM_SEEN_BLOCKS 1
#define SHMEM_SEEN_INODES 2
#define SHMEM_SEEN_HUGE 4
+#define SHMEM_SEEN_INUMS 8
};
#ifdef CONFIG_TMPFS
@@ -2261,15 +2263,24 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
* since max_inodes is always 0, and is called from
* potentially unknown contexts. As such, use the global
* allocator which doesn't require the per-sb stat_lock.
+ *
+ * No special behaviour is needed for
+ * sbinfo->full_inums, because it's not possible to
+ * manually set on callers of this type, and
+ * CONFIG_TMPFS_INODE64 only applies to user-visible
+ * mounts.
*/
inode->i_ino = get_next_ino();
} else {
spin_lock(&sbinfo->stat_lock);
- if (unlikely(sbinfo->next_ino > UINT_MAX)) {
+ if (unlikely(!sbinfo->full_inums &&
+ sbinfo->next_ino > UINT_MAX)) {
/*
* Emulate get_next_ino uint wraparound for
* compatibility
*/
+ pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n",
+ __func__, MINOR(sb->s_dev));
sbinfo->next_ino = 1;
}
inode->i_ino = sbinfo->next_ino++;
@@ -3406,6 +3417,8 @@ enum shmem_param {
Opt_nr_inodes,
Opt_size,
Opt_uid,
+ Opt_inode32,
+ Opt_inode64,
};
static const struct fs_parameter_spec shmem_param_specs[] = {
@@ -3417,6 +3430,8 @@ static const struct fs_parameter_spec shmem_param_specs[] = {
fsparam_string("nr_inodes", Opt_nr_inodes),
fsparam_string("size", Opt_size),
fsparam_u32 ("uid", Opt_uid),
+ fsparam_flag ("inode32", Opt_inode32),
+ fsparam_flag ("inode64", Opt_inode64),
{}
};
@@ -3441,6 +3456,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
unsigned long long size;
char *rest;
int opt;
+ const char *err;
opt = fs_parse(fc, &shmem_fs_parameters, param, &result);
if (opt < 0)
@@ -3502,6 +3518,18 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
break;
}
goto unsupported_parameter;
+ case Opt_inode32:
+ ctx->full_inums = false;
+ ctx->seen |= SHMEM_SEEN_INUMS;
+ break;
+ case Opt_inode64:
+ if (sizeof(ino_t) < 8) {
+ err = "Cannot use inode64 with <64bit inums in kernel";
+ goto err_msg;
+ }
+ ctx->full_inums = true;
+ ctx->seen |= SHMEM_SEEN_INUMS;
+ break;
}
return 0;
@@ -3509,6 +3537,8 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
return invalf(fc, "tmpfs: Unsupported parameter '%s'", param->key);
bad_value:
return invalf(fc, "tmpfs: Bad value for '%s'", param->key);
+err_msg:
+ return invalf(fc, "tmpfs: %s", err);
}
static int shmem_parse_options(struct fs_context *fc, void *data)
@@ -3593,8 +3623,16 @@ static int shmem_reconfigure(struct fs_context *fc)
}
}
+ if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums &&
+ sbinfo->next_ino > UINT_MAX) {
+ err = "Current inum too high to switch to 32-bit inums";
+ goto out;
+ }
+
if (ctx->seen & SHMEM_SEEN_HUGE)
sbinfo->huge = ctx->huge;
+ if (ctx->seen & SHMEM_SEEN_INUMS)
+ sbinfo->full_inums = ctx->full_inums;
if (ctx->seen & SHMEM_SEEN_BLOCKS)
sbinfo->max_blocks = ctx->blocks;
if (ctx->seen & SHMEM_SEEN_INODES) {
@@ -3634,6 +3672,29 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u",
from_kgid_munged(&init_user_ns, sbinfo->gid));
+
+ /*
+ * Showing inode{64,32} might be useful even if it's the system default,
+ * since then people don't have to resort to checking both here and
+ * /proc/config.gz to confirm 64-bit inums were successfully applied
+ * (which may not even exist if IKCONFIG_PROC isn't enabled).
+ *
+ * We hide it when inode64 isn't the default and we are using 32-bit
+ * inodes, since that probably just means the feature isn't even under
+ * consideration.
+ *
+ * As such:
+ *
+ * +-----------------+-----------------+
+ * | TMPFS_INODE64=y | TMPFS_INODE64=n |
+ * +------------------+-----------------+-----------------+
+ * | full_inums=true | show | show |
+ * | full_inums=false | show | hide |
+ * +------------------+-----------------+-----------------+
+ *
+ */
+ if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums)
+ seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32));
#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
if (sbinfo->huge)
@@ -3681,6 +3742,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
ctx->blocks = shmem_default_max_blocks();
if (!(ctx->seen & SHMEM_SEEN_INODES))
ctx->inodes = shmem_default_max_inodes();
+ if (!(ctx->seen & SHMEM_SEEN_INUMS))
+ ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
} else {
sb->s_flags |= SB_NOUSER;
}
@@ -3694,6 +3757,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
sbinfo->uid = ctx->uid;
sbinfo->gid = ctx->gid;
+ sbinfo->full_inums = ctx->full_inums;
sbinfo->mode = ctx->mode;
sbinfo->huge = ctx->huge;
sbinfo->mpol = ctx->mpol;
--
2.24.1
On 2020/1/5 20:06, Chris Down wrote:
> get_next_ino has a number of problems:
>
> - It uses and returns a uint, which is susceptible to become overflowed
> if a lot of volatile inodes that use get_next_ino are created.
> - It's global, with no specificity per-sb or even per-filesystem. This
> means it's not that difficult to cause inode number wraparounds on a
> single device, which can result in having multiple distinct inodes
> with the same inode number.
>
> This patch adds a per-superblock counter that mitigates the second case.
> This design also allows us to later have a specific i_ino size
> per-device, for example, allowing users to choose whether to use 32- or
> 64-bit inodes for each tmpfs mount. This is implemented in the next
> commit.
>
> Signed-off-by: Chris Down <[email protected]>
> Reviewed-by: Amir Goldstein <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Jeff Layton <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/shmem_fs.h | 1 +
> mm/shmem.c | 30 +++++++++++++++++++++++++++++-
> 2 files changed, 30 insertions(+), 1 deletion(-)
>
> v5: Nothing in code, just resending with correct linux-mm domain.
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index de8e4b71e3ba..7fac91f490dc 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -35,6 +35,7 @@ struct shmem_sb_info {
> unsigned char huge; /* Whether to try for hugepages */
> kuid_t uid; /* Mount uid for root directory */
> kgid_t gid; /* Mount gid for root directory */
> + ino_t next_ino; /* The next per-sb inode number to use */
> struct mempolicy *mpol; /* default memory policy for mappings */
> spinlock_t shrinklist_lock; /* Protects shrinklist */
> struct list_head shrinklist; /* List of shinkable inodes */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8793e8cc1a48..9e97ba972225 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2236,6 +2236,12 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> return 0;
> }
>
> +/*
> + * shmem_get_inode - reserve, allocate, and initialise a new inode
> + *
> + * If this tmpfs is from kern_mount we use get_next_ino, which is global, since
> + * inum churn there is low and this avoids taking locks.
> + */
> static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
> umode_t mode, dev_t dev, unsigned long flags)
> {
> @@ -2248,7 +2254,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>
> inode = new_inode(sb);
> if (inode) {
> - inode->i_ino = get_next_ino();
> + if (sb->s_flags & SB_KERNMOUNT) {
> + /*
> + * __shmem_file_setup, one of our callers, is lock-free:
> + * it doesn't hold stat_lock in shmem_reserve_inode
> + * since max_inodes is always 0, and is called from
> + * potentially unknown contexts. As such, use the global
> + * allocator which doesn't require the per-sb stat_lock.
> + */
> + inode->i_ino = get_next_ino();
> + } else {
> + spin_lock(&sbinfo->stat_lock);
Use spin_lock will affect performance, how about define
unsigned long __percpu *last_ino_number; /* Last inode number */
atomic64_t shared_last_ino_number; /* Shared last inode number */
in shmem_sb_info, whose performance will be better?
> + if (unlikely(sbinfo->next_ino > UINT_MAX)) {
> + /*
> + * Emulate get_next_ino uint wraparound for
> + * compatibility
> + */
> + sbinfo->next_ino = 1;
> + }
> + inode->i_ino = sbinfo->next_ino++;
> + spin_unlock(&sbinfo->stat_lock);
> + }
> +
> inode_init_owner(inode, dir, mode);
> inode->i_blocks = 0;
> inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
> @@ -3662,6 +3689,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
> #else
> sb->s_flags |= SB_NOUSER;
> #endif
> + sbinfo->next_ino = 1;
> sbinfo->max_blocks = ctx->blocks;
> sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
> sbinfo->uid = ctx->uid;
On Mon, Jan 6, 2020 at 4:04 AM zhengbin (A) <[email protected]> wrote:
>
>
> On 2020/1/5 20:06, Chris Down wrote:
> > get_next_ino has a number of problems:
> >
> > - It uses and returns a uint, which is susceptible to become overflowed
> > if a lot of volatile inodes that use get_next_ino are created.
> > - It's global, with no specificity per-sb or even per-filesystem. This
> > means it's not that difficult to cause inode number wraparounds on a
> > single device, which can result in having multiple distinct inodes
> > with the same inode number.
> >
> > This patch adds a per-superblock counter that mitigates the second case.
> > This design also allows us to later have a specific i_ino size
> > per-device, for example, allowing users to choose whether to use 32- or
> > 64-bit inodes for each tmpfs mount. This is implemented in the next
> > commit.
> >
> > Signed-off-by: Chris Down <[email protected]>
> > Reviewed-by: Amir Goldstein <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: Matthew Wilcox <[email protected]>
> > Cc: Jeff Layton <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > include/linux/shmem_fs.h | 1 +
> > mm/shmem.c | 30 +++++++++++++++++++++++++++++-
> > 2 files changed, 30 insertions(+), 1 deletion(-)
> >
> > v5: Nothing in code, just resending with correct linux-mm domain.
> >
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index de8e4b71e3ba..7fac91f490dc 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -35,6 +35,7 @@ struct shmem_sb_info {
> > unsigned char huge; /* Whether to try for hugepages */
> > kuid_t uid; /* Mount uid for root directory */
> > kgid_t gid; /* Mount gid for root directory */
> > + ino_t next_ino; /* The next per-sb inode number to use */
> > struct mempolicy *mpol; /* default memory policy for mappings */
> > spinlock_t shrinklist_lock; /* Protects shrinklist */
> > struct list_head shrinklist; /* List of shinkable inodes */
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 8793e8cc1a48..9e97ba972225 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2236,6 +2236,12 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> > return 0;
> > }
> >
> > +/*
> > + * shmem_get_inode - reserve, allocate, and initialise a new inode
> > + *
> > + * If this tmpfs is from kern_mount we use get_next_ino, which is global, since
> > + * inum churn there is low and this avoids taking locks.
> > + */
> > static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
> > umode_t mode, dev_t dev, unsigned long flags)
> > {
> > @@ -2248,7 +2254,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> >
> > inode = new_inode(sb);
> > if (inode) {
> > - inode->i_ino = get_next_ino();
> > + if (sb->s_flags & SB_KERNMOUNT) {
> > + /*
> > + * __shmem_file_setup, one of our callers, is lock-free:
> > + * it doesn't hold stat_lock in shmem_reserve_inode
> > + * since max_inodes is always 0, and is called from
> > + * potentially unknown contexts. As such, use the global
> > + * allocator which doesn't require the per-sb stat_lock.
> > + */
> > + inode->i_ino = get_next_ino();
> > + } else {
> > + spin_lock(&sbinfo->stat_lock);
>
> Use spin_lock will affect performance, how about define
>
> unsigned long __percpu *last_ino_number; /* Last inode number */
> atomic64_t shared_last_ino_number; /* Shared last inode number */
> in shmem_sb_info, whose performance will be better?
>
Please take a look at shmem_reserve_inode().
spin lock is already being taken in shmem_get_inode()
so there is nothing to be gained from complicating next_ino counter.
This fact would have been a lot clearer if next_ino was incremented
inside shmem_reserve_inode() and its value returned to be used by
shmem_get_inode(), but I am also fine with code as it is with the
comment above.
Thanks,
Amir.
zhengbin (A) writes:
>Use spin_lock will affect performance
"Performance" isn't a binary. In discussions, you should avoid invoking the
performance boogeyman without presenting any real-world data. :-)
We already have to take this spin lock before when setting the free inode
count. The two sites can be merged, but it seems unnecessary to conflate their
purpose at this time.
On Sun, Jan 05, 2020 at 12:06:05PM +0000, Chris Down wrote:
> The default is still set to inode32 for backwards compatibility, but
> system administrators can opt in to the new 64-bit inode numbers by
> either:
>
> 1. Passing inode64 on the command line when mounting, or
> 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
>
> The inode64 and inode32 names are used based on existing precedent from
> XFS.
Please don't copy this misfeature of XFS.
The inode32/inode64 XFS options were a horrible hack made more than
20 years ago when NFSv2 was still in use and 64 bit inodes could
not be used for NFSv2 exports. It was then continued to be used
because 32bit NFSv3 clients were unable to handle 64 bit inodes.
It took 15 years for us to be able to essentially deprecate
inode32 (inode64 is the default behaviour), and we were very happy
to get that albatross off our necks. In reality, almost everything
out there in the world handles 64 bit inodes correctly
including 32 bit machines and 32bit binaries on 64 bit machines.
And, IMNSHO, there no excuse these days for 32 bit binaries that
don't using the *64() syscall variants directly and hence support
64 bit inodes correctlyi out of the box on all platforms.
I don't think we should be repeating past mistakes by trying to
cater for broken 32 bit applications on 64 bit machines in this day
and age.
Cheers,
Dave.
--
Dave Chinner
[email protected]
Dave Chinner writes:
>It took 15 years for us to be able to essentially deprecate
>inode32 (inode64 is the default behaviour), and we were very happy
>to get that albatross off our necks. In reality, almost everything
>out there in the world handles 64 bit inodes correctly
>including 32 bit machines and 32bit binaries on 64 bit machines.
>And, IMNSHO, there no excuse these days for 32 bit binaries that
>don't using the *64() syscall variants directly and hence support
>64 bit inodes correctlyi out of the box on all platforms.
>
>I don't think we should be repeating past mistakes by trying to
>cater for broken 32 bit applications on 64 bit machines in this day
>and age.
I'm very glad to hear that. I strongly support moving to 64-bit inums in all
cases if there is precedent that it's not a compatibility issue, but from the
comments on my original[0] patch (especially that they strayed from the
original patches' change to use ino_t directly into slab reuse), I'd been given
the impression that it was known to be one.
From my perspective I have no evidence that inode32 is needed other than the
comment from Jeff above get_next_ino. If that turns out not to be a problem, I
am more than happy to just wholesale migrate 64-bit inodes per-sb in tmpfs.
0: https://lore.kernel.org/patchwork/patch/1170963/
On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> Dave Chinner writes:
> > It took 15 years for us to be able to essentially deprecate
> > inode32 (inode64 is the default behaviour), and we were very happy
> > to get that albatross off our necks. In reality, almost everything
> > out there in the world handles 64 bit inodes correctly
> > including 32 bit machines and 32bit binaries on 64 bit machines.
> > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > don't using the *64() syscall variants directly and hence support
> > 64 bit inodes correctlyi out of the box on all platforms.
> >
> > I don't think we should be repeating past mistakes by trying to
> > cater for broken 32 bit applications on 64 bit machines in this day
> > and age.
>
> I'm very glad to hear that. I strongly support moving to 64-bit inums in all
> cases if there is precedent that it's not a compatibility issue, but from
> the comments on my original[0] patch (especially that they strayed from the
> original patches' change to use ino_t directly into slab reuse), I'd been
> given the impression that it was known to be one.
>
> From my perspective I have no evidence that inode32 is needed other than the
> comment from Jeff above get_next_ino. If that turns out not to be a problem,
> I am more than happy to just wholesale migrate 64-bit inodes per-sb in
> tmpfs.
Well, that's my comment above about 32 bit apps using non-LFS
compliant interfaces in this day and age. It's essentially a legacy
interface these days, and anyone trying to access a modern linux
filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle
64 bit inodes because they all can create >32bit inode numbers
in their default configurations.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <[email protected]> wrote:
>
> On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > Dave Chinner writes:
> > > It took 15 years for us to be able to essentially deprecate
> > > inode32 (inode64 is the default behaviour), and we were very happy
> > > to get that albatross off our necks. In reality, almost everything
> > > out there in the world handles 64 bit inodes correctly
> > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > don't using the *64() syscall variants directly and hence support
> > > 64 bit inodes correctlyi out of the box on all platforms.
> > >
> > > I don't think we should be repeating past mistakes by trying to
> > > cater for broken 32 bit applications on 64 bit machines in this day
> > > and age.
> >
> > I'm very glad to hear that. I strongly support moving to 64-bit inums in all
> > cases if there is precedent that it's not a compatibility issue, but from
> > the comments on my original[0] patch (especially that they strayed from the
> > original patches' change to use ino_t directly into slab reuse), I'd been
> > given the impression that it was known to be one.
> >
> > From my perspective I have no evidence that inode32 is needed other than the
> > comment from Jeff above get_next_ino. If that turns out not to be a problem,
> > I am more than happy to just wholesale migrate 64-bit inodes per-sb in
> > tmpfs.
>
> Well, that's my comment above about 32 bit apps using non-LFS
> compliant interfaces in this day and age. It's essentially a legacy
> interface these days, and anyone trying to access a modern linux
> filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle
> 64 bit inodes because they all can create >32bit inode numbers
> in their default configurations.
>
Chris,
Following Dave's comment, let's keep the config option, but make it
default to Y and remove the mount option for now.
If somebody shouts, mount option could be re-added later.
This way at least we leave an option for workaround for an unlikely
breakage.
Thanks,
Amir.
On Mon, 6 Jan 2020, Amir Goldstein wrote:
> On Mon, Jan 6, 2020 at 4:04 AM zhengbin (A) <[email protected]> wrote:
> > On 2020/1/5 20:06, Chris Down wrote:
> > > get_next_ino has a number of problems:
> > >
> > > - It uses and returns a uint, which is susceptible to become overflowed
> > > if a lot of volatile inodes that use get_next_ino are created.
> > > - It's global, with no specificity per-sb or even per-filesystem. This
> > > means it's not that difficult to cause inode number wraparounds on a
> > > single device, which can result in having multiple distinct inodes
> > > with the same inode number.
> > >
> > > This patch adds a per-superblock counter that mitigates the second case.
> > > This design also allows us to later have a specific i_ino size
> > > per-device, for example, allowing users to choose whether to use 32- or
> > > 64-bit inodes for each tmpfs mount. This is implemented in the next
> > > commit.
> > >
> > > Signed-off-by: Chris Down <[email protected]>
> > > Reviewed-by: Amir Goldstein <[email protected]>
> > > Cc: Hugh Dickins <[email protected]>
> > > Cc: Andrew Morton <[email protected]>
> > > Cc: Al Viro <[email protected]>
> > > Cc: Matthew Wilcox <[email protected]>
> > > Cc: Jeff Layton <[email protected]>
> > > Cc: Johannes Weiner <[email protected]>
> > > Cc: Tejun Heo <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > ---
> > > include/linux/shmem_fs.h | 1 +
> > > mm/shmem.c | 30 +++++++++++++++++++++++++++++-
> > > 2 files changed, 30 insertions(+), 1 deletion(-)
> > >
> > > v5: Nothing in code, just resending with correct linux-mm domain.
> > >
> > > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > > index de8e4b71e3ba..7fac91f490dc 100644
> > > --- a/include/linux/shmem_fs.h
> > > +++ b/include/linux/shmem_fs.h
> > > @@ -35,6 +35,7 @@ struct shmem_sb_info {
> > > unsigned char huge; /* Whether to try for hugepages */
> > > kuid_t uid; /* Mount uid for root directory */
> > > kgid_t gid; /* Mount gid for root directory */
> > > + ino_t next_ino; /* The next per-sb inode number to use */
> > > struct mempolicy *mpol; /* default memory policy for mappings */
> > > spinlock_t shrinklist_lock; /* Protects shrinklist */
> > > struct list_head shrinklist; /* List of shinkable inodes */
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 8793e8cc1a48..9e97ba972225 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -2236,6 +2236,12 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * shmem_get_inode - reserve, allocate, and initialise a new inode
> > > + *
> > > + * If this tmpfs is from kern_mount we use get_next_ino, which is global, since
> > > + * inum churn there is low and this avoids taking locks.
> > > + */
> > > static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
> > > umode_t mode, dev_t dev, unsigned long flags)
> > > {
> > > @@ -2248,7 +2254,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> > >
> > > inode = new_inode(sb);
> > > if (inode) {
> > > - inode->i_ino = get_next_ino();
> > > + if (sb->s_flags & SB_KERNMOUNT) {
> > > + /*
> > > + * __shmem_file_setup, one of our callers, is lock-free:
> > > + * it doesn't hold stat_lock in shmem_reserve_inode
> > > + * since max_inodes is always 0, and is called from
> > > + * potentially unknown contexts. As such, use the global
> > > + * allocator which doesn't require the per-sb stat_lock.
> > > + */
> > > + inode->i_ino = get_next_ino();
> > > + } else {
> > > + spin_lock(&sbinfo->stat_lock);
> >
> > Use spin_lock will affect performance, how about define
> >
> > unsigned long __percpu *last_ino_number; /* Last inode number */
> > atomic64_t shared_last_ino_number; /* Shared last inode number */
> > in shmem_sb_info, whose performance will be better?
> >
>
> Please take a look at shmem_reserve_inode().
> spin lock is already being taken in shmem_get_inode()
> so there is nothing to be gained from complicating next_ino counter.
>
> This fact would have been a lot clearer if next_ino was incremented
> inside shmem_reserve_inode() and its value returned to be used by
> shmem_get_inode(), but I am also fine with code as it is with the
> comment above.
Chris, Amir, thank you both for all the work you have been putting
into this over the holiday. I'm only now catching up, sorry.
It appears that what you are ending up with above is very close to
the patch Google has been using for several years. I'm glad Chris
has explored some interesting options, disappointed that you had no
more success than I had in trying to solve it efficiently with 32-bit
inos, agree with the way in which Amir cut it back. That we've come to
the same conclusions independently is good confirmation of this approach.
Only yesterday did I get to see that Amir had asked to see my patch on
the 27th: rediffed against 5.5-rc5, appended now below. I'm showing it,
though it's known deficient in three ways (not to mention lack of config
or mount options, but I now see Dave Chinner has an interesting take on
those) - better make it visible to you now, than me delay you further.
It does indicate a couple of improvements to Chris's current patch:
reducing use of stat_lock, as Amir suggested (in both nr_inodes limited
and unlimited cases); and need to fix shmem_encode_fh(), which neither
of us did yet. Where we should go from here, fix Chris's or fix mine,
I've not decided.
From: Hugh Dickins <[email protected]>
Date: Fri, 7 Aug 2015 20:08:53 -0700
Subject: [PATCH] tmpfs: provide 64-bit inode numbers
Some programs (including ld.so and clang) try to deduplicate opened
files by comparing st_dev and st_ino, which breaks down when two files
on the same tmpfs have the same inode number: we are now hitting this
problem too often. J. R. Okajima has reported the same problem with
backup tools.
tmpfs is currently sharing the same 32-bit get_next_ino() pool as
sockets, pipes, ramfs, hugetlbfs etc. It delivers 32-bit inos even
on a 64-bit kernel for one reason: because if a 32-bit executable
compiled without _FILE_OFFSET_BITS=64 tries to stat a file with an
ino which won't fit in 32 bits, glibc fails that with EOVERFLOW.
glibc is being correct, but unhelpful: so 2.6.22 commit 866b04fccbf1
("inode numbering: make static counters in new_inode and iunique be
32 bits") forced get_next_ino() to unsigned int.
But whatever the upstream need to avoid surprising a mis-built
32-bit executable, Google production can be sure of the 64-bit
environment, and any remaining 32-bit executables built with
_FILE_OFFSET_BITS=64 (our files are sometimes bigger than 2G).
So, leave get_next_ino() as it is, but convert tmpfs to supply
unsigned long inos, and let each superblock keep its own account:
it was weird to be sharing a pool with such disparate uses before.
shmem_reserve_inode() already provides a good place to do this;
and normally it has to take stat_lock to update free_inodes, so
no overhead to increment its next_ino under the same lock. But
if it's the internal shm_mnt, or mounted with "nr_inodes=0" to
increase scalability by avoiding that stat_lock, then use the
same percpu batching technique as get_next_ino().
Take on board 4.2 commit 2adc376c5519 ("vfs: avoid creation of
inode number 0 in get_next_ino"): for safety, skip any ino whose
low 32 bits is 0.
Upstream? That's tougher: maybe try to upstream this as is, and
see what falls out; maybe add ino32 or ino64 mount options before
trying; or maybe upstream has to stick with the 32-bit ino, and a
scheme more complicated than this be implemented for tmpfs.
Not-Yet-Signed-off-by: Hugh Dickins <[email protected]>
1) Probably needs minor corrections for the 32-bit kernel: I wasn't
worrying about that at the time, and expect some "unsigned long"s
I didn't need to change for the 64-bit kernel actually need to be
"u64"s or "ino_t"s now.
2) The "return 1" from shmem_encode_fh() would nowadays be written as
"return FILEID_INO32_GEN" (and overlayfs takes particular interest
in that fileid); yet it already put the high 32 bits of the ino into
the fh: probably needs a different fileid once high 32 bits are set.
3) When this patch was written, a tmpfs could not be remounted from
limited nr_inodes to unlimited nr_inodes=0; but that restriction
was accidentally (though sensibly) lifted during v5.4's mount API
mods; so would need to be taken into account (or reverted) here.
---
include/linux/shmem_fs.h | 2 +
mm/shmem.c | 59 ++++++++++++++++++++++++++++++++++---
2 files changed, 57 insertions(+), 4 deletions(-)
--- 5.5-rc5/include/linux/shmem_fs.h 2019-11-24 16:32:01.000000000 -0800
+++ linux/include/linux/shmem_fs.h 2020-01-06 19:58:04.487301841 -0800
@@ -30,6 +30,8 @@ struct shmem_sb_info {
struct percpu_counter used_blocks; /* How many are allocated */
unsigned long max_inodes; /* How many inodes are allowed */
unsigned long free_inodes; /* How many are left for allocation */
+ unsigned long next_ino; /* Next inode number to be allocated */
+ unsigned long __percpu *ino_batch; /* Next per cpu, if unlimited */
spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
umode_t mode; /* Mount mode for root directory */
unsigned char huge; /* Whether to try for hugepages */
--- 5.5-rc5/mm/shmem.c 2019-12-08 18:04:55.053566640 -0800
+++ linux/mm/shmem.c 2020-01-06 19:58:04.487301841 -0800
@@ -261,9 +261,24 @@ bool vma_is_shmem(struct vm_area_struct
static LIST_HEAD(shmem_swaplist);
static DEFINE_MUTEX(shmem_swaplist_mutex);
-static int shmem_reserve_inode(struct super_block *sb)
+/*
+ * shmem_reserve_inode() reserves "space" for a shmem inode, and allocates
+ * an ino. Unlike get_next_ino() in fs/inode.c, the 64-bit kernel here goes
+ * for a 64-bit st_ino, expecting even 32-bit userspace to have been built
+ * with _FILE_OFFSET_BITS=64 nowadays; but lest it has not, each sb uses its
+ * own supply, independent of sockets and pipes etc, and of other tmpfs sbs.
+ * Maybe add a 32-bit ino mount option later, if it turns out to be needed.
+ * Don't worry about 64-bit ino support from a 32-bit kernel at this time.
+ *
+ * shmem_reserve_inode() is also called when making a hard link, to allow for
+ * the space needed by each dentry; but no new ino is wanted then (NULL inop).
+ */
+#define SHMEM_INO_BATCH 1024
+static int shmem_reserve_inode(struct super_block *sb, unsigned long *inop)
{
struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+ unsigned long ino;
+
if (sbinfo->max_inodes) {
spin_lock(&sbinfo->stat_lock);
if (!sbinfo->free_inodes) {
@@ -271,7 +286,35 @@ static int shmem_reserve_inode(struct su
return -ENOSPC;
}
sbinfo->free_inodes--;
+ if (inop) {
+ ino = sbinfo->next_ino++;
+ /* Avoid 0 in the low 32 bits: might appear deleted */
+ if (unlikely((unsigned int)ino == 0))
+ ino = sbinfo->next_ino++;
+ *inop = ino;
+ }
spin_unlock(&sbinfo->stat_lock);
+ } else if (inop) {
+ unsigned long *next_ino;
+ /*
+ * Internal shm_mnt, or mounted with unlimited nr_inodes=0 for
+ * maximum scalability: we don't need to take stat_lock every
+ * time here, so use percpu batching like get_next_ino().
+ */
+ next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu());
+ ino = *next_ino;
+ if (unlikely((ino & (SHMEM_INO_BATCH-1)) == 0)) {
+ spin_lock(&sbinfo->stat_lock);
+ ino = sbinfo->next_ino;
+ sbinfo->next_ino += SHMEM_INO_BATCH;
+ spin_unlock(&sbinfo->stat_lock);
+ /* Avoid 0 in the low 32 bits: might appear deleted */
+ if (unlikely((unsigned int)ino == 0))
+ ino++;
+ }
+ *inop = ino;
+ *next_ino = ++ino;
+ put_cpu();
}
return 0;
}
@@ -2241,13 +2284,14 @@ static struct inode *shmem_get_inode(str
struct inode *inode;
struct shmem_inode_info *info;
struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+ unsigned long ino;
- if (shmem_reserve_inode(sb))
+ if (shmem_reserve_inode(sb, &ino))
return NULL;
inode = new_inode(sb);
if (inode) {
- inode->i_ino = get_next_ino();
+ inode->i_ino = ino;
inode_init_owner(inode, dir, mode);
inode->i_blocks = 0;
inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -2960,7 +3004,7 @@ static int shmem_link(struct dentry *old
* first link must skip that, to get the accounting right.
*/
if (inode->i_nlink) {
- ret = shmem_reserve_inode(inode->i_sb);
+ ret = shmem_reserve_inode(inode->i_sb, NULL);
if (ret)
goto out;
}
@@ -3621,6 +3665,7 @@ static void shmem_put_super(struct super
{
struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+ free_percpu(sbinfo->ino_batch);
percpu_counter_destroy(&sbinfo->used_blocks);
mpol_put(sbinfo->mpol);
kfree(sbinfo);
@@ -3663,6 +3708,12 @@ static int shmem_fill_super(struct super
#endif
sbinfo->max_blocks = ctx->blocks;
sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
+ /* If inodes are unlimited for scalability, use percpu ino batching */
+ if (!sbinfo->max_inodes) {
+ sbinfo->ino_batch = alloc_percpu(unsigned long);
+ if (!sbinfo->ino_batch)
+ goto failed;
+ }
sbinfo->uid = ctx->uid;
sbinfo->gid = ctx->gid;
sbinfo->mode = ctx->mode;
> Chris, Amir, thank you both for all the work you have been putting
> into this over the holiday. I'm only now catching up, sorry.
>
> It appears that what you are ending up with above is very close to
> the patch Google has been using for several years. I'm glad Chris
> has explored some interesting options, disappointed that you had no
> more success than I had in trying to solve it efficiently with 32-bit
> inos, agree with the way in which Amir cut it back. That we've come to
> the same conclusions independently is good confirmation of this approach.
>
Indeed :)
> Only yesterday did I get to see that Amir had asked to see my patch on
> the 27th: rediffed against 5.5-rc5, appended now below. I'm showing it,
> though it's known deficient in three ways (not to mention lack of config
> or mount options, but I now see Dave Chinner has an interesting take on
> those) - better make it visible to you now, than me delay you further.
>
> It does indicate a couple of improvements to Chris's current patch:
> reducing use of stat_lock, as Amir suggested (in both nr_inodes limited
> and unlimited cases); and need to fix shmem_encode_fh(), which neither
> of us did yet. Where we should go from here, fix Chris's or fix mine,
> I've not decided.
>
I vote in favor or best of both patches to result in a simpler outcome:
1. use inop approach from Hugh's patch
2. use get_next_ino() instead of per-sb ino_batch for kern_mount
Hugh, do you have any evidence or suspect that shmem_file_setup()
could be contributing to depleting the global get_next_ino pool?
Do you have an objection to the combination above?
> From: Hugh Dickins <[email protected]>
> Date: Fri, 7 Aug 2015 20:08:53 -0700
> Subject: [PATCH] tmpfs: provide 64-bit inode numbers
>
> Some programs (including ld.so and clang) try to deduplicate opened
> files by comparing st_dev and st_ino, which breaks down when two files
> on the same tmpfs have the same inode number: we are now hitting this
> problem too often. J. R. Okajima has reported the same problem with
> backup tools.
>
> tmpfs is currently sharing the same 32-bit get_next_ino() pool as
> sockets, pipes, ramfs, hugetlbfs etc. It delivers 32-bit inos even
> on a 64-bit kernel for one reason: because if a 32-bit executable
> compiled without _FILE_OFFSET_BITS=64 tries to stat a file with an
> ino which won't fit in 32 bits, glibc fails that with EOVERFLOW.
> glibc is being correct, but unhelpful: so 2.6.22 commit 866b04fccbf1
> ("inode numbering: make static counters in new_inode and iunique be
> 32 bits") forced get_next_ino() to unsigned int.
>
> But whatever the upstream need to avoid surprising a mis-built
> 32-bit executable, Google production can be sure of the 64-bit
> environment, and any remaining 32-bit executables built with
> _FILE_OFFSET_BITS=64 (our files are sometimes bigger than 2G).
>
> So, leave get_next_ino() as it is, but convert tmpfs to supply
> unsigned long inos, and let each superblock keep its own account:
> it was weird to be sharing a pool with such disparate uses before.
>
> shmem_reserve_inode() already provides a good place to do this;
> and normally it has to take stat_lock to update free_inodes, so
> no overhead to increment its next_ino under the same lock. But
> if it's the internal shm_mnt, or mounted with "nr_inodes=0" to
> increase scalability by avoiding that stat_lock, then use the
> same percpu batching technique as get_next_ino().
>
> Take on board 4.2 commit 2adc376c5519 ("vfs: avoid creation of
> inode number 0 in get_next_ino"): for safety, skip any ino whose
> low 32 bits is 0.
>
> Upstream? That's tougher: maybe try to upstream this as is, and
> see what falls out; maybe add ino32 or ino64 mount options before
> trying; or maybe upstream has to stick with the 32-bit ino, and a
> scheme more complicated than this be implemented for tmpfs.
>
> Not-Yet-Signed-off-by: Hugh Dickins <[email protected]>
>
> 1) Probably needs minor corrections for the 32-bit kernel: I wasn't
> worrying about that at the time, and expect some "unsigned long"s
> I didn't need to change for the 64-bit kernel actually need to be
> "u64"s or "ino_t"s now.
> 2) The "return 1" from shmem_encode_fh() would nowadays be written as
> "return FILEID_INO32_GEN" (and overlayfs takes particular interest
> in that fileid); yet it already put the high 32 bits of the ino into
> the fh: probably needs a different fileid once high 32 bits are set.
Nice spotting, but this really isn't a problem for overlayfs.
I agree that it would be nice to follow the same practice as xfs with
XFS_FILEID_TYPE_64FLAG, but as one can see from the private
flag, this is by no means a standard of any sort.
As the name_to_handle_at() man page says:
"Other than the use of the handle_bytes field, the caller should treat the
file_handle structure as an opaque data type: the handle_type and f_handle
fields are needed only by a subsequent call to open_by_handle_at()."
It is true that one of my initial versions was checking value returned from
encode_fh, but Miklos has pointed out to me that this value is arbitrary
and we shouldn't rely on it.
In current overlayfs, the value FILEID_INO32_GEN is only used internally
to indicate the case where filesystem has no encode_fh op (see comment
above ovl_can_decode_fh()). IOW, only the special case of encoding
by the default export_encode_fh() is considered a proof for 32bit inodes.
So tmpfs has never been considered as a 32bit inodes filesystem by
overlayfs.
Thanks,
Amir.
On Tue, 7 Jan 2020, Amir Goldstein wrote:
> On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <[email protected]> wrote:
> > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > > Dave Chinner writes:
> > > > It took 15 years for us to be able to essentially deprecate
> > > > inode32 (inode64 is the default behaviour), and we were very happy
> > > > to get that albatross off our necks. In reality, almost everything
> > > > out there in the world handles 64 bit inodes correctly
> > > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > > don't using the *64() syscall variants directly and hence support
> > > > 64 bit inodes correctlyi out of the box on all platforms.
Interesting take on it. I'd all along imagined we would have to resort
to a mount option for safety, but Dave is right that I was too focused on
avoiding tmpfs regressions, without properly realizing that people were
very unlikely to have written such tools for tmpfs in particular, but
written them for all filesystems, and already encountered and fixed
such EOVERFLOWs for other filesystems.
Hmm, though how readily does XFS actually reach the high inos on
ordinary users' systems?
> > > >
> > > > I don't think we should be repeating past mistakes by trying to
> > > > cater for broken 32 bit applications on 64 bit machines in this day
> > > > and age.
> > >
> > > I'm very glad to hear that. I strongly support moving to 64-bit inums in all
> > > cases if there is precedent that it's not a compatibility issue, but from
> > > the comments on my original[0] patch (especially that they strayed from the
> > > original patches' change to use ino_t directly into slab reuse), I'd been
> > > given the impression that it was known to be one.
> > >
> > > From my perspective I have no evidence that inode32 is needed other than the
> > > comment from Jeff above get_next_ino. If that turns out not to be a problem,
> > > I am more than happy to just wholesale migrate 64-bit inodes per-sb in
> > > tmpfs.
> >
> > Well, that's my comment above about 32 bit apps using non-LFS
> > compliant interfaces in this day and age. It's essentially a legacy
> > interface these days, and anyone trying to access a modern linux
> > filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle
> > 64 bit inodes because they all can create >32bit inode numbers
> > in their default configurations.
I thought ext4 still deals in 32-bit inos, so ext4-world would not
necessarily have caught up? Though, arguing the other way, IIUC 64-bit
ino support comes bundled with file sizes over 2GB (on all architectures?),
and it's hard to imagine who gets by with a 2GB file size limit nowadays.
> >
>
> Chris,
>
> Following Dave's comment, let's keep the config option, but make it
> default to Y and remove the mount option for now.
> If somebody shouts, mount option could be re-added later.
> This way at least we leave an option for workaround for an unlikely
> breakage.
Though I don't share Dave's strength of aversion to the inode32 and
inode64 mount options, I do agree that it's preferable not to offer
them if they're so very unlikely to be necessary.
Do we even need the config option? We certainly do need to hold the
patch with config option and mount options "in our back pocket", so
that if a regression does get reported, then upstream and stable can
respond to fix it quickly; but do we need more than that?
My main concern is that a new EOVERFLOW might not be reported as such,
and waste a lot of someone's time to track down.
Hugh
On Tue, Jan 7, 2020 at 10:36 AM Hugh Dickins <[email protected]> wrote:
>
> On Tue, 7 Jan 2020, Amir Goldstein wrote:
> > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <[email protected]> wrote:
> > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > > > Dave Chinner writes:
> > > > > It took 15 years for us to be able to essentially deprecate
> > > > > inode32 (inode64 is the default behaviour), and we were very happy
> > > > > to get that albatross off our necks. In reality, almost everything
> > > > > out there in the world handles 64 bit inodes correctly
> > > > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > > > don't using the *64() syscall variants directly and hence support
> > > > > 64 bit inodes correctlyi out of the box on all platforms.
>
> Interesting take on it. I'd all along imagined we would have to resort
> to a mount option for safety, but Dave is right that I was too focused on
> avoiding tmpfs regressions, without properly realizing that people were
> very unlikely to have written such tools for tmpfs in particular, but
> written them for all filesystems, and already encountered and fixed
> such EOVERFLOWs for other filesystems.
>
> Hmm, though how readily does XFS actually reach the high inos on
> ordinary users' systems?
>
Define 'ordinary'
I my calculations are correct, with default mkfs.xfs any inode allocated
from logical offset > 2TB on a volume has high ino bits set.
Besides, a deployment with more than 4G inodes shouldn't be hard to find.
Thanks,
Amir.
On Tue, Jan 07, 2020 at 12:36:25AM -0800, Hugh Dickins wrote:
> On Tue, 7 Jan 2020, Amir Goldstein wrote:
> > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <[email protected]> wrote:
> > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > > > Dave Chinner writes:
> > > > > It took 15 years for us to be able to essentially deprecate
> > > > > inode32 (inode64 is the default behaviour), and we were very happy
> > > > > to get that albatross off our necks. In reality, almost everything
> > > > > out there in the world handles 64 bit inodes correctly
> > > > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > > > don't using the *64() syscall variants directly and hence support
> > > > > 64 bit inodes correctlyi out of the box on all platforms.
>
> Interesting take on it. I'd all along imagined we would have to resort
> to a mount option for safety, but Dave is right that I was too focused on
> avoiding tmpfs regressions, without properly realizing that people were
> very unlikely to have written such tools for tmpfs in particular, but
> written them for all filesystems, and already encountered and fixed
> such EOVERFLOWs for other filesystems.
>
> Hmm, though how readily does XFS actually reach the high inos on
> ordinary users' systems?
Quite frequently these days - the threshold for exceeding 32 bits
in the inode number is dependent on the inode size.
Old v4 filesystems use 256 byte inodes by default, so they overflow
32bits when the filesystem size is >1TB.
Current v5 filesystems use 512 byte inodes, so they overflow 32 bits
on filesytsems >2TB.
> > > > I'm very glad to hear that. I strongly support moving to 64-bit inums in all
> > > > cases if there is precedent that it's not a compatibility issue, but from
> > > > the comments on my original[0] patch (especially that they strayed from the
> > > > original patches' change to use ino_t directly into slab reuse), I'd been
> > > > given the impression that it was known to be one.
> > > >
> > > > From my perspective I have no evidence that inode32 is needed other than the
> > > > comment from Jeff above get_next_ino. If that turns out not to be a problem,
> > > > I am more than happy to just wholesale migrate 64-bit inodes per-sb in
> > > > tmpfs.
> > >
> > > Well, that's my comment above about 32 bit apps using non-LFS
> > > compliant interfaces in this day and age. It's essentially a legacy
> > > interface these days, and anyone trying to access a modern linux
> > > filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle
> > > 64 bit inodes because they all can create >32bit inode numbers
> > > in their default configurations.
>
> I thought ext4 still deals in 32-bit inos, so ext4-world would not
> necessarily have caught up?
Hmmm - I might have got my wires crossed there - there *was* a
project some time ago to increase the ext4 inode number size for
large filesystems. Ah, yeah, there we go:
https://lore.kernel.org/linux-ext4/[email protected]/
I thought it had been rolled in with all the other "make stuff
larger" features that ext4 has...
> Though, arguing the other way, IIUC 64-bit
> ino support comes bundled with file sizes over 2GB (on all architectures?),
> and it's hard to imagine who gets by with a 2GB file size limit nowadays.
Right, you need to define LFS support for >2GB file support and
you get 64 bit inode support with that for free. It's only legacy
binaries that haven't been rebuilt in the past 15 years that are an
issue here, but there are so many other things that can go trivially
wrong with such binaries I think that inode numbers are the least of
the worries here...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Jan 07, 2020 at 12:12:00PM +0200, Amir Goldstein wrote:
> On Tue, Jan 7, 2020 at 10:36 AM Hugh Dickins <[email protected]> wrote:
> >
> > On Tue, 7 Jan 2020, Amir Goldstein wrote:
> > > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <[email protected]> wrote:
> > > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote:
> > > > > Dave Chinner writes:
> > > > > > It took 15 years for us to be able to essentially deprecate
> > > > > > inode32 (inode64 is the default behaviour), and we were very happy
> > > > > > to get that albatross off our necks. In reality, almost everything
> > > > > > out there in the world handles 64 bit inodes correctly
> > > > > > including 32 bit machines and 32bit binaries on 64 bit machines.
> > > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that
> > > > > > don't using the *64() syscall variants directly and hence support
> > > > > > 64 bit inodes correctlyi out of the box on all platforms.
> >
> > Interesting take on it. I'd all along imagined we would have to resort
> > to a mount option for safety, but Dave is right that I was too focused on
> > avoiding tmpfs regressions, without properly realizing that people were
> > very unlikely to have written such tools for tmpfs in particular, but
> > written them for all filesystems, and already encountered and fixed
> > such EOVERFLOWs for other filesystems.
> >
> > Hmm, though how readily does XFS actually reach the high inos on
> > ordinary users' systems?
> >
>
> Define 'ordinary'
> I my calculations are correct, with default mkfs.xfs any inode allocated
> from logical offset > 2TB on a volume has high ino bits set.
> Besides, a deployment with more than 4G inodes shouldn't be hard to find.
You don't need to allocate 4 billion inodes to get >32bit inodes in
XFS - the inode number is an encoding of the physical location of
the inode in the filesystem. Hence we just have to allocate the
inode at a disk address higher than 2TB into the device and we
overflow 32bits.
e.g. make a sparse fs image file of 10TB, mount it, create 50
subdirs, then start creating zero length files spread across the 50
separate subdirectories. You should see >32bit inode numbers almost
immediately. (i.e. as soon as we allocate inodes in AG 2 or higher)
IOWs, there are *lots* of 64bit inode numbers out there on XFS
filesystems....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 7 Jan 2020, at 16:07, Dave Chinner wrote:
> IOWs, there are *lots* of 64bit inode numbers out there on XFS
> filesystems....
It's less likely in btrfs but +1 to all of Dave's comments. I'm happy
to run a scan on machines in the fleet and see how many have 64 bit
inodes (either buttery or x-y), but it's going to be a lot.
-chris
On Tue, 7 Jan 2020, Chris Mason wrote:
> On 7 Jan 2020, at 16:07, Dave Chinner wrote:
>
> > IOWs, there are *lots* of 64bit inode numbers out there on XFS
> > filesystems....
>
> It's less likely in btrfs but +1 to all of Dave's comments. I'm happy
> to run a scan on machines in the fleet and see how many have 64 bit
> inodes (either buttery or x-y), but it's going to be a lot.
Dave, Amir, Chris, many thanks for the info you've filled in -
and absolutely no need to run any scan on your fleet for this,
I think we can be confident that even if fb had some 15-year-old tool
in use on its fleet of 2GB-file filesystems, it would not be the one
to insist on a kernel revert of 64-bit tmpfs inos.
The picture looks clear now: while ChrisD does need to hold on to his
config option and inode32/inode64 mount option patch, it is much better
left out of the kernel until (very unlikely) proved necessary.
Thanks,
Hugh
On Tue, 7 Jan 2020, Amir Goldstein wrote:
>
> I vote in favor or best of both patches to result in a simpler outcome:
> 1. use inop approach from Hugh's patch
> 2. use get_next_ino() instead of per-sb ino_batch for kern_mount
>
> Hugh, do you have any evidence or suspect that shmem_file_setup()
> could be contributing to depleting the global get_next_ino pool?
Depends on what the system is used for: on one system, it will make
very little use of that pool, on another it will be one of the major
depleters of the pool. Generally, it would be kinder to the other
users of the pool (those that might also care about ino duplication)
if shmem were to cede all use of it; but a bigger patch, yes.
> Do you have an objection to the combination above?
Objection would be too strong: I'm uncomfortable with it, and not
tempted to replace our internal implementation by one reverting to
use get_next_ino(); but not as uncomfortable as I am with holding
up progress on the issue.
Uncomfortable because of the "depletion" you mention. Uncomfortable
because half the reason for ever offering the unlimited "nr_inodes=0"
mount option was to avoid stat_lock overhead (though, for all I know,
maybe nobody anywhere uses that option now - and I'll be surprised if
0-day uses it and reports any slowdown).
Also uncomfortable because your (excellent, and I'd never thought
about it that way) argument that the shm_mnt objects are internal
and unstatable (that may not resemble how you expressed it, I've not
gone back to search the mails to find where you made the point), is
not entirely correct nowadays - memfd_create()d objects come from
that same shm_mnt, and they can be fstat()ed. What is the
likelihood that anything would care about duplicated inos of
memfd_create()d objects? Fairly low, I think we'll agree; and
we can probably also say that their inos are an odd implementation
detail that no memfd_create() user should depend upon anyway. But
I mention it in case it changes your own feeling about the shm_mnt.
> > Not-Yet-Signed-off-by: Hugh Dickins <[email protected]>
> >
> > 1) Probably needs minor corrections for the 32-bit kernel: I wasn't
> > worrying about that at the time, and expect some "unsigned long"s
> > I didn't need to change for the 64-bit kernel actually need to be
> > "u64"s or "ino_t"s now.
> > 2) The "return 1" from shmem_encode_fh() would nowadays be written as
> > "return FILEID_INO32_GEN" (and overlayfs takes particular interest
> > in that fileid); yet it already put the high 32 bits of the ino into
> > the fh: probably needs a different fileid once high 32 bits are set.
>
> Nice spotting, but this really isn't a problem for overlayfs.
> I agree that it would be nice to follow the same practice as xfs with
> XFS_FILEID_TYPE_64FLAG, but as one can see from the private
> flag, this is by no means a standard of any sort.
>
> As the name_to_handle_at() man page says:
> "Other than the use of the handle_bytes field, the caller should treat the
> file_handle structure as an opaque data type: the handle_type and f_handle
> fields are needed only by a subsequent call to open_by_handle_at()."
>
> It is true that one of my initial versions was checking value returned from
> encode_fh, but Miklos has pointed out to me that this value is arbitrary
> and we shouldn't rely on it.
>
> In current overlayfs, the value FILEID_INO32_GEN is only used internally
> to indicate the case where filesystem has no encode_fh op (see comment
> above ovl_can_decode_fh()). IOW, only the special case of encoding
> by the default export_encode_fh() is considered a proof for 32bit inodes.
> So tmpfs has never been considered as a 32bit inodes filesystem by
> overlayfs.
Thanks, you know far more about encode_fh() than I ever shall, so for
now I'll take your word for it that we don't need to make any change
there for this 64-bit ino support. One day I should look back, go
through the encode_fh() callsites, and decide what that "return 1"
really ought to say.
It's inconvenient, I'm sorry, but I shall have to go quiet again
for a few days - I'm here, but cannot afford to split my attention.
Hugh
On Wed, Jan 8, 2020 at 12:59 PM Hugh Dickins <[email protected]> wrote:
>
> On Tue, 7 Jan 2020, Amir Goldstein wrote:
> >
> > I vote in favor or best of both patches to result in a simpler outcome:
> > 1. use inop approach from Hugh's patch
> > 2. use get_next_ino() instead of per-sb ino_batch for kern_mount
> >
> > Hugh, do you have any evidence or suspect that shmem_file_setup()
> > could be contributing to depleting the global get_next_ino pool?
>
> Depends on what the system is used for: on one system, it will make
> very little use of that pool, on another it will be one of the major
> depleters of the pool. Generally, it would be kinder to the other
> users of the pool (those that might also care about ino duplication)
> if shmem were to cede all use of it; but a bigger patch, yes.
>
> > Do you have an objection to the combination above?
>
> Objection would be too strong: I'm uncomfortable with it, and not
> tempted to replace our internal implementation by one reverting to
> use get_next_ino(); but not as uncomfortable as I am with holding
> up progress on the issue.
>
> Uncomfortable because of the "depletion" you mention. Uncomfortable
> because half the reason for ever offering the unlimited "nr_inodes=0"
> mount option was to avoid stat_lock overhead (though, for all I know,
> maybe nobody anywhere uses that option now - and I'll be surprised if
> 0-day uses it and reports any slowdown).
>
> Also uncomfortable because your (excellent, and I'd never thought
> about it that way) argument that the shm_mnt objects are internal
> and unstatable (that may not resemble how you expressed it, I've not
> gone back to search the mails to find where you made the point), is
> not entirely correct nowadays - memfd_create()d objects come from
> that same shm_mnt, and they can be fstat()ed. What is the
> likelihood that anything would care about duplicated inos of
> memfd_create()d objects? Fairly low, I think we'll agree; and
> we can probably also say that their inos are an odd implementation
> detail that no memfd_create() user should depend upon anyway. But
> I mention it in case it changes your own feeling about the shm_mnt.
>
I have no strong feeling either. I just didn't know how intensive its use
is. You have provided sufficient arguments IMO to go with your version
of the patch.
> > > Not-Yet-Signed-off-by: Hugh Dickins <[email protected]>
> > >
> > > 1) Probably needs minor corrections for the 32-bit kernel: I wasn't
> > > worrying about that at the time, and expect some "unsigned long"s
> > > I didn't need to change for the 64-bit kernel actually need to be
> > > "u64"s or "ino_t"s now.
> > > 2) The "return 1" from shmem_encode_fh() would nowadays be written as
> > > "return FILEID_INO32_GEN" (and overlayfs takes particular interest
> > > in that fileid); yet it already put the high 32 bits of the ino into
> > > the fh: probably needs a different fileid once high 32 bits are set.
> >
> > Nice spotting, but this really isn't a problem for overlayfs.
> > I agree that it would be nice to follow the same practice as xfs with
> > XFS_FILEID_TYPE_64FLAG, but as one can see from the private
> > flag, this is by no means a standard of any sort.
> >
> > As the name_to_handle_at() man page says:
> > "Other than the use of the handle_bytes field, the caller should treat the
> > file_handle structure as an opaque data type: the handle_type and f_handle
> > fields are needed only by a subsequent call to open_by_handle_at()."
> >
> > It is true that one of my initial versions was checking value returned from
> > encode_fh, but Miklos has pointed out to me that this value is arbitrary
> > and we shouldn't rely on it.
> >
> > In current overlayfs, the value FILEID_INO32_GEN is only used internally
> > to indicate the case where filesystem has no encode_fh op (see comment
> > above ovl_can_decode_fh()). IOW, only the special case of encoding
> > by the default export_encode_fh() is considered a proof for 32bit inodes.
> > So tmpfs has never been considered as a 32bit inodes filesystem by
> > overlayfs.
>
> Thanks, you know far more about encode_fh() than I ever shall, so for
> now I'll take your word for it that we don't need to make any change
> there for this 64-bit ino support. One day I should look back, go
> through the encode_fh() callsites, and decide what that "return 1"
> really ought to say.
>
TBH, I meant to say that return 1 shouldn't matter functionally,
but it would be much nicer to change it to FILEID_INO64_GEN
and define that as 0x81 in include/linux/exportfs.h and actually
order the gen word after the inode64 words.
But that is independent of the per-sb ino change, because the
layout of the file handle has always been [gen,inode64] and never
really matched that standard of FILEID_INO32_GEN.
I may send a patch to later on to change that.
Thanks,
Amir.
Dave Chinner writes:
> On Sun, Jan 05, 2020 at 12:06:05PM +0000, Chris Down wrote:
> > The default is still set to inode32 for backwards compatibility, but
> > system administrators can opt in to the new 64-bit inode numbers by
> > either:
> >
> > 1. Passing inode64 on the command line when mounting, or
> > 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
> >
> > The inode64 and inode32 names are used based on existing precedent from
> > XFS.
>
> Please don't copy this misfeature of XFS.
>
> The inode32/inode64 XFS options were a horrible hack made more than
> 20 years ago when NFSv2 was still in use and 64 bit inodes could
> not be used for NFSv2 exports. It was then continued to be used
> because 32bit NFSv3 clients were unable to handle 64 bit inodes.
>
> It took 15 years for us to be able to essentially deprecate
> inode32 (inode64 is the default behaviour), and we were very happy
> to get that albatross off our necks. In reality, almost everything
> out there in the world handles 64 bit inodes correctly
> including 32 bit machines and 32bit binaries on 64 bit machines.
> And, IMNSHO, there no excuse these days for 32 bit binaries that
> don't using the *64() syscall variants directly and hence support
> 64 bit inodes correctlyi out of the box on all platforms.
>
> I don't think we should be repeating past mistakes by trying to
> cater for broken 32 bit applications on 64 bit machines in this day
> and age.
Hi,
It's unfortunately not true that everything handles this correctly.
32-bit binaries for games on Steam that use stat() without the 64 is
so prevalent that I got tired of adding the LD_PRELOAD wrapper script
and just patched out the EOVERFLOW return from glibc instead. (They
obviously don't care about the inode value at all, and I don't use any
other 32-bit binaries that do). This is probably a class of binaries
you don't care very much about, and not very likely to be installed on
a tmpfs that has wrapped around, but I thought it was worth mentioning
that they do exist anyway.
--
Mikael Magnusson
On Wed, 2020-01-08 at 03:24 -0800, Hugh Dickins wrote:
> On Tue, 7 Jan 2020, Chris Mason wrote:
> > On 7 Jan 2020, at 16:07, Dave Chinner wrote:
> >
> > > IOWs, there are *lots* of 64bit inode numbers out there on XFS
> > > filesystems....
> >
> > It's less likely in btrfs but +1 to all of Dave's comments. I'm happy
> > to run a scan on machines in the fleet and see how many have 64 bit
> > inodes (either buttery or x-y), but it's going to be a lot.
>
> Dave, Amir, Chris, many thanks for the info you've filled in -
> and absolutely no need to run any scan on your fleet for this,
> I think we can be confident that even if fb had some 15-year-old tool
> in use on its fleet of 2GB-file filesystems, it would not be the one
> to insist on a kernel revert of 64-bit tmpfs inos.
>
> The picture looks clear now: while ChrisD does need to hold on to his
> config option and inode32/inode64 mount option patch, it is much better
> left out of the kernel until (very unlikely) proved necessary.
This approach seems like the best course to me.
FWIW, at the time we capped this at 32-bits (2007), 64-bit machines were
really just becoming widely available, and it was quite common to run
32-bit, non-LFS apps on a 64-bit kernel. Users were hitting spurious
EOVERFLOW errors all over the place so this seemed like the best way to
address it.
The world has changed a lot since then though, and one would hope that
almost everything these days is compiled with FILE_OFFSET_BITS=64.
Fingers crossed!
--
Jeff Layton <[email protected]>
Hi Hugh, Dave,
Hugh Dickins writes:
>Dave, Amir, Chris, many thanks for the info you've filled in -
>and absolutely no need to run any scan on your fleet for this,
>I think we can be confident that even if fb had some 15-year-old tool
>in use on its fleet of 2GB-file filesystems, it would not be the one
>to insist on a kernel revert of 64-bit tmpfs inos.
>
>The picture looks clear now: while ChrisD does need to hold on to his
>config option and inode32/inode64 mount option patch, it is much better
>left out of the kernel until (very unlikely) proved necessary.
Based on Mikael's comment above about Steam binaries, and the lack of
likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32},
but make legacy behaviour require explicit opt-in. That is:
- Default it to inode64
- Remove the Kconfig option
- Only print it as an option if tmpfs was explicitly mounted with inode32
The reason I suggest keeping this is that I'm mildly concerned that the kind of
users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS --
like the not-too-uncommon case that Mikael brings up -- seem unlikely to be the
kind of people that would find it in an rc.
Other than that, the first patch could be similar to how it is now,
incorporating Hugh's improvements to the first patch to put everything under
the same stat_lock in shmem_reserve_inode.
What do you folks think?
Thanks,
Chris
On Wed, 8 Jan 2020, Mikael Magnusson wrote:
>
> It's unfortunately not true that everything handles this correctly.
> 32-bit binaries for games on Steam that use stat() without the 64 is
> so prevalent that I got tired of adding the LD_PRELOAD wrapper script
> and just patched out the EOVERFLOW return from glibc instead. (They
> obviously don't care about the inode value at all, and I don't use any
> other 32-bit binaries that do). This is probably a class of binaries
> you don't care very much about, and not very likely to be installed on
> a tmpfs that has wrapped around, but I thought it was worth mentioning
> that they do exist anyway.
Thank you for alerting us to reality, Mikael: not what any of us wanted
to hear, but we do care about them, and it was well worth mentioning.
Hugh
On Fri, 10 Jan 2020, Chris Down wrote:
> Hugh Dickins writes:
> > Dave, Amir, Chris, many thanks for the info you've filled in -
> > and absolutely no need to run any scan on your fleet for this,
> > I think we can be confident that even if fb had some 15-year-old tool
> > in use on its fleet of 2GB-file filesystems, it would not be the one
> > to insist on a kernel revert of 64-bit tmpfs inos.
> >
> > The picture looks clear now: while ChrisD does need to hold on to his
> > config option and inode32/inode64 mount option patch, it is much better
> > left out of the kernel until (very unlikely) proved necessary.
>
> Based on Mikael's comment above about Steam binaries, and the lack of
> likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32},
> but make legacy behaviour require explicit opt-in. That is:
>
> - Default it to inode64
> - Remove the Kconfig option
> - Only print it as an option if tmpfs was explicitly mounted with inode32
>
> The reason I suggest keeping this is that I'm mildly concerned that the kind
> of users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS
> -- like the not-too-uncommon case that Mikael brings up -- seem unlikely to
> be the kind of people that would find it in an rc.
Okay. None of us are thrilled with it, but I agree that
Mikael's observation should override our developer's preference.
So the "inode64" option will be accepted but redundant on mounting,
but exists for use as a remount option after mounting or remounting
with "inode32": allowing the admin to switch temporarily to mask off
the high ino bits with "inode32" when needing to run a limited binary.
Documentation and commit message to alert Andrew and Linus and distros
that we are risking some breakage with this, but supplying the antidote
(not breakage of any distros themselves, no doubt they're all good;
but breakage of what some users might run on them).
>
> Other than that, the first patch could be similar to how it is now,
> incorporating Hugh's improvements to the first patch to put everything under
> the same stat_lock in shmem_reserve_inode.
So, I persuaded Amir to the other aspects my version, but did not
persuade you? Well, I can live with that (or if not, can send mods
on top of yours): but please read again why I was uncomfortable with
yours, to check that you still prefer it (I agree that your patch is
simpler, and none of my discomfort decisive).
Thanks,
Hugh
Hi Hugh,
Sorry this response took so long, I had some non-work issues that took a lot of
time last week.
Hugh Dickins writes:
>On Fri, 10 Jan 2020, Chris Down wrote:
>> Hugh Dickins writes:
>> > Dave, Amir, Chris, many thanks for the info you've filled in -
>> > and absolutely no need to run any scan on your fleet for this,
>> > I think we can be confident that even if fb had some 15-year-old tool
>> > in use on its fleet of 2GB-file filesystems, it would not be the one
>> > to insist on a kernel revert of 64-bit tmpfs inos.
>> >
>> > The picture looks clear now: while ChrisD does need to hold on to his
>> > config option and inode32/inode64 mount option patch, it is much better
>> > left out of the kernel until (very unlikely) proved necessary.
>>
>> Based on Mikael's comment above about Steam binaries, and the lack of
>> likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32},
>> but make legacy behaviour require explicit opt-in. That is:
>>
>> - Default it to inode64
>> - Remove the Kconfig option
>> - Only print it as an option if tmpfs was explicitly mounted with inode32
>>
>> The reason I suggest keeping this is that I'm mildly concerned that the kind
>> of users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS
>> -- like the not-too-uncommon case that Mikael brings up -- seem unlikely to
>> be the kind of people that would find it in an rc.
>
>Okay. None of us are thrilled with it, but I agree that
>Mikael's observation should override our developer's preference.
>
>So the "inode64" option will be accepted but redundant on mounting,
>but exists for use as a remount option after mounting or remounting
>with "inode32": allowing the admin to switch temporarily to mask off
>the high ino bits with "inode32" when needing to run a limited binary.
>
>Documentation and commit message to alert Andrew and Linus and distros
>that we are risking some breakage with this, but supplying the antidote
>(not breakage of any distros themselves, no doubt they're all good;
>but breakage of what some users might run on them).
Sounds good.
>>
>> Other than that, the first patch could be similar to how it is now,
>> incorporating Hugh's improvements to the first patch to put everything under
>> the same stat_lock in shmem_reserve_inode.
>
>So, I persuaded Amir to the other aspects my version, but did not
>persuade you? Well, I can live with that (or if not, can send mods
>on top of yours): but please read again why I was uncomfortable with
>yours, to check that you still prefer it (I agree that your patch is
>simpler, and none of my discomfort decisive).
Hmm, which bit were you thinking of? The lack of batching, shmem_encode_fh(),
or the fact that nr_inodes can now be 0 on non-internal mounts?
For batching, I'm neutral. I'm happy to use the approach from your patch and
integrate it (and credit you, of course).
For shmem_encode_fh, I'm not totally sure I understand the concern, if that's
what you mean.
For nr_inodes, I agree that intentional or unintentional, we should at least
handle this case for now and can adjust later if the behaviour changes.
Thanks again,
Chris
On Mon, 20 Jan 2020, Chris Down wrote:
> Hi Hugh,
>
> Sorry this response took so long, I had some non-work issues that took a lot
> of time last week.
No, clearly it's I who must apologize to you for very slow response.
>
> Hugh Dickins writes:
> >
> > So the "inode64" option will be accepted but redundant on mounting,
> > but exists for use as a remount option after mounting or remounting
> > with "inode32": allowing the admin to switch temporarily to mask off
> > the high ino bits with "inode32" when needing to run a limited binary.
> >
> > Documentation and commit message to alert Andrew and Linus and distros
> > that we are risking some breakage with this, but supplying the antidote
> > (not breakage of any distros themselves, no doubt they're all good;
> > but breakage of what some users might run on them).
>
> Sounds good.
>
> > >
> > > Other than that, the first patch could be similar to how it is now,
> > > incorporating Hugh's improvements to the first patch to put everything
> > > under
> > > the same stat_lock in shmem_reserve_inode.
> >
> > So, I persuaded Amir to the other aspects my version, but did not
> > persuade you? Well, I can live with that (or if not, can send mods
> > on top of yours): but please read again why I was uncomfortable with
> > yours, to check that you still prefer it (I agree that your patch is
> > simpler, and none of my discomfort decisive).
>
> Hmm, which bit were you thinking of? The lack of batching, shmem_encode_fh(),
> or the fact that nr_inodes can now be 0 on non-internal mounts?
I was uncomfortable with tmpfs depleting get_next_ino()'s pool in some
configurations, and wanted the (get_next_ino-like) per-cpu but per-sb
batching for nr_inodes=0, to minimize its stat_lock contention.
I did not have a patch to shmem_encode_fh(), had gone through thinking
that 64-bit inos made an additional fix there necessary; but Amir then
educated us that it is safe as is, though could be cleaned up later.
nr_inodes can be 0 on non-internal mounts, for many years.
>
> For batching, I'm neutral. I'm happy to use the approach from your patch and
> integrate it (and credit you, of course).
Credit not important, but you may well want to blame me for that
complication :)
>
> For shmem_encode_fh, I'm not totally sure I understand the concern, if that's
> what you mean.
My concern had been that shmem_encode_fh() builds up an fh from i_ino
and more, looks well prepared for a 64-bit ino, but appeared to be
announcing a 32-bit ino in its return value: Amir reassures us that
that return value does not matter.
>
> For nr_inodes, I agree that intentional or unintentional, we should at least
> handle this case for now and can adjust later if the behaviour changes.
nr_inodes=0 is an intentional configuration (but 0 denoting infinity:
not very clean, and I've sometimes regretted that choice).
If there's any behavior change, that's a separate matter from these
64-bit ino patches; maybe I mentioned it in passing and confused us -
that we seem to have recently allowed a remounting limited<->unlimited
that was not permitted before, and might or might not need a fix patch.
Sorry again for delaying you, Chris: not at all what I'd wanted to do.
Hugh