Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp9930682ybl; Thu, 26 Dec 2019 07:49:14 -0800 (PST) X-Google-Smtp-Source: APXvYqzJuqtdC4Gdqr9q/WUyDmFfmRsZBsAxApKOP/Avy1vEMXly+ktxAPIDESPEPNWaBfMzTIVc X-Received: by 2002:a9d:f26:: with SMTP id 35mr53430813ott.260.1577375354633; Thu, 26 Dec 2019 07:49:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1577375354; cv=none; d=google.com; s=arc-20160816; b=jgDuyoiHfRHy3+NVaFyHnOX99VEl9xLSWKhybZGAmzSARqi8LpwGrVHfl5LxDj0a9H GdFtg2RBAyQCZ+0kkor6I4Pwz7iwnUVdwTo1rCCiu28EcclwX+E/S49kM2ReBn/lR0hw 4qQ6yVmR1aCipfwA4eoESOylFCzF/fmH4mAKDEIt2P9XVqqKKp/lkd9gBfBghvksop0o rgVTbKd74R1v8KK2S7HCIyO4NFyiYCxLMXlG4caurc/5ETfisklIOdHSJKSpDzftLxan uL2gCaErVzhEwtAe8wPJojsUcWzc7+A1AJcAEzVVm3Raq13OzZNZtfTZSACSWkPpyRM3 pl5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-disposition:mime-version :message-id:subject:cc:to:from:date:dkim-signature; bh=RnNpSJkpTTHSSXdkTIGiG3GQdw5OOuvH5LxHNOHSFDE=; b=IpDeQ2lkw2sewKCqxe0AKL4NQdsRxSUSZugURXa9nOa5SOYZPpz3ML39HWRP4XoZqY zyW602EjeWzfssV01uauGPbBBestvItPekGyTeF7JaaWf948zBbK7VBFu1SmWiGvQB0J MU089ZyGtQQr+iHuK9VDRSB0EfIiEkanY9/Rdu3hYsmnV5afCXsCCB5oS8V6lozCZ53y 5GK5Vc6HuyRhNi2bPIUt7hvmmH3q41v5nz1snguxf/q2i6pzEYaM2CA37PFbp9jSLQQk toimjvm78lJOka6iahIhJZGSuEmffU11CQc2/pc6gxZn7oMZKqIT0NlgkcFG+A1fxwcu ghfA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chrisdown.name header.s=google header.b="nF6ehLf/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chrisdown.name Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w197si14997449oif.150.2019.12.26.07.49.03; Thu, 26 Dec 2019 07:49:14 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@chrisdown.name header.s=google header.b="nF6ehLf/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chrisdown.name Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726628AbfLZPsO (ORCPT + 99 others); Thu, 26 Dec 2019 10:48:14 -0500 Received: from mail-wr1-f66.google.com ([209.85.221.66]:40083 "EHLO mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726480AbfLZPsO (ORCPT ); Thu, 26 Dec 2019 10:48:14 -0500 Received: by mail-wr1-f66.google.com with SMTP id c14so23917044wrn.7 for ; Thu, 26 Dec 2019 07:48:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chrisdown.name; s=google; h=date:from:to:cc:subject:message-id:mime-version:content-disposition; bh=RnNpSJkpTTHSSXdkTIGiG3GQdw5OOuvH5LxHNOHSFDE=; b=nF6ehLf/8fpuGvCYw5OHcTzrw6jh6/6C1d1DqyOLAGH/CcYOO75zG/mXseEbe4M1s/ uV18piB0OPWBEnrE34ie6FqcWtAzYTsEAMQKs2R40oxGwKc9+WTwAdieNPk3OrPqKZ+9 C79sGBv/p7xL2ydPW4XkE418onpuayegsATbU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:mime-version :content-disposition; bh=RnNpSJkpTTHSSXdkTIGiG3GQdw5OOuvH5LxHNOHSFDE=; b=SGYMagcXUzDiKBuDKUJHP1DRDLcAmjcfpVuY8behZa8JtkvsXST4n4mdMZQbkK00CL eKMrOj4iCp1G5XyDPWWqxY6jLt1xNPyD18TcBH0Azqejs6KKIpFani8Y3axKRXtcTxM7 Me+zGLcwrjZ9L5PWrPqNwAj43/WqJ/HOZd9fxZ2Z3JyGvsrOx7WZG9M3dv96GHEvqB9m gSBrYgHDC3LhrSwII4AeOXxawiwnEXaWPlOyrIw83RYoxujzCekj+YhPYQS90jrM6u5n 3kYxtXrsoRJayAal3gfNKQhmhya5JYEsm6H+wzYHCNDxuW2zxwAnXxTaBPUVEkyaHjHD Y9Bg== X-Gm-Message-State: APjAAAVN+e31atJp5/kVs3vuKVwkHBuM7RMc7eIDIjX2Pr3qtz64UKfc kZtFcZMANaDdSyQTl2iWTAMBOw== X-Received: by 2002:adf:e6cb:: with SMTP id y11mr47061397wrm.345.1577375290394; Thu, 26 Dec 2019 07:48:10 -0800 (PST) Received: from localhost ([85.255.234.78]) by smtp.gmail.com with ESMTPSA id z21sm8800006wml.5.2019.12.26.07.48.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 26 Dec 2019 07:48:09 -0800 (PST) Date: Thu, 26 Dec 2019 15:48:08 +0000 From: Chris Down To: linux-fsdevel@vger.kernel.org Cc: Al Viro , Matthew Wilcox , Amir Goldstein , Jeff Layton , Johannes Weiner , Tejun Heo , linux-kernel@vger.kernel.org, kernel-team@fb.com, "zhengbin (A)" Subject: [PATCH] fs: inode: Recycle inodenum from volatile inode slabs Message-ID: <20191226154808.GA418948@chrisdown.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On affected tiers, in excess of 10% of hosts show multiple files with different content and the same inode number, with some servers even having as many as 150 duplicated inode numbers with differing file content. This causes actual, tangible problems in production. For example, we have complaints from those working on remote caches that their application is reporting cache corruptions because it uses (device, inodenum) to establish the identity of a particular cache object, but because it's not unique any more, the application refuses to continue and reports cache corruption. Even worse, sometimes applications may not even detect the corruption but may continue anyway, causing phantom and hard to debug behaviour. In general, userspace applications expect that (device, inodenum) should be enough to be uniquely point to one inode, which seems fair enough. One might also need to check the generation, but in this case: 1. That's not currently exposed to userspace (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY); 2. Even with generation, there shouldn't be two live inodes with the same inode number on one device. In order to fix this, we reuse inode numbers from recycled slabs where possible, allowing us to significantly reduce the risk of 32 bit wraparound. There are probably some other potential users of this, like some FUSE internals, and {proc,sys,kern}fs style APIs, but doing a general opt-out codemod requires some thinking depending on the particular callsites and how far up the stack they are, we might end up recycling an i_ino value that actually does have some semantic meaning. As such, to start with this patch only opts in a few get_next_ino-heavy filesystems, and those which looked straightforward and without likelihood for corner cases: - bpffs - configfs - debugfs - efivarfs - hugetlbfs - ramfs - tmpfs Another alternative considered was to change get_next_ino to use and return ino_t with internal use of 64-bit atomics, but this has a couple of setbacks that might make it not possible to use in all scenarios: 1. This may break some 32-bit userspace applications on a 64-bit kernel which cannot handle a 64-bit ino_t, see the comment above get_next_ino; 2. Some applications inside the kernel already make use of the ino_t high bits. For example, overlayfs' xino feature uses these to merge inode numbers and fsid indexes to form a new identifier. One limitation to this approach is that slab recycling is currently only per-memcg. This means workloads which heavily exercise get_next_ino with the same memcg are most likely to benefit, rather than those with a wide range of cgroups thrashing it. Depending on the workload, I've seen from 10%-50% recycle rate, which seems like a reasonable win with no significant increase in code complexity. Signed-off-by: Chris Down Reported-by: Phyllipe Medeiros Cc: Al Viro Cc: Matthew Wilcox Cc: Amir Goldstein Cc: Jeff Layton Cc: Johannes Weiner Cc: Tejun Heo Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: kernel-team@fb.com --- fs/configfs/inode.c | 2 +- fs/debugfs/inode.c | 2 +- fs/efivarfs/inode.c | 2 +- fs/hugetlbfs/inode.c | 4 ++-- fs/inode.c | 31 +++++++++++++++++++++++++++++++ fs/ramfs/inode.c | 2 +- include/linux/fs.h | 2 ++ kernel/bpf/inode.c | 2 +- mm/shmem.c | 2 +- 9 files changed, 41 insertions(+), 8 deletions(-) diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c index fd0b5dd68f9e..c0157f9b3e33 100644 --- a/fs/configfs/inode.c +++ b/fs/configfs/inode.c @@ -114,7 +114,7 @@ struct inode *configfs_new_inode(umode_t mode, struct configfs_dirent *sd, { struct inode * inode = new_inode(s); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = recycle_or_get_next_ino(inode->i_ino); inode->i_mapping->a_ops = &configfs_aops; inode->i_op = &configfs_inode_operations; diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index bda942afc644..64af16104661 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -66,7 +66,7 @@ static struct inode *debugfs_get_inode(struct super_block *sb) { struct inode *inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = recycle_or_get_next_ino(inode->i_ino); inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); } diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c index 96c0c86f3fff..ba2f3c6a4042 100644 --- a/fs/efivarfs/inode.c +++ b/fs/efivarfs/inode.c @@ -19,7 +19,7 @@ struct inode *efivarfs_get_inode(struct super_block *sb, struct inode *inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = recycle_or_get_next_ino(inode->i_ino); inode->i_mode = mode; inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); inode->i_flags = is_removable ? 0 : S_IMMUTABLE; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index d5c2a3158610..a867035b6460 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -732,7 +732,7 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = recycle_or_get_next_ino(inode->i_ino); inode->i_mode = S_IFDIR | ctx->mode; inode->i_uid = ctx->uid; inode->i_gid = ctx->gid; @@ -775,7 +775,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, if (inode) { struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); - inode->i_ino = get_next_ino(); + inode->i_ino = recycle_or_get_next_ino(inode->i_ino); inode_init_owner(inode, dir, mode); lockdep_set_class(&inode->i_mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key); diff --git a/fs/inode.c b/fs/inode.c index aff2b5831168..0df68367587b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -880,6 +881,11 @@ static struct inode *find_inode_fast(struct super_block *sb, #define LAST_INO_BATCH 1024 static DEFINE_PER_CPU(unsigned int, last_ino); +/* + * As get_next_ino returns a type with a small width (typically 32 bits), + * consider calling recycle_or_get_next_ino instead if your callsite may be able + * to reuse a recycled inode's i_ino to reduce the risk of inode wraparound. + */ unsigned int get_next_ino(void) { unsigned int *p = &get_cpu_var(last_ino); @@ -904,6 +910,31 @@ unsigned int get_next_ino(void) } EXPORT_SYMBOL(get_next_ino); +/* + * get_next_ino() returns an unsigned int, which can wrap around on workloads + * havily making use of it. + * + * To reduce the risks, callsites can instead use recycle_or_get_next_ino to + * only get a new inode number when the slab wasn't recycled. old_ino should be + * i_ino from the (potentially) recycled slab. + */ +unsigned int recycle_or_get_next_ino(ino_t old_ino) +{ + /* + * get_next_ino returns unsigned int. If this fires then i_ino must be + * >32 bits and have been changed later, so the caller shouldn't be + * recycling inode numbers + */ + WARN_ONCE(old_ino > UINT_MAX, + "Recyclable i_ino not from get_next_ino: %llu", (u64)old_ino); + + if (old_ino) + return old_ino; + else + return get_next_ino(); +} +EXPORT_SYMBOL(recycle_or_get_next_ino); + /** * new_inode_pseudo - obtain an inode * @sb: superblock diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index 35624ca2a2f9..e517202dd607 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -66,7 +66,7 @@ struct inode *ramfs_get_inode(struct super_block *sb, struct inode * inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = recycle_or_get_next_ino(inode->i_ino); inode_init_owner(inode, dir, mode); inode->i_mapping->a_ops = &ramfs_aops; mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); diff --git a/include/linux/fs.h b/include/linux/fs.h index 190c45039359..d90be3ab5fc4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3053,6 +3053,8 @@ static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { }; extern void unlock_new_inode(struct inode *); extern void discard_new_inode(struct inode *); extern unsigned int get_next_ino(void); +extern unsigned int recycle_or_get_next_ino(ino_t old_ino); + extern void evict_inodes(struct super_block *sb); extern void __iget(struct inode * inode); diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index ecf42bec38c0..a459a7da74d5 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -97,7 +97,7 @@ static struct inode *bpf_get_inode(struct super_block *sb, if (!inode) return ERR_PTR(-ENOSPC); - inode->i_ino = get_next_ino(); + inode->i_ino = recycle_or_get_next_ino(inode->i_ino); inode->i_atime = current_time(inode); inode->i_mtime = inode->i_atime; inode->i_ctime = inode->i_atime; diff --git a/mm/shmem.c b/mm/shmem.c index 165fa6332993..8c358d2c24d3 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2247,7 +2247,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = recycle_or_get_next_ino(inode->i_ino); inode_init_owner(inode, dir, mode); inode->i_blocks = 0; inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); -- 2.24.1