Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp10061910ybl; Thu, 26 Dec 2019 10:06:22 -0800 (PST) X-Google-Smtp-Source: APXvYqwJ+sof0iMCV/Eu0UcCqgIyGu/9KJ605n3L2VYeFNUUoV/XdlHX00lV/R9WeM0fcY8VAvct X-Received: by 2002:aca:503:: with SMTP id 3mr2511351oif.24.1577383582152; Thu, 26 Dec 2019 10:06:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1577383582; cv=none; d=google.com; s=arc-20160816; b=ULpOy3wskgOqfPaWHjr8Yog1Vr/aPio1BCKyrFK5DeiFmgJD8djATM8R1iOj6MSt2y U2I9B+2MKm7WS1ppYlQOLdg0MhyaNw91p2CaTs9krNl8suC2xcZ8lGzMUByGeRSHrMXN HLTUwkPOb6nSg53xW3t/B5L4moCWHN6xhVbyKijlh0ub520HaSDvqKKL3T+4hwVjUqLf VZl0LAiNARzLvBlhlwkYxnXKdQ2sx6wjlCtSwuZSYJ36wQ2Ws6mpwqYNfeueIFVMVe3t 4w8WUCZKs8Kv5RRHcCIb7L9PG+O/z9MH5+zzjUX3JyQVEu5x53qFaiSkD/EF95YnNnr4 IqYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=ftEiq2g1C6i1qkX10ktSzMbKjMJjtzyUZBsgYVT1Lxk=; b=JryxF086l/pnw4Ru6yeo9sFbjwecfzwxJdsQxxe/K/MvWqaw3RdGrklH2xaoW1bDyP 3/BGXo6Q7AkvlmmdqmWxwCJe57wv3MD2krHkNVSRxSYRz1jAH8zeoZYaL2YKPZjnAOXQ iJjas7fDz9gxeHn5KPo7Na9QJLAkF46XcdcZKhEndCA6MU3gHlPutM1Zxigm6oN7ypeI KBleo7QlIr8Lc7cmdSnkg3lm0NZy2g4iYXRTop8sc8Ix5NCa9yNz5HONOvbWMCdK4JMX fzZEoBbGKo6t3yU50wtklQSLQhi4bTiX96bJz0fWQQMwtVEddzv520FxCBAE+aSisD9N Nchw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=Wc0FJNnd; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m8si4091569oic.163.2019.12.26.10.06.09; Thu, 26 Dec 2019 10:06:22 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=Wc0FJNnd; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726946AbfLZSEM (ORCPT + 99 others); Thu, 26 Dec 2019 13:04:12 -0500 Received: from mail-io1-f67.google.com ([209.85.166.67]:42017 "EHLO mail-io1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726453AbfLZSEL (ORCPT ); Thu, 26 Dec 2019 13:04:11 -0500 Received: by mail-io1-f67.google.com with SMTP id n11so17112167iom.9; Thu, 26 Dec 2019 10:04:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ftEiq2g1C6i1qkX10ktSzMbKjMJjtzyUZBsgYVT1Lxk=; b=Wc0FJNndxndkBLZV4POgh5Uc3rCzfhvUDGJky7WJveSJJFHwyWTgP0iB/1jcrYi5/o cssU45YXq0fVZftjArJSAFVpPPDtq99n0eJ/RmX9Lx6VE6UE4P9zGduc5wqclDhG9L1t fjRSof3yuyYrK1l1RAVC8HPSQiDvQMoqbFz8wQcXNtTIApeUuY2KtYAoxNBmdtk7Gf12 pjnK39DXZ++v6FmzkwGLo89N9XpsNOdSsSU17CQzLAnWooZ2SEv4W4r22e5VF/kM3PzT S4fhYrhb/EiUjZ7dZmuReHwGCHMRaacJ34fyO0lLRp4vd4Rl3rK1AdBltqMe1nijEnT3 rq2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ftEiq2g1C6i1qkX10ktSzMbKjMJjtzyUZBsgYVT1Lxk=; b=mqC/Zc/AXOZmXTsWFI8ZyVAi6ovSBfujk3TNPbjQCHyRAFdfomdlams+DAK8Mkc8w6 W2H/yG577Z27RcnpHaIuD/QM+0aVOgV6/3XNu5D4qrFdUm4FxEgKf/9Hk+ThQTLv3Our L91hDLZbdv8UGxsZCw8B3rjv3rcRNq9iJzlrCh/wzIvU8nntDjOkibZC6KAvijlIZxbS 1/it7WJpKP73/spmFnKReo5cOVrQBAQqBPR8Jv8vPBIvtjfD6nKOyDLTww9YRwJmX8Ve Coi/J0txYGC3cuDpLCtPohNADa37MbuAmJAVd3exNOIThxchjQiP0Zu329ojCt5ECelX oZvg== X-Gm-Message-State: APjAAAUeJr2hLcMWG6zBeIBPmQCqm3jFWf+tFOT5aoSazs1fRvvNyAy0 3stXmzix6ODkaZIsxmJUOAweWhyZGNL2hLNA0rnHP7x+ X-Received: by 2002:a05:6602:280b:: with SMTP id d11mr31551395ioe.250.1577383451097; Thu, 26 Dec 2019 10:04:11 -0800 (PST) MIME-Version: 1.0 References: <20191226154808.GA418948@chrisdown.name> In-Reply-To: <20191226154808.GA418948@chrisdown.name> From: Amir Goldstein Date: Thu, 26 Dec 2019 20:04:00 +0200 Message-ID: Subject: Re: [PATCH] fs: inode: Recycle inodenum from volatile inode slabs To: Chris Down Cc: linux-fsdevel , Al Viro , Matthew Wilcox , Jeff Layton , Johannes Weiner , Tejun Heo , linux-kernel , kernel-team@fb.com, "zhengbin (A)" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 26, 2019 at 5:48 PM Chris Down wrote: > > In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. > On affected tiers, in excess of 10% of hosts show multiple files with > different content and the same inode number, with some servers even > having as many as 150 duplicated inode numbers with differing file > content. > > This causes actual, tangible problems in production. For example, we > have complaints from those working on remote caches that their > application is reporting cache corruptions because it uses (device, > inodenum) to establish the identity of a particular cache object, but > because it's not unique any more, the application refuses to continue > and reports cache corruption. Even worse, sometimes applications may not > even detect the corruption but may continue anyway, causing phantom and > hard to debug behaviour. > > In general, userspace applications expect that (device, inodenum) should > be enough to be uniquely point to one inode, which seems fair enough. > One might also need to check the generation, but in this case: > > 1. That's not currently exposed to userspace > (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY); > 2. Even with generation, there shouldn't be two live inodes with the > same inode number on one device. > > In order to fix this, we reuse inode numbers from recycled slabs where > possible, allowing us to significantly reduce the risk of 32 bit > wraparound. > > There are probably some other potential users of this, like some FUSE > internals, and {proc,sys,kern}fs style APIs, but doing a general opt-out > codemod requires some thinking depending on the particular callsites and > how far up the stack they are, we might end up recycling an i_ino value > that actually does have some semantic meaning. As such, to start with > this patch only opts in a few get_next_ino-heavy filesystems, and those > which looked straightforward and without likelihood for corner cases: > > - bpffs > - configfs > - debugfs > - efivarfs > - hugetlbfs > - ramfs > - tmpfs > I'm confused about this list. I suggested to convert tmpfs and hugetlbfs because they use a private inode cache pool, therefore, you can know for sure that a recycled i_ino was allocated by get_next_ino(). If I am not mistaken, other fs above are using the common inode_cache pool, so when you recycle i_ino from that pool you don't know where it came from and cannot trust its uniqueness in the get_next_ino() domain. Even if *all* filesystems that currently use common inode_cache use get_next_ino() exclusively to allocate ino numbers, that could change in the future. I'd go even further to say that introducing a generic helper for this sort of thing is asking for trouble. It is best to keep the recycle logic well within the bounds of the specific filesystem driver, which is the owner of the private inode cache and the responsible for allocating ino numbers in this pool. Thanks and happy holidays, Amir.