Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp10459224ybl; Thu, 26 Dec 2019 18:17:28 -0800 (PST) X-Google-Smtp-Source: APXvYqzwN9KTHnK+d9ar0cGAsvTcjBHstt9yJd6a1ydhvuFgltvPPVo1vQZM84r9q8ImIBT7g1ok X-Received: by 2002:a05:6830:1e16:: with SMTP id s22mr42551802otr.340.1577413048057; Thu, 26 Dec 2019 18:17:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1577413048; cv=none; d=google.com; s=arc-20160816; b=QAogLgQrPF2HKt2AnGL+1GHNy++R9Rf5UYdAqJ4aIPFMgMLgk6o8sLjuWoG7e5DCUB A7FKNsEtkwmpvkFcaCBfvCKL+4HUQmTjdPcqrH0Fo1TrSknRtm45hFXfxyeX69p80N3K /puosnqy9nxZghLgHGu7VtlRRyTv8JH4u7PgV4kRhioMaspEl9rnSK4gN6LT1udV2g1R ubLlpAZTOejchwi+8QP+gUjlGdjkRHZPokqPajzx0yQvdVZXgd3uXS3VYr1grC5gRMWv 04rcmwOWsjKLXSmPl+7pUjsdyeE0ib46ORK2L7FVzp1pw7BmoHDMAIMZmx4deFYamNUQ R1uA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=C1cpURgLK5d4ht+OYKeIwtOwsVErgxBwpPAL4C8Z1YU=; b=SeF1NZO+AuweiR0ndHqKRVfzADYGLlusHd3aFUwQ5hBg3Si6w+ryuXG0fj4F1DvPyr ef1PGXu1JPvHH1qz92BvxO2mbXNZMeMFn2ujn4cPYWxdpDX8c23TAHhFjwyUloFcAmST Avs/tnJP1UZrO9/hHd1IFmVd8hXu3uaLI61XarFoe5cvHszcF+8Bv6g9sUasQIqQu7vo 6jEJvmqhlg+JWDAx+OSxRIZHWm48RGQ1ay2Wt3hSm1WIR6SuyEZLJirJ0kTQ7SBPgyWA oZ0KMgWRimSFhHK98A2JV+b0/4TknKTAFGLSFMWeI1EtqYP7FDxLDezH3mgV587LvfES KOMg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z1si17048826otp.70.2019.12.26.18.17.15; Thu, 26 Dec 2019 18:17:27 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727040AbfL0CQC (ORCPT + 99 others); Thu, 26 Dec 2019 21:16:02 -0500 Received: from szxga07-in.huawei.com ([45.249.212.35]:43706 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726550AbfL0CQC (ORCPT ); Thu, 26 Dec 2019 21:16:02 -0500 Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.60]) by Forcepoint Email with ESMTP id 4BF3FDAAD5393AE5A973; Fri, 27 Dec 2019 10:16:00 +0800 (CST) Received: from [127.0.0.1] (10.184.213.217) by DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id 14.3.439.0; Fri, 27 Dec 2019 10:15:57 +0800 Subject: Re: [PATCH] fs: inode: Recycle inodenum from volatile inode slabs To: Amir Goldstein , Chris Down CC: linux-fsdevel , Al Viro , Matthew Wilcox , Jeff Layton , Johannes Weiner , Tejun Heo , linux-kernel , References: <20191226154808.GA418948@chrisdown.name> From: "zhengbin (A)" Message-ID: <88698fed-528b-85b2-1d07-e00051d6db60@huawei.com> Date: Fri, 27 Dec 2019 10:15:55 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Language: en-US X-Originating-IP: [10.184.213.217] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/12/27 2:04, Amir Goldstein wrote: > On Thu, Dec 26, 2019 at 5:48 PM Chris Down wrote: >> In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. >> On affected tiers, in excess of 10% of hosts show multiple files with >> different content and the same inode number, with some servers even >> having as many as 150 duplicated inode numbers with differing file >> content. >> >> This causes actual, tangible problems in production. For example, we >> have complaints from those working on remote caches that their >> application is reporting cache corruptions because it uses (device, >> inodenum) to establish the identity of a particular cache object, but >> because it's not unique any more, the application refuses to continue >> and reports cache corruption. Even worse, sometimes applications may not >> even detect the corruption but may continue anyway, causing phantom and >> hard to debug behaviour. >> >> In general, userspace applications expect that (device, inodenum) should >> be enough to be uniquely point to one inode, which seems fair enough. >> One might also need to check the generation, but in this case: >> >> 1. That's not currently exposed to userspace >> (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY); >> 2. Even with generation, there shouldn't be two live inodes with the >> same inode number on one device. >> >> In order to fix this, we reuse inode numbers from recycled slabs where >> possible, allowing us to significantly reduce the risk of 32 bit >> wraparound. >> >> There are probably some other potential users of this, like some FUSE >> internals, and {proc,sys,kern}fs style APIs, but doing a general opt-out >> codemod requires some thinking depending on the particular callsites and >> how far up the stack they are, we might end up recycling an i_ino value >> that actually does have some semantic meaning. As such, to start with >> this patch only opts in a few get_next_ino-heavy filesystems, and those >> which looked straightforward and without likelihood for corner cases: >> >> - bpffs >> - configfs >> - debugfs >> - efivarfs >> - hugetlbfs >> - ramfs >> - tmpfs >> > I'm confused about this list. > I suggested to convert tmpfs and hugetlbfs because they use a private > inode cache pool, therefore, you can know for sure that a recycled i_ino > was allocated by get_next_ino(). How about tmpfs and hugetlbfs use their own get_next_ino? like static DEFINE_PER_CPU(unsigned int, tmpfs_last_ino), which can reduce the risk of 32 bit wraparound further. > > If I am not mistaken, other fs above are using the common inode_cache > pool, so when you recycle i_ino from that pool you don't know where it > came from and cannot trust its uniqueness in the get_next_ino() domain. > Even if *all* filesystems that currently use common inode_cache use > get_next_ino() exclusively to allocate ino numbers, that could change > in the future. > > I'd go even further to say that introducing a generic helper for this sort > of thing is asking for trouble. It is best to keep the recycle logic well within > the bounds of the specific filesystem driver, which is the owner of the > private inode cache and the responsible for allocating ino numbers in > this pool. > > Thanks and happy holidays, > Amir. > > . >