Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp7132596ybl; Mon, 23 Dec 2019 19:05:16 -0800 (PST) X-Google-Smtp-Source: APXvYqxT0EE+BTV5VWeGKJX9IGM4WBKRXaUm9ygD3Hac1bkHAGjkqwIh9yjNKOqcPA386U7Gz8PK X-Received: by 2002:a9d:20e4:: with SMTP id x91mr34950709ota.335.1577156716406; Mon, 23 Dec 2019 19:05:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1577156716; cv=none; d=google.com; s=arc-20160816; b=U9LOb5S1pv/92tvVrqcsC06vOCQfasRe1HjEf+WGlkrP3GpEtbpkNpr8OhpypLO3wt G+9ATp0tqoLHAy0rWdjixt1UFbJrJ6UiX/Z4CNMxOGnPq7k8hm4FQpn170RricLOLCqn rfSUEGgesfAQTMkoxDUQdYKsbDJn8gbj6/w+t40wjF2i6NOkF28E2dnJHJQkb+LQ0+iw qytW7HAnf6j5qWZ+WzJb/NZsHAX5f94UV1hNo0lukSkmFrZdpASQXWtdgZSyG4efh3hR GVWy0GhAiqmCuXuiPeFC4wwhwXSy8nHB5JcMFH8FO7ufjSGCCljiX47Z3KJkeeSjj0QG mZOQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=cYZhtmD+9NUOzLwE4DFDXgTJzhRtHeQxHMYH/iS7zTg=; b=KoOV7HG0wd9JJUUTcC3WZTnvwa8HsagppSbeFAyu+vWgzSXCIbm1DQ6+v9Ope1T9BH 6dZg67CAO7+w4PCohOio7w/bKzEUfL9IwfvflSzu98rYuM9XS4ov+2oIhvnr3mPIBTW0 Ss/1XbD50i6W/xkTd6tnt36ssegkl59UipvIlkswEqKUpAHYq+ADw1JHmFkYQuOpzyeV IgFu/6nGWTx6/Sg7vttpzspC8ncB6Z+KYCdRguUlaPlWqaKfsh5e3pP2MmFs2jvkj9JB tDBCtvS4UJzaDVuafNNBt6IHwQLxHmslfDt7zVgWSdMLPoHlMhBcglrblf5oJ8B3Y2cm 6QsQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b="GtqZ/SMO"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n203si10025150oia.112.2019.12.23.19.05.04; Mon, 23 Dec 2019 19:05:16 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b="GtqZ/SMO"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726887AbfLXDEX (ORCPT + 99 others); Mon, 23 Dec 2019 22:04:23 -0500 Received: from mail-io1-f65.google.com ([209.85.166.65]:36953 "EHLO mail-io1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726853AbfLXDEX (ORCPT ); Mon, 23 Dec 2019 22:04:23 -0500 Received: by mail-io1-f65.google.com with SMTP id k24so9262495ioc.4; Mon, 23 Dec 2019 19:04:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=cYZhtmD+9NUOzLwE4DFDXgTJzhRtHeQxHMYH/iS7zTg=; b=GtqZ/SMOo36thoZU+8R+cZitg8ycgYRxrL/6kquEe+itVLv6GaDb5VQsU3/P1YDwjp QCCISU1WVoaJ43Cfd/tTKWlxLEGflSIbVzYl7GRlkofOZWL7Yz0oTl8XJeglS4PWJ/Dk vmy0XuFySBPplrSQ9z/jxURdiX9f4IBn7oAX/CV7Z23Dq+CVnEcgVN9qBA9Q1lNW6EPa vH4v1sxMW653eZZnVlaX8ZX0QltvW7YOHDsHRIfHMTFtx/46GthozQlIIBzjJqxVs6WO 6QgQgTCCYdvVJCEIbPajbdCzndjGcPMJymJfD1uyzC5vvh58wCdOaFyvKNQfFyroB3yF nHxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cYZhtmD+9NUOzLwE4DFDXgTJzhRtHeQxHMYH/iS7zTg=; b=VnVLMyoofA3kclaNXGCaJUrm+wK9l+tmGiktep+NkxyDXsjz8X6SydDUWMpCmeSY2W Qr+jBFw79dCB3R5Y7YQp2Kolid15H1thzjzr0zjzJCbYN/V+/yp90XZv+S6v0HhXwJ6r FEjQVUDGM57HMB9rc8oaWv8k07luWYz6NmMpEbDOlWAAmutGsCAGZ1R4F1emJ1W6AMci yozyniUlpCLzIt9G+gQZnU4ZoSj5cHWl2qEMMEaz+EzaV4Me91WIyXUCzRhM6hi1G6ZU xTL/A9tj0vuh2aoUKuT57P0Gvsr4qGtQVu/r/rULGwtOu46S/4UpAxSWMo5sdC54uGHB 9Ssg== X-Gm-Message-State: APjAAAU7NMKdn4n5wwwhGen4x4L5ABU7vWW2E1SZKbHY4hIhjdgoQnPg /KO6ajFGBKzwa+MfXg9HVhbBXwaeJrQqYmRvQBs= X-Received: by 2002:a6b:5904:: with SMTP id n4mr23268086iob.9.1577156662858; Mon, 23 Dec 2019 19:04:22 -0800 (PST) MIME-Version: 1.0 References: <20191220024936.GA380394@chrisdown.name> <20191220121615.GB388018@chrisdown.name> <20191220164632.GA26902@bombadil.infradead.org> <20191220195025.GA9469@bombadil.infradead.org> <20191223204551.GA272672@chrisdown.name> In-Reply-To: <20191223204551.GA272672@chrisdown.name> From: Amir Goldstein Date: Tue, 24 Dec 2019 05:04:11 +0200 Message-ID: Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit To: Chris Down Cc: Matthew Wilcox , linux-fsdevel , Al Viro , Jeff Layton , Johannes Weiner , Tejun Heo , linux-kernel , kernel-team@fb.com, Hugh Dickins , Miklos Szeredi , "zhengbin (A)" , Roman Gushchin Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > The slab i_ino recycling approach works somewhat, but is unfortunately neutered > quite a lot by the fact that slab recycling is per-memcg. That is, replacing > with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial > callsites only leads to about 10% slab reuse, which doesn't really stem the > bleeding of 32-bit inums on an affected workload: > > # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c > 4454 recycle_or_get_next_ino: not recycled > 546 recycle_or_get_next_ino: recycled > Too bad.. Maybe recycled ino should be implemented all the same because it is simple and may improve workloads that are not so MEMCG intensive. > Roman (who I've just added to cc) tells me that currently we only have > per-memcg slab reuse instead of global when using CONFIG_MEMCG. This > contributes fairly significantly here since there are multiple tasks across > multiple cgroups which are contributing to the get_next_ino() thrash. > > I think this is a good start, but we need something of a different magnitude in > order to actually solve this problem with the current slab infrastructure. How > about something like the following? > > 1. Add get_next_ino_full, which uses whatever the full width of ino_t is > 2. Use get_next_ino_full in tmpfs (et al.) I would prefer that filesystems making heavy use of get_next_ino, be converted to use a private ino pool per sb: ino_pool_create() ino_pool_get_next() flags to ino_pool_create() can determine the desired ino range. Does the Facebook use case involve a single large tmpfs or many small ones? I would guess the latter and therefore we are trying to solve a problem that nobody really needs to solve (i.e. global efficient ino pool). > 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can > pass if they want the 32-bit inode numbers back. This would still allow > people who want to make this tradeoff to use xino. inode32|inode64 (see man xfs(5)). > 4. (If you like) Also add a CONFIG option to disable this at compile time. > I Don't know about disable, but the default mode for tmpfs (inode32|inode64) might me best determined by CONFIG option, so distro builders could decide if they want to take the risk of breaking applications on tmpfs. But if you implement per sb ino pool, maybe inode64 will no longer be required for your use case? Thanks, Amir.