Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp8718138ybl; Wed, 25 Dec 2019 04:57:10 -0800 (PST) X-Google-Smtp-Source: APXvYqzhxuYVykjcWUi6uNlFKd+dSWZOrPrIA9dEGYWvDIOVRkM/YwsYSaZE6MyWU7JaeSxVWVjP X-Received: by 2002:a9d:799a:: with SMTP id h26mr42706058otm.240.1577278630575; Wed, 25 Dec 2019 04:57:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1577278630; cv=none; d=google.com; s=arc-20160816; b=DaMLu9J9/8C1YhN1wt+Xtty2eTDkUBbqdJiva/RWo8JHlCOettruNmrQkeWWDJD+wY 13Ms48BV257LbjNvO7uymMdwdlhf91GhMM6db8spSLP3x+h4I8ZWnE1q5bXbwkNodHP+ jqF3dtxFQaRwHwl/0XSaWk0h4RUQGJ2U2KC62eIwag+fLvpo8KW5QjM6d66PAy8E6pj7 PZ1iJ16q2WB2+NUokcSkewtnAYS8WJQKOR0YYgPJA0ar78ydCI3X72xu7swvu+cxSZox ZSpvTByevxUzMDvaiySxQcJaGFXbd/bqq7pTUlYrTFMbNwCr0MME1T4yn7nvkE2gEbrN mssg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=jIham/EaCFEKkl8fK1a+ksAmrlxt3rRnwMRIAl1SXmI=; b=sPkheBmBC59c0Yk4f0b9o5VITrHWfV9W8X4xPJ8zcCTSDP+bUxaoE61P/ztxUgcxyH TJ6fxKp435TfrXJk4qY5nM0NYmkZpvWadESWA5yOocXZJsrIJhNucmg+wQ+gLHsI+D0X sVdbxURq2YBsEZrLVe8ewLuDaYyI+iN9z9OvbmcO7YUdN6IyCaVSvIoCG6Hr6eFfEzx0 8xY02fnliKypriH8orJr9KjtObLg3U2ZRw6E36i7VVRBBhUP/BpVpZIPLLij3fIsBkaW 7S2iqD9IjiY4X+zC6mLTk0VLD5AXaN3DfjB8PEZDLfrHVcIGlO9jlicpjQAWo2WoxmXI A7+A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chrisdown.name header.s=google header.b=c8Zo8NEk; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chrisdown.name Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m17si12087716otr.17.2019.12.25.04.56.59; Wed, 25 Dec 2019 04:57:10 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@chrisdown.name header.s=google header.b=c8Zo8NEk; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chrisdown.name Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726879AbfLYMyx (ORCPT + 99 others); Wed, 25 Dec 2019 07:54:53 -0500 Received: from mail-wr1-f68.google.com ([209.85.221.68]:41206 "EHLO mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726414AbfLYMyx (ORCPT ); Wed, 25 Dec 2019 07:54:53 -0500 Received: by mail-wr1-f68.google.com with SMTP id c9so21746656wrw.8 for ; Wed, 25 Dec 2019 04:54:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chrisdown.name; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=jIham/EaCFEKkl8fK1a+ksAmrlxt3rRnwMRIAl1SXmI=; b=c8Zo8NEkJAvJVAWWv5++cFMFFPdQeZL0z/4iuG82ihnv+eFKk7q7quHsPKpqN6wGkK Dh1LzXFczxl7GD2iDyWdUx9BegFuBu1KkVBX5AiU7EsZLDk3+Aiq2OT5L3+bmCvb7Asa QMuwcpCubHeTZzkJe3UvMWtGkNigXRJh8J6Us= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=jIham/EaCFEKkl8fK1a+ksAmrlxt3rRnwMRIAl1SXmI=; b=RexXUiQpL7y2B9QLL0UgBBLA+Yl/g/3TWwuVQc3GZ3SsTcHdih4GmDFGYLhCGdO/Ez ZRwHIquxYTP7QStqS4YgCCvQVacAa9nM2xNXAwgI7v1MNNtb75rgVj2iX5fsN+ZjQuNx Z9Ro1pl1WdDxsFBoXDFQIupn0t7LKjhO893zBNSwVaNapNaIO0Q1W5NUq/hYLajgc3aA CIfbfHvQH3eFz8vz20UujK1xTZMRq7wvJ81SpUb93dIdDwfvnwj3wBjMdBQEo8jJzr5G sJah10smpQWhIEYmH00AFzMHdDvXc3tAWaU6/uJXa9y2a/7Hi0ZfCXDCmfsAg0W7DrsU uUZA== X-Gm-Message-State: APjAAAWwZfaafclNEA5E3q1Rq1f7ii/oVbCHKrDs3HrZR/lLkMR/3zjd DK9Dl4OcU198u5Mtn0rVu0tNag== X-Received: by 2002:adf:c145:: with SMTP id w5mr40394878wre.205.1577278490981; Wed, 25 Dec 2019 04:54:50 -0800 (PST) Received: from localhost (host-92-23-123-10.as13285.net. [92.23.123.10]) by smtp.gmail.com with ESMTPSA id v83sm5590825wmg.16.2019.12.25.04.54.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Dec 2019 04:54:50 -0800 (PST) Date: Wed, 25 Dec 2019 12:54:48 +0000 From: Chris Down To: Amir Goldstein Cc: Matthew Wilcox , linux-fsdevel , Al Viro , Jeff Layton , Johannes Weiner , Tejun Heo , linux-kernel , kernel-team@fb.com, Hugh Dickins , Miklos Szeredi , "zhengbin (A)" , Roman Gushchin Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit Message-ID: <20191225125448.GA309148@chrisdown.name> References: <20191220024936.GA380394@chrisdown.name> <20191220121615.GB388018@chrisdown.name> <20191220164632.GA26902@bombadil.infradead.org> <20191220195025.GA9469@bombadil.infradead.org> <20191223204551.GA272672@chrisdown.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Amir Goldstein writes: >> The slab i_ino recycling approach works somewhat, but is unfortunately neutered >> quite a lot by the fact that slab recycling is per-memcg. That is, replacing >> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial >> callsites only leads to about 10% slab reuse, which doesn't really stem the >> bleeding of 32-bit inums on an affected workload: >> >> # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c >> 4454 recycle_or_get_next_ino: not recycled >> 546 recycle_or_get_next_ino: recycled >> > >Too bad.. >Maybe recycled ino should be implemented all the same because it is simple >and may improve workloads that are not so MEMCG intensive. Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for this) since it's not a total solution for the problem, but still helps somewhat and we all seem to agree that it's overall an uncontroversial improvement. >> Roman (who I've just added to cc) tells me that currently we only have >> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This >> contributes fairly significantly here since there are multiple tasks across >> multiple cgroups which are contributing to the get_next_ino() thrash. >> >> I think this is a good start, but we need something of a different magnitude in >> order to actually solve this problem with the current slab infrastructure. How >> about something like the following? >> >> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is >> 2. Use get_next_ino_full in tmpfs (et al.) > >I would prefer that filesystems making heavy use of get_next_ino, be converted >to use a private ino pool per sb: > >ino_pool_create() >ino_pool_get_next() > >flags to ino_pool_create() can determine the desired ino range. >Does the Facebook use case involve a single large tmpfs or many >small ones? I would guess the latter and therefore we are trying to solve >a problem that nobody really needs to solve (i.e. global efficient ino pool). Unfortunately in the case under discussion, it's all in one large tmpfs in /dev/shm. I can empathise with that -- application owners often prefer to use the mounts provided to them rather than having to set up their own. For this one case we can change that, but I think it seems reasonable to support this case since using a single tmpfs can be a reasonable decision as an application developer, especially if you only have unprivileged access to the system. >> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can >> pass if they want the 32-bit inode numbers back. This would still allow >> people who want to make this tradeoff to use xino. > >inode32|inode64 (see man xfs(5)). Ah great, thanks! I'll reuse precedent from those. >> 4. (If you like) Also add a CONFIG option to disable this at compile time. >> > >I Don't know about disable, but the default mode for tmpfs (inode32|inode64) >might me best determined by CONFIG option, so distro builders could decide >if they want to take the risk of breaking applications on tmpfs. Sounds good. >But if you implement per sb ino pool, maybe inode64 will no longer be >required for your use case? In this case I think per-sb ino pool will help a bit, but unfortunately not by an order of magnitude. As with the recycling patch this will help reduce thrash a bit but not conclusively prevent the problem from happening long-term. To fix that, I think we really do need the option to use ino_t-sized get_next_ino_full (or per-sb equivalent). Happy holidays, and thanks for your feedback! Chris