Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp3121622ybl; Fri, 20 Dec 2019 04:18:27 -0800 (PST) X-Google-Smtp-Source: APXvYqz3QRyQuN2UF/ulMN6xYaB+wxZoSxFpRifVcyojW2Eh5aKbUbv06Eci7jUWyWYN9yzCegcm X-Received: by 2002:aca:4dd3:: with SMTP id a202mr3723837oib.3.1576844307312; Fri, 20 Dec 2019 04:18:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1576844307; cv=none; d=google.com; s=arc-20160816; b=mTOqIZGQRoEhsYMeNLeyCC/SeSbjsT8hBk39Avq/K2E/mKMtK0tCyQ9H1b/PwTu98w g4ta4W4J5Q1La2ykkE6HEyCXJg/YYqERzasrX7HHqM858wb9edjDMjUUtFRkpK884L4p DyGSk3zVna/eIEFPsUcPhR//liW9Gy9M2M80FIOESSf732Yc3BpCkowU+5ZWAwjfOT4O mG5gcd17keP0Vj2AxAqG3/b3ll9k93Cmi0C9K3IYrIJ5dleXAQhhTHt8856KCBsEP+LA Qj91Eo3nDGc03v/vooNqKFd0ZWQY+l6o6Ni/ZzhU/KCuZf1TaVyI0BWQb2wuuFyMEmsU bUfA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=agoLDBhJR59GTTSs+ez/Yrhh73k6oloZGrgz5Rsg0gM=; b=Zzsj1pv5YIKDU3is9Fi1MiwjVFmZiZseAYeiZxq2AUIrUCKLqAeV14zrVEAxZQOcut I/xZ0GQC6v9C+Vm97zw48mKalS7QxKh61IVfDksyyZpoSUfBw9uyeH8OD6tO0Do7EJMP 14R8Hh9n/5veRuOZmyVtKAT/ymzv4Y1Dsk6uJE+5QUXEbVdpv45/46U9omOy5Ydizg9J bi4/AJoYqzmZBVHxpwzm1BMj/X7RyvlUcRTbCksmCTRUHVrsOQ7WEvWDvN5BcBkxgCQF 30dhukPhkjML5WN8xFejjizB7HqIsa7nT0aqg18FXxJ1z8FRtsPu6miCdr3O/Pp/0ybI hU4Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chrisdown.name header.s=google header.b="avEhe0h/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chrisdown.name Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a10si4720361oia.232.2019.12.20.04.18.16; Fri, 20 Dec 2019 04:18:27 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@chrisdown.name header.s=google header.b="avEhe0h/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chrisdown.name Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727345AbfLTMQW (ORCPT + 99 others); Fri, 20 Dec 2019 07:16:22 -0500 Received: from mail-wr1-f67.google.com ([209.85.221.67]:38659 "EHLO mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727211AbfLTMQV (ORCPT ); Fri, 20 Dec 2019 07:16:21 -0500 Received: by mail-wr1-f67.google.com with SMTP id y17so9239171wrh.5 for ; Fri, 20 Dec 2019 04:16:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chrisdown.name; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=agoLDBhJR59GTTSs+ez/Yrhh73k6oloZGrgz5Rsg0gM=; b=avEhe0h/Y5RzW/XaZZS0w+OJNrTSQRaj1ODufA74IXVLKtT4JSSrhOZptONbWS/7be PjiBYEjKUg6j99+8UCQe9dN9GIufg8pWG9zZmUYi6KBawfAdk4t8KuBq/VVJjLszIvnU EGwPEX41dSoeDEOmIZ3BcqkAzxBJwWP4HN+lM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=agoLDBhJR59GTTSs+ez/Yrhh73k6oloZGrgz5Rsg0gM=; b=FouLDk9/pYXvuYuJ7QE7O2xZhtFi1Q8iNZ/QenjdiQRuc6nKQJ3a6RsGjFbiE8pxSQ TXP9LY5QI45Pa8IVqzbzEXnQ1Duf7WkBUhhIt29P6EIAietBepBbSc/AqzEcIhAuJl0i z9f6TJjFeGPh0Lyw4FYS3NGDDxacHU6D11btu/AdY0vvism62E38bpWi5mkHzBP3dIqU JY6aOTPcxd1Vfevz4IB8Yh7AntgsfaMYpgc7gX4YuR6DxEVuWSIlmj1eMULnufDH2lNe K47+Ko4pJhDCP4OFqwa3b0DMMMo+JKk1TAK202IP6mbXTywozDjAzo/QpDerT7mQfOIn QFeg== X-Gm-Message-State: APjAAAX3kkbbzfReKmL0OzAC/81LhxtxfYbB4PBYKz8n9+vH4ehFi5aT zCpTsH01JKnRugAG+vBO0OhDtQ== X-Received: by 2002:a5d:4983:: with SMTP id r3mr15186247wrq.134.1576844179206; Fri, 20 Dec 2019 04:16:19 -0800 (PST) Received: from localhost ([2a01:4b00:8432:8a00:63de:dd93:20be:f460]) by smtp.gmail.com with ESMTPSA id k7sm9289854wmi.19.2019.12.20.04.16.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 20 Dec 2019 04:16:16 -0800 (PST) Date: Fri, 20 Dec 2019 12:16:15 +0000 From: Chris Down To: Amir Goldstein Cc: linux-fsdevel , Al Viro , Jeff Layton , Johannes Weiner , Tejun Heo , linux-kernel , kernel-team@fb.com, Hugh Dickins , Matthew Wilcox , Miklos Szeredi Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit Message-ID: <20191220121615.GB388018@chrisdown.name> References: <20191220024936.GA380394@chrisdown.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Amir, Thanks for getting back, I appreciate it. Amir Goldstein writes: >How about something like this: > >/* just to explain - use an existing macro */ >shmem_ino_shift = ilog2(sizeof(void *)); >inode->i_ino = (__u64)inode >> shmem_ino_shift; > >This should solve the reported problem with little complexity, >but it exposes internal kernel address to userspace. One problem I can see with that approach is that get_next_ino doesn't discriminate based on the context (for example, when it is called for a particular tmpfs mount) which means that eventually wraparound risk is still pushed to the limit on such machines for other users of get_next_ino (like named pipes, sockets, procfs, etc). Granted then the space for collisions between them is less likely due to their general magnitude of inodes at one time compared to some tmpfs workloads, but still. >Can we do anything to mitigate this risk? > >For example, instead of trying to maintain a unique map of >ino_t to struct shmem_inode_info * in the system >it would be enough (and less expensive) to maintain a unique map of >shmem_ino_range_t to slab. >The ino_range id can then be mixes with the relative object index in >slab to compose i_ino. > >The big win here is not having to allocate an id every bunch of inodes >instead of every inode, but the fact that recycled (i.e. delete/create) >shmem_inode_info objects get the same i_ino without having to >allocate any id. > >This mimics a standard behavior of blockdev filesystem like ext4/xfs >where inode number is determined by logical offset on disk and is >quite often recycled on delete/create. > >I realize that the method I described with slab it crossing module layers >and would probably be NACKED. Yeah, that's more or less my concern with that approach as well, hence why I went for something that seemed less intrusive and keeps with the current inode allocation strategy :-) >Similar result could be achieved by shmem keeping a small stash of >recycled inode objects, which are not returned to slab right away and >retain their allocated i_ino. This at least should significantly reduce the >rate of burning get_next_ino allocation. While this issue happens to present itself currently on tmpfs, I'm worried that future users of get_next_ino based on historic precedent might end up hitting this as well. That's the main reason why I'm inclined to try and improve get_next_ino's strategy itself. >Anyway, to add another consideration to the mix, overlayfs uses >the high ino bits to multiplex several layers into a single ino domain >(mount option xino=on). > >tmpfs is a very commonly used filesystem as overlayfs upper layer, >so many users are going to benefit from keeping the higher most bits >of tmpfs ino inodes unused. > >For this reason, I dislike the current "grow forever" approach of >get_next_ino() and prefer that we use a smarter scheme when >switching over to 64bit values. By "a smarter scheme when switching over to 64bit values", you mean keeping i_ino as low magnitude as possible while still avoiding simultaneous reuse, right? To that extent, if we can reliably and expediently recycle inode numbers, I'm not against sticking to the existing typing scheme in get_next_ino. It's just a matter of agreeing by what method and at what level of the stack that should take place :-) I'd appreciate your thoughts on approaches forward. One potential option is to reimplement get_next_ino using an IDA, as mentioned in my patch message. Other than the potential to upset microbenchmarks, do you have concerns with that as a patch? Thanks, Chris