Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89F62C4360F for ; Thu, 4 Apr 2019 20:04:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4D38A20855 for ; Thu, 4 Apr 2019 20:04:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="TC5C8fa/" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729644AbfDDUEB (ORCPT ); Thu, 4 Apr 2019 16:04:01 -0400 Received: from aserp2130.oracle.com ([141.146.126.79]:60554 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729456AbfDDUEB (ORCPT ); Thu, 4 Apr 2019 16:04:01 -0400 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x34K3Wmw100211; Thu, 4 Apr 2019 20:03:50 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=edyUSz643/8nOJFX2SyyOdMNVcR6AhLtprvIL8+8+FA=; b=TC5C8fa/F8q0t9E48D/DhNuATc04TtoGTI2uzdKSyO+Ia8rycQ0lk92PWxdpwMSwtW+H E8Tt4AW0Z/a7O/1CZolZUcaN+V2vQv+pq3cDGPL/Bu5CFHBDqurDbURwwDbULhTvsVhD B4yZ64KlNEJ98vJwp11LS7UoCd+d/XUZI3fH4+Nv275UOXcArMzCt83XsGrw59krHwbs 8yvzGLCuU6EAt1pUpYLCVcMDaeFbS3twKh5eA8K/aABUUIrTQRVm3uOfvbLCv+fse4nc OE/kSiRwq2wPvx0KO8F1+8BBpq12MHHdbr+eCNWONzmVZ4CSrBolefiTgVXR8MdTz+ce dA== Received: from userp3020.oracle.com (userp3020.oracle.com [156.151.31.79]) by aserp2130.oracle.com with ESMTP id 2rhwydhh31-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 04 Apr 2019 20:03:50 +0000 Received: from pps.filterd (userp3020.oracle.com [127.0.0.1]) by userp3020.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x34K3fQW066216; Thu, 4 Apr 2019 20:03:49 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userp3020.oracle.com with ESMTP id 2rm8f6vhpr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 04 Apr 2019 20:03:49 +0000 Received: from abhmp0020.oracle.com (abhmp0020.oracle.com [141.146.116.26]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x34K3iaK006995; Thu, 4 Apr 2019 20:03:47 GMT Received: from bradley.us.oracle.com (/10.152.12.111) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 04 Apr 2019 13:03:44 -0700 Subject: Re: directory delegations To: Chuck Lever , Jeff Layton Cc: Bruce Fields , Trond Myklebust , Linux NFS Mailing List References: <2065755c-f888-9c62-f6e5-f143d42c51ee@oracle.com> <20190402161116.GA2828@fieldses.org> <2f1f6582-3672-1361-4392-80cb1e62e19c@oracle.com> <20190402194148.GA5269@fieldses.org> <58230e155813e866cb057e6543ab7e61f51fedf6.camel@hammerspace.com> <20190403002822.GA7667@fieldses.org> <20190403020750.GA8272@fieldses.org> <20190404010559.GA17840@fieldses.org> <97542732-49F5-4BEA-9903-D9801370A221@oracle.com> From: "Bradley C. Kuszmaul" Message-ID: <9ca6e116-818b-6615-1532-47611b6fcc6f@oracle.com> Date: Thu, 4 Apr 2019 16:03:42 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <97542732-49F5-4BEA-9903-D9801370A221@oracle.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9217 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904040127 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9217 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904040127 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org It would also be possible with our file system to preallocate inode numbers (inumbers). This isn't necessarily directly related to NFS, but one could imagine further extending NSF to allow a CREATE to happen entirely on the client by letting the client maintain a cache of preallocated inumbers. Just for the fun of it, I'll tell you a little bit more about how we preallocate inumbers. For Oracle's File Storage Service (FSS), Inumbers are cheap to allocate, and it's not a big deal if a few of them end up unused. Unused inode numbers don't use up any space. I would imagine that most B-tree-based file systems are like this.   In contrast in an ext-style file system, unused inumbers imply unused storage. Furthermore, FSS never reuses inumbers when files are deleted. It just keeps allocating new ones. There's a tradeoff between preallocating lots of inumbers to get better performance but potentially wasting the inumbers if the client were to crash just after getting a batch.   If you only ask for one at a time, you don't get much performance, but if you ask for 1000 at a time, there's a chance that the client could start, ask for 1000 and then immediately crash, and then repeat the cycle, quickly using up many inumbers.  Here's a 2-competetive algorithm to solve this problem (by "2-competetive" I mean that it's guaranteed to waste at most half of the inode numbers):  * A client that has successfully created K files without crashing is allowed, when it's preallocated cache of inumbers goes empty, to ask for another K inumbers. The worst-case lossage occurs if the client crashes just after getting K inumbers, and those inumbers go to waste.   But we know that the client successfully created K files, so we are wasting at most half the inumbers. For a long-running client, each time it asks for another batch of inumbers, it doubles the size of the request.  For the first file created, it does it the old-fashioned way.   For the second file, it preallocated a single inumber.   For the third file, it preallocates 2 inumbers.   On the fifth file creation, it preallocates 4 inumbers.  And so forth. One obstacle to getting FSS to use any of these ideas is that we currently support only NFSv3.   We need to get an NFSv4 server going, and then we'll be interested in doing the server work to speed up these kinds of metadata workloads. -Bradley On 4/4/19 11:22 AM, Chuck Lever wrote: > >> On Apr 4, 2019, at 11:09 AM, Jeff Layton wrote: >> >> On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org >> wrote: >>> On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote: >>>> This proposal does look like it would be helpful. How does this >>>> kind of proposal play out in terms of actually seeing the light of >>>> day in deployed systems? >>> We need some people to commit to implementing it. >>> >>> We have 2-3 testing events a year, so ideally we'd agree to show up with >>> implementations at one of those to test and hash out any issues. >>> >>> We revise the draft based on any experience or feedback we get. If >>> nothing else, it looks like it needs some updates for v4.2. >>> >>> The on-the-wire protocol change seems small, and my feeling is that if >>> there's running code then documenting the protocol and getting it >>> through the IETF process shouldn't be a big deal. >>> >>> --b. >>> >>>> On 4/2/19 10:07 PM, bfields@fieldses.org wrote: >>>>> On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote: >>>>>> The create itself needs to be sync, but the attribute delegations mean >>>>>> that the client, not the server, is authoritative for the timestamps. >>>>>> So the client now owns the atime and mtime, and just sets them as part >>>>>> of the (asynchronous) delegreturn some time after you are done writing. >>>>>> >>>>>> Were you perhaps thinking about this earlier proposal? >>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e= >>>>> That's it, thanks! >>>>> >>>>> Bradley is concerned about performance of something like untar on a >>>>> backend filesystem with particularly high-latency metadata operations, >>>>> so something like your unstable file createion proposal (or actual write >>>>> delegations) seems like it should help. >>>>> >>>>> --b. >> The serialized create with something like an untar is a >> performance-killer though. >> >> FWIW, I'm working on something similar right now for Ceph. If a ceph >> client has adequate caps [1] for a directory and the dentry inode, >> then we should (in principle) be able to buffer up directory morphing >> operations and flush them out to the server asynchronously. >> >> I'm starting with unlink (mostly because it's simpler), and am mainly >> just returning early when we do have the right caps -- after issuing >> the call but before the reply comes in. We should be able to do the >> same for link, rename and create too. Create will require the Ceph MDS >> to delegate out a range of inode numbers (and that bit hasn't been >> implemented yet). >> >> My thinking with all of this is that the buffering of directory >> morphing operations is not as helpful as something like a pagecache >> write is, as we aren't that interested in merging operations that >> change the same dentry. However, being able to do them asynchronously >> should work really well. That should allow us to better parallellize >> create/link/unlink/rename on different dentries even when they are >> issued serially by a single task. > What happens if an asynchronous directory change fails (eg. ENOSPC)? > > >> RFC5661 doesn't currently provide for writeable directory delegations, >> AFAICT, but they could eventually be implemented in a similar way. >> >> [1]: cephfs capabilies (aka caps) are like a delegation for a subset >> of inode metadata >> -- >> Jeff Layton > -- > Chuck Lever > > >