Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp4739193iob; Sun, 8 May 2022 23:50:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzQSofs36LVyCPpaofudagdeeCJGcMnjAehjSbVzN5eonXFCaZhkoA68H8DJFz2iGVXy88U X-Received: by 2002:a05:6a00:1352:b0:510:4c0e:d230 with SMTP id k18-20020a056a00135200b005104c0ed230mr15068991pfu.79.1652079043944; Sun, 08 May 2022 23:50:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652079043; cv=none; d=google.com; s=arc-20160816; b=z1v35o0CUNzJwWcpeGowI8nzhOkqipwkjcnigQVOWMOCSD71pabgGLvqiNhEwJqkLW yx/kT5+ytYqcgr2auTD4WHISJ9MhmJg0O9XHc/uMSSJOHUWD7ygfS6zUWotWQvnldGPT oJbptGZKTNr23cXd3lyEjTL/lG/0Hn5xqEv71cCtR0N0KGyIje5g0fYdQ0c9oBCPAJqd YaxnICJ4yC1oBEQskflUG7x+7Gr+h7lBu+5iHFdQ3WQJUgNPRqnP1Vn/U9EPgYX3rEoi 5kARsxHKsdpgHXVe2DqOmgsAet77UQUEdqPc3wFqiphJsbPS0hks23RW9AjQ+In1BeyU V9iQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:cc:in-reply-to:user-agent :content-transfer-encoding:mime-version:references:message-id:date :subject:from:to; bh=SHnHg1qgdwIhqzRDRkCAhrUkCBvhtGph/d+MTIeGKIE=; b=XTSLVl1LgAoxun1JLHzK5VeuM/WU/QoQbEptAXMSecHojcxiNDamb9qa33m8WTSGHX +Y+fNF9sVFm4t2OGtHZrYMYMU1o0SbdAN2lxR9f2cUXVp+x97Kr6C0JsC4OxDrGBylvz UCsCIAwFhTvAKRCbwLm6kbUHKXPvgKMqoQqflMCbyFdY3okn6/0snj7C+kBtWVIxQ14A VUZDSu4Bbx8BNEDbI7V+7hNYtATWi/YV6rewsptksi5NdT/77wRYSdP4INkea5jXIJb0 V3yYYzJqmRG44XDDqRjIHcuRNkhzx559/uulWAQxNT+eD8AIgMvK442OL7wdlVVinweo JUpA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id m17-20020a170902db1100b0015eb6621630si12759425plx.307.2022.05.08.23.50.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 08 May 2022 23:50:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 153E914AA56; Sun, 8 May 2022 23:46:54 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376383AbiEGKyJ (ORCPT + 99 others); Sat, 7 May 2022 06:54:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34402 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1386163AbiEGKxv (ORCPT ); Sat, 7 May 2022 06:53:51 -0400 Received: from ciao.gmane.io (ciao.gmane.io [116.202.254.214]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8EA0A1AB for ; Sat, 7 May 2022 03:50:04 -0700 (PDT) Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1nnI0g-0004G2-AM for linux-kernel@vger.kernel.org; Sat, 07 May 2022 12:50:02 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: linux-kernel@vger.kernel.org From: =?UTF-8?Q?Jean-Pierre_Andr=c3=a9?= Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create Date: Sat, 7 May 2022 12:42:56 +0200 Message-ID: References: <20220502102521.22875-1-dharamhans87@gmail.com> <20220502102521.22875-2-dharamhans87@gmail.com> <78c2beed-b221-71b4-019f-b82522d98f1e@ddn.com> <90fbe06b-4af7-c9ce-4aca-393aed709722@ddn.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.0 In-Reply-To: <90fbe06b-4af7-c9ce-4aca-393aed709722@ddn.com> Cc: linux-fsdevel@vger.kernel.org, fuse-devel@lists.sourceforge.net Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Bernd Schubert wrote on 5/6/22 8:45 PM: > > > On 5/6/22 19:07, Vivek Goyal wrote: >> On Fri, May 06, 2022 at 06:41:17PM +0200, Bernd Schubert wrote: >>> >>> >>> On 5/6/22 16:12, Vivek Goyal wrote: >>> >>> [...] >>> >>>> On Fri, May 06, 2022 at 11:04:05AM +0530, Dharmendra Hans wrote: >>> >>>> >>>> Ok, looks like your fuse file server is talking to a another file >>>> server on network and that's why you are mentioning two network trips. >>>> >>>> Let us differentiate between two things first. >>>> >>>> A. FUSE protocol semantics >>>> B. Implementation of FUSE protocl by libfuse. >>>> >>>> I think I am stressing on A and you are stressing on B. I just want >>>> to see what's the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE >>>> from fuse protocol point of view. Again look at from kernel's point of >>>> view and don't worry about libfuse is going to implement it. >>>> Implementations can vary. >>> >>> Agreed, I don't think we need to bring in network for the kernel to >>> libfuse >>> API. >>> >>>> >>>>   From kernel's perspective FUSE_CREATE is supposed to create + open a >>>> file. It is possible file already exists. Look at >>>> include/fuse_lowlevel.h >>>> description for create(). >>>> >>>>           /** >>>>            * Create and open a file >>>>            * >>>>            * If the file does not exist, first create it with the >>>> specified >>>>            * mode, and then open it. >>>>            */ >>>> >>>> I notice that fuse is offering a high level API as well as low level >>>> API. I primarily know about low level API. To me these are just two >>>> different implementation but things don't change how kernel sends >>>> fuse messages and what it expects from server in return. >>>> >>>> Now with FUSE_ATOMIC_CREATE, from kernel's perspective, only difference >>>> is that in reply message file server will also indicate if file was >>>> actually created or not. Is that right? >>>> >>>> And I am focussing on this FUSE API apsect. I am least concerned at >>>> this point of time who libfuse decides to actually implement >>>> FUSE_CREATE >>>> or FUSE_ATOMIC_CREATE etc. You might make a single call in libfuse >>>> server (instead of two) and that's performance optimization in libfuse. >>>> Kernel does not care how many calls did you make in file server to >>>> implement FUSE_CREATE or FUSE_ATOMIC_CREATE. All it cares is that >>>> create and open the file. >>>> >>>> So while you might do things in more atomic manner in file server and >>>> cut down on network traffic, kernel fuse API does not care. All it >>>> cares >>>> about is create + open a file. >>>> >>>> Anyway, from kernel's perspective, I think you should be able to >>>> just use FUSE_CREATE and still be do "lookup + create + open". >>>> FUSE_ATOMIC_CREATE is just allows one additional optimization so >>>> that you know whether to invalidate parent dir's attrs or not. >>>> >>>> In fact kernel is not putting any atomicity requirements as well on >>>> file server. And that's why I think this new command should probably >>>> be called FUSE_CREATE_EXT because it just sends back additional >>>> info. >>>> >>>> All the atomicity stuff you have been describing is that you are >>>> trying to do some optimizations in libfuse implementation to implement >>>> FUSE_ATOMIC_CREATE so that you send less number of commands over >>>> network. That's a good idea but fuse kernel API does not require you >>>> do these atomically, AFAICS. >>>> >>>> Given I know little bit of fuse low level API, If I were to implement >>>> this in virtiofs/passthrough_ll.c, I probably will do following. >>>> >>>> A. Check if caller provided O_EXCL flag. >>>> B. openat(O_CREAT | O_EXCL) >>>> C. If success, we created the file. Set file_created = 1. >>>> >>>> D. If error and error != -EEXIST, send error back to client. >>>> E. If error and error == -EEXIST, if caller did provide O_EXCL flag, >>>>      return error. >>>> F. openat() returned -EEXIST and caller did not provide O_EXCL flag, >>>>      that means file already exists.  Set file_created = 0. >>>> G. Do lookup() etc to create internal lo_inode and stat() of file. >>>> H. Send response back to client using fuse_reply_create(). >>>> This is one sample implementation for fuse lowlevel API. There could >>>> be other ways to implement. But all that is libfuse + filesystem >>>> specific and kernel does not care how many operations you use to >>>> complete and what's the atomicity etc. Of course less number of >>>> operations you do better it is. >>>> >>>> Anyway, I think I have said enough on this topic. IMHO, FUSE_CREATE >>>> descritpion (fuse_lowlevel.h) already mentions that "If the file >>>> does not >>>> exist, first create it with the specified mode and then open it". That >>>> means intent of protocol is that file could already be there as well. >>>> So I think we probably should implement this optimization (in kernel) >>>> using FUSE_CREATE command and then add FUSE_CREATE_EXT to add >>>> optimization >>>> about knowing whether file was actually created or not. >>>> >>>> W.r.t libfuse optimizations, I am not sure why can't you do >>>> optimizations >>>> with FUSE_CREATE and why do you need FUSE_CREATE_EXT necessarily. If >>>> are you worried that some existing filesystems will break, I think >>>> you can create an internal helper say fuse_create_atomic() and then >>>> use that if filesystem offers it. IOW, libfuse will have two >>>> ways to implement FUSE_CREATE. And if filesystem offers a new way which >>>> cuts down on network traffic, libfuse uses more efficient method. We >>>> should not have to change kernel FUSE API just because libfuse can >>>> do create + open operation more efficiently. >>> >>> Ah right, I like this. As I had written before, the first patch >>> version was >>> using FUSE_CREATE and I was worried to break something. Yes, it >>> should be >>> possible split into lookup+create on the libfuse side. That being said, >>> libfuse will need to know which version it is - there might be an old >>> kernel >>> sending the non-optimized version - libfuse should not do another lookup >>> then. >> >> I am confused about one thing. For FUSE_CREATE command, how does it >> matter whether kernel has done lookup() before sending FUSE_CREATE. All >> FUSE_CREATE seems to say that crate a file (if it does not exist already) >> and then open it and return file handle as well as inode attributes. It >> does not say anything about whether a LOOKUP has already been done >> by kernel or not. >> >> It looks like you are assuming that if FUSE_CREATE is coming, that >> means client has already done FUSE_LOOKUP. So there is something we >> are not on same page about. >> >> I looked at fuse_lowlevel API and passthrough_ll.c and there is no >> assumption whether FUSE_LOOKUP has already been called by client >> before calling FUSE_CREATE. Similarly, I looked at virtiofs code >> and I can't see any such assumption there as well. > > The current linux kernel code does this right now, skipping the lookup > just changes behavior.  Personally I would see it as bug if the > userspace implementation does not handle EEXIST for FUSE_CREATE. > Implementation developer and especially users might see it differently > if a kernel update breaks/changes things out of the sudden. > passthrough_ll.c is not the issue here, it handles it correctly, but > what about the XYZ other file systems out there - do you want to check > them one by one, including closed source ones? And wouldn't even a > single broken application count as regression? > >> >> https://github.com/qemu/qemu/blob/master/tools/virtiofsd/passthrough_ll.c >> >> So I am sort of lost. May be you can help me understsand this. > > I guess it would be more interesting to look at different file systems > that are not overlay based. Like ntfs-3g - I have not looked at the code > yet, but especially disk based file system didn't have a reason so far > to handle EEXIST. And AFAIK ntfs-3g proper does not keep a context across calls and does not know what LOOKUP was preparing a CREATE. However this may have consequences in the high level of libfuse for managing the file tree. The kernel apparently issues a LOOKUP to decide whether issuing a CREATE (create+open) or an OPEN. If it sent blindly a CREATE, ntfs-3g would return EEXIST if the name was already present in the directory. For a test, can you suggest a way to force ignore of such lookup within libfuse, without applying your kernel patches ? Is there a way to detect the purpose of a lookup ? (A possible way is to hardcode a directory inode within which the lookups return ENOENT). > >> >>> Now there is 'fi.flags = arg->flags', but these are already taken by >>> open/fcntl flags - I would not feel comfortable to overload these. At >>> best, >>> struct fuse_create_in currently had a padding field, we could convert >>> these >>> to something like 'ext_fuse_open_flags' and then use it for fuse >>> internal >>> things. Difficulty here is that I don't know if all kernel >>> implementations >>> zero the struct (BSD, MacOS), so I guess we would need to negotiate at >>> startup/init time and would need another main feature flag? And with >>> that >>> I'm not be sure anymore if the result would be actually more simple than >>> what we have right now for the first patch. >> >> If FUSE_CREATE indeed has a dependency on FUSE_LOOKUP have been called >> before that, then I agree that we can't implement new semantics with >> FUSE_CREATE and we will have to introduce a new op say >> FUSE_ATOMIC_CREATE/FUSE_LOOKUP_CREATE/FUSE_CREATE_EXT. >> >> But looking at fuse API, I don't see FUSE_CREATE ever guaranteeing that >> a FUSE_LOOKUP has been done before this. > > It is not in written document, but in the existing (linux) kernel behavior. > > > You can look up "regressions due to 64-bit ext4 directory cookies" - > patches from me had been once already the reason for accidental breakage > and back that time I did not even have the slightest chance to predict > it, as glusterfs was relying on max 32bit telldir results and using the > other 32bit for its own purposes. Kernel behavior changed and > application broke, even though the application was relying on non-posix > behavior. In case of fuse there is no document like posix, but just API > headers and current behavior... > > Thanks, > Bernd > > > > >