Received: by 2002:a25:e7d8:0:0:0:0:0 with SMTP id e207csp1744881ybh; Fri, 13 Mar 2020 06:42:55 -0700 (PDT) X-Google-Smtp-Source: ADFU+vsCvJ3pqIfRI7mDSLO3XN81susdy4MssJhX5QMJZrUX+4wRXbMNtG+eU1Ri7NPybLJnpUiq X-Received: by 2002:aca:c488:: with SMTP id u130mr7136325oif.154.1584106975080; Fri, 13 Mar 2020 06:42:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584106975; cv=none; d=google.com; s=arc-20160816; b=pcoqGyI0ID5cSw/NJ9KArQ1DYAoqiijAmRt2iSI5FnLvkFy6HQAxaw6kJqb3NxQjza zRRm3tO471+D2VJm8D895OEh+DVSezF+cke0S2wOMm5Xto4TAFhQN+gMZFAE7Sb+/jwP UXE1oefpDpNUPOx8q/U3R/aWtMNt28doJc5JYNR2UT/clg7/OVBUrpMNnzbrH5fkYJy4 TY1HupVDAQTcxXlv6bFmJpV0eKymEfm/BXBOxRw2a1i7buJuM9SpPZxc+4G9D8Aicv0u BD2Qwgzf+5VBmq50vI7fDaycybQdI5d23/gkE/tkxicOwTuSn/VZx5sX2B3LEu2VdlUO 8i+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=WmJMCPjLeZOGPd5XvclGChfYUeyS2atWd4pQtWyZc14=; b=IRQVW7YjjO9VxPLsBZPsAmfPMavOaA0ypCHN4SQP2OFuCWr3dWB4nKJoiL//u1P2QW bsfVNs5Xm0n0SmfljEoAA05L/mVP62Zl6iU8ZIvkgwkH8fNpx+ldeTFfdF7q1wUR6CiO Hx/lynNfxEg52H9iuQX1FEUTFJsHat8aLQqgEov8I6W2dyWcArSbESTAaB15/jiLFqYN JIcsR8OGtYCjTse9+TbcPoMlx8hyOlxNyw0PxqJsOcjSUiPnlX8Dx3glhnmSRXLMnXTd PAufSqrSVGhqAH2JmsdCOKiruMAHatR0bvAuz9sazC1SAark+vYIXHBNHtUsbsKnigRQ SzBg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="H6/1F5IN"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l16si4340244otp.260.2020.03.13.06.42.40; Fri, 13 Mar 2020 06:42:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="H6/1F5IN"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726636AbgCMNmQ (ORCPT + 99 others); Fri, 13 Mar 2020 09:42:16 -0400 Received: from us-smtp-1.mimecast.com ([205.139.110.61]:44352 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726327AbgCMNmP (ORCPT ); Fri, 13 Mar 2020 09:42:15 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1584106934; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=WmJMCPjLeZOGPd5XvclGChfYUeyS2atWd4pQtWyZc14=; b=H6/1F5IN3m8BybRL0RM81CtAAAyEuYErKvkoFUE9BlS+qhg44uH1X3Ge0di0OcMJmHvzmJ xYA9freBuSyzSsK2IhPlwBWIJpODTbZZVu/Bj9h1A78BaP6KnuDLyBOZAIsGmJ1b9YWHR5 vP2Vk0VttwGj2c0oJLHHUFBu1EONGnk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-408-dxlPnEmCMRqY71uj9jQetw-1; Fri, 13 Mar 2020 09:42:06 -0400 X-MC-Unique: dxlPnEmCMRqY71uj9jQetw-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 0C61518A8CAF; Fri, 13 Mar 2020 13:42:04 +0000 (UTC) Received: from horse.redhat.com (unknown [10.18.25.210]) by smtp.corp.redhat.com (Postfix) with ESMTP id A89A4101D480; Fri, 13 Mar 2020 13:41:55 +0000 (UTC) Received: by horse.redhat.com (Postfix, from userid 10451) id 3896722021D; Fri, 13 Mar 2020 09:41:55 -0400 (EDT) Date: Fri, 13 Mar 2020 09:41:55 -0400 From: Vivek Goyal To: Miklos Szeredi Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvdimm , virtio-fs@redhat.com, Stefan Hajnoczi , "Dr. David Alan Gilbert" , "Michael S. Tsirkin" , Miklos Szeredi , Liu Bo , Peng Tao Subject: Re: [PATCH 13/20] fuse, dax: Implement dax read/write operations Message-ID: <20200313134155.GA156804@redhat.com> References: <20200304165845.3081-1-vgoyal@redhat.com> <20200304165845.3081-14-vgoyal@redhat.com> <20200312160208.GB114720@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 13, 2020 at 11:18:15AM +0100, Miklos Szeredi wrote: [..] > > > > +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */ > > > > +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset, > > > > + struct fuse_dax_mapping *dmap, bool writable, > > > > + bool upgrade) > > > > +{ > > > > + struct fuse_conn *fc = get_fuse_conn(inode); > > > > + struct fuse_inode *fi = get_fuse_inode(inode); > > > > + struct fuse_setupmapping_in inarg; > > > > + FUSE_ARGS(args); > > > > + ssize_t err; > > > > + > > > > + WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ); > > > > + WARN_ON(fc->nr_free_ranges < 0); > > > > + > > > > + /* Ask fuse daemon to setup mapping */ > > > > + memset(&inarg, 0, sizeof(inarg)); > > > > + inarg.foffset = offset; > > > > + inarg.fh = -1; > > > > + inarg.moffset = dmap->window_offset; > > > > + inarg.len = FUSE_DAX_MEM_RANGE_SZ; > > > > + inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ; > > > > + if (writable) > > > > + inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE; > > > > + args.opcode = FUSE_SETUPMAPPING; > > > > + args.nodeid = fi->nodeid; > > > > + args.in_numargs = 1; > > > > + args.in_args[0].size = sizeof(inarg); > > > > + args.in_args[0].value = &inarg; > > > > > > args.force = true? > > > > I can do that but I am not sure what exactly does args.force do and > > why do we need it in this case. > > Hm, it prevents interrupts. Looking closely, however it will only > prevent SIGKILL from immediately interrupting the request, otherwise > it will send an INTERRUPT request and the filesystem can ignore that. > Might make sense to have a args.nonint flag to prevent the sending of > INTERRUPT... Hi Miklos, virtiofs does not support interrupt requests yet. Its fiq interrupt handler just does not do anything. static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq) __releases(fiq->lock) { /* * TODO interrupts. * * Normal fs operations on a local filesystems aren't interruptible. * Exceptions are blocking lock operations; for example fcntl(F_SETLKW) * with shared lock between host and guest. */ spin_unlock(&fiq->lock); } So as of now setting force or not will not make any difference. We will still end up waiting for request to finish. Infact, I think there is no mechanism to set fc->no_interrupt in virtio_fs. If I am reading request_wait_answer(), correctly, it will see fc->no_interrupt is not set. That means filesystem supports interrupt requests and it will do wait_event_interruptible() and not even check for FR_FORCE bit. Right now fc->no_interrupt is set in response to INTERRUPT request reply. Will it make sense to also be able to set it as part of connection negotation protocol and filesystem can tell in the beginning itself that it does not support interrupt and virtiofs can make use of that. So force flag is only useful if filesystem does not support interrupt and in that case we do wait_event_killable() and upon receiving SIGKILL, cancel request if it is still in pending queue. For virtiofs, we take request out of fiq->pending queue in submission path itself and if it can't be dispatched it waits on virtiofs speicfic queue with FR_PENDING cleared. That means, setting FR_FORCE for virtiofs does not mean anything as caller will end up waiting for request to finish anyway. IOW, setting FR_FORCE will make sense when we have mechanism to detect that request is still queued in virtiofs queues and have mechanism to cancel it. We don't have it. In fact, given we are a push model, we dispatch request immediately to filesystem, until and unless virtqueue is full. So probability of a request still in virtiofs queue is low. So may be we can start setting force at some point of time later when we have mechanism to cancel detect and cancel pending requests in virtiofs. > > > First thing it does is that request is allocated with flag __GFP_NOFAIL. > > Second thing it does is that caller is forced to wait for request > > completion and its not an interruptible sleep. > > > > I am wondering what makes FUSE_SETUPMAPING/FUSE_REMOVEMAPPING requests > > special that we need to set force flag. > > Maybe not for SETUPMAPPING (I was confused by the error log). > > However if REMOVEMAPPING fails for some reason, than that dax mapping > will be leaked for the lifetime of the filesystem. Or am I > misunderstanding it? FUSE_REMVOEMAPPING is not must. If we send another FUSE_SETUPMAPPING, then it will create the new mapping and free up resources associated with the previous mapping, IIUC. So at one point of time we were thinking that what's the point of sending FUSE_REMOVEMAPPING. It helps a bit with freeing up filesystem resources earlier. So if cache size is big, then there will not be much reclaim activity going and if we don't send FUSE_REMOVEMAPPING, all these filesystem resources will remain busy on host for a long time. > > > > > + ret = fuse_setup_one_mapping(inode, > > > > + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ), > > > > + dmap, true, true); > > > > + if (ret < 0) { > > > > + printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n", > > > > + ret, pos); > > > > > > Again. > > > > Will remove. How about converting some of them to pr_debug() instead? It > > can help with debugging if something is not working. > > Okay, and please move it to fuse_setup_one_mapping() where there's > already a pr_debug() for the success case. Will do. > > > > + > > > > + /* Do not use dax for file extending writes as its an mmap and > > > > + * trying to write beyong end of existing page will generate > > > > + * SIGBUS. > > > > > > Ah, here it is. So what happens in case of a race? Does that > > > currently crash KVM? > > > > In case of race, yes, KVM hangs. So no shared directory operation yet > > till we have designed proper error handling in kvm path. > > I think before this is merged we have to fix the KVM crash; that's not > acceptable even if we explicitly say that shared directory is not > supported for the time being. Ok, I will look into it. I had done some work in the past and realized its not trivial to fix kvm error paths. There are no users and propagating signals back into qemu instances and finding the right process is going to be tricky. Given the complexity of that work, I thought that for now we say that shared directory is not supported and once basic dax patches get merged, focus on kvm work. Thanks Vivek