Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp849869imm; Wed, 23 May 2018 06:35:02 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqRbWfwI/oRIQzsxFKCKTw+L0MacO20x0YbqBjdjvLG4GwsmxK31gG52HDsFXInXKvMvRe+ X-Received: by 2002:a62:3889:: with SMTP id f131-v6mr2949099pfa.173.1527082502774; Wed, 23 May 2018 06:35:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527082502; cv=none; d=google.com; s=arc-20160816; b=yD67/smmwXDTZBeBXlOcUbUm1jRsPJ29klEM8v9iLJKoHPKigIzZFUAzVuXJGz0s/H JbVMhfWzw9siROqEcjJha0fD1Q8pgGQKGLvi80oZSD7hBnmT1cbJuoWkUaqWz+O4F15T 7WZWN/uK0uUFO8LMy9RgN9pSOs06/fg4DS9g1CzO41jZMCMXvl4GfFOWsvN/9jYNNr5h s+obC1UluolggvJO0ADGaxc+SvJu0PUvHwJKUEwx2Qemk2SARPD6SdW3OMbv9lJ1bDpB YW6QSy2kCq230CcyhPlbeJWeNwzgYFP8kHC1y0bngl6qP/yXMKfSPs7Q0cxsgMy731C5 EDHQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date:arc-authentication-results; bh=/LYVYoSzXeCbCTdrRNVYyc2yXB/i8GwQoGMhR0Zncts=; b=fe63W52qrBFL5RaPHcLSiV52AO/G//8K6Jm8WQTna1JYkEpNYYFiWwezCg7/JjpB2q tgxMZsK6Hc9rAyloZrbDNxKp9KEwM0EDu64DTzNK25W20AHXToECaZypStsemYOqTGXc FBbypojU+MRLzsrvk2IR0bzBH+tBR/z6yTkl15ZoURSyleDQ/b0F3hr85+BeuT1WKFq9 rzr9dzTmzHSdJmJdoyUml36Hlwk4UeSkqU9dd0cNvAOeyZc5FWRGskEQoVQkYo0waYDE +HojnB6W06PsmqqUh4VVPAzYTQGgtKQX5n/NQw5siMzW3U4gBj3XwVBTzXy9Rwe06XQd tOtQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e9-v6si14734740pgr.477.2018.05.23.06.34.40; Wed, 23 May 2018 06:35:02 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933067AbeEWNeL (ORCPT + 99 others); Wed, 23 May 2018 09:34:11 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:36146 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932898AbeEWNeK (ORCPT ); Wed, 23 May 2018 09:34:10 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E99407A7E1; Wed, 23 May 2018 13:34:09 +0000 (UTC) Received: from gondolin (ovpn-117-30.ams2.redhat.com [10.36.117.30]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2772D63536; Wed, 23 May 2018 13:34:06 +0000 (UTC) Date: Wed, 23 May 2018 15:34:03 +0200 From: Cornelia Huck To: Halil Pasic Cc: Alex Williamson , kwankhede@nvidia.com, Dong Jia , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v4 0/2] vfio/mdev: Device namespace protection Message-ID: <20180523153403.01c84046.cohuck@redhat.com> In-Reply-To: References: <20180518190145.3187.7620.stgit@gimli.home> <20180522123829.4e758646@w520.home> <20180523105641.0d89701b.cohuck@redhat.com> Organization: Red Hat GmbH MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Wed, 23 May 2018 13:34:09 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Wed, 23 May 2018 13:34:09 +0000 (UTC) for IP:'10.11.54.5' DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'cohuck@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 23 May 2018 14:29:28 +0200 Halil Pasic wrote: > On 05/23/2018 10:56 AM, Cornelia Huck wrote: > > On Tue, 22 May 2018 12:38:29 -0600 > > Alex Williamson wrote: > > > >> On Tue, 22 May 2018 19:17:07 +0200 > >> Halil Pasic wrote: > >> > >>> From vfio-ccw perspective I join Connie's assessment: vfio-ccw should > >>> be fine with these changes. I'm however not too deeply involved with > >>> the mdev framework, thus I don't feel comfortable r-b-ing. That results > >>> in > >>> Acked-by: Halil Pasic > >>> for both patches. > >>> > >>> While at it I have would like to ask about the semantics and intended > >>> use of the mdev interfaces. > >>> > >>> static int vfio_ccw_sch_probe(struct subchannel *sch) > >>> { > >>> > >>> /* HALIL: 8< Not so interesting stuff happens here. >8 */ > >> > >> This was interesting: > >> > >> private->state = VFIO_CCW_STATE_NOT_OPER; > >> > >>> ret = vfio_ccw_mdev_reg(sch); > >>> if (ret) > >>> goto out_disable; > >>> /* > >>> * HALIL: > >>> * This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute > >>> * is made available (it calls mdev_register_device()). For instance create will > >>> * attempt to decrement private->avail which is initialized below. I fail to > >>> * understand how is this well synchronized. > >>> */ > >>> INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo); > >>> atomic_set(&private->avail, 1); > >>> private->state = VFIO_CCW_STATE_STANDBY; > >>> > >>> return 0; > >>> > >>> out_disable: > >>> cio_disable_subchannel(sch); > >>> out_free: > >>> dev_set_drvdata(&sch->dev, NULL); > >>> kfree(private); > >>> return ret; > >>> } > >>> > >>> Should not initialization of go before mdev_register_device(), and then rolled > >>> back if necessary if mdev_register_device() fails? > >>> > >>> In practice it does not seem very likely that userspace can trigger > >>> mdev_device_create() before vfio_ccw_sch_probe() finishes so it should > >>> not be a practical problem. But I would like to understand how synchronization > >>> is supposed to work. > >>> > >>> [Added Dong Jia, maybe he is also able to answer my question.] > >> > >> vfio_ccw_mdev_create() requires that private->state is not > >> VFIO_CCW_STATE_NOT_OPER but vfio_ccw_sch_probe() explicitly sets state > >> to this value before calling vfio_ccw_mdev_reg(), so a create should > >> return -ENODEV if racing with parent registration. Is there something > >> else that I'm missing? Thanks, > >> > > > Disclaimer: I did not do much kernel work up until now. I still have > much to learn. > > I mostly agree with your analysis but I'm not sure if the conclusion should be > 'and thus everything is good' or 'and thus indeed we do have a race, a > poorly handled one'. Let me throw in that there is more than one way to handle a race, and one of them is to return an error if something happens at an inconvenient time :) > > One thing I'm not sure about is: can atomic_set(&private->avail, 1) and > private->state = VFIO_CCW_STATE_STANDBY be perceived as reordered by > e.g. some other cpu and thus vfio_ccw_mdev_create() or not. I tried to > figure it out based on Documentation/atomic_t.txt but was not very successful. > If these can be reordered we could observe -EPERM instead of -ENODEV, I > think. I don't think that matters (see below). > > Furthermore from your analysis I deduce that the client code (I think mdev > calls it vendor code) may rely on mdev_register_device() containing a > (RELEASE) barrier. We use a mutex in there so the barrier is there. And > the client code may rely on a (ACQUIRE) barrier before the create callback > is called. That should also be true and was true in the past too again because > of mutex usage. > > > >> Alex > > > > No, I think your understanding is correct. We move the state from > > NOT_OPER to STANDBY only after we're set up completely, so our create > > callback will simply fail early with -ENODEV. This looks fine to me. > > > > This -ENODEV looks strange to me. Which device does not exist? The > userspace were supposed to retry on this? It's not even -EAGAIN. Is it > documented somewhere? -ENODEV looks very reasonable if we consider a device in the NOT_OPER state. > > If it's unavoidable (which I don't see why) I would prefer -EAGAIN. I > think throwing an -ENODEV at our userspace once in a blue moon (if ever) > because that is the way we 'handle' races in our code instead of avoiding > them is not very friendly. > > And I'm not sure -EPERM is not possible (see my statement > about reordering of the writes above). I don't think the actual return code does matter in this case. User space must be prepared for an error (and -ENODEV was even possible before, see the discussion in the v3 thread.) We're dealing with a hard to trigger corner case that is easily handled by user space here: let's not overthink this.