Received: by 2002:a05:6a11:4021:0:0:0:0 with SMTP id ky33csp920488pxb; Wed, 29 Sep 2021 12:32:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxMuaj4EkipBMZfTWda37mthfO37WsK0/aJ7EkymYWOODecctPrFIXUqsKNAqf2msCVZWnR X-Received: by 2002:a17:906:8a6e:: with SMTP id hy14mr1847902ejc.366.1632943933370; Wed, 29 Sep 2021 12:32:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1632943933; cv=none; d=google.com; s=arc-20160816; b=LKZ0UWTmcy7+lsr/tLRf9rYqz0lVgqahF9q/fXVIvgcn5DRfw2idf//XJY9mKRF+iD BQMqBuPJajaCwmyZ8eiD9ORozgMOqhXTn4PNCw9W3vxAub9yX9vleD7OyEyyqDvfy/mx nKQuvuGt9LMkDucvbe9hzi7gJ6NwfvcS4XiyZxoyatCRbNeh1fIKP6sEYNOqS/iDyJCk tOj9NH8sbQxVBGMn6LcQMZr9B1lYx5wigPyAhOWzTj2IT4K2tQTynSO82/WdvkmV68hz pWOk815pNiMs4oXnu1EfaZXKZC1iIxMXeHnZLoBfLWWGmk/pCVTHyTgGItDnPtK3Y73R 2BMA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=5fEUwAikCXlY0hPFc3Zca5cd4O3adbLSONrps+zhumI=; b=mpyuI0OD33od5jxtS6PiDphxtvFjpTMS9IumvnRXmS0De4Qu/HVQQ1B+Lj7v4hxEcJ 0BJDJfjeHk2pAIXRxvOfjn22b8u3aka6cl2/N/iuk9mlkg5TQizAPgSI6hmH3vmiqga/ eHZoURcelA3cnmWuhJcm39+pfWkJMU3aL1KJ1UFeLDlOhNFF4SM3Vuc1vbv61ZNnBmli vxBTZvUAMrwXXbxXvW2aGjIwI5qB9DVWNKj221Yixot8Orv3aejdTNL38kO3/UQyyEMj uvye2jwS/DzQDCsG+lTWvIYhNkrKz5X4uTUr9I0TNNJH4HPKRZILUDy8DgkIiDCqsSJm C7Mw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=BJRL6SjO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id er10si800989ejc.675.2021.09.29.12.31.49; Wed, 29 Sep 2021 12:32:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=BJRL6SjO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345239AbhI2SIm (ORCPT + 99 others); Wed, 29 Sep 2021 14:08:42 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:31841 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344935AbhI2SIl (ORCPT ); Wed, 29 Sep 2021 14:08:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1632938820; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5fEUwAikCXlY0hPFc3Zca5cd4O3adbLSONrps+zhumI=; b=BJRL6SjOCTRVgliVkMU89HH0FQjkLTfEaec/JDHfaAvchXkqS3R9SyfL0eJeVhF3dJfWtA +nkXp8tPJZ1/4+oLycZ8/yXxhjTMSkKaffJDvR4voq2pgPS7eR/9LppsmhGQ+lyG/b5wAR VnCHp+1eNEVwFei04iqQJj9p3gg5Fzc= Received: from mail-ot1-f71.google.com (mail-ot1-f71.google.com [209.85.210.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-166-nPzfmv0tOXiCYiRbkQMnkg-1; Wed, 29 Sep 2021 14:06:58 -0400 X-MC-Unique: nPzfmv0tOXiCYiRbkQMnkg-1 Received: by mail-ot1-f71.google.com with SMTP id x27-20020a9d705b000000b005470c0ed087so2456533otj.10 for ; Wed, 29 Sep 2021 11:06:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=5fEUwAikCXlY0hPFc3Zca5cd4O3adbLSONrps+zhumI=; b=6vguwE6THJffn4Aunek5+Pol4IqwRA1kA1QgQuZuX+UVsDYeZKxIBQQNNIfRFhO6JR KgKCjtU9/hd5mt3BWxuFMGjZXhs0GBfWDnFQOLbCbwy4vgVtfhZVLdEcAUKV+2bWfThW 3zNslXFLvBmoXp+1WfoqmeTZkbLwNlZy8D0LtN+TkCt684Hflz747SchnhsSY0EQWVfT pt47l8X27ZWDm/X/69J9Gh1PcYUk9Xq+6a2NL8Ki49JIOB5EqipsNL1OCCSK+T0X4Rf4 G7QhOq/ZQaYP5UEaLKzL6oFCvh3ommRZAH/zHgNSfzeDyPw+4BjLrxMzGmmPVuVA+S7+ g/dA== X-Gm-Message-State: AOAM533TUMnuQtftKV42RQdInfk24G4D9JZGmJtV96BBQyGVEXZ/nquq 7kmDwnrk/5lS+1BAn+5MpG8nc5JRTjTZ0ORS4TpMcVTefgf34KLoJmPfbCKNaKF1G1Z2KLMhnAj 0OX9ClGj+9RfCjTb0CY6FHlFD X-Received: by 2002:a05:6808:243:: with SMTP id m3mr9254417oie.54.1632938817868; Wed, 29 Sep 2021 11:06:57 -0700 (PDT) X-Received: by 2002:a05:6808:243:: with SMTP id m3mr9254391oie.54.1632938817587; Wed, 29 Sep 2021 11:06:57 -0700 (PDT) Received: from redhat.com ([198.99.80.109]) by smtp.gmail.com with ESMTPSA id k1sm97828otr.43.2021.09.29.11.06.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Sep 2021 11:06:57 -0700 (PDT) Date: Wed, 29 Sep 2021 12:06:55 -0600 From: Alex Williamson To: Jason Gunthorpe Cc: Leon Romanovsky , Doug Ledford , Yishai Hadas , Bjorn Helgaas , "David S. Miller" , Jakub Kicinski , Kirti Wankhede , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, netdev@vger.kernel.org, Saeed Mahameed , Cornelia Huck Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity Message-ID: <20210929120655.28e0b3a6.alex.williamson@redhat.com> In-Reply-To: <20210929161602.GA1805739@nvidia.com> References: <20210927164648.1e2d49ac.alex.williamson@redhat.com> <20210927231239.GE3544071@ziepe.ca> <20210928131958.61b3abec.alex.williamson@redhat.com> <20210928193550.GR3544071@ziepe.ca> <20210928141844.15cea787.alex.williamson@redhat.com> <20210929161602.GA1805739@nvidia.com> X-Mailer: Claws Mail 3.18.0 (GTK+ 2.24.33; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 29 Sep 2021 13:16:02 -0300 Jason Gunthorpe wrote: > On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote: > > On Tue, 28 Sep 2021 16:35:50 -0300 > > Jason Gunthorpe wrote: > > > > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote: > > > > > > > In defining the device state, we tried to steer away from defining it > > > > in terms of the QEMU migration API, but rather as a set of controls > > > > that could be used to support that API to leave us some degree of > > > > independence that QEMU implementation might evolve. > > > > > > That is certainly a different perspective, it would have been > > > better to not express this idea as a FSM in that case... > > > > > > So each state in mlx5vf_pci_set_device_state() should call the correct > > > combination of (un)freeze, (un)quiesce and so on so each state > > > reflects a defined operation of the device? > > > > I'd expect so, for instance the implementation of entering the _STOP > > state presumes a previous state that where the device is apparently > > already quiesced. That doesn't support a direct _RUNNING -> _STOP > > transition where I argued in the linked threads that those states > > should be reachable from any other state. Thanks, > > If we focus on mlx5 there are two device 'flags' to manage: > - Device cannot issue DMAs > - Device internal state cannot change (ie cannot receive DMAs) > > This is necessary to co-ordinate across multiple devices that might be > doing peer to peer DMA between them. The whole multi-device complex > should be moved to "cannot issue DMA's" then the whole complex would > go to "state cannot change" and be serialized. Are you anticipating p2p from outside the VM? The typical scenario here would be that p2p occurs only intra-VM, so all the devices would stop issuing DMA (modulo trying to quiesce devices simultaneously). > The expected sequence at the device is thus > > Resuming > full stop -> does not issue DMAs -> full operation > Suspend > full operation -> does not issue DMAs -> full stop > > Further the device has two actions > - Trigger serializating the device state > - Trigger de-serializing the device state > > So, what is the behavior upon each state: > > * 000b => Device Stopped, not saving or resuming > Does not issue DMAs > Internal state cannot change > > * 001b => Device running, which is the default state > Neither flags > > * 010b => Stop the device & save the device state, stop-and-copy state > Does not issue DMAs > Internal state cannot change > > * 011b => Device running and save the device state, pre-copy state > Neither flags > (future, DMA tracking turned on) > > * 100b => Device stopped and the device state is resuming > Does not issue DMAs > Internal state cannot change cannot change... other that as loaded via migration region. > > * 110b => Error state > ??? > > * 101b => Invalid state > * 111b => Invalid state > > ??? > > What should the ??'s be? It looks like mlx5 doesn't use these, so it > should just refuse to enter these states in the first place.. _SAVING and _RESUMING are considered mutually exclusive, therefore any combination of both of them is invalid. We've chosen to use the combination of 110b as an error state to indicate the device state is undefined, but not _RUNNING. This state is only reachable by an internal error of the driver during a state transition. The expected protocol is that if the user write to the device_state register returns an errno, the user reevaluates the device_state to determine if the desired transition is unavailable (previous state value is returned) or generated a fault (error state value returned). Due to the undefined state of the device, the only exit from the error state is to re-initialize the device state via a reset. Therefore a successful device reset should always return the device to the 001b state. The 111b state is also considered unreachable through normal means due to the _SAVING | _RESUMING conflict, but suggests the device is also _RUNNING in this undefined state. This combination has no currently defined use case and should not be reachable. The 101b state indicates _RUNNING while _RESUMING, which is simply not a mode that has been spec'd at this time as it would require some mechanism for the device to fault in state on demand. > The two actions: > trigger serializing the device state > Done when asked to go to 010b ? When the _SAVING bit is set. The exact mechanics depends on the size and volatility of the device state. A GPU might begin in pre-copy (011b) to transmit chunks of framebuffer data, recording hashes of blocks read by the user to avoid re-sending them during the stop-and-copy (010b) phase. A device with a small internal state representation may choose to forgo providing data in the pre-copy phase and entirely serialize internal state at stop-and-copy. > trigger de-serializing the device state > Done when transition from 100b -> 000b ? 100b -> 000b is not a required transition, generally this would be 100b -> 001b, ie. end state of _RUNNING vs _STOP. I think the requirement is that de-serialization is complete when the _RESUMING bit is cleared. Whether the driver chooses to de-serialize piece-wise as each block of data is written to the device or in bulk from a buffer is left to the implementation. In either case, the driver can fail the transition to !_RESUMING if the state is incomplete or otherwise corrupt. It would again be the driver's discretion if the device enters the error state or remains in _RESUMING. If the user has no valid state with which to exit the _RESUMING phase, a device reset should return the device to _RUNNING with a default initial state. > There is a missing state "Stop Active Transactions" which would be > only "does not issue DMAs". I've seen a proposal to add that. This would be to get all devices to stop issuing DMA while internal state can be modified to avoid the synchronization issue of trying to stop devices concurrently? For PCI devices we obviously have the bus master bit to manage that, but I could see how a migration extension for such support (perhaps even just wired through to BM for PCI) could be useful. Thanks, Alex