Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp902693pxb; Tue, 1 Feb 2022 12:49:07 -0800 (PST) X-Google-Smtp-Source: ABdhPJzf4hGz82R/+FdsRwPsFjKOXX6uWPk+Q7H2pgALynGskk6Vpn/lxTt1hrMqA5NHGVlp6VDx X-Received: by 2002:a17:90a:8409:: with SMTP id j9mr4352134pjn.101.1643748547158; Tue, 01 Feb 2022 12:49:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1643748547; cv=none; d=google.com; s=arc-20160816; b=d2E+WUypstYi7oGwMCDIGYWUXxa0Whjqn5auqLrEbQUz4h8AGUSw1MPTq6tIT9AkNW 9l67UImmrZOvYBHn1SOJ4FgLQRTqeV6+D/gicKMjA2KJsKqS7PNUvo0JRYM2mcBtt5dd CFQOmM0GLn3OqF50D625o0XD2YLKO+HofJw/oVAvwQDVTixlTVWeleL17VOB4QhD1VDf L7m3yqVj8vPUUahL93brLOZ+YcHs/lI8WNXSjrqiyBKE/c/byi1Dq7QLNPk9FxysWmEe RMjUu1XyHl5tV4WjpbY3MzkVlcDxCk35E5UGrK28C9zIrUi0fkU0ToylYxLKIGZGPI/7 M3VQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=o8rYk6jbX3z7S18LDcnIJR9gFkPNH5yqGudlMFBgYTo=; b=ikaBCPsWi5UuffgB1KVMg0bOCXgKRm5gn8PUKBjIhbrtBHIWM5x03jv3Mq3I2Izn0k GvTtwSVCJCtEuPhHcpMiRQK6fCU7IBB9sWx0X6aMO8aSeITsvGobe8KlpOqHcXgnV6Nd c10HuL6shepxWfGWAzJ+rIcW6RIyPqJYDL7+Ui70RCoxwNOPOUHylAaVJ+60cF2XCnKQ XQiZ5+LsScF1yDNGaB/y1sGV8AIOPfkxTGRtAtOLphbOjSTzI6OEdM+u32upzSfWXlZQ ZF2xE6102T7XIIujv4DEJ1HalO/U3tzL52eM+3vPvA02+EnxZOCqN66ji+4JGMPdPCTW kE5Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Ctr3Scwt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ik28si15078848plb.65.2022.02.01.12.48.55; Tue, 01 Feb 2022 12:49:07 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Ctr3Scwt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231500AbiAaUL5 (ORCPT + 99 others); Mon, 31 Jan 2022 15:11:57 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:42351 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229781AbiAaULz (ORCPT ); Mon, 31 Jan 2022 15:11:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643659915; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=o8rYk6jbX3z7S18LDcnIJR9gFkPNH5yqGudlMFBgYTo=; b=Ctr3ScwtdF7dLeoSIHfA+t1WTXPVbTU3gPeB26bKsclePuOBqR4GERkx/B4/0+60UvU7rh onpvU4FVngBMiMtZYFAJA7FrFw5KrkSLZap9cS7G67w4SnYx2lR2/SosVEK1R64y1sa0IC 4QlQsWdtkJWFFCTJB5WexBdKEvgo+I8= Received: from mail-oi1-f198.google.com (mail-oi1-f198.google.com [209.85.167.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-119-an-qqbCdPli2_ViWAIFsMg-1; Mon, 31 Jan 2022 15:11:53 -0500 X-MC-Unique: an-qqbCdPli2_ViWAIFsMg-1 Received: by mail-oi1-f198.google.com with SMTP id bl33-20020a05680830a100b002cf47784c5bso7272165oib.3 for ; Mon, 31 Jan 2022 12:11:53 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=o8rYk6jbX3z7S18LDcnIJR9gFkPNH5yqGudlMFBgYTo=; b=GNVC5wSm45fqFH3atnP7FG3kl2UanAQZ82SZ2osbFy3iCisaelVc3TF/WLPyZPVa1I u4trbW6ocL1CnxKLqld3sH1ObYla7yMg86mPCF6IzADXiNRv3liluRystnAZMzx+o1zy webWvPjmnFE4ed+mtLy4gST79jU0GWilt0j44HpLiwhZq7IvA0T7W3tav0rwv6o7SRZi gMulamVWmuHsJRSUO2lHjSlx8c8aQ6JnKP7T4AkOn0oTluZp28CZb+AQjr6RNHjin1/e WOhDNjlIEjXQcgXHZR0KdAMAVql9WwzBR8h4WWXD95V583cqH59b1leKQUYu/c4loapz BFjg== X-Gm-Message-State: AOAM533QMdGL/dN2+LV5EZ3vxl8rjOlkPNsMMC0+SHwrNWbjs4VG/v1C tobSGhmUnhgdcqlthj9mE8xqfdDEtEj6zxcGIAU+XA+B/H29bmkGl+CGjC9l2N/NQN4WeRE7DuW ztnK2as03TxyucblvQ1rW/E1u X-Received: by 2002:a05:6808:170c:: with SMTP id bc12mr18203845oib.171.1643659912772; Mon, 31 Jan 2022 12:11:52 -0800 (PST) X-Received: by 2002:a05:6808:170c:: with SMTP id bc12mr18203833oib.171.1643659912516; Mon, 31 Jan 2022 12:11:52 -0800 (PST) Received: from redhat.com ([38.15.36.239]) by smtp.gmail.com with ESMTPSA id n67sm9584822oib.31.2022.01.31.12.11.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 31 Jan 2022 12:11:52 -0800 (PST) Date: Mon, 31 Jan 2022 13:11:51 -0700 From: Alex Williamson To: Abhishek Sahu Cc: , Cornelia Huck , Max Gurtovoy , Yishai Hadas , Zhen Lei , Jason Gunthorpe , Subject: Re: [PATCH v2] vfio/pci: fix memory leak during D3hot to D0 transition Message-ID: <20220131131151.4f113557.alex.williamson@redhat.com> In-Reply-To: <20220131112450.3550-1-abhsahu@nvidia.com> References: <20220131112450.3550-1-abhsahu@nvidia.com> X-Mailer: Claws Mail 3.18.0 (GTK+ 2.24.33; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 31 Jan 2022 16:54:50 +0530 Abhishek Sahu wrote: > If needs_pm_restore is set (PCI device does not have support for no > soft reset), then the current PCI state will be saved during D0->D3hot > transition and same will be restored back during D3hot->D0 transition. > For saving the PCI state locally, pci_store_saved_state() is being > used and the pci_load_and_free_saved_state() will free the allocated > memory. > > But for reset related IOCTLs, vfio driver calls PCI reset related > API's which will internally change the PCI power state back to D0. So, > when the guest resumes, then it will get the current state as D0 and it > will skip the call to vfio_pci_set_power_state() for changing the > power state to D0 explicitly. In this case, the memory pointed by > pm_save will never be freed. In a malicious sequence, the state changing > to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be > run in a loop and it can cause an OOM situation. > > Also, pci_pm_reset() returns -EINVAL if we try to reset a device that > isn't currently in D0. Therefore any path where we're triggering a > function reset that could use a PM reset and we don't know if the device > is in D0, should wake up the device before we try that reset. > > This patch changes the device power state to D0 by invoking > vfio_pci_set_power_state() before calling reset related API's. > It will help in fixing the mentioned memory leak and making sure > that the device is in D0 during reset. Also, to prevent any similar > memory leak for future development, this patch frees memory first > before overwriting 'pm_save'. > > Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition") > Signed-off-by: Abhishek Sahu > --- > > * Changes in v2 > > - Add the Fixes tag and sent this patch independently. > - Invoke vfio_pci_set_power_state() before invoking reset related API's. > - Removed saving of power state locally. > - Removed warning before 'kfree(vdev->pm_save)'. > - Updated comments and commit message according to updated changes. > > * v1 of this patch was sent in > https://lore.kernel.org/lkml/20220124181726.19174-4-abhsahu@nvidia.com/ > > drivers/vfio/pci/vfio_pci_core.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index f948e6cd2993..d6dd4f7c4b2c 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -228,6 +228,13 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat > if (!ret) { > /* D3 might be unsupported via quirk, skip unless in D3 */ > if (needs_save && pdev->current_state >= PCI_D3hot) { > + /* > + * If somehow, the vfio driver was not able to free the > + * memory allocated in pm_save, then free the earlier > + * memory first before overwriting pm_save to prevent > + * memory leak. > + */ > + kfree(vdev->pm_save); > vdev->pm_save = pci_store_saved_state(pdev); > } else if (needs_restore) { > pci_load_and_free_saved_state(pdev, &vdev->pm_save); > @@ -322,6 +329,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > /* For needs_reset */ > lockdep_assert_held(&vdev->vdev.dev_set->lock); > > + /* > + * This function can be invoked while the power state is non-D0, > + * Change the device power state to D0 first. I think we need to describe more why we're doing this than what we're doing. We need to make sure the device is in D0 in case we have a reset method that depends on that directly, ex. pci_pm_reset(), or possibly device specific resets that may access device BAR resources. I think it's placed here in the function so that the config space changes below aren't overwritten by restoring the saved state and maybe also because the set_irqs_ioctl() call might access device MMIO space. > + */ > + vfio_pci_set_power_state(vdev, PCI_D0); > + > /* Stop the device from further DMA */ > pci_clear_master(pdev); > > @@ -921,6 +934,13 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, > return -EINVAL; > > vfio_pci_zap_and_down_write_memory_lock(vdev); > + > + /* > + * This function can be invoked while the power state is non-D0, > + * Change the device power state to D0 before doing reset. > + */ See below, reconsidering this... > + vfio_pci_set_power_state(vdev, PCI_D0); > + > ret = pci_try_reset_function(vdev->pdev); > up_write(&vdev->memory_lock); > > @@ -2055,6 +2075,13 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > } > cur_mem = NULL; > > + /* > + * This function can be invoked while the power state is non-D0. > + * Change power state of all devices to D0 before doing reset. > + */ Here I have trouble convincing myself exactly what we're doing. As you note in patch 1/ of the RFC series, pci_reset_bus(), or more precisely pci_dev_save_and_disable(), wakes the device to D0 before reset, so we can't be doing this only to get the device into D0. The function level resets do the same. Actually, now I'm remembering and debugging where I got myself confused previously with pci_pm_reset(). The scenario was a Windows guest with an assigned Intel 82574L NIC. When doing a shutdown from the guest the device is placed in D3hot and we enter vfio_pci_core_disable() in that state. That function however uses __pci_reset_function_locked(), which skips the pci_dev_save_and_disable() since much of it is redundant for that call path (I think I generalized this to all flavors of pci_reset_function() in my head). The standard call to pci_try_reset_function(), as in the previous chunk, will make use of pci_dev_save_and_disable(), so for either of these latter cases the concern cannot be simply having the device in D0, we need a reason that we want the previously saved state restored on the device before the reset, and thus restored to the device after the reset as the rationale for the change. > + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) > + vfio_pci_set_power_state(cur, PCI_D0); > + > ret = pci_reset_bus(pdev); > > err_undo: We also call pci_reset_bus() in vfio_pci_dev_set_try_reset(). In that case, none of the other devices can be in use by the user, but they can certainly be in D3hot with previous device state saved off into our pm_save cache. If we don't have a good reason to restore in that case, I'm wondering if we really have a good reason to restore in the above two cases. Perhaps we just need the first chunk above to resolve the memory leak, and the second chunk as a separate patch to resolve the issue with devices entering vfio_pci_core_disable() in non-D0 state. Sorry if I mislead you that we had a more widespread issue with pci_pm_reset(), it wasn't clear in my head until now that it was only the __pci_reset_function_locked() caller that had the issue. Thanks, Alex