Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp578077pxj; Fri, 28 May 2021 10:19:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJypqycwuW5RUst7fJ4b+Vl1dpFn9/QgmJNZWICuFpJ6cs+bSed1tHNAcF6WZLtsvlE/D2iU X-Received: by 2002:a05:6e02:4af:: with SMTP id e15mr7915675ils.131.1622222373950; Fri, 28 May 2021 10:19:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1622222373; cv=none; d=google.com; s=arc-20160816; b=iAGenj0anFr07GbKMC6/h8Gy9nCIcYyZRjouWn7aBeXHZI/n0tjZHRoxyBiZCN6cMZ JV2KWIHJRK25n5XRNh42ATWxYgeqqZ4K1nddCssCfQ7sHJnobuvOEJ+wQ84bvQ9/X58B lv+uWe0efOJYjRfOM3p9oBzXSXa53gSnl8Hl/gwKCN/sptdpdZL/qe+bqA1RJK8Sot2w V3kJ/4vhp/DnacU/KLvw/7Ggubhyb02B9PFmTrOG6C+Cd2dvJThexLwocH/z+625eIOv 4gIkGUnIK4nu4um6kLHk6MOXMcnHcnWTtYlb5B42fOyXKxwTdiDKTzYs0huY+kNhKVfH We4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject :organization:from:references:cc:to:dkim-signature; bh=hCpZDLzGfuvpzK9WauKlTe9JpEyhJUtHOj4EKePAcT8=; b=jFkwi+18NAlW5YBAxJISwCSNq5NZjsXHUeVrNrQImfaoFn3X1Ur4kjvf2mBeGPvdwp mCnLyMtjmYxCwpyP5hRFIF7vXTJKpXqvLZJa+jJokESYnIcTYUcGOAh6L53gFoE+k6zT I15hfjr1WwQ1MINbf/5HRu1ka0gzDXlE+4Nu/gBggltFwMfcIDKSvcLb7C3cET6TkIdU kYUYaGrSgZknleYekh8X/dQVy057JN6pEjL+E4dLz6kRig+xB4d0L9W/87orfg3A5ztu 7kWk8UAOYyX8diHMPa4QNbgGGh76jh8Z+p3efFGD8zDSvAzR1wclYOVOqsBF/Ps0wptQ b06A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=fZeAHGfa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j23si6022099ioo.31.2021.05.28.10.19.21; Fri, 28 May 2021 10:19:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=fZeAHGfa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236880AbhE1Qxg (ORCPT + 99 others); Fri, 28 May 2021 12:53:36 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:42357 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236566AbhE1Qxa (ORCPT ); Fri, 28 May 2021 12:53:30 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622220714; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hCpZDLzGfuvpzK9WauKlTe9JpEyhJUtHOj4EKePAcT8=; b=fZeAHGfagx2aWKDdolk2ZncZwcHwQ+fxnObAZQI+H8cLxN1/JlVfngKe4iXyHpJTYbhO/C BioPdT++PgGeP9l8KSzkVwCKkgUWI6hy3Tp6POwJMuCHun7kwAocplMe5HIj+R3zxQ16ti b8OVBa5wlf/lqTJ2yPVPL0UNyE4mPQw= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-446-ydqm8zFyNCCTDjzx_41t3w-1; Fri, 28 May 2021 12:51:52 -0400 X-MC-Unique: ydqm8zFyNCCTDjzx_41t3w-1 Received: by mail-ed1-f70.google.com with SMTP id q18-20020a50cc920000b029038cf491864cso2412377edi.14 for ; Fri, 28 May 2021 09:51:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=hCpZDLzGfuvpzK9WauKlTe9JpEyhJUtHOj4EKePAcT8=; b=dgD3v+hYQuPmiX0xbYlMoi6DXU9NpI2/Eeb9/tLtSumNhUmqssRCTntTNf2PfHHxM0 7GhNUa0YlRL8XUWghgFyoiF8uSUAKzP3TaV6vz3zvemjWu6wn+RiypR32pQ4gF98cvrh TZaRXsvJFl3PVgP/HRB6h58kVfolOy7MejJtt/s1lkxPY+7+yh53EpaT/TEsFK39Qz4z FlmHcm1w3ebkjqyv+SjLl+yTp0OcpLkp+02XURXMdrNcaFthrWSkBfvp57BHBFVtsRfr kFan4d5IrWsY/i5+46KfkK5umTFcSryap1blu3WQX+KT+IZw0SGZVblT1iamEfPhbsSM 5JxQ== X-Gm-Message-State: AOAM5311OGRrzAkNvZaLlyvPtAw20tcaaqtx7drQy1uDdlpDS5EFnbZi kasByhIUHBMixD2CHyCMoehyKRhp5u57oJsszGf3jvFb/12g2HJlU1MNGgtWexZoxzrFNpL8F8O eRM2vA/fXPiEEK/M+7wgb2iMg X-Received: by 2002:a05:6402:22f1:: with SMTP id dn17mr10807140edb.286.1622220711650; Fri, 28 May 2021 09:51:51 -0700 (PDT) X-Received: by 2002:a05:6402:22f1:: with SMTP id dn17mr10807115edb.286.1622220711360; Fri, 28 May 2021 09:51:51 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6870.dip0.t-ipconnect.de. [91.12.104.112]) by smtp.gmail.com with ESMTPSA id bu1sm2598573ejb.116.2021.05.28.09.51.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 28 May 2021 09:51:50 -0700 (PDT) To: Dan Williams Cc: Bjorn Helgaas , Greg KH , Arnd Bergmann , Ingo Molnar , Kees Cook , Matthew Wilcox , Russell King , Andrew Morton , Linux Kernel Mailing List , Linux MM , Linux PCI , Daniel Vetter , =?UTF-8?Q?Krzysztof_Wilczy=c5=84ski?= , Jason Gunthorpe , Christoph Hellwig References: <159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com> <20210527205845.GA1421476@bjorn-Precision-5520> <7eaf4c10-292e-18d7-e8ce-3a6b72122381@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v4] /dev/mem: Revoke mappings when a driver claims the region Message-ID: Date: Fri, 28 May 2021 18:51:50 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 28.05.21 18:42, Dan Williams wrote: > On Fri, May 28, 2021 at 1:58 AM David Hildenbrand wrote: >> >> On 27.05.21 23:30, Dan Williams wrote: >>> On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas wrote: >>>> >>>> [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci] >>>> >>>> On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote: >>>>> Close the hole of holding a mapping over kernel driver takeover event of >>>>> a given address range. >>>>> >>>>> Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges") >>>>> introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the >>>>> kernel against scenarios where a /dev/mem user tramples memory that a >>>>> kernel driver owns. However, this protection only prevents *new* read(), >>>>> write() and mmap() requests. Established mappings prior to the driver >>>>> calling request_mem_region() are left alone. >>>>> >>>>> Especially with persistent memory, and the core kernel metadata that is >>>>> stored there, there are plentiful scenarios for a /dev/mem user to >>>>> violate the expectations of the driver and cause amplified damage. >>>>> >>>>> Teach request_mem_region() to find and shoot down active /dev/mem >>>>> mappings that it believes it has successfully claimed for the exclusive >>>>> use of the driver. Effectively a driver call to request_mem_region() >>>>> becomes a hole-punch on the /dev/mem device. >>>> >>>> This idea of hole-punching /dev/mem has since been extended to PCI >>>> BARs via [1]. >>>> >>>> Correct me if I'm wrong: I think this means that if a user process has >>>> mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests >>>> that region via pci_request_region() or similar, we punch holes in the >>>> the user process mmap. The driver might be happy, but my guess is the >>>> user starts seeing segmentation violations for no obvious reason and >>>> is not happy. >>>> >>>> Apart from the user process issue, the implementation of [1] is >>>> problematic for PCI because the mmappable sysfs attributes now depend >>>> on iomem_init_inode(), an fs_initcall, which means they can't be >>>> static attributes, which ultimately leads to races in creating them. >>> >>> See the comments in iomem_get_mapping(), and revoke_iomem(): >>> >>> /* >>> * Check that the initialization has completed. Losing the race >>> * is ok because it means drivers are claiming resources before >>> * the fs_initcall level of init and prevent iomem_get_mapping users >>> * from establishing mappings. >>> */ >>> >>> ...the observation being that it is ok for the revocation inode to >>> come on later in the boot process because userspace won't be able to >>> use the fs yet. So any missed calls to revoke_iomem() would fall back >>> to userspace just seeing the resource busy in the first instance. I.e. >>> through the normal devmem_is_allowed() exclusion. >>> >>>> >>>> So I'm raising the question of whether this hole-punch is the right >>>> strategy. >>>> >>>> - Prior to revoke_iomem(), __request_region() was very >>>> self-contained and really only depended on the resource tree. Now >>>> it depends on a lot of higher-level MM machinery to shoot down >>>> mappings of other tasks. This adds quite a bit of complexity and >>>> some new ordering constraints. >>>> >>>> - Punching holes in the address space of an existing process seems >>>> unfriendly. Maybe the driver's __request_region() should fail >>>> instead, since the driver should be prepared to handle failure >>>> there anyway. >>> >>> It's prepared to handle failure, but in this case it is dealing with a >>> root user of 2 minds. >>> >>>> >>>> - [2] suggests that the hole punch protects drivers from /dev/mem >>>> writers, especially with persistent memory. I'm not really >>>> convinced. The hole punch does nothing to prevent a user process >>>> from mmapping and corrupting something before the driver loads. >>> >>> The motivation for this was a case that was swapping between /dev/mem >>> access and /dev/pmem0 access and they forgot to stop using /dev/mem >>> when they switched to /dev/pmem0. If root wants to use /dev/mem it can >>> use it, if root wants to stop the driver from loading it can set >>> mopdrobe policy or manually unbind, and if root asks the kernel to >>> load the driver while it is actively using /dev/mem something has to >>> give. Given root has other options to stop a driver the decision to >>> revoke userspace access when root messes up and causes a collision >>> seems prudent to me. >>> >> >> Is there a real use case for mapping pmem via /dev/mem or could we just >> prohibit the access to these areas completely? > > The kernel offers conflicting access to iomem resources and a > long-standing mechanism to enforce mutual exclusion > (CONFIG_IO_STRICT_DEVMEM) between those interfaces. That mechanism was > found to be incomplete for the case where a /dev/mem mapping is > maintained after a kernel driver is attached, and incomplete for other > mechanisms to map iomem like pci-sysfs. This was found with PMEM, but > the issue is larger and applies to userspace drivers / debug in > general. > >> What's the use case for "swapping between /dev/mem access and /dev/pmem0 >> access" ? > > "Who knows". I mean, I know in this case it was a platform validation > test using /dev/mem for "reasons", but I am not sure that is relevant > to the wider concern. If CONFIG_IO_STRICT_DEVMEM=n exclusion is > enforced when drivers pass the IORESOURCE_EXCLUSIVE flag, if > CONFIG_IO_STRICT_DEVMEM=y exclusion is enforced whenever the kernel > marks a resource IORESOURCE_BUSY, and if kernel lockdown is enabled > the driver state is moot as LOCKDOWN_DEV_MEM and LOCKDOWN_PCI_ACCESS > policy is in effect. > I was thinking about a mechanism to permanently disallow /dev/mem access to specific memory regions (BUSY or not) in any /dev/mem mode. In my case, it would apply to the whole virtio-mem provided memory region. Once the driver is loaded, it would disallow access to the whole region. I thought about doing it via the kernel resource tree, extending the EXCLUSIVE flag to !BUSY SYSRAM regions. But a simplistic list managed in /dev/mem code would also be possible. That's why I wondered if we could just disallow access to these physical PMEM memory regions right from the start similarly, such that we don't have to really care about revoking in case of PMEM anymore. -- Thanks, David / dhildenb