2021-08-16 14:26:21

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 0/3] virtio-mem: disallow mapping virtio-mem memory via /dev/mem

Let's add the basic infrastructure to exclude some physical memory
regions completely from /dev/mem access, on any architecture and under
any system configuration (independent of CONFIG_STRICT_DEVMEM and
independent of "iomem=").

Use it for virtio-mem, to disallow mapping any virtio-mem memory via
/dev/mem to user space after the virtio-mem driver was loaded: there is
no sane use case to access the device-managed memory region via /dev/mem
once the driver is actively (un)plugging memory within that region and
we want to make sure that nobody will accidentially access unplugged
memory in a sane environment.

Details can be found in patch #1.

v1 -> v2:
- "/dev/mem: disallow access to explicitly excluded system RAM regions"
-- Introduce and use for_each_resource() and next_resource_skip_children()
-- s/iomem_range_contains_excluded/iomem_range_contains_excluded_devmem/
- "kernel/resource: cleanup and optimize iomem_is_exclusive()"
-- Use for_each_resource()

Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Hanjun Guo <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: [email protected]
Cc: [email protected]

David Hildenbrand (3):
/dev/mem: disallow access to explicitly excluded system RAM regions
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
kernel/resource: cleanup and optimize iomem_is_exclusive()

drivers/char/mem.c | 22 ++++++--------
drivers/virtio/virtio_mem.c | 4 ++-
include/linux/ioport.h | 1 +
kernel/resource.c | 60 +++++++++++++++++++++++++++++++++----
lib/Kconfig.debug | 4 ++-
5 files changed, 71 insertions(+), 20 deletions(-)


base-commit: 7c60610d476766e128cc4284bb6349732cbd6606
--
2.31.1


2021-08-16 14:26:28

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 1/3] /dev/mem: disallow access to explicitly excluded system RAM regions

virtio-mem dynamically exposes memory inside a device memory region as
system RAM to Linux, coordinating with the hypervisor which parts are
actually "plugged" and consequently usable/accessible. On the one hand, the
virtio-mem driver adds/removes whole memory blocks, creating/removing busy
IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs
memory inside added memory blocks, dynamically either exposing them to
the buddy or hiding them from the buddy and marking them PG_offline.

virtio-mem wants to make sure that in a sane environment, nobody
"accidentially" accesses unplugged memory inside the device managed
region. After /proc/kcore has been sanitized and /dev/kmem has been
removed, /dev/mem is the remaining interface that still allows uncontrolled
access to the device-managed region of virtio-mem devices from user
space.

There is no known sane use case for mapping virtio-mem device memory
via /dev/mem while virtio-mem driver concurrently (un)plugs memory inside
that region. So once the driver was loaded and detected the device
along the device-managed region, we just want to disallow any access via
/dev/mem to it.

Let's add the basic infrastructure to exclude some physical memory
regions completely from /dev/mem access, on any architecture and under
any system configuration (independent of CONFIG_STRICT_DEVMEM and
independent of "iomem=").

Any range marked with "IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE"
will be excluded, even if not busy. For now, there are no applicable
ranges and we'll modify virtio-mem next to properly set
IORESOURCE_EXCLUSIVE on the parent resource.

As next_resource() will iterate over children although we might want to
skip a certain range completely, let's add and use
next_range_skip_children() and for_each_resource(), to optimize that case,
avoding having to traverse subtrees that are not of interest.

Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/char/mem.c | 22 ++++++++------------
include/linux/ioport.h | 1 +
kernel/resource.c | 47 ++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 4 +++-
4 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 1c596b5cdb27..1829dc6a1f29 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -60,13 +60,18 @@ static inline int valid_mmap_phys_addr_range(unsigned long pfn, size_t size)
}
#endif

-#ifdef CONFIG_STRICT_DEVMEM
static inline int page_is_allowed(unsigned long pfn)
{
- return devmem_is_allowed(pfn);
+#ifdef CONFIG_STRICT_DEVMEM
+ if (!devmem_is_allowed(pfn))
+ return 0;
+#endif /* CONFIG_STRICT_DEVMEM */
+ return !iomem_range_contains_excluded_devmem(PFN_PHYS(pfn), PAGE_SIZE);
}
+
static inline int range_is_allowed(unsigned long pfn, unsigned long size)
{
+#ifdef CONFIG_STRICT_DEVMEM
u64 from = ((u64)pfn) << PAGE_SHIFT;
u64 to = from + size;
u64 cursor = from;
@@ -77,18 +82,9 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size)
cursor += PAGE_SIZE;
pfn++;
}
- return 1;
-}
-#else
-static inline int page_is_allowed(unsigned long pfn)
-{
- return 1;
-}
-static inline int range_is_allowed(unsigned long pfn, unsigned long size)
-{
- return 1;
+#endif /* CONFIG_STRICT_DEVMEM */
+ return !iomem_range_contains_excluded_devmem(PFN_PHYS(pfn), size);
}
-#endif

#ifndef unxlate_dev_mem_ptr
#define unxlate_dev_mem_ptr unxlate_dev_mem_ptr
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 8359c50f9988..d31f83281327 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -308,6 +308,7 @@ extern struct resource * __devm_request_region(struct device *dev,
extern void __devm_release_region(struct device *dev, struct resource *parent,
resource_size_t start, resource_size_t n);
extern int iomem_map_sanity_check(resource_size_t addr, unsigned long size);
+extern bool iomem_range_contains_excluded_devmem(u64 addr, u64 size);
extern bool iomem_is_exclusive(u64 addr);

extern int
diff --git a/kernel/resource.c b/kernel/resource.c
index ca9f5198a01f..f57a14617c49 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -73,6 +73,18 @@ static struct resource *next_resource(struct resource *p)
return p->sibling;
}

+static struct resource *next_resource_skip_children(struct resource *p)
+{
+ while (!p->sibling && p->parent)
+ p = p->parent;
+ return p->sibling;
+}
+
+#define for_each_resource(_root, _p, _skip_children) \
+ for ((_p) = (_root)->child; (_p); \
+ (_p) = (_skip_children) ? next_resource_skip_children(_p) : \
+ next_resource(_p))
+
static void *r_next(struct seq_file *m, void *v, loff_t *pos)
{
struct resource *p = v;
@@ -1700,6 +1712,41 @@ int iomem_map_sanity_check(resource_size_t addr, unsigned long size)
return err;
}

+/*
+ * Check if a physical memory range is completely excluded from getting
+ * mapped/accessed via /dev/mem.
+ */
+bool iomem_range_contains_excluded_devmem(u64 addr, u64 size)
+{
+ const unsigned int flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE;
+ bool skip_children = false, excluded = false;
+ struct resource *p;
+
+ read_lock(&resource_lock);
+ for_each_resource(&iomem_resource, p, skip_children) {
+ if (p->start >= addr + size)
+ break;
+ if (p->end < addr) {
+ skip_children = true;
+ continue;
+ }
+ skip_children = false;
+
+ /*
+ * A system RAM resource is excluded if IORESOURCE_EXCLUSIVE
+ * is set, even if not busy and even if we don't have strict
+ * checks enabled -- no ifs or buts.
+ */
+ if ((p->flags & flags) == flags) {
+ excluded = true;
+ break;
+ }
+ }
+ read_unlock(&resource_lock);
+
+ return excluded;
+}
+
#ifdef CONFIG_STRICT_DEVMEM
static int strict_iomem_checks = 1;
#else
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5ddd575159fb..d0ce6e23a6db 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1780,7 +1780,9 @@ config STRICT_DEVMEM
access to this is obviously disastrous, but specific access can
be used by people debugging the kernel. Note that with PAT support
enabled, even in this case there are restrictions on /dev/mem
- use due to the cache aliasing requirements.
+ use due to the cache aliasing requirements. Further, some drivers
+ will still restrict access to some physical memory regions either
+ already used or to be used in the future as system RAM.

If this option is switched on, and IO_STRICT_DEVMEM=n, the /dev/mem
file only allows userspace access to PCI space and the BIOS code and
--
2.31.1

2021-08-16 14:26:45

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 2/3] virtio-mem: disallow mapping virtio-mem memory via /dev/mem

By creating our parent IORESOURCE_SYSTEM_RAM resource with
IORESOURCE_EXCLUSIVE, we will disallow any /dev/mem access to our
device-managed region.

Note that access to the region would still be possible if someone simply
doesn't load the virtio-mem driver; however, there is no way of
protecting against someone that just wants to do nasty things.

Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/virtio/virtio_mem.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 09ed55de07d7..c8f914700a42 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2516,8 +2516,10 @@ static int virtio_mem_create_resource(struct virtio_mem *vm)
if (!name)
return -ENOMEM;

+ /* Disallow mapping device memory via /dev/mem completely. */
vm->parent_resource = __request_mem_region(vm->addr, vm->region_size,
- name, IORESOURCE_SYSTEM_RAM);
+ name, IORESOURCE_SYSTEM_RAM |
+ IORESOURCE_EXCLUSIVE);
if (!vm->parent_resource) {
kfree(name);
dev_warn(&vm->vdev->dev, "could not reserve device region\n");
--
2.31.1

2021-08-16 14:26:56

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 3/3] kernel/resource: cleanup and optimize iomem_is_exclusive()

Let's clean it up a bit, reusing for_each_resource() and avoiding
traversing subtrees we are not interested in.

Signed-off-by: David Hildenbrand <[email protected]>
---
kernel/resource.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index f57a14617c49..7e59b57afa56 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1759,10 +1759,9 @@ static int strict_iomem_checks;
*/
bool iomem_is_exclusive(u64 addr)
{
- struct resource *p = &iomem_resource;
- bool err = false;
- loff_t l;
+ bool skip_children = false, err = false;
int size = PAGE_SIZE;
+ struct resource *p;

if (!strict_iomem_checks)
return false;
@@ -1770,15 +1769,19 @@ bool iomem_is_exclusive(u64 addr)
addr = addr & PAGE_MASK;

read_lock(&resource_lock);
- for (p = p->child; p ; p = r_next(NULL, p, &l)) {
+ for_each_resource(&iomem_resource, p, skip_children) {
/*
* We can probably skip the resources without
* IORESOURCE_IO attribute?
*/
if (p->start >= addr + size)
break;
- if (p->end < addr)
+ if (p->end < addr) {
+ skip_children = true;
continue;
+ }
+ skip_children = false;
+
/*
* A resource is exclusive if IORESOURCE_EXCLUSIVE is set
* or CONFIG_IO_STRICT_DEVMEM is enabled and the
--
2.31.1

2021-08-23 19:18:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] virtio-mem: disallow mapping virtio-mem memory via /dev/mem

On 16.08.21 16:25, David Hildenbrand wrote:
> Let's add the basic infrastructure to exclude some physical memory
> regions completely from /dev/mem access, on any architecture and under
> any system configuration (independent of CONFIG_STRICT_DEVMEM and
> independent of "iomem=").
>
> Use it for virtio-mem, to disallow mapping any virtio-mem memory via
> /dev/mem to user space after the virtio-mem driver was loaded: there is
> no sane use case to access the device-managed memory region via /dev/mem
> once the driver is actively (un)plugging memory within that region and
> we want to make sure that nobody will accidentially access unplugged
> memory in a sane environment.
>
> Details can be found in patch #1.
>
> v1 -> v2:
> - "/dev/mem: disallow access to explicitly excluded system RAM regions"
> -- Introduce and use for_each_resource() and next_resource_skip_children()
> -- s/iomem_range_contains_excluded/iomem_range_contains_excluded_devmem/
> - "kernel/resource: cleanup and optimize iomem_is_exclusive()"
> -- Use for_each_resource()
>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: "Michael S. Tsirkin" <[email protected]>
> Cc: Jason Wang <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Hanjun Guo <[email protected]>
> Cc: Andy Shevchenko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> David Hildenbrand (3):
> /dev/mem: disallow access to explicitly excluded system RAM regions
> virtio-mem: disallow mapping virtio-mem memory via /dev/mem
> kernel/resource: cleanup and optimize iomem_is_exclusive()
>
> drivers/char/mem.c | 22 ++++++--------
> drivers/virtio/virtio_mem.c | 4 ++-
> include/linux/ioport.h | 1 +
> kernel/resource.c | 60 +++++++++++++++++++++++++++++++++----
> lib/Kconfig.debug | 4 ++-
> 5 files changed, 71 insertions(+), 20 deletions(-)
>
>
> base-commit: 7c60610d476766e128cc4284bb6349732cbd6606
>

More review welcome; I'd suggest this should go via the -mm tree, and
not via the vhost tree.

--
Thanks,

David / dhildenb

2021-08-25 01:00:36

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] /dev/mem: disallow access to explicitly excluded system RAM regions

On Mon, Aug 16, 2021 at 7:25 AM David Hildenbrand <[email protected]> wrote:
>
> virtio-mem dynamically exposes memory inside a device memory region as
> system RAM to Linux, coordinating with the hypervisor which parts are
> actually "plugged" and consequently usable/accessible. On the one hand, the
> virtio-mem driver adds/removes whole memory blocks, creating/removing busy
> IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs
> memory inside added memory blocks, dynamically either exposing them to
> the buddy or hiding them from the buddy and marking them PG_offline.
>
> virtio-mem wants to make sure that in a sane environment, nobody
> "accidentially" accesses unplugged memory inside the device managed
> region. After /proc/kcore has been sanitized and /dev/kmem has been
> removed, /dev/mem is the remaining interface that still allows uncontrolled
> access to the device-managed region of virtio-mem devices from user
> space.
>
> There is no known sane use case for mapping virtio-mem device memory
> via /dev/mem while virtio-mem driver concurrently (un)plugs memory inside
> that region. So once the driver was loaded and detected the device
> along the device-managed region, we just want to disallow any access via
> /dev/mem to it.
>
> Let's add the basic infrastructure to exclude some physical memory
> regions completely from /dev/mem access, on any architecture and under
> any system configuration (independent of CONFIG_STRICT_DEVMEM and
> independent of "iomem=").

I'm certainly on team "/dev/mem considered harmful", but this approach
feels awkward. It feels wrong for being non-committal about whether
CONFIG_STRICT_DEVMEM is in wide enough use that the safety can be
turned on all the time, and the configuration option dropped, or there
are users clinging onto /dev/mem where they expect to be able to build
a debug kernel to turn all of these restrictions off, even the
virtio-mem ones. This splits the difference and says some /dev/mem
accesses are always disallowed for "reasons", but I could say the same
thing about pmem, there's no sane reason to allow /dev/mem which has
no idea about the responsibilities of properly touching pmem to get
access to it.


>
> Any range marked with "IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE"
> will be excluded, even if not busy. For now, there are no applicable
> ranges and we'll modify virtio-mem next to properly set
> IORESOURCE_EXCLUSIVE on the parent resource.
>
> As next_resource() will iterate over children although we might want to
> skip a certain range completely, let's add and use
> next_range_skip_children() and for_each_resource(), to optimize that case,
> avoding having to traverse subtrees that are not of interest.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> drivers/char/mem.c | 22 ++++++++------------
> include/linux/ioport.h | 1 +
> kernel/resource.c | 47 ++++++++++++++++++++++++++++++++++++++++++
> lib/Kconfig.debug | 4 +++-
> 4 files changed, 60 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
> index 1c596b5cdb27..1829dc6a1f29 100644
> --- a/drivers/char/mem.c
> +++ b/drivers/char/mem.c
> @@ -60,13 +60,18 @@ static inline int valid_mmap_phys_addr_range(unsigned long pfn, size_t size)
> }
> #endif
>
> -#ifdef CONFIG_STRICT_DEVMEM
> static inline int page_is_allowed(unsigned long pfn)
> {
> - return devmem_is_allowed(pfn);
> +#ifdef CONFIG_STRICT_DEVMEM
> + if (!devmem_is_allowed(pfn))
> + return 0;
> +#endif /* CONFIG_STRICT_DEVMEM */
> + return !iomem_range_contains_excluded_devmem(PFN_PHYS(pfn), PAGE_SIZE);
> }
> +
> static inline int range_is_allowed(unsigned long pfn, unsigned long size)
> {
> +#ifdef CONFIG_STRICT_DEVMEM
> u64 from = ((u64)pfn) << PAGE_SHIFT;
> u64 to = from + size;
> u64 cursor = from;
> @@ -77,18 +82,9 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size)
> cursor += PAGE_SIZE;
> pfn++;
> }
> - return 1;
> -}
> -#else
> -static inline int page_is_allowed(unsigned long pfn)
> -{
> - return 1;
> -}
> -static inline int range_is_allowed(unsigned long pfn, unsigned long size)
> -{
> - return 1;
> +#endif /* CONFIG_STRICT_DEVMEM */
> + return !iomem_range_contains_excluded_devmem(PFN_PHYS(pfn), size);
> }
> -#endif
>
> #ifndef unxlate_dev_mem_ptr
> #define unxlate_dev_mem_ptr unxlate_dev_mem_ptr
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 8359c50f9988..d31f83281327 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -308,6 +308,7 @@ extern struct resource * __devm_request_region(struct device *dev,
> extern void __devm_release_region(struct device *dev, struct resource *parent,
> resource_size_t start, resource_size_t n);
> extern int iomem_map_sanity_check(resource_size_t addr, unsigned long size);
> +extern bool iomem_range_contains_excluded_devmem(u64 addr, u64 size);
> extern bool iomem_is_exclusive(u64 addr);
>
> extern int
> diff --git a/kernel/resource.c b/kernel/resource.c
> index ca9f5198a01f..f57a14617c49 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -73,6 +73,18 @@ static struct resource *next_resource(struct resource *p)
> return p->sibling;
> }
>
> +static struct resource *next_resource_skip_children(struct resource *p)
> +{
> + while (!p->sibling && p->parent)
> + p = p->parent;
> + return p->sibling;
> +}
> +
> +#define for_each_resource(_root, _p, _skip_children) \
> + for ((_p) = (_root)->child; (_p); \
> + (_p) = (_skip_children) ? next_resource_skip_children(_p) : \
> + next_resource(_p))
> +
> static void *r_next(struct seq_file *m, void *v, loff_t *pos)
> {
> struct resource *p = v;
> @@ -1700,6 +1712,41 @@ int iomem_map_sanity_check(resource_size_t addr, unsigned long size)
> return err;
> }
>
> +/*
> + * Check if a physical memory range is completely excluded from getting
> + * mapped/accessed via /dev/mem.
> + */
> +bool iomem_range_contains_excluded_devmem(u64 addr, u64 size)
> +{
> + const unsigned int flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE;
> + bool skip_children = false, excluded = false;
> + struct resource *p;
> +
> + read_lock(&resource_lock);
> + for_each_resource(&iomem_resource, p, skip_children) {
> + if (p->start >= addr + size)
> + break;
> + if (p->end < addr) {
> + skip_children = true;
> + continue;
> + }
> + skip_children = false;
> +
> + /*
> + * A system RAM resource is excluded if IORESOURCE_EXCLUSIVE
> + * is set, even if not busy and even if we don't have strict
> + * checks enabled -- no ifs or buts.
> + */
> + if ((p->flags & flags) == flags) {
> + excluded = true;
> + break;
> + }
> + }
> + read_unlock(&resource_lock);
> +
> + return excluded;
> +}
> +
> #ifdef CONFIG_STRICT_DEVMEM
> static int strict_iomem_checks = 1;
> #else
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 5ddd575159fb..d0ce6e23a6db 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1780,7 +1780,9 @@ config STRICT_DEVMEM
> access to this is obviously disastrous, but specific access can
> be used by people debugging the kernel. Note that with PAT support
> enabled, even in this case there are restrictions on /dev/mem
> - use due to the cache aliasing requirements.
> + use due to the cache aliasing requirements. Further, some drivers
> + will still restrict access to some physical memory regions either
> + already used or to be used in the future as system RAM.
>
> If this option is switched on, and IO_STRICT_DEVMEM=n, the /dev/mem
> file only allows userspace access to PCI space and the BIOS code and
> --
> 2.31.1
>

2021-08-25 07:25:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] /dev/mem: disallow access to explicitly excluded system RAM regions

On 25.08.21 02:58, Dan Williams wrote:
> On Mon, Aug 16, 2021 at 7:25 AM David Hildenbrand <[email protected]> wrote:
>>
>> virtio-mem dynamically exposes memory inside a device memory region as
>> system RAM to Linux, coordinating with the hypervisor which parts are
>> actually "plugged" and consequently usable/accessible. On the one hand, the
>> virtio-mem driver adds/removes whole memory blocks, creating/removing busy
>> IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs
>> memory inside added memory blocks, dynamically either exposing them to
>> the buddy or hiding them from the buddy and marking them PG_offline.
>>
>> virtio-mem wants to make sure that in a sane environment, nobody
>> "accidentially" accesses unplugged memory inside the device managed
>> region. After /proc/kcore has been sanitized and /dev/kmem has been
>> removed, /dev/mem is the remaining interface that still allows uncontrolled
>> access to the device-managed region of virtio-mem devices from user
>> space.
>>
>> There is no known sane use case for mapping virtio-mem device memory
>> via /dev/mem while virtio-mem driver concurrently (un)plugs memory inside
>> that region. So once the driver was loaded and detected the device
>> along the device-managed region, we just want to disallow any access via
>> /dev/mem to it.
>>
>> Let's add the basic infrastructure to exclude some physical memory
>> regions completely from /dev/mem access, on any architecture and under
>> any system configuration (independent of CONFIG_STRICT_DEVMEM and
>> independent of "iomem=").
>
> I'm certainly on team "/dev/mem considered harmful", but this approach
> feels awkward. It feels wrong for being non-committal about whether
> CONFIG_STRICT_DEVMEM is in wide enough use that the safety can be
> turned on all the time, and the configuration option dropped, or there
> are users clinging onto /dev/mem where they expect to be able to build
> a debug kernel to turn all of these restrictions off, even the
> virtio-mem ones. This splits the difference and says some /dev/mem
> accesses are always disallowed for "reasons", but I could say the same
> thing about pmem, there's no sane reason to allow /dev/mem which has
> no idea about the responsibilities of properly touching pmem to get
> access to it.

For virtio-mem, there is no use case *and* access could be harmful; I
don't even want to allow if for debugging purposes. If you want to
inspect virtio-mem device memory content, use /proc/kcore, which
performs proper synchronized access checks. Modifying random virtio-mem
memory via /dev/mem in a debug kernel will not be possible: if you
really have to play with fire, use kdb or better don't load the
virtio-mem driver during boot, such that the kernel won't even be making
use of device memory.

I don't want people disabling CONFIG_STRICT_DEVMEM, or booting with
"iomem=relaxed", and "accidentally" accessing any of virtio-mem memory
via /dev/mem, while it gets concurrently plugged/unplugged by the
virtio-mem driver. Not even for debugging purposes.

We disallow mapping to some other regions independent of
CONFIG_STRICT_DEVMEM already, so the idea to ignore CONFIG_STRICT_DEVMEM
is not completely new:

"Note that with PAT support enabled, even in this case there are
restrictions on /dev/mem use due to the cache aliasing requirements."

Maybe you even want to do something similar with PMEM now that there is
infrastructure for it and just avoid having to deal with revoking
/dev/mem mappings later.


I think there are weird debugging/educational setups [1] that still
require CONFIG_STRICT_DEVMEM=n even with iomem=relaxed. Take a look at
lib/devmem_is_allowed.c:devmem_is_allowed(), it disallows any access to
(what's currently added as) System RAM. It might just do what people
want when dealing with system RAM that doesn't suddenly vanish , so I
don't ultimately see why we should remove CONFIG_STRICT_DEVMEM=n.

[1] https://bakhi.github.io/devmem/

Thanks!

--
Thanks,

David / dhildenb

2021-08-25 17:10:25

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] /dev/mem: disallow access to explicitly excluded system RAM regions

On Wed, Aug 25, 2021 at 12:23 AM David Hildenbrand <[email protected]> wrote:
>
> On 25.08.21 02:58, Dan Williams wrote:
> > On Mon, Aug 16, 2021 at 7:25 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> virtio-mem dynamically exposes memory inside a device memory region as
> >> system RAM to Linux, coordinating with the hypervisor which parts are
> >> actually "plugged" and consequently usable/accessible. On the one hand, the
> >> virtio-mem driver adds/removes whole memory blocks, creating/removing busy
> >> IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs
> >> memory inside added memory blocks, dynamically either exposing them to
> >> the buddy or hiding them from the buddy and marking them PG_offline.
> >>
> >> virtio-mem wants to make sure that in a sane environment, nobody
> >> "accidentially" accesses unplugged memory inside the device managed
> >> region. After /proc/kcore has been sanitized and /dev/kmem has been
> >> removed, /dev/mem is the remaining interface that still allows uncontrolled
> >> access to the device-managed region of virtio-mem devices from user
> >> space.
> >>
> >> There is no known sane use case for mapping virtio-mem device memory
> >> via /dev/mem while virtio-mem driver concurrently (un)plugs memory inside
> >> that region. So once the driver was loaded and detected the device
> >> along the device-managed region, we just want to disallow any access via
> >> /dev/mem to it.
> >>
> >> Let's add the basic infrastructure to exclude some physical memory
> >> regions completely from /dev/mem access, on any architecture and under
> >> any system configuration (independent of CONFIG_STRICT_DEVMEM and
> >> independent of "iomem=").
> >
> > I'm certainly on team "/dev/mem considered harmful", but this approach
> > feels awkward. It feels wrong for being non-committal about whether
> > CONFIG_STRICT_DEVMEM is in wide enough use that the safety can be
> > turned on all the time, and the configuration option dropped, or there
> > are users clinging onto /dev/mem where they expect to be able to build
> > a debug kernel to turn all of these restrictions off, even the
> > virtio-mem ones. This splits the difference and says some /dev/mem
> > accesses are always disallowed for "reasons", but I could say the same
> > thing about pmem, there's no sane reason to allow /dev/mem which has
> > no idea about the responsibilities of properly touching pmem to get
> > access to it.
>
> For virtio-mem, there is no use case *and* access could be harmful; I
> don't even want to allow if for debugging purposes. If you want to
> inspect virtio-mem device memory content, use /proc/kcore, which
> performs proper synchronized access checks. Modifying random virtio-mem
> memory via /dev/mem in a debug kernel will not be possible: if you
> really have to play with fire, use kdb or better don't load the
> virtio-mem driver during boot, such that the kernel won't even be making
> use of device memory.
>
> I don't want people disabling CONFIG_STRICT_DEVMEM, or booting with
> "iomem=relaxed", and "accidentally" accessing any of virtio-mem memory
> via /dev/mem, while it gets concurrently plugged/unplugged by the
> virtio-mem driver. Not even for debugging purposes.

That sounds more an argument that all of the existing "kernel is using
this region" cases should become mandatory exclusions. If unloading
the driver removes the exclusion then that's precisely
CONFIG_IO_STRICT_DEVMEM. Why is the virtio-mem driver more special
than any other driver that expects this integrity guarantee?

> We disallow mapping to some other regions independent of
> CONFIG_STRICT_DEVMEM already, so the idea to ignore CONFIG_STRICT_DEVMEM
> is not completely new:
>
> "Note that with PAT support enabled, even in this case there are
> restrictions on /dev/mem use due to the cache aliasing requirements."
>
> Maybe you even want to do something similar with PMEM now that there is
> infrastructure for it and just avoid having to deal with revoking
> /dev/mem mappings later.

That would be like blocking writes to /dev/sda just because a
filesytem might later be mounted on it. If the /dev/mem access is not
actively colliding with other kernel operations what business does the
kernel have saying no?

I'm pushing on this topic because I am also considering an exclusion
on PCI configuration access to the "DOE mailbox" since it can disrupt
the kernel's operation, at the same time, root can go change PCI BARs
to nonsensical values whenever it wants which is also in the category
of "has no use case && could be harmful".

> I think there are weird debugging/educational setups [1] that still
> require CONFIG_STRICT_DEVMEM=n even with iomem=relaxed. Take a look at
> lib/devmem_is_allowed.c:devmem_is_allowed(), it disallows any access to
> (what's currently added as) System RAM. It might just do what people
> want when dealing with system RAM that doesn't suddenly vanish , so I
> don't ultimately see why we should remove CONFIG_STRICT_DEVMEM=n.

Yes, I wanted to tease out more of your rationale on where the line
should be drawn, I think a mostly unfettered /dev/mem mode is here to
stay.

2021-08-25 17:29:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] /dev/mem: disallow access to explicitly excluded system RAM regions

On 25.08.21 19:07, Dan Williams wrote:
> On Wed, Aug 25, 2021 at 12:23 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 25.08.21 02:58, Dan Williams wrote:
>>> On Mon, Aug 16, 2021 at 7:25 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> virtio-mem dynamically exposes memory inside a device memory region as
>>>> system RAM to Linux, coordinating with the hypervisor which parts are
>>>> actually "plugged" and consequently usable/accessible. On the one hand, the
>>>> virtio-mem driver adds/removes whole memory blocks, creating/removing busy
>>>> IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs
>>>> memory inside added memory blocks, dynamically either exposing them to
>>>> the buddy or hiding them from the buddy and marking them PG_offline.
>>>>
>>>> virtio-mem wants to make sure that in a sane environment, nobody
>>>> "accidentially" accesses unplugged memory inside the device managed
>>>> region. After /proc/kcore has been sanitized and /dev/kmem has been
>>>> removed, /dev/mem is the remaining interface that still allows uncontrolled
>>>> access to the device-managed region of virtio-mem devices from user
>>>> space.
>>>>
>>>> There is no known sane use case for mapping virtio-mem device memory
>>>> via /dev/mem while virtio-mem driver concurrently (un)plugs memory inside
>>>> that region. So once the driver was loaded and detected the device
>>>> along the device-managed region, we just want to disallow any access via
>>>> /dev/mem to it.
>>>>
>>>> Let's add the basic infrastructure to exclude some physical memory
>>>> regions completely from /dev/mem access, on any architecture and under
>>>> any system configuration (independent of CONFIG_STRICT_DEVMEM and
>>>> independent of "iomem=").
>>>
>>> I'm certainly on team "/dev/mem considered harmful", but this approach
>>> feels awkward. It feels wrong for being non-committal about whether
>>> CONFIG_STRICT_DEVMEM is in wide enough use that the safety can be
>>> turned on all the time, and the configuration option dropped, or there
>>> are users clinging onto /dev/mem where they expect to be able to build
>>> a debug kernel to turn all of these restrictions off, even the
>>> virtio-mem ones. This splits the difference and says some /dev/mem
>>> accesses are always disallowed for "reasons", but I could say the same
>>> thing about pmem, there's no sane reason to allow /dev/mem which has
>>> no idea about the responsibilities of properly touching pmem to get
>>> access to it.
>>
>> For virtio-mem, there is no use case *and* access could be harmful; I
>> don't even want to allow if for debugging purposes. If you want to
>> inspect virtio-mem device memory content, use /proc/kcore, which
>> performs proper synchronized access checks. Modifying random virtio-mem
>> memory via /dev/mem in a debug kernel will not be possible: if you
>> really have to play with fire, use kdb or better don't load the
>> virtio-mem driver during boot, such that the kernel won't even be making
>> use of device memory.
>>
>> I don't want people disabling CONFIG_STRICT_DEVMEM, or booting with
>> "iomem=relaxed", and "accidentally" accessing any of virtio-mem memory
>> via /dev/mem, while it gets concurrently plugged/unplugged by the
>> virtio-mem driver. Not even for debugging purposes.
>
> That sounds more an argument that all of the existing "kernel is using
> this region" cases should become mandatory exclusions. If unloading
> the driver removes the exclusion then that's precisely
> CONFIG_IO_STRICT_DEVMEM. Why is the virtio-mem driver more special
> than any other driver that expects this integrity guarantee?

Unloading the driver will only remove exclusion if the driver can be
unloaded cleanly -- if there is no memory added to Linux. Similar to
force-unbinding dax/kmem without offlining memory, the whole device
range will remain excluded.

(unloading the driver is only even implemented because there is no way
to not implement it; there is no sane use case for virtio-mem to do that)

There are 2 things that are relevant for virtio-mem memory in regards of
this series:

1. Kernel is currently using it (added virtio-mem memory). Don't allow
access. Pretty much like most other things we want to exclude, I agree.

2. Kernel is currently not using it (not yet added virtio-mem memory),
or not using it right now any more (removed virtio-mem memory). In
contrast to other devices (DIMM, PMEM, ...) there is no sane use case
for this memory, because the VM must not use it (as defined in the
virtio-spec).


I care about 2) a lot because I don't want people looking at
/proc/iomem, figuring out that there is something to map. And by the
time they try to map it via /dev/mem, virtio-mem emoved that memory, yet
a /dev/mem mapping happened and we have invalid memory access.

Mapping /dev/mem and accidentally being able to read/write virtio-mem
memory has to be forbidden in sane environments. Force unloading a
driver or preventing it from loading just to touch virtio-mem memory via
/dev/mem is not a sane environment, someone is explicitly is asking for
trouble, which is fine.

>
>> We disallow mapping to some other regions independent of
>> CONFIG_STRICT_DEVMEM already, so the idea to ignore CONFIG_STRICT_DEVMEM
>> is not completely new:
>>
>> "Note that with PAT support enabled, even in this case there are
>> restrictions on /dev/mem use due to the cache aliasing requirements."
>>
>> Maybe you even want to do something similar with PMEM now that there is
>> infrastructure for it and just avoid having to deal with revoking
>> /dev/mem mappings later.
>
> That would be like blocking writes to /dev/sda just because a
> filesytem might later be mounted on it. If the /dev/mem access is not
> actively colliding with other kernel operations what business does the
> kernel have saying no?

That the spec defines that that memory must not be read/written, because
there might not be any memory after all anymore backing the virtio-mem
device, or there is and the hypervisor will flag you as "malicious" and
eventually zap the VM. That's different to most physical devices I am
aware of.

>
> I'm pushing on this topic because I am also considering an exclusion
> on PCI configuration access to the "DOE mailbox" since it can disrupt
> the kernel's operation, at the same time, root can go change PCI BARs
> to nonsensical values whenever it wants which is also in the category
> of "has no use case && could be harmful".

Right.

>
>> I think there are weird debugging/educational setups [1] that still
>> require CONFIG_STRICT_DEVMEM=n even with iomem=relaxed. Take a look at
>> lib/devmem_is_allowed.c:devmem_is_allowed(), it disallows any access to
>> (what's currently added as) System RAM. It might just do what people
>> want when dealing with system RAM that doesn't suddenly vanish , so I
>> don't ultimately see why we should remove CONFIG_STRICT_DEVMEM=n.
>
> Yes, I wanted to tease out more of your rationale on where the line
> should be drawn, I think a mostly unfettered /dev/mem mode is here to
> stay.

I could most certainly be convinced to

a) Leave CONFIG_STRICT_DEVMEM=n untouched
b) Restrict what I propose to CONFIG_STRICT_DEVMEM=y.

I could even go ahead and require CONFIG_STRICT_DEVMEM for virtio-mem.

--
Thanks,

David / dhildenb