2018-12-05 12:30:47

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 0/7] mm: PG_reserved cleanups and documentation

I was recently going over all users of PG_reserved. Short story: it is
difficult and sometimes not really clear if setting/checking for
PG_reserved is only a relict from the past. Easy to break things.

I had way more cleanups in this series inititally,
but some architectures take PG_reserved as a way to apply a different
caching strategy (for MMIO pages). So I decided to only include the most
obvious changes (that are less likely to break something).

So let's see if the documentation update for PG_reserved I crafted
actually covers most cases or if there is plenty more.

Most notably, for device memory we can hopefully soon stop setting
it PG_reserved

I only briefly tested this on s390x.

David Hildenbrand (7):
agp: efficeon: no need to set PG_reserved on GATT tables
s390/vdso: don't clear PG_reserved
powerpc/vdso: don't clear PG_reserved
riscv/vdso: don't clear PG_reserved
m68k/mm: use __ClearPageReserved()
arm64: kexec: no need to ClearPageReserved()
mm: better document PG_reserved

arch/arm64/kernel/machine_kexec.c | 1 -
arch/m68k/mm/memory.c | 2 +-
arch/powerpc/kernel/vdso.c | 2 --
arch/riscv/kernel/vdso.c | 1 -
arch/s390/kernel/vdso.c | 2 --
drivers/char/agp/efficeon-agp.c | 2 --
include/linux/page-flags.h | 18 ++++++++++++++++--
7 files changed, 17 insertions(+), 11 deletions(-)

--
2.17.2



2018-12-05 12:31:08

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 1/7] agp: efficeon: no need to set PG_reserved on GATT tables

The l1 GATT page table is kept in a special on-chip page with 64 entries.
We allocate the l2 page table pages via get_zeroed_page() and enter them
into the table. These l2 pages are modified accordingly when
inserting/removing memory via efficeon_insert_memory and
efficeon_remove_memory.

Apart from that, these pages are not exposed or ioremap'ed. We can stop
setting them reserved (propably copied from generic code).

Cc: David Airlie <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/char/agp/efficeon-agp.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/drivers/char/agp/efficeon-agp.c b/drivers/char/agp/efficeon-agp.c
index 7f88490b5479..c53f0f9ef5b0 100644
--- a/drivers/char/agp/efficeon-agp.c
+++ b/drivers/char/agp/efficeon-agp.c
@@ -163,7 +163,6 @@ static int efficeon_free_gatt_table(struct agp_bridge_data *bridge)
unsigned long page = efficeon_private.l1_table[index];
if (page) {
efficeon_private.l1_table[index] = 0;
- ClearPageReserved(virt_to_page((char *)page));
free_page(page);
freed++;
}
@@ -219,7 +218,6 @@ static int efficeon_create_gatt_table(struct agp_bridge_data *bridge)
efficeon_free_gatt_table(agp_bridge);
return -ENOMEM;
}
- SetPageReserved(virt_to_page((char *)page));

for (offset = 0; offset < PAGE_SIZE; offset += clflush_chunk)
clflush((char *)page+offset);
--
2.17.2


2018-12-05 12:31:16

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 2/7] s390/vdso: don't clear PG_reserved

The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.

As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.

Suggested-by: Martin Schwidefsky <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/kernel/vdso.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index ebe748a9f472..9e24d23c26c0 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -292,7 +292,6 @@ static int __init vdso_init(void)
BUG_ON(vdso32_pagelist == NULL);
for (i = 0; i < vdso32_pages - 1; i++) {
struct page *pg = virt_to_page(vdso32_kbase + i*PAGE_SIZE);
- ClearPageReserved(pg);
get_page(pg);
vdso32_pagelist[i] = pg;
}
@@ -310,7 +309,6 @@ static int __init vdso_init(void)
BUG_ON(vdso64_pagelist == NULL);
for (i = 0; i < vdso64_pages - 1; i++) {
struct page *pg = virt_to_page(vdso64_kbase + i*PAGE_SIZE);
- ClearPageReserved(pg);
get_page(pg);
vdso64_pagelist[i] = pg;
}
--
2.17.2


2018-12-05 12:31:19

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 4/7] riscv/vdso: don't clear PG_reserved

The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.

As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.

Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Tobias Klauser <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/riscv/kernel/vdso.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
index 582cb153eb24..0cd044122234 100644
--- a/arch/riscv/kernel/vdso.c
+++ b/arch/riscv/kernel/vdso.c
@@ -54,7 +54,6 @@ static int __init vdso_init(void)
struct page *pg;

pg = virt_to_page(vdso_start + (i << PAGE_SHIFT));
- ClearPageReserved(pg);
vdso_pagelist[i] = pg;
}
vdso_pagelist[i] = virt_to_page(vdso_data);
--
2.17.2


2018-12-05 12:31:19

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 3/7] powerpc/vdso: don't clear PG_reserved

The VDSO is part of the kernel image and therefore the struct pages are
marked as reserved during boot.

As we install a special mapping, the actual struct pages will never be
exposed to MM via the page tables. We can therefore leave the pages
marked as reserved.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/kernel/vdso.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 65b3bdb99f0b..d59dc2e9a695 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -795,7 +795,6 @@ static int __init vdso_init(void)
BUG_ON(vdso32_pagelist == NULL);
for (i = 0; i < vdso32_pages; i++) {
struct page *pg = virt_to_page(vdso32_kbase + i*PAGE_SIZE);
- ClearPageReserved(pg);
get_page(pg);
vdso32_pagelist[i] = pg;
}
@@ -809,7 +808,6 @@ static int __init vdso_init(void)
BUG_ON(vdso64_pagelist == NULL);
for (i = 0; i < vdso64_pages; i++) {
struct page *pg = virt_to_page(vdso64_kbase + i*PAGE_SIZE);
- ClearPageReserved(pg);
get_page(pg);
vdso64_pagelist[i] = pg;
}
--
2.17.2


2018-12-05 12:31:28

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 7/7] mm: better document PG_reserved

The usage of PG_reserved and how PG_reserved pages are to be treated is
burried deep down in different parts of the kernel. Let's shine some light
onto these details by documenting (most?) current users and expected
behavior.

I don't see a reason why we have to document "Some of them might not even
exist". If there is a user, we should document it. E.g. for balloon
drivers we now use PG_offline to indicate that a page might currently
not be backed by memory in the hypervisor. And that is independent from
PG_reserved.

Cc: Andrew Morton <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Anthony Yznaga <[email protected]>
Cc: Miles Chen <[email protected]>
Cc: [email protected]
Cc: Dan Williams <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/page-flags.h | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 68b8495e2fbc..112526f5ba61 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -17,8 +17,22 @@
/*
* Various page->flags bits:
*
- * PG_reserved is set for special pages, which can never be swapped out. Some
- * of them might not even exist...
+ * PG_reserved is set for special pages. The "struct page" of such a page
+ * should in general not be touched (e.g. set dirty) except by their owner.
+ * Pages marked as PG_reserved include:
+ * - Kernel image (including vDSO) and similar (e.g. BIOS, initrd)
+ * - Pages allocated early during boot (bootmem, memblock)
+ * - Zero pages
+ * - Pages that have been associated with a zone but are not available for
+ * the page allocator (e.g. excluded via online_page_callback())
+ * - Pages to exclude from the hibernation image (e.g. loaded kexec images)
+ * - MMIO pages (communicate with a device, special caching strategy needed)
+ * - MCA pages on ia64 (pages with memory errors)
+ * - Device memory (e.g. PMEM, DAX, HMM)
+ * Some architectures don't allow to ioremap pages that are not marked
+ * PG_reserved (as they might be in use by somebody else who does not respect
+ * the caching strategy). Consequently, PG_reserved for a page mapped into
+ * user space can indicate the zero page, the vDSO, MMIO pages or device memory.
*
* The PG_private bitflag is set on pagecache pages if they contain filesystem
* specific data (which is normally at page->private). It can be used by
--
2.17.2


2018-12-05 12:32:38

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 6/7] arm64: kexec: no need to ClearPageReserved()

This will already be done by free_reserved_page().

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Bhupesh Sharma <[email protected]>
Cc: James Morse <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/arm64/kernel/machine_kexec.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 922add8adb74..0ef4ea73aa54 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -353,7 +353,6 @@ void crash_free_reserved_phys_range(unsigned long begin, unsigned long end)

for (addr = begin; addr < end; addr += PAGE_SIZE) {
page = phys_to_page(addr);
- ClearPageReserved(page);
free_reserved_page(page);
}
}
--
2.17.2


2018-12-05 12:32:39

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 5/7] m68k/mm: use __ClearPageReserved()

The PG_reserved flag is cleared from memory that is part of the kernel
image (and therefore marked as PG_reserved). Avoid using PG_reserved
directly.

Cc: Geert Uytterhoeven <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/m68k/mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/m68k/mm/memory.c b/arch/m68k/mm/memory.c
index b86a2e21693b..227c04fe60d2 100644
--- a/arch/m68k/mm/memory.c
+++ b/arch/m68k/mm/memory.c
@@ -51,7 +51,7 @@ void __init init_pointer_table(unsigned long ptable)
pr_debug("init_pointer_table: %lx, %x\n", ptable, PD_MARKBITS(dp));

/* unreserve the page so it's possible to free that page */
- PD_PAGE(dp)->flags &= ~(1 << PG_reserved);
+ __ClearPageReserved(PD_PAGE(dp));
init_page_count(PD_PAGE(dp));

return;
--
2.17.2


2018-12-05 12:58:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] mm: PG_reserved cleanups and documentation

On Wed 05-12-18 13:28:44, David Hildenbrand wrote:
[...]
> Most notably, for device memory we can hopefully soon stop setting
> it PG_reserved

I am busy as hell so I am not likely to look at specific patche this
week. But could you be more specific why we need to get rid of other
PG_reserved users before we can do so for device memory?

I am all for removing relicts because they just confuse people but I
fail to see any relation here.

--
Michal Hocko
SUSE Labs

2018-12-05 13:00:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] mm: better document PG_reserved

On Wed 05-12-18 13:28:51, David Hildenbrand wrote:
> The usage of PG_reserved and how PG_reserved pages are to be treated is
> burried deep down in different parts of the kernel. Let's shine some light
> onto these details by documenting (most?) current users and expected
> behavior.
>
> I don't see a reason why we have to document "Some of them might not even
> exist". If there is a user, we should document it. E.g. for balloon
> drivers we now use PG_offline to indicate that a page might currently
> not be backed by memory in the hypervisor. And that is independent from
> PG_reserved.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Anthony Yznaga <[email protected]>
> Cc: Miles Chen <[email protected]>
> Cc: [email protected]
> Cc: Dan Williams <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

This looks like an improvement. The essential part is that PG_reserved
page belongs to its user and no generic code should touch it. The rest
is a description of current users which I haven't checked due to to lack
of time but yeah, I like the updated wording because I have seen
multiple people confused from the swapped out part which is not true for
many many years. I have tried to dig out when it was actually the case
but failed.

So I cannot give my Ack because I didn't really do a real review but I
like this FWIW.

> ---
> include/linux/page-flags.h | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 68b8495e2fbc..112526f5ba61 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -17,8 +17,22 @@
> /*
> * Various page->flags bits:
> *
> - * PG_reserved is set for special pages, which can never be swapped out. Some
> - * of them might not even exist...
> + * PG_reserved is set for special pages. The "struct page" of such a page
> + * should in general not be touched (e.g. set dirty) except by their owner.
> + * Pages marked as PG_reserved include:
> + * - Kernel image (including vDSO) and similar (e.g. BIOS, initrd)
> + * - Pages allocated early during boot (bootmem, memblock)
> + * - Zero pages
> + * - Pages that have been associated with a zone but are not available for
> + * the page allocator (e.g. excluded via online_page_callback())
> + * - Pages to exclude from the hibernation image (e.g. loaded kexec images)
> + * - MMIO pages (communicate with a device, special caching strategy needed)
> + * - MCA pages on ia64 (pages with memory errors)
> + * - Device memory (e.g. PMEM, DAX, HMM)
> + * Some architectures don't allow to ioremap pages that are not marked
> + * PG_reserved (as they might be in use by somebody else who does not respect
> + * the caching strategy). Consequently, PG_reserved for a page mapped into
> + * user space can indicate the zero page, the vDSO, MMIO pages or device memory.
> *
> * The PG_private bitflag is set on pagecache pages if they contain filesystem
> * specific data (which is normally at page->private). It can be used by
> --
> 2.17.2
>

--
Michal Hocko
SUSE Labs

2018-12-05 13:07:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] mm: PG_reserved cleanups and documentation

On 05.12.18 13:56, Michal Hocko wrote:
> On Wed 05-12-18 13:28:44, David Hildenbrand wrote:
> [...]
>> Most notably, for device memory we can hopefully soon stop setting
>> it PG_reserved
>
> I am busy as hell so I am not likely to look at specific patche this
> week. But could you be more specific why we need to get rid of other
> PG_reserved users before we can do so for device memory?
>

No worries, this has time.

For device memory, nothing should really be needed. I am only collecting
and docuumenting users and this is one user soon to go (eventually) :)

> I am all for removing relicts because they just confuse people but I
> fail to see any relation here.
>

It's really only "why is this patch set not bigger", nothing related to
device memory actually :)

--

Thanks,

David / dhildenb

2018-12-05 14:01:38

by James Morse

[permalink] [raw]
Subject: Re: [PATCH RFC 6/7] arm64: kexec: no need to ClearPageReserved()

Hi David,

(CC: +Akashi)

On 05/12/2018 12:28, David Hildenbrand wrote:
> This will already be done by free_reserved_page().

(will already be -> will be ?)

So it does!


> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> index 922add8adb74..0ef4ea73aa54 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -353,7 +353,6 @@ void crash_free_reserved_phys_range(unsigned long begin, unsigned long end)
>
> for (addr = begin; addr < end; addr += PAGE_SIZE) {
> page = phys_to_page(addr);
> - ClearPageReserved(page);
> free_reserved_page(page);
> }
> }
>

Acked-by: James Morse <[email protected]>


Thanks,

James

2018-12-05 14:36:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] mm: better document PG_reserved

On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
> I don't see a reason why we have to document "Some of them might not even
> exist". If there is a user, we should document it. E.g. for balloon
> drivers we now use PG_offline to indicate that a page might currently
> not be backed by memory in the hypervisor. And that is independent from
> PG_reserved.

I think you're confused by the meaning of "some of them might not even
exist". What this means is that there might not be memory there; maybe
writes to that memory will be discarded, or maybe they'll cause a machine
check. Maybe reads will return ~0, or 0, or cause a machine check.
We just don't know what's there, and we shouldn't try touching the memory.

> +++ b/include/linux/page-flags.h
> @@ -17,8 +17,22 @@
> /*
> * Various page->flags bits:
> *
> - * PG_reserved is set for special pages, which can never be swapped out. Some
> - * of them might not even exist...
> + * PG_reserved is set for special pages. The "struct page" of such a page
> + * should in general not be touched (e.g. set dirty) except by their owner.
> + * Pages marked as PG_reserved include:
> + * - Kernel image (including vDSO) and similar (e.g. BIOS, initrd)
> + * - Pages allocated early during boot (bootmem, memblock)
> + * - Zero pages
> + * - Pages that have been associated with a zone but are not available for
> + * the page allocator (e.g. excluded via online_page_callback())
> + * - Pages to exclude from the hibernation image (e.g. loaded kexec images)
> + * - MMIO pages (communicate with a device, special caching strategy needed)
> + * - MCA pages on ia64 (pages with memory errors)
> + * - Device memory (e.g. PMEM, DAX, HMM)
> + * Some architectures don't allow to ioremap pages that are not marked
> + * PG_reserved (as they might be in use by somebody else who does not respect
> + * the caching strategy). Consequently, PG_reserved for a page mapped into
> + * user space can indicate the zero page, the vDSO, MMIO pages or device memory.

So maybe just add one more option to the list.

2018-12-05 15:07:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] mm: better document PG_reserved

On 05.12.18 15:35, Matthew Wilcox wrote:
> On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
>> I don't see a reason why we have to document "Some of them might not even
>> exist". If there is a user, we should document it. E.g. for balloon
>> drivers we now use PG_offline to indicate that a page might currently
>> not be backed by memory in the hypervisor. And that is independent from
>> PG_reserved.
>
> I think you're confused by the meaning of "some of them might not even
> exist". What this means is that there might not be memory there; maybe
> writes to that memory will be discarded, or maybe they'll cause a machine
> check. Maybe reads will return ~0, or 0, or cause a machine check.
> We just don't know what's there, and we shouldn't try touching the memory.

If there are users, let's document it. And I need more details for that :)

1. machine check: if there is a HW error, we set PG_hwpoison (except
ia64 MCA, see the list)

2. Writes to that memory will be discarded

Who is the user of that? When will we have such pages right now?

3. Reads will return ~0, / 0?

I think this is a special case of e.g. x86? But where do we have that,
are there any user?


In summary: When can we have memory sections that are online but pages
reserved and not accessible? (one example is ballooning I mention here)

(I classify this as dangerous as dump tools will happily dump
PG_reserved pages (unless PG_hwpoison/PG_offline) and that's the right
thing to do).

I want to avoid documenting things that are not actually getting used.

>
>> +++ b/include/linux/page-flags.h
>> @@ -17,8 +17,22 @@
>> /*
>> * Various page->flags bits:
>> *
>> - * PG_reserved is set for special pages, which can never be swapped out. Some
>> - * of them might not even exist...
>> + * PG_reserved is set for special pages. The "struct page" of such a page
>> + * should in general not be touched (e.g. set dirty) except by their owner.
>> + * Pages marked as PG_reserved include:
>> + * - Kernel image (including vDSO) and similar (e.g. BIOS, initrd)
>> + * - Pages allocated early during boot (bootmem, memblock)
>> + * - Zero pages
>> + * - Pages that have been associated with a zone but are not available for
>> + * the page allocator (e.g. excluded via online_page_callback())
>> + * - Pages to exclude from the hibernation image (e.g. loaded kexec images)
>> + * - MMIO pages (communicate with a device, special caching strategy needed)
>> + * - MCA pages on ia64 (pages with memory errors)
>> + * - Device memory (e.g. PMEM, DAX, HMM)
>> + * Some architectures don't allow to ioremap pages that are not marked
>> + * PG_reserved (as they might be in use by somebody else who does not respect
>> + * the caching strategy). Consequently, PG_reserved for a page mapped into
>> + * user space can indicate the zero page, the vDSO, MMIO pages or device memory.
>
> So maybe just add one more option to the list.
>


--

Thanks,

David / dhildenb

2018-12-05 17:34:44

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] mm: better document PG_reserved

On Wed, Dec 05, 2018 at 04:05:12PM +0100, David Hildenbrand wrote:
> On 05.12.18 15:35, Matthew Wilcox wrote:
> > On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
> >> I don't see a reason why we have to document "Some of them might not even
> >> exist". If there is a user, we should document it. E.g. for balloon
> >> drivers we now use PG_offline to indicate that a page might currently
> >> not be backed by memory in the hypervisor. And that is independent from
> >> PG_reserved.
> >
> > I think you're confused by the meaning of "some of them might not even
> > exist". What this means is that there might not be memory there; maybe
> > writes to that memory will be discarded, or maybe they'll cause a machine
> > check. Maybe reads will return ~0, or 0, or cause a machine check.
> > We just don't know what's there, and we shouldn't try touching the memory.
>
> If there are users, let's document it. And I need more details for that :)
>
> 1. machine check: if there is a HW error, we set PG_hwpoison (except
> ia64 MCA, see the list)
>
> 2. Writes to that memory will be discarded
>
> Who is the user of that? When will we have such pages right now?
>
> 3. Reads will return ~0, / 0?
>
> I think this is a special case of e.g. x86? But where do we have that,
> are there any user?

When there are gaps in the physical memory. As in, if you put that
physical address on the bus (or in a packet), no device will respond
to it. Look:

00000000-00000fff : Reserved
00001000-00057fff : System RAM
00058000-00058fff : Reserved
00059000-0009dfff : System RAM
0009e000-000fffff : Reserved

Those examples I gave are examples of how various different architectures
respond to "no device responded to this memory access".


2018-12-05 18:14:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] mm: better document PG_reserved

On 05.12.18 18:32, Matthew Wilcox wrote:
> On Wed, Dec 05, 2018 at 04:05:12PM +0100, David Hildenbrand wrote:
>> On 05.12.18 15:35, Matthew Wilcox wrote:
>>> On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
>>>> I don't see a reason why we have to document "Some of them might not even
>>>> exist". If there is a user, we should document it. E.g. for balloon
>>>> drivers we now use PG_offline to indicate that a page might currently
>>>> not be backed by memory in the hypervisor. And that is independent from
>>>> PG_reserved.
>>>
>>> I think you're confused by the meaning of "some of them might not even
>>> exist". What this means is that there might not be memory there; maybe
>>> writes to that memory will be discarded, or maybe they'll cause a machine
>>> check. Maybe reads will return ~0, or 0, or cause a machine check.
>>> We just don't know what's there, and we shouldn't try touching the memory.
>>
>> If there are users, let's document it. And I need more details for that :)
>>
>> 1. machine check: if there is a HW error, we set PG_hwpoison (except
>> ia64 MCA, see the list)
>>
>> 2. Writes to that memory will be discarded
>>
>> Who is the user of that? When will we have such pages right now?
>>
>> 3. Reads will return ~0, / 0?
>>
>> I think this is a special case of e.g. x86? But where do we have that,
>> are there any user?
>
> When there are gaps in the physical memory. As in, if you put that
> physical address on the bus (or in a packet), no device will respond
> to it. Look:
>
> 00000000-00000fff : Reserved
> 00001000-00057fff : System RAM
> 00058000-00058fff : Reserved
> 00059000-0009dfff : System RAM
> 0009e000-000fffff : Reserved
>
> Those examples I gave are examples of how various different architectures
> respond to "no device responded to this memory access".
>

Okay, so for this memory we will have
a) vmmaps
b) Memory block devices
c) A sections that is online

So essentially "Gaps in physical memory" which is part of a online section.

This might be a candidate for PG_offline as well.

Thanks for the info, I'll try to find out how such things are handled.
In general I assume this memory has to be readable, because otherwise
kdump and friends would crash already when trying to dump?

--

Thanks,

David / dhildenb

2018-12-06 10:49:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] mm: better document PG_reserved

On 05.12.18 19:13, David Hildenbrand wrote:
> On 05.12.18 18:32, Matthew Wilcox wrote:
>> On Wed, Dec 05, 2018 at 04:05:12PM +0100, David Hildenbrand wrote:
>>> On 05.12.18 15:35, Matthew Wilcox wrote:
>>>> On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
>>>>> I don't see a reason why we have to document "Some of them might not even
>>>>> exist". If there is a user, we should document it. E.g. for balloon
>>>>> drivers we now use PG_offline to indicate that a page might currently
>>>>> not be backed by memory in the hypervisor. And that is independent from
>>>>> PG_reserved.
>>>>
>>>> I think you're confused by the meaning of "some of them might not even
>>>> exist". What this means is that there might not be memory there; maybe
>>>> writes to that memory will be discarded, or maybe they'll cause a machine
>>>> check. Maybe reads will return ~0, or 0, or cause a machine check.
>>>> We just don't know what's there, and we shouldn't try touching the memory.
>>>
>>> If there are users, let's document it. And I need more details for that :)
>>>
>>> 1. machine check: if there is a HW error, we set PG_hwpoison (except
>>> ia64 MCA, see the list)
>>>
>>> 2. Writes to that memory will be discarded
>>>
>>> Who is the user of that? When will we have such pages right now?
>>>
>>> 3. Reads will return ~0, / 0?
>>>
>>> I think this is a special case of e.g. x86? But where do we have that,
>>> are there any user?
>>
>> When there are gaps in the physical memory. As in, if you put that
>> physical address on the bus (or in a packet), no device will respond
>> to it. Look:
>>
>> 00000000-00000fff : Reserved
>> 00001000-00057fff : System RAM
>> 00058000-00058fff : Reserved
>> 00059000-0009dfff : System RAM
>> 0009e000-000fffff : Reserved
>>
>> Those examples I gave are examples of how various different architectures
>> respond to "no device responded to this memory access".
>>
>
> Okay, so for this memory we will have
> a) vmmaps
> b) Memory block devices
> c) A sections that is online
>
> So essentially "Gaps in physical memory" which is part of a online section.
>
> This might be a candidate for PG_offline as well.
>
> Thanks for the info, I'll try to find out how such things are handled.
> In general I assume this memory has to be readable, because otherwise
> kdump and friends would crash already when trying to dump?
>

So I finally understood how physical memory holes in online sections are
handled when dumping. They won't be dumped because the list of dumpable
chunks (contained in /proc/kcore and after a crash /proc/vmcore) is
built using walk_system_ram_range(). So anything not listed as RAM will
be ignored.

I will update the documentation, describing that if we have an online
section that is not completely IORESOURCE_SYSTEM_RAM, that the physical
memory gaps will also be set to PG_reserved.

Trying to touch this memory is indeed dangerous, luckily dumping is
properly handled.

I'll think about if marking these ranges as PG_offline might make sense
(and if it can be easily added). Then we directly know when seeing that
page type that we should not touch it. Ever.

That hint was really helpful :)

--

Thanks,

David / dhildenb

2018-12-07 18:47:26

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH RFC 4/7] riscv/vdso: don't clear PG_reserved

On Wed, 05 Dec 2018 04:28:48 PST (-0800), [email protected] wrote:
> The VDSO is part of the kernel image and therefore the struct pages are
> marked as reserved during boot.
>
> As we install a special mapping, the actual struct pages will never be
> exposed to MM via the page tables. We can therefore leave the pages
> marked as reserved.
>
> Cc: Palmer Dabbelt <[email protected]>
> Cc: Albert Ou <[email protected]>
> Cc: Tobias Klauser <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/riscv/kernel/vdso.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
> index 582cb153eb24..0cd044122234 100644
> --- a/arch/riscv/kernel/vdso.c
> +++ b/arch/riscv/kernel/vdso.c
> @@ -54,7 +54,6 @@ static int __init vdso_init(void)
> struct page *pg;
>
> pg = virt_to_page(vdso_start + (i << PAGE_SHIFT));
> - ClearPageReserved(pg);
> vdso_pagelist[i] = pg;
> }
> vdso_pagelist[i] = virt_to_page(vdso_data);

I'm going to assume this will go in through another tree along with the rest of
the set assuming everyone else is happy with it.

Acked-by: Palmer Dabbelt <[email protected]>

Thanks!