Changes since v5 [1]:
* Move put_page() before memory_failure() in madvise_inject_error()
(Naoya)
* The previous change uncovered a latent bug / broken assumption in
__put_devmap_managed_page(). We need to preserve page->mapping for
dax pages when they go idle.
* Rename mapping_size() to dev_pagemap_mapping_size() (Naoya)
* Catch and fail attempts to soft-offline dax pages (Naoya)
* Collect Naoya's ack on "mm, memory_failure: Collect mapping size in
collect_procs()"
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-July/016682.html
---
As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable for
dax mappings, however the primary motivation is to allow the system to
survive userspace consumption of hardware-poison via dax. Specifically
the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks.
---
Dan Williams (13):
device-dax: Convert to vmf_insert_mixed and vm_fault_t
device-dax: Enable page_mapping()
device-dax: Set page->index
filesystem-dax: Set page->index
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, memory_failure: Collect mapping size in collect_procs()
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses
x86/memory_failure: Introduce {set,clear}_mce_nospec()
libnvdimm, pmem: Restore page attributes when clearing errors
arch/x86/include/asm/set_memory.h | 42 ++++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 --
arch/x86/kernel/cpu/mcheck/mce.c | 38 -----
arch/x86/mm/pat.c | 16 ++
drivers/dax/device.c | 75 +++++++---
drivers/nvdimm/pmem.c | 26 ++++
drivers/nvdimm/pmem.h | 13 ++
fs/dax.c | 125 ++++++++++++++++-
include/linux/dax.h | 13 ++
include/linux/huge_mm.h | 5 -
include/linux/mm.h | 1
include/linux/set_memory.h | 14 ++
kernel/memremap.c | 1
mm/hmm.c | 2
mm/huge_memory.c | 4 -
mm/madvise.c | 16 ++
mm/memory-failure.c | 210 +++++++++++++++++++++++------
17 files changed, 481 insertions(+), 135 deletions(-)
Use new return type vm_fault_t for fault and huge_fault handler. For
now, this is just documenting that the function returns a VM_FAULT value
rather than an errno. Once all instances are converted, vm_fault_t will
become a distinct type.
Commit 1c8f422059ae ("mm: change return type to vm_fault_t")
Previously vm_insert_mixed() returned an error code which driver mapped into
VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
inefficiency by returning VM_FAULT_* type.
Signed-off-by: Souptick Joarder <[email protected]>
Reviewed-by: Matthew Wilcox <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/device.c | 26 +++++++++++---------------
include/linux/huge_mm.h | 5 +++--
mm/huge_memory.c | 4 ++--
3 files changed, 16 insertions(+), 19 deletions(-)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index de2f8297a210..ad5e7b4a15dc 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -244,11 +244,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
return -1;
}
-static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
+ struct vm_fault *vmf)
{
struct device *dev = &dev_dax->dev;
struct dax_region *dax_region;
- int rc = VM_FAULT_SIGBUS;
phys_addr_t phys;
pfn_t pfn;
unsigned int fault_size = PAGE_SIZE;
@@ -274,17 +274,11 @@ static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
- rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
-
- if (rc == -ENOMEM)
- return VM_FAULT_OOM;
- if (rc < 0 && rc != -EBUSY)
- return VM_FAULT_SIGBUS;
-
- return VM_FAULT_NOPAGE;
+ return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
}
-static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
+ struct vm_fault *vmf)
{
unsigned long pmd_addr = vmf->address & PMD_MASK;
struct device *dev = &dev_dax->dev;
@@ -334,7 +328,8 @@ static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
}
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+ struct vm_fault *vmf)
{
unsigned long pud_addr = vmf->address & PUD_MASK;
struct device *dev = &dev_dax->dev;
@@ -384,13 +379,14 @@ static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
vmf->flags & FAULT_FLAG_WRITE);
}
#else
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+ struct vm_fault *vmf)
{
return VM_FAULT_FALLBACK;
}
#endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-static int dev_dax_huge_fault(struct vm_fault *vmf,
+static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size)
{
int rc, id;
@@ -420,7 +416,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
return rc;
}
-static int dev_dax_fault(struct vm_fault *vmf)
+static vm_fault_t dev_dax_fault(struct vm_fault *vmf)
{
return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a126259bc4..d3bbf6bea9e9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
#define _LINUX_HUGE_MM_H
#include <linux/sched/coredump.h>
+#include <linux/mm_types.h>
#include <linux/fs.h> /* only for vma_is_dax() */
@@ -46,9 +47,9 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot,
int prot_numa);
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, pfn_t pfn, bool write);
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, pfn_t pfn, bool write);
enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a57a14..feba371169ca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -752,7 +752,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
spin_unlock(ptl);
}
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, pfn_t pfn, bool write)
{
pgprot_t pgprot = vma->vm_page_prot;
@@ -812,7 +812,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
spin_unlock(ptl);
}
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, pfn_t pfn, bool write)
{
pgprot_t pgprot = vma->vm_page_prot;
In support of enabling memory_failure() handling for device-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree.
The ->index value is never cleared. There is no possibility for the
page to become associated with another pgoff while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed and ->index is reinitialized to zero.
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/device.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 95cfcfd612df..361a11089591 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -416,6 +416,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
if (rc == VM_FAULT_NOPAGE) {
unsigned long i;
+ pgoff_t pgoff;
/*
* In the device-dax case the only possibility for a
@@ -423,6 +424,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
* mapped. No need to consider the zero page, or racing
* conflicting mappings.
*/
+ pgoff = linear_page_index(vmf->vma, vmf->address
+ & ~(fault_size - 1));
for (i = 0; i < fault_size / PAGE_SIZE; i++) {
struct page *page;
@@ -430,6 +433,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
if (page->mapping)
continue;
page->mapping = filp->f_mapping;
+ page->index = pgoff + i;
}
}
dax_read_unlock(id);
In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 641192808bb6..4de11ed463ce 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,18 +319,27 @@ static unsigned long dax_radix_end_pfn(void *entry)
for (pfn = dax_radix_pfn(entry); \
pfn < dax_radix_end_pfn(entry); pfn++)
-static void dax_associate_entry(void *entry, struct address_space *mapping)
+/*
+ * TODO: for reflink+dax we need a way to associate a single page with
+ * multiple address_space instances at different linear_page_index()
+ * offsets.
+ */
+static void dax_associate_entry(void *entry, struct address_space *mapping,
+ struct vm_area_struct *vma, unsigned long address)
{
- unsigned long pfn;
+ unsigned long size = dax_entry_size(entry), pfn, index;
+ int i = 0;
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
return;
+ index = linear_page_index(vma, address & ~(size - 1));
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
WARN_ON_ONCE(page->mapping);
page->mapping = mapping;
+ page->index = index + i++;
}
}
@@ -348,6 +357,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
+ page->index = 0;
}
}
@@ -701,7 +711,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
new_entry = dax_radix_locked_entry(pfn, flags);
if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
dax_disassociate_entry(entry, mapping, false);
- dax_associate_entry(new_entry, mapping);
+ dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
}
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
In support of enabling memory_failure() handling for device-dax
mappings, set the ->mapping association of pages backing device-dax
mappings. The rmap implementation requires page_mapping() to return the
address_space hosting the vmas that map the page.
The ->mapping pointer is never cleared. There is no possibility for the
page to become associated with another address_space while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed / later reinitialized to zero.
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/device.c | 55 +++++++++++++++++++++++++++++++++++---------------
1 file changed, 38 insertions(+), 17 deletions(-)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ad5e7b4a15dc..95cfcfd612df 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -245,12 +245,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
}
static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
- struct vm_fault *vmf)
+ struct vm_fault *vmf, pfn_t *pfn)
{
struct device *dev = &dev_dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
- pfn_t pfn;
unsigned int fault_size = PAGE_SIZE;
if (check_vma(dev_dax, vmf->vma, __func__))
@@ -272,20 +271,19 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}
- pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+ *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
- return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+ return vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
}
static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
- struct vm_fault *vmf)
+ struct vm_fault *vmf, pfn_t *pfn)
{
unsigned long pmd_addr = vmf->address & PMD_MASK;
struct device *dev = &dev_dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
pgoff_t pgoff;
- pfn_t pfn;
unsigned int fault_size = PMD_SIZE;
if (check_vma(dev_dax, vmf->vma, __func__))
@@ -321,22 +319,21 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}
- pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+ *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
- return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn,
+ return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, *pfn,
vmf->flags & FAULT_FLAG_WRITE);
}
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
- struct vm_fault *vmf)
+ struct vm_fault *vmf, pfn_t *pfn)
{
unsigned long pud_addr = vmf->address & PUD_MASK;
struct device *dev = &dev_dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
pgoff_t pgoff;
- pfn_t pfn;
unsigned int fault_size = PUD_SIZE;
@@ -373,14 +370,14 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}
- pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+ *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
- return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn,
+ return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, *pfn,
vmf->flags & FAULT_FLAG_WRITE);
}
#else
static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
- struct vm_fault *vmf)
+ struct vm_fault *vmf, pfn_t *pfn)
{
return VM_FAULT_FALLBACK;
}
@@ -389,8 +386,10 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size)
{
- int rc, id;
struct file *filp = vmf->vma->vm_file;
+ unsigned long fault_size;
+ int rc, id;
+ pfn_t pfn;
struct dev_dax *dev_dax = filp->private_data;
dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
@@ -400,17 +399,39 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
id = dax_read_lock();
switch (pe_size) {
case PE_SIZE_PTE:
- rc = __dev_dax_pte_fault(dev_dax, vmf);
+ fault_size = PAGE_SIZE;
+ rc = __dev_dax_pte_fault(dev_dax, vmf, &pfn);
break;
case PE_SIZE_PMD:
- rc = __dev_dax_pmd_fault(dev_dax, vmf);
+ fault_size = PMD_SIZE;
+ rc = __dev_dax_pmd_fault(dev_dax, vmf, &pfn);
break;
case PE_SIZE_PUD:
- rc = __dev_dax_pud_fault(dev_dax, vmf);
+ fault_size = PUD_SIZE;
+ rc = __dev_dax_pud_fault(dev_dax, vmf, &pfn);
break;
default:
rc = VM_FAULT_SIGBUS;
}
+
+ if (rc == VM_FAULT_NOPAGE) {
+ unsigned long i;
+
+ /*
+ * In the device-dax case the only possibility for a
+ * VM_FAULT_NOPAGE result is when device-dax capacity is
+ * mapped. No need to consider the zero page, or racing
+ * conflicting mappings.
+ */
+ for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+ struct page *page;
+
+ page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+ if (page->mapping)
+ continue;
+ page->mapping = filp->f_mapping;
+ }
+ }
dax_read_unlock(id);
return rc;
In preparation for using set_memory_uc() instead set_memory_np() for
isolating poison from speculation, teach the memtype code to sanitize
physical addresses vs __PHYSICAL_MASK.
The motivation for using set_memory_uc() for this case is to allow
ongoing access to persistent memory pages via the pmem-driver +
memcpy_mcsafe() until the poison is repaired.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/mm/pat.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 1555bd7d3449..6788ffa990f8 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -512,6 +512,17 @@ static int free_ram_pages_type(u64 start, u64 end)
return 0;
}
+static u64 sanitize_phys(u64 address)
+{
+ /*
+ * When changing the memtype for pages containing poison allow
+ * for a "decoy" virtual address (bit 63 clear) passed to
+ * set_memory_X(). __pa() on a "decoy" address results in a
+ * physical address with it 63 set.
+ */
+ return address & __PHYSICAL_MASK;
+}
+
/*
* req_type typically has one of the:
* - _PAGE_CACHE_MODE_WB
@@ -533,6 +544,8 @@ int reserve_memtype(u64 start, u64 end, enum page_cache_mode req_type,
int is_range_ram;
int err = 0;
+ start = sanitize_phys(start);
+ end = sanitize_phys(end);
BUG_ON(start >= end); /* end is exclusive */
if (!pat_enabled()) {
@@ -609,6 +622,9 @@ int free_memtype(u64 start, u64 end)
if (!pat_enabled())
return 0;
+ start = sanitize_phys(start);
+ end = sanitize_phys(end);
+
/* Low ISA region is always mapped WB. No need to track */
if (x86_platform.is_untracked_pat_range(start, end))
return 0;
Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.
Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.
Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Acked-by: Tony Luck <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/set_memory.h | 42 +++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 ----------
arch/x86/kernel/cpu/mcheck/mce.c | 38 ++------------------------
include/linux/set_memory.h | 14 ++++++++++
4 files changed, 59 insertions(+), 50 deletions(-)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index bd090367236c..cf5e9124b45e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -88,4 +88,46 @@ extern int kernel_set_to_readonly;
void set_kernel_text_rw(void);
void set_kernel_text_ro(void);
+#ifdef CONFIG_X86_64
+static inline int set_mce_nospec(unsigned long pfn)
+{
+ unsigned long decoy_addr;
+ int rc;
+
+ /*
+ * Mark the linear address as UC to make sure we don't log more
+ * errors because of speculative access to the page.
+ * We would like to just call:
+ * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+ * but doing that would radically increase the odds of a
+ * speculative access to the poison page because we'd have
+ * the virtual address of the kernel 1:1 mapping sitting
+ * around in registers.
+ * Instead we get tricky. We create a non-canonical address
+ * that looks just like the one we want, but has bit 63 flipped.
+ * This relies on set_memory_uc() properly sanitizing any __pa()
+ * results with __PHYSICAL_MASK or PTE_PFN_MASK.
+ */
+ decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+
+ rc = set_memory_uc(decoy_addr, 1);
+ if (rc)
+ pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+ return rc;
+}
+#define set_mce_nospec set_mce_nospec
+
+/* Restore full speculative operation to the pfn. */
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+ return set_memory_wb((unsigned long) pfn_to_kaddr(pfn), 1);
+}
+#define clear_mce_nospec clear_mce_nospec
+#else
+/*
+ * Few people would run a 32-bit kernel on a machine that supports
+ * recoverable errors because they have too much memory to boot 32-bit.
+ */
+#endif
+
#endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 374d1aa66952..ceb67cd5918f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -113,21 +113,6 @@ static inline void mce_register_injector_chain(struct notifier_block *nb) { }
static inline void mce_unregister_injector_chain(struct notifier_block *nb) { }
#endif
-#ifndef CONFIG_X86_64
-/*
- * On 32-bit systems it would be difficult to safely unmap a poison page
- * from the kernel 1:1 map because there are no non-canonical addresses that
- * we can use to refer to the address without risking a speculative access.
- * However, this isn't much of an issue because:
- * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which
- * are only mapped into the kernel as needed
- * 2) Few people would run a 32-bit kernel on a machine that supports
- * recoverable errors because they have too much memory to boot 32-bit.
- */
-static inline void mce_unmap_kpfn(unsigned long pfn) {}
-#define mce_unmap_kpfn mce_unmap_kpfn
-#endif
-
struct mca_config {
bool dont_log_ce;
bool cmci_disabled;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c102ad51025e..42a061ce1f5d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -42,6 +42,7 @@
#include <linux/irq_work.h>
#include <linux/export.h>
#include <linux/jump_label.h>
+#include <linux/set_memory.h>
#include <asm/intel-family.h>
#include <asm/processor.h>
@@ -50,7 +51,6 @@
#include <asm/mce.h>
#include <asm/msr.h>
#include <asm/reboot.h>
-#include <asm/set_memory.h>
#include "mce-internal.h"
@@ -108,10 +108,6 @@ static struct irq_work mce_irq_work;
static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn);
-#endif
-
/*
* CPU/chipset specific EDAC code can register a notifier call here to print
* MCE errors in a human-readable form.
@@ -602,7 +598,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0))
- mce_unmap_kpfn(pfn);
+ set_mce_nospec(pfn);
}
return NOTIFY_OK;
@@ -1072,38 +1068,10 @@ static int do_memory_failure(struct mce *m)
if (ret)
pr_err("Memory error not recovered");
else
- mce_unmap_kpfn(m->addr >> PAGE_SHIFT);
+ set_mce_nospec(m->addr >> PAGE_SHIFT);
return ret;
}
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn)
-{
- unsigned long decoy_addr;
-
- /*
- * Unmap this page from the kernel 1:1 mappings to make sure
- * we don't log more errors because of speculative access to
- * the page.
- * We would like to just call:
- * set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
- * but doing that would radically increase the odds of a
- * speculative access to the poison page because we'd have
- * the virtual address of the kernel 1:1 mapping sitting
- * around in registers.
- * Instead we get tricky. We create a non-canonical address
- * that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_np() not checking whether we passed
- * a legal address.
- */
-
- decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
-
- if (set_memory_np(decoy_addr, 1))
- pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
-}
-#endif
-
/*
* The actual machine check handler. This only handles real
* exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index da5178216da5..2a986d282a97 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -17,6 +17,20 @@ static inline int set_memory_x(unsigned long addr, int numpages) { return 0; }
static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
#endif
+#ifndef set_mce_nospec
+static inline int set_mce_nospec(unsigned long pfn)
+{
+ return 0;
+}
+#endif
+
+#ifndef clear_mce_nospec
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+ return 0;
+}
+#endif
+
#ifndef CONFIG_ARCH_HAS_MEM_ENCRYPT
static inline int set_memory_encrypted(unsigned long addr, int numpages)
{
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.
Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.
Cc: Jan Kara <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/mm.h | 1
mm/memory-failure.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 124 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9ffe380..374e5e9284f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2725,6 +2725,7 @@ enum mf_action_page_type {
MF_MSG_TRUNCATED_LRU,
MF_MSG_BUDDY,
MF_MSG_BUDDY_2ND,
+ MF_MSG_DAX,
MF_MSG_UNKNOWN,
};
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8a81680d00dd..effaa7c7a1a4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -55,6 +55,7 @@
#include <linux/hugetlb.h>
#include <linux/memory_hotplug.h>
#include <linux/mm_inline.h>
+#include <linux/memremap.h>
#include <linux/kfifo.h>
#include <linux/ratelimit.h>
#include "internal.h"
@@ -263,6 +264,40 @@ void shake_page(struct page *p, int access)
}
EXPORT_SYMBOL_GPL(shake_page);
+static unsigned long dev_pagemap_mapping_size(struct page *page,
+ struct vm_area_struct *vma)
+{
+ unsigned long address = vma_address(page, vma);
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset(vma->vm_mm, address);
+ if (!pgd_present(*pgd))
+ return 0;
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ return 0;
+ pud = pud_offset(p4d, address);
+ if (!pud_present(*pud))
+ return 0;
+ if (pud_devmap(*pud))
+ return PUD_SIZE;
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+ if (pmd_devmap(*pmd))
+ return PMD_SIZE;
+ pte = pte_offset_map(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+ if (pte_devmap(*pte))
+ return PAGE_SIZE;
+ return 0;
+}
+
/*
* Failure handling: if we can't find or can't kill a process there's
* not much we can do. We just print a message and ignore otherwise.
@@ -292,7 +327,10 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
}
tk->addr = page_address_in_vma(p, vma);
tk->addr_valid = 1;
- tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
+ if (is_zone_device_page(p))
+ tk->size_shift = ilog2(dev_pagemap_mapping_size(p, vma));
+ else
+ tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
/*
* In theory we don't have to kill when the page was
@@ -300,7 +338,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
* likely very rare kill anyways just out of paranoia, but use
* a SIGKILL because the error is not contained anymore.
*/
- if (tk->addr == -EFAULT) {
+ if (tk->addr == -EFAULT || tk->size_shift == 0) {
pr_info("Memory failure: Unable to find user space address %lx in %s\n",
page_to_pfn(p), tsk->comm);
tk->addr_valid = 0;
@@ -514,6 +552,7 @@ static const char * const action_page_types[] = {
[MF_MSG_TRUNCATED_LRU] = "already truncated LRU page",
[MF_MSG_BUDDY] = "free buddy page",
[MF_MSG_BUDDY_2ND] = "free buddy page (2nd try)",
+ [MF_MSG_DAX] = "dax page",
[MF_MSG_UNKNOWN] = "unknown page",
};
@@ -1111,6 +1150,83 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
return res;
}
+static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+ struct dev_pagemap *pgmap)
+{
+ struct page *page = pfn_to_page(pfn);
+ const bool unmap_success = true;
+ unsigned long size = 0;
+ struct to_kill *tk;
+ LIST_HEAD(tokill);
+ int rc = -EBUSY;
+ loff_t start;
+
+ /*
+ * Prevent the inode from being freed while we are interrogating
+ * the address_space, typically this would be handled by
+ * lock_page(), but dax pages do not use the page lock. This
+ * also prevents changes to the mapping of this pfn until
+ * poison signaling is complete.
+ */
+ if (!dax_lock_mapping_entry(page))
+ goto out;
+
+ if (hwpoison_filter(page)) {
+ rc = 0;
+ goto unlock;
+ }
+
+ switch (pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_PUBLIC:
+ /*
+ * TODO: Handle HMM pages which may need coordination
+ * with device-side memory.
+ */
+ goto unlock;
+ default:
+ break;
+ }
+
+ /*
+ * Use this flag as an indication that the dax page has been
+ * remapped UC to prevent speculative consumption of poison.
+ */
+ SetPageHWPoison(page);
+
+ /*
+ * Unlike System-RAM there is no possibility to swap in a
+ * different physical page at a given virtual address, so all
+ * userspace consumption of ZONE_DEVICE memory necessitates
+ * SIGBUS (i.e. MF_MUST_KILL)
+ */
+ flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+ collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
+
+ list_for_each_entry(tk, &tokill, nd)
+ if (tk->size_shift)
+ size = max(size, 1UL << tk->size_shift);
+ if (size) {
+ /*
+ * Unmap the largest mapping to avoid breaking up
+ * device-dax mappings which are constant size. The
+ * actual size of the mapping being torn down is
+ * communicated in siginfo, see kill_proc()
+ */
+ start = (page->index << PAGE_SHIFT) & ~(size - 1);
+ unmap_mapping_range(page->mapping, start, start + size, 0);
+ }
+ kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
+ rc = 0;
+unlock:
+ dax_unlock_mapping_entry(page);
+out:
+ /* drop pgmap ref acquired in caller */
+ put_dev_pagemap(pgmap);
+ action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
+ return rc;
+}
+
/**
* memory_failure - Handle memory failure of a page.
* @pfn: Page Number of the corrupted page
@@ -1133,6 +1249,7 @@ int memory_failure(unsigned long pfn, int flags)
struct page *p;
struct page *hpage;
struct page *orig_head;
+ struct dev_pagemap *pgmap;
int res;
unsigned long page_flags;
@@ -1145,6 +1262,10 @@ int memory_failure(unsigned long pfn, int flags)
return -ENXIO;
}
+ pgmap = get_dev_pagemap(pfn, NULL);
+ if (pgmap)
+ return memory_failure_dev_pagemap(pfn, flags, pgmap);
+
p = pfn_to_page(pfn);
if (PageHuge(p))
return memory_failure_hugetlb(pfn, flags);
Given that dax / device-mapped pages are never subject to page
allocations remove them from consideration by the soft-offline
mechanism.
Reported-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
mm/memory-failure.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9d142b9b86dc..988f977db3d2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1751,6 +1751,14 @@ int soft_offline_page(struct page *page, int flags)
int ret;
unsigned long pfn = page_to_pfn(page);
+ if (is_zone_device_page(page)) {
+ pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
+ pfn);
+ if (flags & MF_COUNT_INCREASED)
+ put_page(page);
+ return -EIO;
+ }
+
if (PageHWPoison(page)) {
pr_info("soft offline: %#lx page already poisoned\n", pfn);
if (flags & MF_COUNT_INCREASED)
In preparation for implementing support for memory poison (media error)
handling via dax mappings, implement a lock_page() equivalent. Poison
error handling requires rmap and needs guarantees that the page->mapping
association is maintained / valid (inode not freed) for the duration of
the lookup.
In the device-dax case it is sufficient to simply hold a dev_pagemap
reference. In the filesystem-dax case we need to use the entry lock.
Export the entry lock via dax_lock_mapping_entry() that uses
rcu_read_lock() to protect against the inode being freed, and
revalidates the page->mapping association under xa_lock().
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++---
include/linux/dax.h | 13 ++++++
2 files changed, 116 insertions(+), 6 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 4de11ed463ce..57ec272038da 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -226,8 +226,8 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
*
* Must be called with the i_pages lock held.
*/
-static void *get_unlocked_mapping_entry(struct address_space *mapping,
- pgoff_t index, void ***slotp)
+static void *__get_unlocked_mapping_entry(struct address_space *mapping,
+ pgoff_t index, void ***slotp, bool (*wait_fn)(void))
{
void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
@@ -237,6 +237,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
ewait.wait.func = wake_exceptional_entry_func;
for (;;) {
+ bool revalidate;
+
entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
&slot);
if (!entry ||
@@ -251,14 +253,31 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
prepare_to_wait_exclusive(wq, &ewait.wait,
TASK_UNINTERRUPTIBLE);
xa_unlock_irq(&mapping->i_pages);
- schedule();
+ revalidate = wait_fn();
finish_wait(wq, &ewait.wait);
xa_lock_irq(&mapping->i_pages);
+ if (revalidate)
+ return ERR_PTR(-EAGAIN);
}
}
-static void dax_unlock_mapping_entry(struct address_space *mapping,
- pgoff_t index)
+static bool entry_wait(void)
+{
+ schedule();
+ /*
+ * Never return an ERR_PTR() from
+ * __get_unlocked_mapping_entry(), just keep looping.
+ */
+ return false;
+}
+
+static void *get_unlocked_mapping_entry(struct address_space *mapping,
+ pgoff_t index, void ***slotp)
+{
+ return __get_unlocked_mapping_entry(mapping, index, slotp, entry_wait);
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
{
void *entry, **slot;
@@ -277,7 +296,7 @@ static void dax_unlock_mapping_entry(struct address_space *mapping,
static void put_locked_mapping_entry(struct address_space *mapping,
pgoff_t index)
{
- dax_unlock_mapping_entry(mapping, index);
+ unlock_mapping_entry(mapping, index);
}
/*
@@ -374,6 +393,84 @@ static struct page *dax_busy_page(void *entry)
return NULL;
}
+static bool entry_wait_revalidate(void)
+{
+ rcu_read_unlock();
+ schedule();
+ rcu_read_lock();
+
+ /*
+ * Tell __get_unlocked_mapping_entry() to take a break, we need
+ * to revalidate page->mapping after dropping locks
+ */
+ return true;
+}
+
+bool dax_lock_mapping_entry(struct page *page)
+{
+ pgoff_t index;
+ struct inode *inode;
+ bool did_lock = false;
+ void *entry = NULL, **slot;
+ struct address_space *mapping;
+
+ rcu_read_lock();
+ for (;;) {
+ mapping = READ_ONCE(page->mapping);
+
+ if (!dax_mapping(mapping))
+ break;
+
+ /*
+ * In the device-dax case there's no need to lock, a
+ * struct dev_pagemap pin is sufficient to keep the
+ * inode alive, and we assume we have dev_pagemap pin
+ * otherwise we would not have a valid pfn_to_page()
+ * translation.
+ */
+ inode = mapping->host;
+ if (S_ISCHR(inode->i_mode)) {
+ did_lock = true;
+ break;
+ }
+
+ xa_lock_irq(&mapping->i_pages);
+ if (mapping != page->mapping) {
+ xa_unlock_irq(&mapping->i_pages);
+ continue;
+ }
+ index = page->index;
+
+ entry = __get_unlocked_mapping_entry(mapping, index, &slot,
+ entry_wait_revalidate);
+ if (!entry) {
+ xa_unlock_irq(&mapping->i_pages);
+ break;
+ } else if (IS_ERR(entry)) {
+ WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN);
+ continue;
+ }
+ lock_slot(mapping, slot);
+ did_lock = true;
+ xa_unlock_irq(&mapping->i_pages);
+ break;
+ }
+ rcu_read_unlock();
+
+ return did_lock;
+}
+
+void dax_unlock_mapping_entry(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+
+ if (S_ISCHR(inode->i_mode))
+ return;
+
+ unlock_mapping_entry(mapping, page->index);
+}
+
/*
* Find radix tree entry at given index. If it points to an exceptional entry,
* return it with the radix tree entry locked. If the radix tree doesn't
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3855e3800f48..cf8ac51cf0d7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -88,6 +88,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc);
struct page *dax_layout_busy_page(struct address_space *mapping);
+bool dax_lock_mapping_entry(struct page *page);
+void dax_unlock_mapping_entry(struct page *page);
#else
static inline bool bdev_dax_supported(struct block_device *bdev,
int blocksize)
@@ -119,6 +121,17 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
{
return -EOPNOTSUPP;
}
+
+static inline bool dax_lock_mapping_entry(struct page *page)
+{
+ if (IS_DAX(page->mapping->host))
+ return true;
+ return false;
+}
+
+static inline void dax_unlock_mapping_entry(struct page *page)
+{
+}
#endif
int dax_read_lock(void);
MEMORY_DEVICE_FS_DAX relies on typical page semantics whereby ->mapping
is only ever cleared by truncation, not final put.
Without this fix dax pages may forget their mapping association at the
end of every page pin event.
Move this atypical behavior that HMM wants into the HMM ->page_free()
callback.
Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ross Zwisler <[email protected]>
Fixes: d2c997c0f145 ("fs, dax: use page->mapping...")
Signed-off-by: Dan Williams <[email protected]>
---
kernel/memremap.c | 1 -
mm/hmm.c | 2 ++
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 5857267a4af5..62603634a1d2 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -339,7 +339,6 @@ void __put_devmap_managed_page(struct page *page)
__ClearPageActive(page);
__ClearPageWaiters(page);
- page->mapping = NULL;
mem_cgroup_uncharge(page);
page->pgmap->page_free(page, page->pgmap->data);
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..f9d1d89dec4d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -963,6 +963,8 @@ static void hmm_devmem_free(struct page *page, void *data)
{
struct hmm_devmem *devmem = data;
+ page->mapping = NULL;
+
devmem->ops->free(devmem, page);
}
In preparation for supporting memory_failure() for dax mappings, teach
collect_procs() to also determine the mapping size. Unlike typical
mappings the dax mapping size is determined by walking page-table
entries rather than using the compound-page accounting for THP pages.
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
mm/memory-failure.c | 81 +++++++++++++++++++++++++--------------------------
1 file changed, 40 insertions(+), 41 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 988f977db3d2..8a81680d00dd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -174,22 +174,51 @@ int hwpoison_filter(struct page *p)
EXPORT_SYMBOL_GPL(hwpoison_filter);
/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+ struct list_head nd;
+ struct task_struct *tsk;
+ unsigned long addr;
+ short size_shift;
+ char addr_valid;
+};
+
+/*
* Send all the processes who have the page mapped a signal.
* ``action optional'' if they are not immediately affected by the error
* ``action required'' if error happened in current execution context
*/
-static int kill_proc(struct task_struct *t, unsigned long addr,
- unsigned long pfn, struct page *page, int flags)
+static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
{
- short addr_lsb;
+ struct task_struct *t = tk->tsk;
+ short addr_lsb = tk->size_shift;
int ret;
pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
pfn, t->comm, t->pid);
- addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
- ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
+ ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
addr_lsb, current);
} else {
/*
@@ -198,7 +227,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr,
* This could cause a loop when the user sets SIGBUS
* to SIG_IGN, but hopefully no one will do that?
*/
- ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
+ ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
addr_lsb, t); /* synchronous? */
}
if (ret < 0)
@@ -235,35 +264,6 @@ void shake_page(struct page *p, int access)
EXPORT_SYMBOL_GPL(shake_page);
/*
- * Kill all processes that have a poisoned page mapped and then isolate
- * the page.
- *
- * General strategy:
- * Find all processes having the page mapped and kill them.
- * But we keep a page reference around so that the page is not
- * actually freed yet.
- * Then stash the page away
- *
- * There's no convenient way to get back to mapped processes
- * from the VMAs. So do a brute-force search over all
- * running processes.
- *
- * Remember that machine checks are not common (or rather
- * if they are common you have other problems), so this shouldn't
- * be a performance issue.
- *
- * Also there are some races possible while we get from the
- * error detection to actually handle it.
- */
-
-struct to_kill {
- struct list_head nd;
- struct task_struct *tsk;
- unsigned long addr;
- char addr_valid;
-};
-
-/*
* Failure handling: if we can't find or can't kill a process there's
* not much we can do. We just print a message and ignore otherwise.
*/
@@ -292,6 +292,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
}
tk->addr = page_address_in_vma(p, vma);
tk->addr_valid = 1;
+ tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
/*
* In theory we don't have to kill when the page was
@@ -317,9 +318,8 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
* Also when FAIL is set do a force kill because something went
* wrong earlier.
*/
-static void kill_procs(struct list_head *to_kill, int forcekill,
- bool fail, struct page *page, unsigned long pfn,
- int flags)
+static void kill_procs(struct list_head *to_kill, int forcekill, bool fail,
+ unsigned long pfn, int flags)
{
struct to_kill *tk, *next;
@@ -342,8 +342,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill,
* check for that, but we need to tell the
* process anyways.
*/
- else if (kill_proc(tk->tsk, tk->addr,
- pfn, page, flags) < 0)
+ else if (kill_proc(tk, pfn, flags) < 0)
pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n",
pfn, tk->tsk->comm, tk->tsk->pid);
}
@@ -1012,7 +1011,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
* any accesses to the poisoned memory.
*/
forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
- kill_procs(&tokill, forcekill, !unmap_success, p, pfn, flags);
+ kill_procs(&tokill, forcekill, !unmap_success, pfn, flags);
return unmap_success;
}
The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it does not release
that pin. The assumption is that failed pages should be taken out of
circulation.
However, for dax mappings it is not possible to take pages out of
circulation since they are 1:1 physically mapped as filesystem blocks,
or device-dax capacity. They also typically represent persistent memory
which has an error clearing capability.
In preparation for adding a special handler for dax mappings, shift the
responsibility of taking the page reference to memory_failure(). I.e.
drop the page reference and do not specify MF_COUNT_INCREASED to
memory_failure().
Cc: Michal Hocko <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
mm/madvise.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..972a9eaa898b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
for (; start < end; start += PAGE_SIZE << order) {
+ unsigned long pfn;
int ret;
ret = get_user_pages_fast(start, 1, 0, &page);
if (ret != 1)
return ret;
+ pfn = page_to_pfn(page);
/*
* When soft offlining hugepages, after migrating the page
@@ -651,17 +653,25 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
- page_to_pfn(page), start);
+ pfn, start);
ret = soft_offline_page(page, MF_COUNT_INCREASED);
if (ret)
return ret;
continue;
}
+
pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
- page_to_pfn(page), start);
+ pfn, start);
- ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
+ /*
+ * Drop the page reference taken by get_user_pages_fast(). In
+ * the absence of MF_COUNT_INCREASED the memory_failure()
+ * routine is responsible for pinning the page to prevent it
+ * from being released back to the page allocator.
+ */
+ put_page(page);
+ ret = memory_failure(pfn, 0);
if (ret)
return ret;
}
Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
has also been marked UC in PAT (page attribute table) via
set_mce_nospec() to prevent speculative retrievals of poison.
The 'HWPoison' flag is only cleared when overwriting an entire page.
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 26 ++++++++++++++++++++++++++
drivers/nvdimm/pmem.h | 13 +++++++++++++
2 files changed, 39 insertions(+)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 68940356cad3..bdaaa7ed49fd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -20,6 +20,7 @@
#include <linux/hdreg.h>
#include <linux/init.h>
#include <linux/platform_device.h>
+#include <linux/set_memory.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/badblocks.h>
@@ -51,6 +52,30 @@ static struct nd_region *to_region(struct pmem_device *pmem)
return to_nd_region(to_dev(pmem)->parent);
}
+static void hwpoison_clear(struct pmem_device *pmem,
+ phys_addr_t phys, unsigned int len)
+{
+ unsigned long pfn_start, pfn_end, pfn;
+
+ /* only pmem in the linear map supports HWPoison */
+ if (is_vmalloc_addr(pmem->virt_addr))
+ return;
+
+ pfn_start = PHYS_PFN(phys);
+ pfn_end = pfn_start + PHYS_PFN(len);
+ for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+ struct page *page = pfn_to_page(pfn);
+
+ /*
+ * Note, no need to hold a get_dev_pagemap() reference
+ * here since we're in the driver I/O path and
+ * outstanding I/O requests pin the dev_pagemap.
+ */
+ if (test_and_clear_pmem_poison(page))
+ clear_mce_nospec(pfn);
+ }
+}
+
static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
phys_addr_t offset, unsigned int len)
{
@@ -65,6 +90,7 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
if (cleared < len)
rc = BLK_STS_IOERR;
if (cleared > 0 && cleared / 512) {
+ hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
cleared /= 512;
dev_dbg(dev, "%#llx clear %ld sector%s\n",
(unsigned long long) sector, cleared,
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index a64ebc78b5df..59cfe13ea8a8 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __NVDIMM_PMEM_H__
#define __NVDIMM_PMEM_H__
+#include <linux/page-flags.h>
#include <linux/badblocks.h>
#include <linux/types.h>
#include <linux/pfn_t.h>
@@ -27,4 +28,16 @@ struct pmem_device {
long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn);
+
+#ifdef CONFIG_MEMORY_FAILURE
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+ return TestClearPageHWPoison(page);
+}
+#else
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+ return false;
+}
+#endif
#endif /* __NVDIMM_PMEM_H__ */
On Fri, Jul 13, 2018 at 09:49:56PM -0700, Dan Williams wrote:
> Given that dax / device-mapped pages are never subject to page
> allocations remove them from consideration by the soft-offline
> mechanism.
>
> Reported-by: Naoya Horiguchi <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> mm/memory-failure.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 9d142b9b86dc..988f977db3d2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1751,6 +1751,14 @@ int soft_offline_page(struct page *page, int flags)
> int ret;
> unsigned long pfn = page_to_pfn(page);
>
> + if (is_zone_device_page(page)) {
> + pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
> + pfn);
> + if (flags & MF_COUNT_INCREASED)
> + put_page(page);
put_hwpoison_page(), which is just an alias of put_page(), is better
for consistency.
With this adjustment, feel free to add my ack.
Acked-by: Naoya Horiguchi <[email protected]>
Thanks,
Naoya Horiguchi
> + return -EIO;
> + }
> +
> if (PageHWPoison(page)) {
> pr_info("soft offline: %#lx page already poisoned\n", pfn);
> if (flags & MF_COUNT_INCREASED)
>
>
On Fri, Jul 13, 2018 at 09:50:06PM -0700, Dan Williams wrote:
> The madvise_inject_error() routine uses get_user_pages() to lookup the
> pfn and other information for injected error, but it does not release
> that pin. The assumption is that failed pages should be taken out of
> circulation.
>
> However, for dax mappings it is not possible to take pages out of
> circulation since they are 1:1 physically mapped as filesystem blocks,
> or device-dax capacity. They also typically represent persistent memory
> which has an error clearing capability.
>
> In preparation for adding a special handler for dax mappings, shift the
> responsibility of taking the page reference to memory_failure(). I.e.
> drop the page reference and do not specify MF_COUNT_INCREASED to
> memory_failure().
>
> Cc: Michal Hocko <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Ingo,
Is it possible to ack the x86 bits in this patch series? I'm hoping to
get this pulled through the libnvdimm tree for 4.19. Thanks!
On 07/13/2018 09:49 PM, Dan Williams wrote:
> Changes since v5 [1]:
> * Move put_page() before memory_failure() in madvise_inject_error()
> (Naoya)
> * The previous change uncovered a latent bug / broken assumption in
> __put_devmap_managed_page(). We need to preserve page->mapping for
> dax pages when they go idle.
> * Rename mapping_size() to dev_pagemap_mapping_size() (Naoya)
> * Catch and fail attempts to soft-offline dax pages (Naoya)
> * Collect Naoya's ack on "mm, memory_failure: Collect mapping size in
> collect_procs()"
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-July/016682.html
>
> ---
>
> As it stands, memory_failure() gets thoroughly confused by dev_pagemap
> backed mappings. The recovery code has specific enabling for several
> possible page states and needs new enabling to handle poison in dax
> mappings.
>
> In order to support reliable reverse mapping of user space addresses:
>
> 1/ Add new locking in the memory_failure() rmap path to prevent races
> that would typically be handled by the page lock.
>
> 2/ Since dev_pagemap pages are hidden from the page allocator and the
> "compound page" accounting machinery, add a mechanism to determine the
> size of the mapping that encompasses a given poisoned pfn.
>
> 3/ Given pmem errors can be repaired, change the speculatively accessed
> poison protection, mce_unmap_kpfn(), to be reversible and otherwise
> allow ongoing access from the kernel.
>
> A side effect of this enabling is that MADV_HWPOISON becomes usable for
> dax mappings, however the primary motivation is to allow the system to
> survive userspace consumption of hardware-poison via dax. Specifically
> the current behavior is:
>
> mce: Uncorrected hardware memory error in user-access at af34214200
> {1}[Hardware Error]: It has been corrected by h/w and requires no further action
> mce: [Hardware Error]: Machine check events logged
> {1}[Hardware Error]: event severity: corrected
> Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
> [..]
> Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
> mce: Memory error not recovered
> <reboot>
>
> ...and with these changes:
>
> Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
> Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
> Memory failure: 0x20cb00: recovery action for dax page: Recovered
>
> Given all the cross dependencies I propose taking this through
> nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
> folks.
>
> ---
>
> Dan Williams (13):
> device-dax: Convert to vmf_insert_mixed and vm_fault_t
> device-dax: Enable page_mapping()
> device-dax: Set page->index
> filesystem-dax: Set page->index
> mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
> mm, dev_pagemap: Do not clear ->mapping on final put
> mm, madvise_inject_error: Let memory_failure() optionally take a page reference
> mm, memory_failure: Collect mapping size in collect_procs()
> filesystem-dax: Introduce dax_lock_mapping_entry()
> mm, memory_failure: Teach memory_failure() about dev_pagemap pages
> x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses
> x86/memory_failure: Introduce {set,clear}_mce_nospec()
> libnvdimm, pmem: Restore page attributes when clearing errors
>
>
> arch/x86/include/asm/set_memory.h | 42 ++++++
> arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 --
> arch/x86/kernel/cpu/mcheck/mce.c | 38 -----
> arch/x86/mm/pat.c | 16 ++
> drivers/dax/device.c | 75 +++++++---
> drivers/nvdimm/pmem.c | 26 ++++
> drivers/nvdimm/pmem.h | 13 ++
> fs/dax.c | 125 ++++++++++++++++-
> include/linux/dax.h | 13 ++
> include/linux/huge_mm.h | 5 -
> include/linux/mm.h | 1
> include/linux/set_memory.h | 14 ++
> kernel/memremap.c | 1
> mm/hmm.c | 2
> mm/huge_memory.c | 4 -
> mm/madvise.c | 16 ++
> mm/memory-failure.c | 210 +++++++++++++++++++++++------
> 17 files changed, 481 insertions(+), 135 deletions(-)
> _______________________________________________
> Linux-nvdimm mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
On Mon, Jul 23, 2018 at 09:12:06AM -0700, Dave Jiang wrote:
> Jerome,
> Is it possible to get an ack for this? Thanks!
>
> On 07/13/2018 09:50 PM, Dan Williams wrote:
> > MEMORY_DEVICE_FS_DAX relies on typical page semantics whereby ->mapping
> > is only ever cleared by truncation, not final put.
> >
> > Without this fix dax pages may forget their mapping association at the
> > end of every page pin event.
> >
> > Move this atypical behavior that HMM wants into the HMM ->page_free()
> > callback.
> >
> > Cc: <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Ross Zwisler <[email protected]>
> > Fixes: d2c997c0f145 ("fs, dax: use page->mapping...")
> > Signed-off-by: Dan Williams <[email protected]>
Acked-by: J?r?me Glisse <[email protected]>
> > ---
> > kernel/memremap.c | 1 -
> > mm/hmm.c | 2 ++
> > 2 files changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index 5857267a4af5..62603634a1d2 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -339,7 +339,6 @@ void __put_devmap_managed_page(struct page *page)
> > __ClearPageActive(page);
> > __ClearPageWaiters(page);
> >
> > - page->mapping = NULL;
> > mem_cgroup_uncharge(page);
> >
> > page->pgmap->page_free(page, page->pgmap->data);
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index de7b6bf77201..f9d1d89dec4d 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -963,6 +963,8 @@ static void hmm_devmem_free(struct page *page, void *data)
> > {
> > struct hmm_devmem *devmem = data;
> >
> > + page->mapping = NULL;
> > +
> > devmem->ops->free(devmem, page);
> > }
> >
> >
> > _______________________________________________
> > Linux-nvdimm mailing list
> > [email protected]
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
> >
Jerome,
Is it possible to get an ack for this? Thanks!
On 07/13/2018 09:50 PM, Dan Williams wrote:
> MEMORY_DEVICE_FS_DAX relies on typical page semantics whereby ->mapping
> is only ever cleared by truncation, not final put.
>
> Without this fix dax pages may forget their mapping association at the
> end of every page pin event.
>
> Move this atypical behavior that HMM wants into the HMM ->page_free()
> callback.
>
> Cc: <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Jérôme Glisse <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Fixes: d2c997c0f145 ("fs, dax: use page->mapping...")
> Signed-off-by: Dan Williams <[email protected]>
> ---
> kernel/memremap.c | 1 -
> mm/hmm.c | 2 ++
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 5857267a4af5..62603634a1d2 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -339,7 +339,6 @@ void __put_devmap_managed_page(struct page *page)
> __ClearPageActive(page);
> __ClearPageWaiters(page);
>
> - page->mapping = NULL;
> mem_cgroup_uncharge(page);
>
> page->pgmap->page_free(page, page->pgmap->data);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..f9d1d89dec4d 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -963,6 +963,8 @@ static void hmm_devmem_free(struct page *page, void *data)
> {
> struct hmm_devmem *devmem = data;
>
> + page->mapping = NULL;
> +
> devmem->ops->free(devmem, page);
> }
>
>
> _______________________________________________
> Linux-nvdimm mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
* Dan Williams <[email protected]> wrote:
> In preparation for using set_memory_uc() instead set_memory_np() for
> isolating poison from speculation, teach the memtype code to sanitize
> physical addresses vs __PHYSICAL_MASK.
>
> The motivation for using set_memory_uc() for this case is to allow
> ongoing access to persistent memory pages via the pmem-driver +
> memcpy_mcsafe() until the poison is repaired.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> arch/x86/mm/pat.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
> index 1555bd7d3449..6788ffa990f8 100644
> --- a/arch/x86/mm/pat.c
> +++ b/arch/x86/mm/pat.c
> @@ -512,6 +512,17 @@ static int free_ram_pages_type(u64 start, u64 end)
> return 0;
> }
>
> +static u64 sanitize_phys(u64 address)
> +{
> + /*
> + * When changing the memtype for pages containing poison allow
> + * for a "decoy" virtual address (bit 63 clear) passed to
> + * set_memory_X(). __pa() on a "decoy" address results in a
> + * physical address with it 63 set.
> + */
> + return address & __PHYSICAL_MASK;
s/it/bit
Thanks,
Ingo
* Dave Jiang <[email protected]> wrote:
> Ingo,
> Is it possible to ack the x86 bits in this patch series? I'm hoping to
> get this pulled through the libnvdimm tree for 4.19. Thanks!
With the minor typo fixed in the first patch, both patches are looking good to me:
Acked-by: Ingo Molnar <[email protected]>
Assuming that it gets properly tested, etc.
Thanks,
Ingo
On 07/24/2018 12:36 AM, Ingo Molnar wrote:
>
> * Dan Williams <[email protected]> wrote:
>
>> In preparation for using set_memory_uc() instead set_memory_np() for
>> isolating poison from speculation, teach the memtype code to sanitize
>> physical addresses vs __PHYSICAL_MASK.
>>
>> The motivation for using set_memory_uc() for this case is to allow
>> ongoing access to persistent memory pages via the pmem-driver +
>> memcpy_mcsafe() until the poison is repaired.
>>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: Tony Luck <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: <[email protected]>
>> Cc: <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> arch/x86/mm/pat.c | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>>
>> diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
>> index 1555bd7d3449..6788ffa990f8 100644
>> --- a/arch/x86/mm/pat.c
>> +++ b/arch/x86/mm/pat.c
>> @@ -512,6 +512,17 @@ static int free_ram_pages_type(u64 start, u64 end)
>> return 0;
>> }
>>
>> +static u64 sanitize_phys(u64 address)
>> +{
>> + /*
>> + * When changing the memtype for pages containing poison allow
>> + * for a "decoy" virtual address (bit 63 clear) passed to
>> + * set_memory_X(). __pa() on a "decoy" address results in a
>> + * physical address with it 63 set.
>> + */
>> + return address & __PHYSICAL_MASK;
>
> s/it/bit
Thanks Ingo! I'll update when I pull in the patch.
>
> Thanks,
>
> Ingo
> _______________________________________________
> Linux-nvdimm mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
On Fri 13-07-18 21:50:16, Dan Williams wrote:
> In preparation for implementing support for memory poison (media error)
> handling via dax mappings, implement a lock_page() equivalent. Poison
> error handling requires rmap and needs guarantees that the page->mapping
> association is maintained / valid (inode not freed) for the duration of
> the lookup.
>
> In the device-dax case it is sufficient to simply hold a dev_pagemap
> reference. In the filesystem-dax case we need to use the entry lock.
>
> Export the entry lock via dax_lock_mapping_entry() that uses
> rcu_read_lock() to protect against the inode being freed, and
> revalidates the page->mapping association under xa_lock().
>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Jan Kara <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
Just got back from vacation... This patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/dax.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++---
> include/linux/dax.h | 13 ++++++
> 2 files changed, 116 insertions(+), 6 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 4de11ed463ce..57ec272038da 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -226,8 +226,8 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
> *
> * Must be called with the i_pages lock held.
> */
> -static void *get_unlocked_mapping_entry(struct address_space *mapping,
> - pgoff_t index, void ***slotp)
> +static void *__get_unlocked_mapping_entry(struct address_space *mapping,
> + pgoff_t index, void ***slotp, bool (*wait_fn)(void))
> {
> void *entry, **slot;
> struct wait_exceptional_entry_queue ewait;
> @@ -237,6 +237,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
> ewait.wait.func = wake_exceptional_entry_func;
>
> for (;;) {
> + bool revalidate;
> +
> entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
> &slot);
> if (!entry ||
> @@ -251,14 +253,31 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
> prepare_to_wait_exclusive(wq, &ewait.wait,
> TASK_UNINTERRUPTIBLE);
> xa_unlock_irq(&mapping->i_pages);
> - schedule();
> + revalidate = wait_fn();
> finish_wait(wq, &ewait.wait);
> xa_lock_irq(&mapping->i_pages);
> + if (revalidate)
> + return ERR_PTR(-EAGAIN);
> }
> }
>
> -static void dax_unlock_mapping_entry(struct address_space *mapping,
> - pgoff_t index)
> +static bool entry_wait(void)
> +{
> + schedule();
> + /*
> + * Never return an ERR_PTR() from
> + * __get_unlocked_mapping_entry(), just keep looping.
> + */
> + return false;
> +}
> +
> +static void *get_unlocked_mapping_entry(struct address_space *mapping,
> + pgoff_t index, void ***slotp)
> +{
> + return __get_unlocked_mapping_entry(mapping, index, slotp, entry_wait);
> +}
> +
> +static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> {
> void *entry, **slot;
>
> @@ -277,7 +296,7 @@ static void dax_unlock_mapping_entry(struct address_space *mapping,
> static void put_locked_mapping_entry(struct address_space *mapping,
> pgoff_t index)
> {
> - dax_unlock_mapping_entry(mapping, index);
> + unlock_mapping_entry(mapping, index);
> }
>
> /*
> @@ -374,6 +393,84 @@ static struct page *dax_busy_page(void *entry)
> return NULL;
> }
>
> +static bool entry_wait_revalidate(void)
> +{
> + rcu_read_unlock();
> + schedule();
> + rcu_read_lock();
> +
> + /*
> + * Tell __get_unlocked_mapping_entry() to take a break, we need
> + * to revalidate page->mapping after dropping locks
> + */
> + return true;
> +}
> +
> +bool dax_lock_mapping_entry(struct page *page)
> +{
> + pgoff_t index;
> + struct inode *inode;
> + bool did_lock = false;
> + void *entry = NULL, **slot;
> + struct address_space *mapping;
> +
> + rcu_read_lock();
> + for (;;) {
> + mapping = READ_ONCE(page->mapping);
> +
> + if (!dax_mapping(mapping))
> + break;
> +
> + /*
> + * In the device-dax case there's no need to lock, a
> + * struct dev_pagemap pin is sufficient to keep the
> + * inode alive, and we assume we have dev_pagemap pin
> + * otherwise we would not have a valid pfn_to_page()
> + * translation.
> + */
> + inode = mapping->host;
> + if (S_ISCHR(inode->i_mode)) {
> + did_lock = true;
> + break;
> + }
> +
> + xa_lock_irq(&mapping->i_pages);
> + if (mapping != page->mapping) {
> + xa_unlock_irq(&mapping->i_pages);
> + continue;
> + }
> + index = page->index;
> +
> + entry = __get_unlocked_mapping_entry(mapping, index, &slot,
> + entry_wait_revalidate);
> + if (!entry) {
> + xa_unlock_irq(&mapping->i_pages);
> + break;
> + } else if (IS_ERR(entry)) {
> + WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN);
> + continue;
> + }
> + lock_slot(mapping, slot);
> + did_lock = true;
> + xa_unlock_irq(&mapping->i_pages);
> + break;
> + }
> + rcu_read_unlock();
> +
> + return did_lock;
> +}
> +
> +void dax_unlock_mapping_entry(struct page *page)
> +{
> + struct address_space *mapping = page->mapping;
> + struct inode *inode = mapping->host;
> +
> + if (S_ISCHR(inode->i_mode))
> + return;
> +
> + unlock_mapping_entry(mapping, page->index);
> +}
> +
> /*
> * Find radix tree entry at given index. If it points to an exceptional entry,
> * return it with the radix tree entry locked. If the radix tree doesn't
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 3855e3800f48..cf8ac51cf0d7 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -88,6 +88,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> struct block_device *bdev, struct writeback_control *wbc);
>
> struct page *dax_layout_busy_page(struct address_space *mapping);
> +bool dax_lock_mapping_entry(struct page *page);
> +void dax_unlock_mapping_entry(struct page *page);
> #else
> static inline bool bdev_dax_supported(struct block_device *bdev,
> int blocksize)
> @@ -119,6 +121,17 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
> {
> return -EOPNOTSUPP;
> }
> +
> +static inline bool dax_lock_mapping_entry(struct page *page)
> +{
> + if (IS_DAX(page->mapping->host))
> + return true;
> + return false;
> +}
> +
> +static inline void dax_unlock_mapping_entry(struct page *page)
> +{
> +}
> #endif
>
> int dax_read_lock(void);
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 13-07-18 21:50:21, Dan Williams wrote:
> mce: Uncorrected hardware memory error in user-access at af34214200
> {1}[Hardware Error]: It has been corrected by h/w and requires no further action
> mce: [Hardware Error]: Machine check events logged
> {1}[Hardware Error]: event severity: corrected
> Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
> [..]
> Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
> mce: Memory error not recovered
>
> In contrast to typical memory, dev_pagemap pages may be dax mapped. With
> dax there is no possibility to map in another page dynamically since dax
> establishes 1:1 physical address to file offset associations. Also
> dev_pagemap pages associated with NVDIMM / persistent memory devices can
> internal remap/repair addresses with poison. While memory_failure()
> assumes that it can discard typical poisoned pages and keep them
> unmapped indefinitely, dev_pagemap pages may be returned to service
> after the error is cleared.
>
> Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
> dev_pagemap pages that have poison consumed by userspace. Mark the
> memory as UC instead of unmapping it completely to allow ongoing access
> via the device driver (nd_pmem). Later, nd_pmem will grow support for
> marking the page back to WB when the error is cleared.
>
> Cc: Jan Kara <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: J?r?me Glisse <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
I'm not very familiar with memory failure code but at least from DAX POV
and from my rudimentary understanding of memory-failure the patch looks
sane to me.
Honza
> ---
> include/linux/mm.h | 1
> mm/memory-failure.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 124 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a0fbb9ffe380..374e5e9284f7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2725,6 +2725,7 @@ enum mf_action_page_type {
> MF_MSG_TRUNCATED_LRU,
> MF_MSG_BUDDY,
> MF_MSG_BUDDY_2ND,
> + MF_MSG_DAX,
> MF_MSG_UNKNOWN,
> };
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 8a81680d00dd..effaa7c7a1a4 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -55,6 +55,7 @@
> #include <linux/hugetlb.h>
> #include <linux/memory_hotplug.h>
> #include <linux/mm_inline.h>
> +#include <linux/memremap.h>
> #include <linux/kfifo.h>
> #include <linux/ratelimit.h>
> #include "internal.h"
> @@ -263,6 +264,40 @@ void shake_page(struct page *p, int access)
> }
> EXPORT_SYMBOL_GPL(shake_page);
>
> +static unsigned long dev_pagemap_mapping_size(struct page *page,
> + struct vm_area_struct *vma)
> +{
> + unsigned long address = vma_address(page, vma);
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + pgd = pgd_offset(vma->vm_mm, address);
> + if (!pgd_present(*pgd))
> + return 0;
> + p4d = p4d_offset(pgd, address);
> + if (!p4d_present(*p4d))
> + return 0;
> + pud = pud_offset(p4d, address);
> + if (!pud_present(*pud))
> + return 0;
> + if (pud_devmap(*pud))
> + return PUD_SIZE;
> + pmd = pmd_offset(pud, address);
> + if (!pmd_present(*pmd))
> + return 0;
> + if (pmd_devmap(*pmd))
> + return PMD_SIZE;
> + pte = pte_offset_map(pmd, address);
> + if (!pte_present(*pte))
> + return 0;
> + if (pte_devmap(*pte))
> + return PAGE_SIZE;
> + return 0;
> +}
> +
> /*
> * Failure handling: if we can't find or can't kill a process there's
> * not much we can do. We just print a message and ignore otherwise.
> @@ -292,7 +327,10 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
> }
> tk->addr = page_address_in_vma(p, vma);
> tk->addr_valid = 1;
> - tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
> + if (is_zone_device_page(p))
> + tk->size_shift = ilog2(dev_pagemap_mapping_size(p, vma));
> + else
> + tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
>
> /*
> * In theory we don't have to kill when the page was
> @@ -300,7 +338,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
> * likely very rare kill anyways just out of paranoia, but use
> * a SIGKILL because the error is not contained anymore.
> */
> - if (tk->addr == -EFAULT) {
> + if (tk->addr == -EFAULT || tk->size_shift == 0) {
> pr_info("Memory failure: Unable to find user space address %lx in %s\n",
> page_to_pfn(p), tsk->comm);
> tk->addr_valid = 0;
> @@ -514,6 +552,7 @@ static const char * const action_page_types[] = {
> [MF_MSG_TRUNCATED_LRU] = "already truncated LRU page",
> [MF_MSG_BUDDY] = "free buddy page",
> [MF_MSG_BUDDY_2ND] = "free buddy page (2nd try)",
> + [MF_MSG_DAX] = "dax page",
> [MF_MSG_UNKNOWN] = "unknown page",
> };
>
> @@ -1111,6 +1150,83 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
> return res;
> }
>
> +static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> + struct dev_pagemap *pgmap)
> +{
> + struct page *page = pfn_to_page(pfn);
> + const bool unmap_success = true;
> + unsigned long size = 0;
> + struct to_kill *tk;
> + LIST_HEAD(tokill);
> + int rc = -EBUSY;
> + loff_t start;
> +
> + /*
> + * Prevent the inode from being freed while we are interrogating
> + * the address_space, typically this would be handled by
> + * lock_page(), but dax pages do not use the page lock. This
> + * also prevents changes to the mapping of this pfn until
> + * poison signaling is complete.
> + */
> + if (!dax_lock_mapping_entry(page))
> + goto out;
> +
> + if (hwpoison_filter(page)) {
> + rc = 0;
> + goto unlock;
> + }
> +
> + switch (pgmap->type) {
> + case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_PUBLIC:
> + /*
> + * TODO: Handle HMM pages which may need coordination
> + * with device-side memory.
> + */
> + goto unlock;
> + default:
> + break;
> + }
> +
> + /*
> + * Use this flag as an indication that the dax page has been
> + * remapped UC to prevent speculative consumption of poison.
> + */
> + SetPageHWPoison(page);
> +
> + /*
> + * Unlike System-RAM there is no possibility to swap in a
> + * different physical page at a given virtual address, so all
> + * userspace consumption of ZONE_DEVICE memory necessitates
> + * SIGBUS (i.e. MF_MUST_KILL)
> + */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> + collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
> +
> + list_for_each_entry(tk, &tokill, nd)
> + if (tk->size_shift)
> + size = max(size, 1UL << tk->size_shift);
> + if (size) {
> + /*
> + * Unmap the largest mapping to avoid breaking up
> + * device-dax mappings which are constant size. The
> + * actual size of the mapping being torn down is
> + * communicated in siginfo, see kill_proc()
> + */
> + start = (page->index << PAGE_SHIFT) & ~(size - 1);
> + unmap_mapping_range(page->mapping, start, start + size, 0);
> + }
> + kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
> + rc = 0;
> +unlock:
> + dax_unlock_mapping_entry(page);
> +out:
> + /* drop pgmap ref acquired in caller */
> + put_dev_pagemap(pgmap);
> + action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
> + return rc;
> +}
> +
> /**
> * memory_failure - Handle memory failure of a page.
> * @pfn: Page Number of the corrupted page
> @@ -1133,6 +1249,7 @@ int memory_failure(unsigned long pfn, int flags)
> struct page *p;
> struct page *hpage;
> struct page *orig_head;
> + struct dev_pagemap *pgmap;
> int res;
> unsigned long page_flags;
>
> @@ -1145,6 +1262,10 @@ int memory_failure(unsigned long pfn, int flags)
> return -ENXIO;
> }
>
> + pgmap = get_dev_pagemap(pfn, NULL);
> + if (pgmap)
> + return memory_failure_dev_pagemap(pfn, flags, pgmap);
> +
> p = pfn_to_page(pfn);
> if (PageHuge(p))
> return memory_failure_hugetlb(pfn, flags);
>
--
Jan Kara <[email protected]>
SUSE Labs, CR