I'm recently looking at the possibility of mapping PCIe large bars to use
huge mappings (just like thp or hugetlb). Then I noticed these only when
reading the code.
I don't think I'm familiar enough with the whole PAT system, however I
figured I should post them out to collect some comments, hencing marking
this small series as RFC.
Please feel free to have a look at each of them; I've put more words in the
commit message than here, as the two issues are not related. Any comments
are welcomed.
Thanks,
Peter Xu (2):
mm/x86/pat: Only untrack the pfn range if unmap region
mm/x86/pat: Do proper PAT bit shift for large mappings
mm/huge_memory.c | 4 ++--
mm/memory.c | 5 ++---
2 files changed, 4 insertions(+), 5 deletions(-)
--
2.45.0
X86 uses pfn tracking to do pfnmaps. It looks like the tracking is
normally started at mmap() of device drivers, then untracked when munmap().
However in the current code the untrack is done in unmap_single_vma().
This might be problematic.
For example, unmap_single_vma() can be used nowadays even for zapping a
single page rather than the whole vmas. It's very confusing to do whole
vma untracking in this function even if a caller would like to zap one
page.
That may not yet be a problem, may because of multiple reasons:
(1) Things like MADV_DONTNEED won't ever work for pfnmaps and it'll fail
already before reaching here.
(2) If some pfn tracking is lost by accident, the default fallback is to
use UC-, which might be safe enough even if it may not be wanted. One
can see track_pfn_insert() -> lookup_memtype() which can fall back to
the default _PAGE_CACHE_MODE_UC_MINUS for a MMIO range.
However there's ongoing work [1] on VFIO driver to allow tearing down MMIO
pgtables before an munmap(), in which case we may not want to untrack the
pfns if we're only tearing down the pgtables. IOW, we may want to keep the
pfn tracking information as those pfn mappings can be restored later with
the same vma object. In VFIO's use case it can be remapped later if the
VFIO mapped MMIO region got re-enabled (e.g. PCI_COMMAND_MEMORY set). In
this case we should do pfn track for the whole lifecycle of the vma.
IIUC, this was overlooked when zap_page_range_single() was introduced,
while in the past it was only used in the munmap() path which wants to
always unmap the region completely. E.g., commit f5cc4eef9987 ("VM: make
zap_page_range() callers that act on a single VMA use separate helper") is
the initial commit that introduced unmap_single_vma(), in which the chunk
of untrack_pfn() was moved over from unmap_vmas().
Recover that behavior to untrack pfnmap only when unmap regions.
[1] https://lore.kernel.org/r/[email protected]
Cc: Alex Williamson <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index b5453b86ec4b..9db5e8730d12 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1821,9 +1821,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
if (vma->vm_file)
uprobe_munmap(vma, start, end);
- if (unlikely(vma->vm_flags & VM_PFNMAP))
- untrack_pfn(vma, 0, 0, mm_wr_locked);
-
if (start != end) {
if (unlikely(is_vm_hugetlb_page(vma))) {
/*
@@ -1888,6 +1885,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
unsigned long start = start_addr;
unsigned long end = end_addr;
hugetlb_zap_begin(vma, &start, &end);
+ if (unlikely(vma->vm_flags & VM_PFNMAP))
+ untrack_pfn(vma, 0, 0, mm_wr_locked);
unmap_single_vma(tlb, vma, start, end, &details,
mm_wr_locked);
hugetlb_zap_end(vma, &details);
--
2.45.0
For large mappings, the pgtable PAT is set on bit 12 (_PAGE_PAT_LARGE)
rather than bit 9 (_PAGE_PAT), while bit 9 is used as PAE hint. Do proper
shifting when inject large pfn pgtable mappings to make cache mode alright.
Cc: Alex Williamson <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
mm/huge_memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 317de2afd371..c4a2356b1a54 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1135,7 +1135,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
goto out_unlock;
}
- entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
+ entry = pmd_mkhuge(pfn_t_pmd(pfn, pgprot_4k_2_large(prot)));
if (pfn_t_devmap(pfn))
entry = pmd_mkdevmap(entry);
if (write) {
@@ -1233,7 +1233,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
goto out_unlock;
}
- entry = pud_mkhuge(pfn_t_pud(pfn, prot));
+ entry = pud_mkhuge(pfn_t_pud(pfn, pgprot_4k_2_large(prot)));
if (pfn_t_devmap(pfn))
entry = pud_mkdevmap(entry);
if (write) {
--
2.45.0
On 5/23/24 15:37, Peter Xu wrote:
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 317de2afd371..c4a2356b1a54 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1135,7 +1135,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> goto out_unlock;
> }
>
> - entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> + entry = pmd_mkhuge(pfn_t_pmd(pfn, pgprot_4k_2_large(prot)));
> if (pfn_t_devmap(pfn))
> entry = pmd_mkdevmap(entry);
> if (write) {
Does this even compile on non-x86 architectures?
On Thu, May 23, 2024 at 03:48:22PM -0700, Dave Hansen wrote:
> On 5/23/24 15:37, Peter Xu wrote:
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 317de2afd371..c4a2356b1a54 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1135,7 +1135,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> > goto out_unlock;
> > }
> >
> > - entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> > + entry = pmd_mkhuge(pfn_t_pmd(pfn, pgprot_4k_2_large(prot)));
> > if (pfn_t_devmap(pfn))
> > entry = pmd_mkdevmap(entry);
> > if (write) {
>
> Does this even compile on non-x86 architectures?
Probably not.. I think I can define a pgprot_to_large() globally, pointing
that to pgprot_4k_2_large() on x86 and make the fallback to be noop. And
if there's a new version I'll guarantee to run over my cross compilers.
Any comments on the idea itself? Do we have a problem, or maybe I
overlooked something?
Thanks,
--
Peter Xu
On Thu, May 23, 2024 at 07:07:06PM -0400, Peter Xu wrote:
> On Thu, May 23, 2024 at 03:48:22PM -0700, Dave Hansen wrote:
> > On 5/23/24 15:37, Peter Xu wrote:
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 317de2afd371..c4a2356b1a54 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1135,7 +1135,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > goto out_unlock;
> > > }
> > >
> > > - entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> > > + entry = pmd_mkhuge(pfn_t_pmd(pfn, pgprot_4k_2_large(prot)));
> > > if (pfn_t_devmap(pfn))
> > > entry = pmd_mkdevmap(entry);
> > > if (write) {
> >
> > Does this even compile on non-x86 architectures?
>
> Probably not.. I think I can define a pgprot_to_large() globally, pointing
> that to pgprot_4k_2_large() on x86 and make the fallback to be noop. And
> if there's a new version I'll guarantee to run over my cross compilers.
>
> Any comments on the idea itself? Do we have a problem, or maybe I
> overlooked something?
I also attached one new version of patch 2 that should pass the cross
builds. Please reviewers feel free to look at this one instead. From x86
perspective they should be the same thing.
Thanks,
===8<===
From 1cce12c872cb01aaa8686d8f5c7cd6b266ca4e38 Mon Sep 17 00:00:00 2001
From: Peter Xu <[email protected]>
Date: Thu, 23 May 2024 18:19:35 -0400
Subject: [PATCH rfcv1.1] mm/x86/pat: Do proper PAT bit shift for large mappings
For large mappings, the pgtable PAT is set on bit 12 (_PAGE_PAT_LARGE)
rather than bit 9 (_PAGE_PAT), while bit 9 is used as PAE hint. Do proper
shifting when inject large pfn pgtable mappings to make cache mode alright.
Cc: Alex Williamson <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 1 +
include/linux/pgtable.h | 4 ++++
mm/huge_memory.c | 4 ++--
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b78644962626..f9edb2bb1512 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -512,6 +512,7 @@ static inline pgprot_t pgprot_large_2_4k(pgprot_t pgprot)
return __pgprot(protval_large_2_4k(pgprot_val(pgprot)));
}
+#define pgprot_to_large(pgprot) pgprot_4k_2_large(pgprot)
typedef struct page *pgtable_t;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 18019f037bae..54487d2b3e40 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1956,4 +1956,8 @@ pgprot_t vm_get_page_prot(unsigned long vm_flags) \
} \
EXPORT_SYMBOL(vm_get_page_prot);
+#ifndef pgprot_to_large
+#define pgprot_to_large(pgprot) pgprot
+#endif
+
#endif /* _LINUX_PGTABLE_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 317de2afd371..4c134a60fb64 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1135,7 +1135,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
goto out_unlock;
}
- entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
+ entry = pmd_mkhuge(pfn_t_pmd(pfn, pgprot_to_large(prot)));
if (pfn_t_devmap(pfn))
entry = pmd_mkdevmap(entry);
if (write) {
@@ -1233,7 +1233,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
goto out_unlock;
}
- entry = pud_mkhuge(pfn_t_pud(pfn, prot));
+ entry = pud_mkhuge(pfn_t_pud(pfn, pgprot_to_large(prot)));
if (pfn_t_devmap(pfn))
entry = pud_mkdevmap(entry);
if (write) {
--
2.45.0
--
Peter Xu
On 5/23/24 16:07, Peter Xu wrote:
> Probably not.. I think I can define a pgprot_to_large() globally, pointing
> that to pgprot_4k_2_large() on x86 and make the fallback to be noop. And
> if there's a new version I'll guarantee to run over my cross compilers.
I guess that would be functional, but it would be a bit mean to
everybody else.
> Any comments on the idea itself? Do we have a problem, or maybe I
> overlooked something?
I think it's probably unnecessary to inflict this particular x86-ism on
generic code. The arch-generic 'prot' should have PAT at its 4k
(_PAGE_BIT_PAT) position and then p*d_mkhuge() can shift it into the
_PAGE_BIT_PAT_LARGE spot.
On Thu, May 23, 2024 at 08:30:19PM -0700, Dave Hansen wrote:
> On 5/23/24 16:07, Peter Xu wrote:
> > Probably not.. I think I can define a pgprot_to_large() globally, pointing
> > that to pgprot_4k_2_large() on x86 and make the fallback to be noop. And
> > if there's a new version I'll guarantee to run over my cross compilers.
>
> I guess that would be functional, but it would be a bit mean to
> everybody else.
>
> > Any comments on the idea itself? Do we have a problem, or maybe I
> > overlooked something?
>
> I think it's probably unnecessary to inflict this particular x86-ism on
> generic code. The arch-generic 'prot' should have PAT at its 4k
> (_PAGE_BIT_PAT) position and then p*d_mkhuge() can shift it into the
> _PAGE_BIT_PAT_LARGE spot.
Right that's another option indeed.
It's just that I found it might in many cases be better when we have the
API separately properly and making the pairs matching each other.
For example, it could be clearer if pxx_mkhuge() does exactly what
pxx_leaf() would check against.
PS: I hoped it's called pxx_huge() already to make the name paired with
each other; afaict we called it pxx_leaf() only because pxx_huge() used to
be "abused" by hugetlbfs before.. now it's gone.
The other thing is we mostly only need these knobs for special maps like
pfnmaps, am I right? OTOH we use WB for RAMs, and maybe we don't want to
bother any PAT stuff when the kernel is installing a THP anonymous?
IMHO having pgprot_to_large() is fine even if only x86 has it; it's really
like pfn tracking itself which is noop for !x86. but I'll follow your
advise if you still insist; I don't really have a strong opinion.
But if so I'd also like to mention a 3rd option, which is to have
pxx_mkhuge_prot(), fallback to pxx_mkhuge() for !x86. That'll make
pxx_huge() untainted for x86. I'm not sure whether that would ease the
same concern, though.
In all cases, thanks for confirming this issue, I appreciate that. Let me
know if you have any comment on patch 1 too; that one isn't a problem so
far iiuc, but it can be soon.
Thanks,
--
Peter Xu