On a ppc64 platform running 2.6.13-1, the virtual to physical mapping
established by mmap'ing a hugetlbfs file does not seem to match the
mapping described by get_user_pages().
Specifically, I have written a driver for a PCI device that writes data
into a user-allocated memory buffer. The user passes the virtual
address of the buffer to the driver, which calls get_user_pages() to
lock the pages and to get the page information in order to be able to
build a scatter list of contiguous physical blocks to pass to the DMA
engine on the device. The device then writes a known pattern to the
buffer, which the user space program can verify.
This process works fine on ia32 and ppc64 using malloc'ed memory. This
process also works fine on ia32 when obtaining the memory by mmap'ing a
file on a hugetlbfs filesystem. The 2MB pages are used to reduce the
number of entries in the scatter list.
The process doesn't work so well on ppc64 with hugetlbfs (and 16MB
pages). Often, the data is written to the wrong 16MB pages, from the
perspective of the user space program. The data is correct within a
16MB page, it's just written to the wrong page. It seems that the
information returned by get_user_pages() doesn't match the virtual to
physical mapping used by the user process.
Any suggestions on what I could be doing wrong in this specific case?
Any known problems with the kernel in this case?
Please CC me on any replies.
Thanks,
Matt
Hi,
> On a ppc64 platform running 2.6.13-1, the virtual to physical mapping
> established by mmap'ing a hugetlbfs file does not seem to match the
> mapping described by get_user_pages().
I just tried a simple test - I created a program that allocated a
hugetlb page and wrote some stuff in it. I then attached with gdb and
dumped memory at that address and things came out OK.
That should have exercised the ptrace -> access_process_vm -> get_user_pages
path. So at least for this case get_user_pages seems to be working :)
It would be useful to simplify your problem a bit and take DMA out of
the picture. Moving this over to [email protected] would also
make sense.
Anton
On Fri, 2005-09-16 at 18:52 -0400, Sexton, Matt wrote:
> On a ppc64 platform running 2.6.13-1, the virtual to physical mapping
> established by mmap'ing a hugetlbfs file does not seem to match the
> mapping described by get_user_pages().
Matt, you might want to try with something newer like 2.6.14-rc2-git6
and the following patch from Ben Herrenschmidt. We found a few issues
with the hardware hash table and they should be fixed now (with the
patch below). Is the system you are testing on LPAR or native? Power4
or 5?
--- snip ---
My previous patch fixing invalidation of huge PTEs wasn't good enough,
we still had an issue if a PTE invalidation batch contained both small
and large pages. This patch fixes this by making sure the batch is
flushed if the page size fed to it changes.
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Index: linux-work/arch/ppc64/mm/hash_native.c
===================================================================
--- linux-work.orig/arch/ppc64/mm/hash_native.c 2005-09-27 11:43:27.000000000 +1000
+++ linux-work/arch/ppc64/mm/hash_native.c 2005-09-27 11:48:06.000000000 +1000
@@ -343,7 +343,7 @@
hpte_t *hptep;
unsigned long hpte_v;
struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
- unsigned long large;
+ unsigned long large = batch->large;
local_irq_save(flags);
@@ -356,7 +356,6 @@
va = (vsid << 28) | (batch->addr[i] & 0x0fffffff);
batch->vaddr[j] = va;
- large = pte_huge(batch->pte[i]);
if (large)
vpn = va >> HPAGE_SHIFT;
else
@@ -406,7 +405,7 @@
asm volatile("ptesync":::"memory");
for (i = 0; i < j; i++)
- __tlbie(batch->vaddr[i], 0);
+ __tlbie(batch->vaddr[i], large);
asm volatile("eieio; tlbsync; ptesync":::"memory");
Index: linux-work/arch/ppc64/mm/tlb.c
===================================================================
--- linux-work.orig/arch/ppc64/mm/tlb.c 2005-09-27 11:43:27.000000000 +1000
+++ linux-work/arch/ppc64/mm/tlb.c 2005-09-27 11:47:35.000000000 +1000
@@ -143,7 +143,8 @@
* up scanning and resetting referenced bits then our batch context
* will change mid stream.
*/
- if (unlikely(i != 0 && context != batch->context)) {
+ if (i != 0 && (context != batch->context ||
+ batch->large != pte_huge(pte))) {
flush_tlb_pending();
i = 0;
}
@@ -151,6 +152,7 @@
if (i == 0) {
batch->context = context;
batch->mm = mm;
+ batch->large = pte_huge(pte);
}
batch->pte[i] = __pte(pte);
batch->addr[i] = addr;
Index: linux-work/include/asm-ppc64/tlbflush.h
===================================================================
--- linux-work.orig/include/asm-ppc64/tlbflush.h 2005-09-27 11:43:27.000000000 +1000
+++ linux-work/include/asm-ppc64/tlbflush.h 2005-09-27 11:45:09.000000000 +1000
@@ -25,6 +25,7 @@
pte_t pte[PPC64_TLB_BATCH_NR];
unsigned long addr[PPC64_TLB_BATCH_NR];
unsigned long vaddr[PPC64_TLB_BATCH_NR];
+ unsigned int large;
};
DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch);
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center