LinuxLists.cc - [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow

2008-07-03 21:40:57

Subject: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

Exports needed by the GRU driver.

Signed-off-by: Jack Steiner <[email protected]>

---
mm/memory.c | 2 ++

---
arch/x86/kernel/genapic_64.c | 1 +
mm/memory.c | 2 ++
2 files changed, 3 insertions(+)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2008-07-03 15:57:13.000000000 -0500
+++ linux/mm/memory.c 2008-07-03 16:19:09.000000000 -0500
@@ -995,6 +995,7 @@ unsigned long zap_page_range(struct vm_a
tlb_finish_mmu(tlb, address, end);
return end;
}
+EXPORT_SYMBOL_GPL(zap_page_range);

/*
* Do a quick page-table lookup for a single page.
@@ -1089,6 +1090,7 @@ no_page_table:
}
return page;
}
+EXPORT_SYMBOL_GPL(follow_page);

/* Can we do the FOLL_ANON optimization? */
static inline int use_zero_page(struct vm_area_struct *vma)
Index: linux/arch/x86/kernel/genapic_64.c
===================================================================
--- linux.orig/arch/x86/kernel/genapic_64.c 2008-07-03 15:59:07.000000000 -0500
+++ linux/arch/x86/kernel/genapic_64.c 2008-07-03 16:19:09.000000000 -0500
@@ -102,3 +102,4 @@ int is_uv_system(void)
{
return uv_system_type != UV_NONE;
}
+EXPORT_SYMBOL_GPL(is_uv_system);

--

2008-07-04 07:39:39

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Thu, Jul 03, 2008 at 04:34:00PM -0500, [email protected] wrote:
> +EXPORT_SYMBOL_GPL(zap_page_range);
>
> /*
> * Do a quick page-table lookup for a single page.
> @@ -1089,6 +1090,7 @@ no_page_table:
> }
> return page;
> }
> +EXPORT_SYMBOL_GPL(follow_page);

NACK.

These should never be called by a driver and suggest you need to rething
your VM integration in this driver.

2008-07-07 14:39:45

by tip-bot for Jack Steiner

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

> On Thu, Jul 03, 2008 at 04:34:00PM -0500, [email protected] wrote:
> > +EXPORT_SYMBOL_GPL(zap_page_range);

Investigating.... More later

> >
> > /*
> > * Do a quick page-table lookup for a single page.
> > @@ -1089,6 +1090,7 @@ no_page_table:
> > }
> > return page;
> > }
> > +EXPORT_SYMBOL_GPL(follow_page);
>
> NACK.
>
> These should never be called by a driver and suggest you need to rething
> your VM integration in this driver.

Can you provide some additional details on the type of kernel API
that could be exported to provide a pte lookup in atomic context?

There has been significant work to add kernel support for
drivers with externel TLBs. The GRU is one of these drivers.
In order to efficiently support the GRU, the driver needs to do
virt->physical translations in atomic context.

Some background.

The GRU is a hardware resource located in the system chipset. The GRU
contains memory that is mmaped into the user address space. This memory is
used to communicate with the GRU to perform functions such as load/store,
scatter/gather, bcopy, AMOs, etc. The GRU is directly accessed by user
instructions using user virtual addresses. GRU instructions (ex., bcopy) use
user virtual addresses for operands.

The GRU contains a large TLB that is functionally very similar to processor TLBs.
If a user references a page and no GRU TLB entry exists, the GRU sends
an interrupt to the processor.

Currently, the driver calls follow_page() in interrupt context. If the
lookup is successful, a new TLB entry is dropped into the GRU and
the fault is resolved. If follow_page() fails, the fault is converted
into a different type of fault that will later be seen by the user. The
driver is later called in user context where sleeping is allowed.
The driver then calls get_user_pages() in user context.

Most users of the GRU will be HPC apps that are primarily cpu bound
in user code. The call to follow_page() should normally be successful
and processing the fault in interrupt context is MUCH lower overhead than
handling it in user context.

I'll gladly make whatever changes are needed but need some pointers on
the direction I should take....

--- jack

2008-07-07 14:46:12

by Christoph Lameter

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

Christoph Hellwig wrote:
> On Thu, Jul 03, 2008 at 04:34:00PM -0500, [email protected] wrote:
>> +EXPORT_SYMBOL_GPL(zap_page_range);
>>
>> /*
>> * Do a quick page-table lookup for a single page.
>> @@ -1089,6 +1090,7 @@ no_page_table:
>> }
>> return page;
>> }
>> +EXPORT_SYMBOL_GPL(follow_page);
>
> NACK.
>
> These should never be called by a driver and suggest you need to rething
> your VM integration in this driver.

Well here we cannot run get_user_pages() since the context is atomic. What other choice is there?

2008-07-07 14:54:19

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Mon, Jul 07, 2008 at 09:39:16AM -0500, Jack Steiner wrote:
> Can you provide some additional details on the type of kernel API
> that could be exported to provide a pte lookup in atomic context?

mmu notifiers makes page pinning an obsolete thing so we want to get
rid of get_user_pages in KVM page fault fast path too (we want to keep
it in the slow path the same way as GRU does).

I tried to profile it and time spent in get_user_pages is quite
smaller than the time spent on follow_page. So I estimated we'll be
very lucky if getting rid of get_user_pages will even mark a 1%
speedup in real life benchmark, but still we want to optimize things
given this is so easy.

If calling follow_page directly is the right thing I don't know. Let's
say I'll postpone any minor-optimization work in this area to the time
mmu notifiers are merged in .27 as scheduled.

If the export of follow_page was going in before a more accurate
thought on what else we could do, I wouldn't be opposed because
eventually we'll have to export a new function anyway and this is a
one liner change, so it couldn't make life harder later if it goes in.

> I'll gladly make whatever changes are needed but need some pointers on
> the direction I should take....

Same here, and I'll join this more actively to help out as soon as we
can move to the next step (i.e. once mmu notifiers are in mainline).

Thanks everyone!!

2008-07-07 16:30:46

by Hugh Dickins

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Mon, 7 Jul 2008, Jack Steiner wrote:
> > > +EXPORT_SYMBOL_GPL(follow_page);
> >
> > NACK.
> >
> > These should never be called by a driver and suggest you need to rething
> > your VM integration in this driver.
>
> Can you provide some additional details on the type of kernel API
> that could be exported to provide a pte lookup in atomic context?

I don't see EXPORT_SYMBOL_GPL(follow_page) as objectionable myself:
it rather seems rather to complement EXPORT_SYMBOL(vm_insert_page)
and EXPORT_SYMBOL(vmalloc_to_page); though I'd agree that it's
sufficiently sensitive to need that _GPL on it.

...

> Currently, the driver calls follow_page() in interrupt context.

However, that's a problem, isn't it, given the pte_offset_map_lock
in follow_page? To avoid the possibility of deadlock, wouldn't we
have to change all the page table locking to irq-disabling variants?
Which I think we'd have reason to prefer not to do.

Maybe study the assumptions Nick is making in his arch/x86/mm/gup.c
in mm, and do something similar in your GRU driver (falling back to
the slow method when anything's not quite right). It's not nice to
have such code out in a driver, but GRU is going to be exceptional,
and it may be better to have it out there than pretence of generality
in the core mm exporting it.

Note that even the unlocked pte_offset_map which gup_pte_range uses,
is in general unsafe at interrupt time: because of using a KM_PTE0
atomic kmap which might be in use at the time of the interrupt. But
I doubt your GRU driver is intended for use in HIGHMEM architectures,
so that may be enough to excuse it.

Hugh

2008-07-07 16:57:18

by tip-bot for Jack Steiner

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Mon, Jul 07, 2008 at 05:29:54PM +0100, Hugh Dickins wrote:
> On Mon, 7 Jul 2008, Jack Steiner wrote:
> > > > +EXPORT_SYMBOL_GPL(follow_page);
> > >
> > > NACK.
> > >
> ...
>
> > Currently, the driver calls follow_page() in interrupt context.
>
> However, that's a problem, isn't it, given the pte_offset_map_lock
> in follow_page? To avoid the possibility of deadlock, wouldn't we
> have to change all the page table locking to irq-disabling variants?
> Which I think we'd have reason to prefer not to do.

Good catch. I stupidly overlooked the locking. And I agree - changes to
irq-disabling is the wrong way to solve this.

>
> Maybe study the assumptions Nick is making in his arch/x86/mm/gup.c
> in mm, and do something similar in your GRU driver (falling back to
> the slow method when anything's not quite right). It's not nice to
> have such code out in a driver, but GRU is going to be exceptional,
> and it may be better to have it out there than pretence of generality
> in the core mm exporting it.

Ok, I'll take this approach. Open code a pagetable walker into the GRU
driver using the ideas of fast_gup(). This has the added benefit of being
able to optimize for exactly what is needed for the GRU. For example,
nr_pages is always 1 (at least in the current design).

>
> Note that even the unlocked pte_offset_map which gup_pte_range uses,
> is in general unsafe at interrupt time: because of using a KM_PTE0
> atomic kmap which might be in use at the time of the interrupt. But
> I doubt your GRU driver is intended for use in HIGHMEM architectures,
> so that may be enough to excuse it.

Right. the GRU driver supports only x86_64 & ia64. No HIGHMEM issues.

--- jack

2008-07-07 17:59:15

by tip-bot for Jack Steiner

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Fri, Jul 04, 2008 at 03:39:26AM -0400, Christoph Hellwig wrote:
> On Thu, Jul 03, 2008 at 04:34:00PM -0500, [email protected] wrote:
> > +EXPORT_SYMBOL_GPL(zap_page_range);
> >
>
> NACK.
>

I agree that zap_pages_range() is more general than what is needed.

How about the following. This should prevent abuse of the API &
provide what (I think) drivers need.

Verified with GRU & XPMEM.

--- jack

--------------------------------------------------------------------------------------------
Add function to unmap driver ptes.

zap_vma_ptes() is intended to be used by drivers to unmap
ptes assigned to the driver private vmas. This interface is
similar to zap_page_range() but is less general & less likely to
be abused.

Signed-off-by: Jack Steiner <[email protected]>

---
include/linux/mm.h | 2 ++
mm/memory.c | 23 +++++++++++++++++++++++
2 files changed, 25 insertions(+)

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2008-07-07 12:22:17.809892803 -0500
+++ linux/include/linux/mm.h 2008-07-07 12:22:56.126695482 -0500
@@ -742,6 +742,8 @@ struct zap_details {
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);

+int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
+ unsigned long size);
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
unsigned long unmap_vmas(struct mmu_gather **tlb,
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2008-07-07 12:22:18.357961503 -0500
+++ linux/mm/memory.c 2008-07-07 12:31:51.097688882 -0500
@@ -978,6 +978,29 @@ unsigned long zap_page_range(struct vm_a
}
EXPORT_SYMBOL_GPL(zap_page_range);

+/**
+ * zap_vma_ptes - remove ptes mapping the vma
+ * @vma: vm_area_struct holding ptes to be zapped
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
+ *
+ * This function only unmaps ptes assigned to VM_PFNMAP vmas.
+ *
+ * The entire address range must be fully contained within the vma.
+ *
+ * Returns 0 if successful.
+ */
+int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
+ unsigned long size)
+{
+ if (address < vma->vm_start || address + size > vma->vm_end ||
+ !(vma->vm_flags & VM_PFNMAP))
+ return -1;
+ zap_page_range(vma, address, size, NULL);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(zap_vma_ptes);
+
/*
* Do a quick page-table lookup for a single page.
*/

2008-07-07 18:59:55

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Mon, 7 Jul 2008 17:29:54 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> On Mon, 7 Jul 2008, Jack Steiner wrote:
> > > > +EXPORT_SYMBOL_GPL(follow_page);
> > >
> > > NACK.
> > >
> > > These should never be called by a driver and suggest you need to
> > > rething your VM integration in this driver.
> >
> > Can you provide some additional details on the type of kernel API
> > that could be exported to provide a pte lookup in atomic context?
>
> I don't see EXPORT_SYMBOL_GPL(follow_page) as objectionable myself:
> it rather seems rather to complement EXPORT_SYMBOL(vm_insert_page)
> and EXPORT_SYMBOL(vmalloc_to_page); though I'd agree that it's
> sufficiently sensitive to need that _GPL on it.
>
> ...
>
> > Currently, the driver calls follow_page() in interrupt context.
>
> However, that's a problem, isn't it, given the pte_offset_map_lock
> in follow_page? To avoid the possibility of deadlock, wouldn't we
> have to change all the page table locking to irq-disabling variants?
> Which I think we'd have reason to prefer not to do.
>
> Maybe study the assumptions Nick is making in his arch/x86/mm/gup.c
> in mm, and do something similar in your GRU driver (falling back to
> the slow method when anything's not quite right). It's not nice to
> have such code out in a driver, but GRU is going to be exceptional,
> and it may be better to have it out there than pretence of generality
> in the core mm exporting it.
>

I wonder if GRU even should be a module; it sounds rather like pretty
core functionality and if it's this invasive to the VM it probably
should be a real part of the VM instead

2008-07-07 19:29:42

by tip-bot for Jack Steiner

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Mon, Jul 07, 2008 at 11:58:44AM -0700, Arjan van de Ven wrote:
> On Mon, 7 Jul 2008 17:29:54 +0100 (BST)
> Hugh Dickins <[email protected]> wrote:
>
> > On Mon, 7 Jul 2008, Jack Steiner wrote:
> > > > > +EXPORT_SYMBOL_GPL(follow_page);
> > > >
> > > > NACK.
> > > >
> > > > These should never be called by a driver and suggest you need to
> > > > rething your VM integration in this driver.
> > >
> > > Can you provide some additional details on the type of kernel API
> > > that could be exported to provide a pte lookup in atomic context?
> >
> > I don't see EXPORT_SYMBOL_GPL(follow_page) as objectionable myself:
> > it rather seems rather to complement EXPORT_SYMBOL(vm_insert_page)
> > and EXPORT_SYMBOL(vmalloc_to_page); though I'd agree that it's
> > sufficiently sensitive to need that _GPL on it.
> >
> > ...
> >
> > > Currently, the driver calls follow_page() in interrupt context.
> >
> > However, that's a problem, isn't it, given the pte_offset_map_lock
> > in follow_page? To avoid the possibility of deadlock, wouldn't we
> > have to change all the page table locking to irq-disabling variants?
> > Which I think we'd have reason to prefer not to do.
> >
> > Maybe study the assumptions Nick is making in his arch/x86/mm/gup.c
> > in mm, and do something similar in your GRU driver (falling back to
> > the slow method when anything's not quite right). It's not nice to
> > have such code out in a driver, but GRU is going to be exceptional,
> > and it may be better to have it out there than pretence of generality
> > in the core mm exporting it.
> >
>
> I wonder if GRU even should be a module; it sounds rather like pretty
> core functionality and if it's this invasive to the VM it probably
> should be a real part of the VM instead

The GRU is not actually very invasive into the VM. It will use the
new MMU-notifier callbacks. Aside from the need to translate
virt->physical & zap ptes belonging to the GRU, it works fine as a module.
No other core changes are needed.

An additional advantage in keeping it as a module is that I expect it
to under a number of changes as the hardware matures. It is easier to
update the GRU if it is a module.

--- jack

2008-07-07 21:04:00

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Mon, Jul 07, 2008 at 02:29:23PM -0500, Jack Steiner wrote:
> The GRU is not actually very invasive into the VM. It will use the
> new MMU-notifier callbacks. Aside from the need to translate
> virt->physical & zap ptes belonging to the GRU, it works fine as a module.
> No other core changes are needed.
>
> An additional advantage in keeping it as a module is that I expect it
> to under a number of changes as the hardware matures. It is easier to
> update the GRU if it is a module.

Agreed, same applies to kvm mmu.c which also is heavily modularized
and hidden to the main Linux VM. The whole point of the mmu notifiers
is to allow all those secondary MMUs to interact fully with the main
Linux VM and get all the benefits from it, but without having to
pollute and mess with it at every hardware change, plus allowing
multiple secondary MMUs to work on the same "mm" simultaneously and
transparently to each other.

2008-07-08 02:16:53

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Tuesday 08 July 2008 02:53, Jack Steiner wrote:
> On Mon, Jul 07, 2008 at 05:29:54PM +0100, Hugh Dickins wrote:

> > Maybe study the assumptions Nick is making in his arch/x86/mm/gup.c
> > in mm, and do something similar in your GRU driver (falling back to
> > the slow method when anything's not quite right). It's not nice to
> > have such code out in a driver, but GRU is going to be exceptional,
> > and it may be better to have it out there than pretence of generality
> > in the core mm exporting it.
>
> Ok, I'll take this approach. Open code a pagetable walker into the GRU
> driver using the ideas of fast_gup(). This has the added benefit of being
> able to optimize for exactly what is needed for the GRU. For example,
> nr_pages is always 1 (at least in the current design).

Well... err, it's pretty tied to the arch and mm design. I'd rather
if you could just make another entry point to gup.c (perhaps, one
which doesn't automatically fall back to the get_user_pages slowpath
for you) rather than code it again in your driver.

2008-07-09 19:11:59

by tip-bot for Jack Steiner

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Tue, Jul 08, 2008 at 12:16:21PM +1000, Nick Piggin wrote:
> On Tuesday 08 July 2008 02:53, Jack Steiner wrote:
> > On Mon, Jul 07, 2008 at 05:29:54PM +0100, Hugh Dickins wrote:
>
> > > Maybe study the assumptions Nick is making in his arch/x86/mm/gup.c
> > > in mm, and do something similar in your GRU driver (falling back to
> > > the slow method when anything's not quite right). It's not nice to
> > > have such code out in a driver, but GRU is going to be exceptional,
> > > and it may be better to have it out there than pretence of generality
> > > in the core mm exporting it.
> >
> > Ok, I'll take this approach. Open code a pagetable walker into the GRU
> > driver using the ideas of fast_gup(). This has the added benefit of being
> > able to optimize for exactly what is needed for the GRU. For example,
> > nr_pages is always 1 (at least in the current design).
>
> Well... err, it's pretty tied to the arch and mm design. I'd rather
> if you could just make another entry point to gup.c (perhaps, one
> which doesn't automatically fall back to the get_user_pages slowpath
> for you) rather than code it again in your driver.

Long term, that is probably a good idea. However, for the short term & while
the GRU is stabilizing, I would prefer to keep the code in the driver itself.
I can address the issue of moving it to gup.c later.

I'll post the new GRU patch in a few minutes.

--- jack

2008-07-10 07:32:49

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Thursday 10 July 2008 05:11, Jack Steiner wrote:
> On Tue, Jul 08, 2008 at 12:16:21PM +1000, Nick Piggin wrote:
> > On Tuesday 08 July 2008 02:53, Jack Steiner wrote:
> > > On Mon, Jul 07, 2008 at 05:29:54PM +0100, Hugh Dickins wrote:
> > > > Maybe study the assumptions Nick is making in his arch/x86/mm/gup.c
> > > > in mm, and do something similar in your GRU driver (falling back to
> > > > the slow method when anything's not quite right). It's not nice to
> > > > have such code out in a driver, but GRU is going to be exceptional,
> > > > and it may be better to have it out there than pretence of generality
> > > > in the core mm exporting it.
> > >
> > > Ok, I'll take this approach. Open code a pagetable walker into the GRU
> > > driver using the ideas of fast_gup(). This has the added benefit of
> > > being able to optimize for exactly what is needed for the GRU. For
> > > example, nr_pages is always 1 (at least in the current design).
> >
> > Well... err, it's pretty tied to the arch and mm design. I'd rather
> > if you could just make another entry point to gup.c (perhaps, one
> > which doesn't automatically fall back to the get_user_pages slowpath
> > for you) rather than code it again in your driver.
>
> Long term, that is probably a good idea. However, for the short term &
> while the GRU is stabilizing, I would prefer to keep the code in the driver
> itself.

Well I disagree and I think it is a bad idea. gup.c is going into 2.6.27
anyway (and if it weren't going in, then it would be due to some discovered
issue in which case your driver should not use it either).

> I can address the issue of moving it to gup.c later.

I guess you wouldn't be moving anything to gup, because it is already
implemented there... Literally all you have to do is extract a
function in gup.c which takes the fastpath body of get_user_pages_fast
and returns failure rather than branching to slowpath.

If ia64 uses the same sort of tlb invalidation and page table teardown
scheme as x86, then you should be able to copy the x86 gup.c straight
to ia64 (minus the PAE crud).

> I'll post the new GRU patch in a few minutes.

It looks broken to me. How does it determine whether it has a
normal page or not?

2008-07-10 13:29:30

by tip-bot for Jack Steiner

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Thu, Jul 10, 2008 at 05:31:54PM +1000, Nick Piggin wrote:
> On Thursday 10 July 2008 05:11, Jack Steiner wrote:
> > On Tue, Jul 08, 2008 at 12:16:21PM +1000, Nick Piggin wrote:
> > > On Tuesday 08 July 2008 02:53, Jack Steiner wrote:
> > > > On Mon, Jul 07, 2008 at 05:29:54PM +0100, Hugh Dickins wrote:
> > > > > Maybe study the assumptions Nick is making in his arch/x86/mm/gup.c
> > > > > in mm, and do something similar in your GRU driver (falling back to
> > > > > the slow method when anything's not quite right). It's not nice to
> > > > > have such code out in a driver, but GRU is going to be exceptional,
> > > > > and it may be better to have it out there than pretence of generality
> > > > > in the core mm exporting it.
> > > >
> > > > Ok, I'll take this approach. Open code a pagetable walker into the GRU
> > > > driver using the ideas of fast_gup(). This has the added benefit of
> > > > being able to optimize for exactly what is needed for the GRU. For
> > > > example, nr_pages is always 1 (at least in the current design).
> > >
> > > Well... err, it's pretty tied to the arch and mm design. I'd rather
> > > if you could just make another entry point to gup.c (perhaps, one
> > > which doesn't automatically fall back to the get_user_pages slowpath
> > > for you) rather than code it again in your driver.
> >
> > Long term, that is probably a good idea. However, for the short term &
> > while the GRU is stabilizing, I would prefer to keep the code in the driver
> > itself.
>
> Well I disagree and I think it is a bad idea. gup.c is going into 2.6.27
> anyway (and if it weren't going in, then it would be due to some discovered
> issue in which case your driver should not use it either).
>
>
> > I can address the issue of moving it to gup.c later.
>
> I guess you wouldn't be moving anything to gup, because it is already
> implemented there... Literally all you have to do is extract a
> function in gup.c which takes the fastpath body of get_user_pages_fast
> and returns failure rather than branching to slowpath.
>
> If ia64 uses the same sort of tlb invalidation and page table teardown
> scheme as x86, then you should be able to copy the x86 gup.c straight
> to ia64 (minus the PAE crud).
>
> > I'll post the new GRU patch in a few minutes.
>
> It looks broken to me. How does it determine whether it has a
> normal page or not?

Right. Hugepages are not currently supported by the GRU. There is code that I
know is missing/broken in this path. I'm trying to get the core driver accepted, then
I'll get the portion dealing with hugepages working.

Eventually the driver will need to support pages up to 1 TB in size. Most of this
is still incomplete. We have a basic high-level design for what we plan to do
but it will be late this year before have the code complete. My plan is to
add support for hugepages at the same time we add the support for the really-huge
pages. (Note: these really-huge pages are NOT the GB huge pages that being added
for AMD support).

Once the GRU can handle the v->p lookup for all the page types that we need
to support, I'll take a second look at moving the code to gup.c.

--- jack

2008-07-10 14:22:16

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Thursday 10 July 2008 23:29, Jack Steiner wrote:
> On Thu, Jul 10, 2008 at 05:31:54PM +1000, Nick Piggin wrote:
> > On Thursday 10 July 2008 05:11, Jack Steiner wrote:

> > > I'll post the new GRU patch in a few minutes.
> >
> > It looks broken to me. How does it determine whether it has a
> > normal page or not?
>
> Right. Hugepages are not currently supported by the GRU. There is code that
> I know is missing/broken in this path. I'm trying to get the core driver
> accepted, then I'll get the portion dealing with hugepages working.

Oh, I meant "normal" pages as in vm_normal_page(), or is there some
other reason this codepath is exempt from them?

Using gup.c code I don't think will prevent your driver from getting
accepted. Conversely, I would not like the open coded page table walk
to go upstream...

2008-07-10 16:33:49

by tip-bot for Jack Steiner

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Fri, Jul 11, 2008 at 12:21:28AM +1000, Nick Piggin wrote:
> On Thursday 10 July 2008 23:29, Jack Steiner wrote:
> > On Thu, Jul 10, 2008 at 05:31:54PM +1000, Nick Piggin wrote:
> > > On Thursday 10 July 2008 05:11, Jack Steiner wrote:
>
> > > > I'll post the new GRU patch in a few minutes.
> > >
> > > It looks broken to me. How does it determine whether it has a
> > > normal page or not?
> >
> > Right. Hugepages are not currently supported by the GRU. There is code that
> > I know is missing/broken in this path. I'm trying to get the core driver
> > accepted, then I'll get the portion dealing with hugepages working.
>
> Oh, I meant "normal" pages as in vm_normal_page(), or is there some
> other reason this codepath is exempt from them?

Maybe...

The GRU deals with cacheable memory only (the check is currently missing).
What is the proper way to catch a reference to a PTE that maps something
other than normal cacheable memory. Note that we support XPMEM. Some cacheable
memory that is valid for GRU references will be memory located on other
partitions. No page struct entries will exist nor will the physical address ranges
be known to the kernel. (Not in efi/e820 tables).

One idea that I had was to use the attributes of the PTE. Is there
better way. vm_flags? ???

Ideas???

>
> Using gup.c code I don't think will prevent your driver from getting
> accepted. Conversely, I would not like the open coded page table walk
> to go upstream...

If that is the concensus, that is ok. How certain are we that gup.c will
go into 2.6.27. Initially, I though it was cleaner to decouple the GRU
from gup.c & to wait until I had all the hugepage & ia64 issues resolved before
trying to push the walker into the kernel. (The driver runs ok as long
as huge pages are not referenced. It detects attempts to reference hugepages
and gives the user an error).

We would also need a gup.c for ia64.

--- jack

2008-07-10 16:52:40

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Friday 11 July 2008 02:33, Jack Steiner wrote:
> On Fri, Jul 11, 2008 at 12:21:28AM +1000, Nick Piggin wrote:
> > On Thursday 10 July 2008 23:29, Jack Steiner wrote:
> > > On Thu, Jul 10, 2008 at 05:31:54PM +1000, Nick Piggin wrote:
> > > > On Thursday 10 July 2008 05:11, Jack Steiner wrote:
> > > > > I'll post the new GRU patch in a few minutes.
> > > >
> > > > It looks broken to me. How does it determine whether it has a
> > > > normal page or not?
> > >
> > > Right. Hugepages are not currently supported by the GRU. There is code
> > > that I know is missing/broken in this path. I'm trying to get the core
> > > driver accepted, then I'll get the portion dealing with hugepages
> > > working.
> >
> > Oh, I meant "normal" pages as in vm_normal_page(), or is there some
> > other reason this codepath is exempt from them?
>
> Maybe...
>
> The GRU deals with cacheable memory only (the check is currently missing).
> What is the proper way to catch a reference to a PTE that maps something
> other than normal cacheable memory. Note that we support XPMEM. Some
> cacheable memory that is valid for GRU references will be memory located on
> other partitions. No page struct entries will exist nor will the physical
> address ranges be known to the kernel. (Not in efi/e820 tables).
>
> One idea that I had was to use the attributes of the PTE. Is there
> better way. vm_flags? ???
>
> Ideas???

lockless gup checks for struct page by checking a bit in the pte.
This should be enough to guarantee it is cacheable memory (unless
another driver has done something tricky like set the the page's
cache attributes to UC or WC -- I don't know if there is a way to
completely avoid all corner cases).

> > Using gup.c code I don't think will prevent your driver from getting
> > accepted. Conversely, I would not like the open coded page table walk
> > to go upstream...
>
> If that is the concensus, that is ok. How certain are we that gup.c will
> go into 2.6.27.

Linus sounded pretty happy to merge gup as soon as 2.6.27 opens
when I asked.

> Initially, I though it was cleaner to decouple the GRU
> from gup.c & to wait until I had all the hugepage & ia64 issues resolved
> before trying to push the walker into the kernel.

If you're just getting things working in -mm I don't mind one way
or the other I guess. But it should not get merged like that.

> (The driver runs ok as
> long as huge pages are not referenced. It detects attempts to reference
> hugepages and gives the user an error).
>
> We would also need a gup.c for ia64.

That may not be such a bad idea. Don't know if they have a spare pte bit.

2008-07-10 17:20:50

by tip-bot for Jack Steiner

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Fri, Jul 11, 2008 at 02:52:00AM +1000, Nick Piggin wrote:
> On Friday 11 July 2008 02:33, Jack Steiner wrote:
> > On Fri, Jul 11, 2008 at 12:21:28AM +1000, Nick Piggin wrote:
> > > On Thursday 10 July 2008 23:29, Jack Steiner wrote:
> > > > On Thu, Jul 10, 2008 at 05:31:54PM +1000, Nick Piggin wrote:
> > > > > On Thursday 10 July 2008 05:11, Jack Steiner wrote:
> > > > > > I'll post the new GRU patch in a few minutes.
> > > > >
> > > > > It looks broken to me. How does it determine whether it has a
> > > > > normal page or not?
> > > >
> > > > Right. Hugepages are not currently supported by the GRU. There is code
> > > > that I know is missing/broken in this path. I'm trying to get the core
> > > > driver accepted, then I'll get the portion dealing with hugepages
> > > > working.
> > >
> > > Oh, I meant "normal" pages as in vm_normal_page(), or is there some
> > > other reason this codepath is exempt from them?
> >
> > Maybe...
> >
> > The GRU deals with cacheable memory only (the check is currently missing).
> > What is the proper way to catch a reference to a PTE that maps something
> > other than normal cacheable memory. Note that we support XPMEM. Some
> > cacheable memory that is valid for GRU references will be memory located on
> > other partitions. No page struct entries will exist nor will the physical
> > address ranges be known to the kernel. (Not in efi/e820 tables).
> >
> > One idea that I had was to use the attributes of the PTE. Is there
> > better way. vm_flags? ???
> >
> > Ideas???
>
> lockless gup checks for struct page by checking a bit in the pte.
> This should be enough to guarantee it is cacheable memory (unless
> another driver has done something tricky like set the the page's
> cache attributes to UC or WC -- I don't know if there is a way to
> completely avoid all corner cases).
>

The GRU itself has no need to reference the page struct.
However, it WILL reference valid ptes that represent pages imported from
other SSIs via xpmem. These will have cacheable ptes but no page structs.

Maybe checking the pte attributes is the best way to do the check.

If we take this approach, what is a good API for the gup.c walker?
Return the pte attributes?

int get_user_pte(struct mm_struct *mm, unsigned long address,
int write, unsigned long *paddr, int *pageshift, pgprot_t *prot)

The GRU would enforce the check for cacheable access.

--- jack

2008-07-11 03:30:23

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch 12/13] GRU Driver V3 - export is_uv_system(), zap_page_range() & follow_page()

On Friday 11 July 2008 03:20, Jack Steiner wrote:
> On Fri, Jul 11, 2008 at 02:52:00AM +1000, Nick Piggin wrote:

> > lockless gup checks for struct page by checking a bit in the pte.
> > This should be enough to guarantee it is cacheable memory (unless
> > another driver has done something tricky like set the the page's
> > cache attributes to UC or WC -- I don't know if there is a way to
> > completely avoid all corner cases).
>
> The GRU itself has no need to reference the page struct.
> However, it WILL reference valid ptes that represent pages imported from
> other SSIs via xpmem. These will have cacheable ptes but no page structs.

Oh, I'm sorry Jack, I misread the patch and thought you still had
the page_to_phys thing in there... OK, then it probably isn't
broken. And in which case you would have to add a little code to
gup.c...

> Maybe checking the pte attributes is the best way to do the check.
>
> If we take this approach, what is a good API for the gup.c walker?
> Return the pte attributes?
>
> int get_user_pte(struct mm_struct *mm, unsigned long address,
> int write, unsigned long *paddr, int *pageshift, pgprot_t *prot)
>
> The GRU would enforce the check for cacheable access.

Yeah that wouldn't be a bad API, although being a lockless, may-fail,
not available on all archs kind of thing, I would prefer a different
name, maybe get_user_pte_fast() to match?

Thanks,
Nick