LinuxLists.cc - [RFC 0/3] SGI Altix cross partition memory (XPMEM)

2007-08-10 01:07:23

Subject: [RFC 0/3] SGI Altix cross partition memory (XPMEM)

Terminology

The term 'partition', adopted by the SGI hardware designers and which
perculated up into the software, is used in reference to a single SSI
when multiple SSIs are running on a single Altix. An Altix running
multiple SSIs is said to be 'partitioned', whereas one that is running
only a single SSI is said to be 'unpartitioned'.

The term '[a]cross partition' refers to a functionality that spans between
two SSIs on a multi-SSI Altix. ('XP' is its abbreviation.)

Introduction

This feature provides cross partition access to user memory (XPMEM) when
running multiple partitions on a single SGI Altix. XPMEM, like XPNET,
utilizes XPC to communicate between the partitions.

XPMEM allows a user process to identify portion(s) of its address space
that other user processes can attach (i.e. map) into their own address
spaces. These processes can be running on the same or a different
partition from the one whose memory they are attaching.

Known Issues

XPMEM is not currently using the kthread API (which is also true for XPC)
because it was in the process of being changed to require a kthread_stop()
be done for every kthread_create() and the kthread_stop() couldn't be called
for a thread that had already exited. In talking with Eric Biederman, there
was some thought of creating a kthread_orphan() which would eliminate the
need for a call to kthread_stop() being required.

2007-08-10 01:11:19

by Dean Nelson

[permalink] [raw]

Subject: Re: [RFC 1/3] SGI Altix cross partition memory (XPMEM)

This patch exports __put_task_struct as it is needed by XPMEM.

Signed-off-by: Dean Nelson <[email protected]>

---

One struct file_operations registered by XPMEM, xpmem_open(), calls
'get_task_struct(current->group_leader)' and another, xpmem_flush(), calls
'put_task_struct(tg->group_leader)'. The reason for this is given in the
comment block that appears in xpmem_open().

/*
* Increment 'usage' and 'mm->mm_users' for the current task's thread
* group leader. This ensures that both its task_struct and mm_struct
* will still be around when our thread group exits. (The Linux kernel
* normally tears down the mm_struct prior to calling a module's
* 'flush' function.) Since all XPMEM thread groups must go through
* this path, this extra reference to mm_users also allows us to
* directly inc/dec mm_users in xpmem_ensure_valid_PFNs() and avoid
* mmput() which has a scaling issue with the mmlist_lock.
*/

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2007-08-09 07:07:55.426611601 -0500
+++ linux-2.6/kernel/fork.c 2007-08-09 07:15:43.246391700 -0500
@@ -127,6 +127,7 @@
if (!profile_handoff_task(tsk))
free_task(tsk);
}
+EXPORT_SYMBOL_GPL(__put_task_struct);

void __init fork_init(unsigned long mempages)
{

2007-08-10 01:12:57

by Dean Nelson

[permalink] [raw]

Subject: [RFC 2/3] SGI Altix cross partition memory (XPMEM)

This patch exports zap_page_range as it is needed by XPMEM.

Signed-off-by: Dean Nelson <[email protected]>

---

XPMEM would have used sys_madvise() except that madvise_dontneed()
madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is
always true for the pages XPMEM imports from other partitions and is
also true for uncached pages allocated locally via the mspec allocator.
XPMEM needs zap_page_range() functionality for these types of pages as
well as 'normal' pages.

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2007-08-09 07:07:55.762651612 -0500
+++ linux-2.6/mm/memory.c 2007-08-09 07:15:43.226389312 -0500
@@ -894,6 +894,7 @@
tlb_finish_mmu(tlb, address, end);
return end;
}
+EXPORT_SYMBOL_GPL(zap_page_range);

/*
* Do a quick page-table lookup for a single page.

2007-08-10 01:14:48

by Dean Nelson

[permalink] [raw]

Subject: [RFC 3/3] SGI Altix cross partition memory (XPMEM)

This patch provides cross partition access to user memory (XPMEM) when
running multiple partitions on a single SGI Altix.

Signed-off-by: Dean Nelson <[email protected]>

Attachments:

(No filename) (165.00 B)
xpmem-module.v002.bz2 (36.09 kB)
Download all attachments

2007-08-10 06:16:39

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC 3/3] SGI Altix cross partition memory (XPMEM)

On Thu, 9 Aug 2007 20:14:35 -0500 Dean Nelson <[email protected]> wrote:

> This patch provides cross partition access to user memory (XPMEM) when
> running multiple partitions on a single SGI Altix.
>

- Please don't send multiple patches all with the same title.

- Feed the diff through checkpatch.pl, ponder the result.

- Avoid needless casts to and from void* (eg, vm_private_data)

- The test for PF_DUMPCORE in xpmem_fault_handler() is mysterious and
merits a comment.

- xpmem_fault_handler() is scary.

- xpmem_fault_handler() appears to have imposed a kernel-wide rule that
when taking multiple mmap_sems, one should take the lowest-addressed one
first? If so, that probably wants a mention in that locking comment in
filemap.c

- xpmem_fault_handler() does atomic_dec(&seg_tg->mm->mm_users). What
happens if that was the last reference?

- Has it all been tested with lockdep enabled? Jugding from all the use
of SPIN_LOCK_UNLOCKED, it has not.

Oh, ia64 doesn't implement lockdep. For this code, that is deeply
regrettable.

- This code all predates the nopage->fault conversion and won't work in
current kernels.

- xpmem_attach() does smp_processor_id() in preemptible code. Lucky that
ia64 doesn't do preempt?

- Stuff like this:

+ ap_tg = xpmem_tg_ref_by_apid(apid);
+ if (IS_ERR(ap_tg))
+ return PTR_ERR(ap_tg);
+
+ ap = xpmem_ap_ref_by_apid(ap_tg, apid);
+ if (IS_ERR(ap)) {
+ xpmem_tg_deref(ap_tg);
+ return PTR_ERR(ap);
+ }

is fragile. It is easy to introduce leaks and locking errors as the
code evolves. The code has a lot of these deeply-embedded `return'
statments preceded by duplicated unwinding. Kenrel code generally
prefers to do the `goto out' thing.

- "XPMEM_FLAG_VALIDPTEs"? Someone's pinky got tired at the end ;)

Attention span expired at 19%, sorry. It's a large patch.

2007-08-22 17:00:33

by Dean Nelson

[permalink] [raw]

Subject: Re: [RFC 3/3] SGI Altix cross partition memory (XPMEM)

On Thu, Aug 09, 2007 at 11:15:42PM -0700, Andrew Morton wrote:
> On Thu, 9 Aug 2007 20:14:35 -0500 Dean Nelson <[email protected]> wrote:
>
> > This patch provides cross partition access to user memory (XPMEM) when
> > running multiple partitions on a single SGI Altix.
> >
>
> - Please don't send multiple patches all with the same title.

Sorry, I'd somehow gotten the impression it was to be done this way.

> - Feed the diff through checkpatch.pl, ponder the result.

I did and made the necessary changes to satisfy the issues it raised, except
for:

1) ERROR: Don't use kernel_thread(): see Documentation/feature-removal-schedule.txt

As mentioned in the intro to this set of patches, XPMEM is not currently
using the kthread API because it was in the process of being changed to
require a kthread_stop() be done for every kthread_create(), and the
kthread_stop() couldn't be called for a thread that had already exited.
In talking with Eric Biederman, there was some talk of creating a
kthread_orphan() which would eliminate the need for a call to
kthread_stop() being required.

2) WARNING: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt

I eliminated all but two usages of volatile in XPMEM. One in
xpmem_flush_cpu_cache() and the other in xpmem_recall_cache_lines_phys().
Without the volatile the compiler eliminates the necessary instruction in
each. Any suggestions as how to eliminate the need for volatile in these
two cases?

3) WARNING: declaring multiple variables together should be avoided

checkpatch.pl is erroneously commplaining about the following found in five
different functions in arch/ia64/sn/kernel/xpmem_pfn.c.

int n_pgs = xpmem_num_of_pages(vaddr, size);

> ! Avoid needless casts to and from void* (eg, vm_private_data)

Done.

> ! The test for PF_DUMPCORE in xpmem_fault_handler() is mysterious and
> merits a comment.

Done.

> - xpmem_fault_handler() is scary.

You are very right about that, but consider that this code has been running
at our customer sites on very large systems (18 partitions with 512 CPUs in
each partition). The avoid_deadlock_1 and avoid_deadlock_2 issues were
discovered at such sites and have been resolved (to the best of our
knowledge).

> - xpmem_fault_handler() appears to have imposed a kernel-wide rule that
> when taking multiple mmap_sems, one should take the lowest-addressed one
> first? If so, that probably wants a mention in that locking comment in
> filemap.c

Sure. After looking at the lock ordering comment block in mm/filemap.c, it
wasn't clear to me how best to document this. Any suggestions/help would
be most appreciated.

> - xpmem_fault_handler() does atomic_dec(&seg_tg->mm->mm_users). What
> happens if that was the last reference?

When /dev/xpmem is opened by a user process, xpmem_open() incs mm_users
and when it is flushed, xpmem_flush() decs it (via mmput()) after having
ensured that no XPMEM attachments exist of this mm. Thus the dec in
xpmem_fault_handler() will never take it to 0.

> - Has it all been tested with lockdep enabled? Jugding from all the use
> of SPIN_LOCK_UNLOCKED, it has not.
>
> Oh, ia64 doesn't implement lockdep. For this code, that is deeply
> regrettable.

No, it hasn't been tested with lockdep. But I have switched it from using
SPIN_LOCK_UNLOCKED to spin_lock_init().

> ! This code all predates the nopage->fault conversion and won't work in
> current kernels.

I've switched from using nopage to using fault. I read that it is intended
that nopfn also goes away. If this is the case, then the BUG_ON if VM_PFNMAP
is set would make __do_fault() a rather unfriendly replacement for do_no_pfn().

> - xpmem_attach() does smp_processor_id() in preemptible code. Lucky that
> ia64 doesn't do preempt?

Actually, the code is fine as is even with preemption configured on. All it's
doing is ensuring that the thread was previously pinned to the CPU it's
currently running on. If it is, it can't be moved to another CPU via
preemption, and if it isn't, the check will fail and we'll return -EINVAL
and all is well.

> - Stuff like this:
>
> + ap_tg = xpmem_tg_ref_by_apid(apid);
> + if (IS_ERR(ap_tg))
> + return PTR_ERR(ap_tg);
> +
> + ap = xpmem_ap_ref_by_apid(ap_tg, apid);
> + if (IS_ERR(ap)) {
> + xpmem_tg_deref(ap_tg);
> + return PTR_ERR(ap);
> + }
>
> is fragile. It is easy to introduce leaks and locking errors as the
> code evolves. The code has a lot of these deeply-embedded `return'
> statments preceded by duplicated unwinding. Kenrel code generally
> prefers to do the `goto out' thing.

Done (assuming I correctly interpreted your intent).

> ! "XPMEM_FLAG_VALIDPTEs"? Someone's pinky got tired at the end ;)

Yeah, but it's better now (and so is the code).

> Attention span expired at 19%, sorry. It's a large patch.

I've included the new patch as an attachment.

Thanks,
Dean

Attachments:

(No filename) (4.83 kB)
xpmem-module.v002.bz2 (36.05 kB)
Download all attachments

2007-08-22 18:05:56

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC 3/3] SGI Altix cross partition memory (XPMEM)

On Wed, 22 Aug 2007 12:00:11 -0500
Dean Nelson <[email protected]> wrote:

>
> 3) WARNING: declaring multiple variables together should be avoided
>
> checkpatch.pl is erroneously commplaining about the following found in five
> different functions in arch/ia64/sn/kernel/xpmem_pfn.c.
>
> int n_pgs = xpmem_num_of_pages(vaddr, size);

What warning does it generate here?

> > - xpmem_fault_handler() appears to have imposed a kernel-wide rule that
> > when taking multiple mmap_sems, one should take the lowest-addressed one
> > first? If so, that probably wants a mention in that locking comment in
> > filemap.c
>
> Sure. After looking at the lock ordering comment block in mm/filemap.c, it
> wasn't clear to me how best to document this. Any suggestions/help would
> be most appreciated.

umm,

* when taking multiple mmap_sems, one should take the lowest-addressed one
* first

;)

> > - xpmem_fault_handler() does atomic_dec(&seg_tg->mm->mm_users). What
> > happens if that was the last reference?
>
> When /dev/xpmem is opened by a user process, xpmem_open() incs mm_users
> and when it is flushed, xpmem_flush() decs it (via mmput()) after having
> ensured that no XPMEM attachments exist of this mm. Thus the dec in
> xpmem_fault_handler() will never take it to 0.

OK. Generally if a reviewer asks a question like this, it indicates that a
code comment is needed. Because it is likely that others will later wonder
the same thing.

> > - Has it all been tested with lockdep enabled? Jugding from all the use
> > of SPIN_LOCK_UNLOCKED, it has not.
> >
> > Oh, ia64 doesn't implement lockdep. For this code, that is deeply
> > regrettable.
>
> No, it hasn't been tested with lockdep. But I have switched it from using
> SPIN_LOCK_UNLOCKED to spin_lock_init().
>
> > ! This code all predates the nopage->fault conversion and won't work in
> > current kernels.
>
> I've switched from using nopage to using fault. I read that it is intended
> that nopfn also goes away. If this is the case, then the BUG_ON if VM_PFNMAP
> is set would make __do_fault() a rather unfriendly replacement for do_no_pfn().
>
> > - xpmem_attach() does smp_processor_id() in preemptible code. Lucky that
> > ia64 doesn't do preempt?
>
> Actually, the code is fine as is even with preemption configured on. All it's
> doing is ensuring that the thread was previously pinned to the CPU it's
> currently running on. If it is, it can't be moved to another CPU via
> preemption, and if it isn't, the check will fail and we'll return -EINVAL
> and all is well.

OK. Running smp_processor_id() from within preemptible code will generate
a warning, but the code is sneaky enough to prevent that warning if the
calling task happens to be pinned to a single CPU.

2007-08-22 19:15:30

by Dean Nelson

[permalink] [raw]

Subject: Re: [RFC 3/3] SGI Altix cross partition memory (XPMEM)

On Wed, Aug 22, 2007 at 11:04:22AM -0700, Andrew Morton wrote:
> On Wed, 22 Aug 2007 12:00:11 -0500
> Dean Nelson <[email protected]> wrote:
>
> >
> > 3) WARNING: declaring multiple variables together should be avoided
> >
> > checkpatch.pl is erroneously commplaining about the following found in five
> > different functions in arch/ia64/sn/kernel/xpmem_pfn.c.
> >
> > int n_pgs = xpmem_num_of_pages(vaddr, size);
>
> What warning does it generate here?

The WARNING #3 above "declaring multiple variables together should be avoided".
There is only one variable being declared, which happens to be initialized by
the function xpmem_num_of_pages().

> > > - xpmem_fault_handler() appears to have imposed a kernel-wide rule that
> > > when taking multiple mmap_sems, one should take the lowest-addressed one
> > > first? If so, that probably wants a mention in that locking comment in
> > > filemap.c
> >
> > Sure. After looking at the lock ordering comment block in mm/filemap.c, it
> > wasn't clear to me how best to document this. Any suggestions/help would
> > be most appreciated.
>
> umm,
>
> * when taking multiple mmap_sems, one should take the lowest-addressed one
> * first
>
> ;)

Thanks.

> > > - xpmem_fault_handler() does atomic_dec(&seg_tg->mm->mm_users). What
> > > happens if that was the last reference?
> >
> > When /dev/xpmem is opened by a user process, xpmem_open() incs mm_users
> > and when it is flushed, xpmem_flush() decs it (via mmput()) after having
> > ensured that no XPMEM attachments exist of this mm. Thus the dec in
> > xpmem_fault_handler() will never take it to 0.
>
> OK. Generally if a reviewer asks a question like this, it indicates that a
> code comment is needed. Because it is likely that others will later wonder
> the same thing.

Will do.

> > > - Has it all been tested with lockdep enabled? Jugding from all the use
> > > of SPIN_LOCK_UNLOCKED, it has not.
> > >
> > > Oh, ia64 doesn't implement lockdep. For this code, that is deeply
> > > regrettable.
> >
> > No, it hasn't been tested with lockdep. But I have switched it from using
> > SPIN_LOCK_UNLOCKED to spin_lock_init().
> >
> > > ! This code all predates the nopage->fault conversion and won't work in
> > > current kernels.
> >
> > I've switched from using nopage to using fault. I read that it is intended
> > that nopfn also goes away. If this is the case, then the BUG_ON if VM_PFNMAP
> > is set would make __do_fault() a rather unfriendly replacement for do_no_pfn().
> >
> > > - xpmem_attach() does smp_processor_id() in preemptible code. Lucky that
> > > ia64 doesn't do preempt?
> >
> > Actually, the code is fine as is even with preemption configured on. All it's
> > doing is ensuring that the thread was previously pinned to the CPU it's
> > currently running on. If it is, it can't be moved to another CPU via
> > preemption, and if it isn't, the check will fail and we'll return -EINVAL
> > and all is well.
>
> OK. Running smp_processor_id() from within preemptible code will generate
> a warning, but the code is sneaky enough to prevent that warning if the
> calling task happens to be pinned to a single CPU.

Would it make more sense in this particular case to replace the call to
smp_processor_id() in xpmem_attach() with a call to raw_smp_processor_id()
instead, and add a comment explaining why?

2007-08-22 19:50:57

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC 3/3] SGI Altix cross partition memory (XPMEM)

On Wed, 22 Aug 2007 14:15:16 -0500
Dean Nelson <[email protected]> wrote:

> On Wed, Aug 22, 2007 at 11:04:22AM -0700, Andrew Morton wrote:
> > On Wed, 22 Aug 2007 12:00:11 -0500
> > Dean Nelson <[email protected]> wrote:
> >
> > >
> > > 3) WARNING: declaring multiple variables together should be avoided
> > >
> > > checkpatch.pl is erroneously commplaining about the following found in five
> > > different functions in arch/ia64/sn/kernel/xpmem_pfn.c.
> > >
> > > int n_pgs = xpmem_num_of_pages(vaddr, size);
> >
> > What warning does it generate here?
>
> The WARNING #3 above "declaring multiple variables together should be avoided".
> There is only one variable being declared, which happens to be initialized by
> the function xpmem_num_of_pages().

Ah, I think I recall seeing a report of that earlier. Maybe it's been fixed?

> ...
> > > I've switched from using nopage to using fault. I read that it is intended
> > > that nopfn also goes away. If this is the case, then the BUG_ON if VM_PFNMAP
> > > is set would make __do_fault() a rather unfriendly replacement for do_no_pfn().
> > >
> > > > - xpmem_attach() does smp_processor_id() in preemptible code. Lucky that
> > > > ia64 doesn't do preempt?
> > >
> > > Actually, the code is fine as is even with preemption configured on. All it's
> > > doing is ensuring that the thread was previously pinned to the CPU it's
> > > currently running on. If it is, it can't be moved to another CPU via
> > > preemption, and if it isn't, the check will fail and we'll return -EINVAL
> > > and all is well.
> >
> > OK. Running smp_processor_id() from within preemptible code will generate
> > a warning, but the code is sneaky enough to prevent that warning if the
> > calling task happens to be pinned to a single CPU.
>
> Would it make more sense in this particular case to replace the call to
> smp_processor_id() in xpmem_attach() with a call to raw_smp_processor_id()
> instead, and add a comment explaining why?

Your call ;) Either will be OK, I expect.

2007-08-23 13:59:15

by Andy Whitcroft

[permalink] [raw]

Subject: Re: [RFC 3/3] SGI Altix cross partition memory (XPMEM)

Andrew Morton wrote:
> On Wed, 22 Aug 2007 14:15:16 -0500
> Dean Nelson <[email protected]> wrote:
>
>> On Wed, Aug 22, 2007 at 11:04:22AM -0700, Andrew Morton wrote:
>>> On Wed, 22 Aug 2007 12:00:11 -0500
>>> Dean Nelson <[email protected]> wrote:
>>>
>>>> 3) WARNING: declaring multiple variables together should be avoided
>>>>
>>>> checkpatch.pl is erroneously commplaining about the following found in five
>>>> different functions in arch/ia64/sn/kernel/xpmem_pfn.c.
>>>>
>>>> int n_pgs = xpmem_num_of_pages(vaddr, size);
>>> What warning does it generate here?
>> The WARNING #3 above "declaring multiple variables together should be avoided".
>> There is only one variable being declared, which happens to be initialized by
>> the function xpmem_num_of_pages().
>
> Ah, I think I recall seeing a report of that earlier. Maybe it's been fixed?

Yep that got fixed. Though the consensus was there were too many good
uses for the multiple define form that it got put on ice after that too.

>> ...
>>>> I've switched from using nopage to using fault. I read that it is intended
>>>> that nopfn also goes away. If this is the case, then the BUG_ON if VM_PFNMAP
>>>> is set would make __do_fault() a rather unfriendly replacement for do_no_pfn().
>>>>
>>>>> - xpmem_attach() does smp_processor_id() in preemptible code. Lucky that
>>>>> ia64 doesn't do preempt?
>>>> Actually, the code is fine as is even with preemption configured on. All it's
>>>> doing is ensuring that the thread was previously pinned to the CPU it's
>>>> currently running on. If it is, it can't be moved to another CPU via
>>>> preemption, and if it isn't, the check will fail and we'll return -EINVAL
>>>> and all is well.
>>> OK. Running smp_processor_id() from within preemptible code will generate
>>> a warning, but the code is sneaky enough to prevent that warning if the
>>> calling task happens to be pinned to a single CPU.
>> Would it make more sense in this particular case to replace the call to
>> smp_processor_id() in xpmem_attach() with a call to raw_smp_processor_id()
>> instead, and add a comment explaining why?
>
> Your call ;) Either will be OK, I expect.