LinuxLists.cc - [PATCH 0/4] SGI Altix cross partition memory (XPMEM)

2007-08-27 16:01:46

Subject: [PATCH 0/4] SGI Altix cross partition memory (XPMEM)

Terminology

The term 'partition', adopted by the SGI hardware designers and which
perculated up into the software, is used in reference to a single SSI
when multiple SSIs are running on a single Altix. An Altix running
multiple SSIs is said to be 'partitioned', whereas one that is running
only a single SSI is said to be 'unpartitioned'.

The term '[a]cross partition' refers to a functionality that spans between
two SSIs on a multi-SSI Altix. ('XP' is its abbreviation.)

Introduction

This feature provides cross partition access to user memory (XPMEM) when
running multiple partitions on a single SGI Altix. XPMEM, like XPNET,
utilizes XPC to communicate between the partitions.

XPMEM allows a user process to identify portion(s) of its address space
that other user processes can attach (i.e. map) into their own address
spaces. These processes can be running on the same or a different
partition from the one whose memory they are attaching.

Known Issues

XPMEM is not currently using the kthread API (which is also true for XPC)
because it was in the process of being changed to require a kthread_stop()
be done for every kthread_create() and the kthread_stop() couldn't be called
for a thread that had already exited. In talking with Eric Biederman, there
was some thought of creating a kthread_orphan() which would eliminate the
need for a call to kthread_stop() being required.

2007-08-27 16:08:32

by Dean Nelson

[permalink] [raw]

Subject: [PATCH 1/4] export __put_task_struct for XPMEM

This patch exports __put_task_struct as it is needed by XPMEM.

Signed-off-by: Dean Nelson <[email protected]>

---

One struct file_operations registered by XPMEM, xpmem_open(), calls
'get_task_struct(current->group_leader)' and another, xpmem_flush(), calls
'put_task_struct(tg->group_leader)'. The reason for this is given in the
comment block that appears in xpmem_open().

/*
* Increment 'usage' and 'mm->mm_users' for the current task's thread
* group leader. This ensures that both its task_struct and mm_struct
* will still be around when our thread group exits. (The Linux kernel
* normally tears down the mm_struct prior to calling a module's
* 'flush' function.) Since all XPMEM thread groups must go through
* this path, this extra reference to mm_users also allows us to
* directly inc/dec mm_users in xpmem_ensure_valid_PFNs() and avoid
* mmput() which has a scaling issue with the mmlist_lock.
*/

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2007-08-09 07:07:55.426611601 -0500
+++ linux-2.6/kernel/fork.c 2007-08-09 07:15:43.246391700 -0500
@@ -127,6 +127,7 @@
if (!profile_handoff_task(tsk))
free_task(tsk);
}
+EXPORT_SYMBOL_GPL(__put_task_struct);

void __init fork_init(unsigned long mempages)
{

2007-08-27 16:10:09

by Dean Nelson

[permalink] [raw]

Subject: [PATCH 2/4] export zap_page_range for XPMEM

This patch exports zap_page_range as it is needed by XPMEM.

Signed-off-by: Dean Nelson <[email protected]>

---

XPMEM would have used sys_madvise() except that madvise_dontneed()
madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is
always true for the pages XPMEM imports from other partitions and is
also true for uncached pages allocated locally via the mspec allocator.
XPMEM needs zap_page_range() functionality for these types of pages as
well as 'normal' pages.

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2007-08-09 07:07:55.762651612 -0500
+++ linux-2.6/mm/memory.c 2007-08-09 07:15:43.226389312 -0500
@@ -894,6 +894,7 @@
tlb_finish_mmu(tlb, address, end);
return end;
}
+EXPORT_SYMBOL_GPL(zap_page_range);

/*
* Do a quick page-table lookup for a single page.

2007-08-27 16:13:55

by Dean Nelson

[permalink] [raw]

Subject: [PATCH 3/4] add new lock ordering rule to mm/filemap.c

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2007-08-09 19:18:15.000000000 -0500
+++ linux-2.6/mm/filemap.c 2007-08-27 09:13:47.435717670 -0500
@@ -78,6 +78,9 @@
* ->i_mutex (generic_file_buffered_write)
* ->mmap_sem (fault_in_pages_readable->do_page_fault)
*
+ * When taking multiple mmap_sems, one should lock the lowest-addressed
+ * one first proceeding on up to the highest-addressed one.
+ *
* ->i_mutex
* ->i_alloc_sem (various)
*

2007-08-27 16:21:29

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH 1/4] export __put_task_struct for XPMEM

On Mon, Aug 27, 2007 at 10:59:33AM -0500, Dean Nelson wrote:
> This patch exports __put_task_struct as it is needed by XPMEM.
>
> Signed-off-by: Dean Nelson <[email protected]>
>
> ---
>
> One struct file_operations registered by XPMEM, xpmem_open(), calls
> 'get_task_struct(current->group_leader)' and another, xpmem_flush(), calls
> 'put_task_struct(tg->group_leader)'.

Does it? Well, then open the file in question and start doing close(dup(fd))
in a loop. Won't take long for an oops...

2007-08-27 16:46:18

by Dean Nelson

[permalink] [raw]

Subject: [PATCH 4/4] add SGI Altix cross partition memory (XPMEM) driver

This patch has been bzip2'd and included as an attachment due to size.

Attachments:

(No filename) (71.00 B)
xpmem-module.v002.bz2 (36.39 kB)
Download all attachments

2007-08-27 18:11:14

by Dean Nelson

[permalink] [raw]

Subject: Re: [PATCH 1/4] export __put_task_struct for XPMEM

On Mon, Aug 27, 2007 at 05:13:28PM +0100, Al Viro wrote:
> On Mon, Aug 27, 2007 at 10:59:33AM -0500, Dean Nelson wrote:
> > This patch exports __put_task_struct as it is needed by XPMEM.
> >
> > Signed-off-by: Dean Nelson <[email protected]>
> >
> > ---
> >
> > One struct file_operations registered by XPMEM, xpmem_open(), calls
> > 'get_task_struct(current->group_leader)' and another, xpmem_flush(), calls
> > 'put_task_struct(tg->group_leader)'.
>
> Does it? Well, then open the file in question and start doing close(dup(fd))
> in a loop. Won't take long for an oops...

Actually it won't oops. And that's because when the file is opened,
xpmem_open() creates a structure for that thread group, and when
xpmem_flush() is called on the close() it first looks for that structure
and if it finds it then it does what it needs to do (which includes the
put_task_struct() call) and then finishes off by destroying the structure.
So for subsequent closes xpmem_flush() returns without calling
put_task_struct().

2007-08-27 18:16:14

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH 1/4] export __put_task_struct for XPMEM

On Mon, Aug 27, 2007 at 01:10:56PM -0500, Dean Nelson wrote:
> On Mon, Aug 27, 2007 at 05:13:28PM +0100, Al Viro wrote:
> > On Mon, Aug 27, 2007 at 10:59:33AM -0500, Dean Nelson wrote:
> > > This patch exports __put_task_struct as it is needed by XPMEM.
> > >
> > > Signed-off-by: Dean Nelson <[email protected]>
> > >
> > > ---
> > >
> > > One struct file_operations registered by XPMEM, xpmem_open(), calls
> > > 'get_task_struct(current->group_leader)' and another, xpmem_flush(), calls
> > > 'put_task_struct(tg->group_leader)'.
> >
> > Does it? Well, then open the file in question and start doing close(dup(fd))
> > in a loop. Won't take long for an oops...
>
> Actually it won't oops. And that's because when the file is opened,
> xpmem_open() creates a structure for that thread group, and when
> xpmem_flush() is called on the close() it first looks for that structure
> and if it finds it then it does what it needs to do (which includes the
> put_task_struct() call) and then finishes off by destroying the structure.
> So for subsequent closes xpmem_flush() returns without calling
> put_task_struct().

Then what kind of protection does it get you? It can be called immediately
after the call of ->open(), so you can't rely on it being there for any
operations. Makes no sense...

2007-08-27 19:19:34

by Dean Nelson

[permalink] [raw]

Subject: Re: [PATCH 1/4] export __put_task_struct for XPMEM

On Mon, Aug 27, 2007 at 07:15:44PM +0100, Al Viro wrote:
> On Mon, Aug 27, 2007 at 01:10:56PM -0500, Dean Nelson wrote:
> > On Mon, Aug 27, 2007 at 05:13:28PM +0100, Al Viro wrote:
> > > On Mon, Aug 27, 2007 at 10:59:33AM -0500, Dean Nelson wrote:
> > > > This patch exports __put_task_struct as it is needed by XPMEM.
> > > >
> > > > Signed-off-by: Dean Nelson <[email protected]>
> > > >
> > > > ---
> > > >
> > > > One struct file_operations registered by XPMEM, xpmem_open(), calls
> > > > 'get_task_struct(current->group_leader)' and another, xpmem_flush(), calls
> > > > 'put_task_struct(tg->group_leader)'.
> > >
> > > Does it? Well, then open the file in question and start doing close(dup(fd))
> > > in a loop. Won't take long for an oops...
> >
> > Actually it won't oops. And that's because when the file is opened,
> > xpmem_open() creates a structure for that thread group, and when
> > xpmem_flush() is called on the close() it first looks for that structure
> > and if it finds it then it does what it needs to do (which includes the
> > put_task_struct() call) and then finishes off by destroying the structure.
> > So for subsequent closes xpmem_flush() returns without calling
> > put_task_struct().
>
> Then what kind of protection does it get you? It can be called immediately
> after the call of ->open(), so you can't rely on it being there for any
> operations. Makes no sense...

No operations can be done once it's closed, only while it's opened.

At fault time we need to be able to fault pages that belong to another
thread group. To call get_user_pages() we need to have the task_struct
and the mm_struct for the other thread group.

At one time we did look up the task_struct by pid, but that brought the
system to its knees due to heavy lock contention.

Do you know of a better way of doing this?

2007-08-27 19:35:29

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH 1/4] export __put_task_struct for XPMEM

On Mon, Aug 27, 2007 at 02:19:06PM -0500, Dean Nelson wrote:

> No operations can be done once it's closed, only while it's opened.

What the hell do you mean, can't be done?

fd = open(...);
fp = popen("/bin/date", "r");
/* read from fp */
fclose(fp);
do operations on fd

And you've got yourself
open
fork duplicating descriptor
exit closing that duplicate
IO on original, after call of flush

2007-08-27 20:01:01

by Adrian Bunk

[permalink] [raw]

Subject: Re: [PATCH 4/4] add SGI Altix cross partition memory (XPMEM) driver

On Mon, Aug 27, 2007 at 11:41:12AM -0500, Dean Nelson wrote:
> This patch has been bzip2'd and included as an attachment due to size.

200 kB shouldn't be a problem
(AFAIR the current limit on linux-kernel is 400 kB).

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2007-08-27 20:30:35

by Robin Holt

[permalink] [raw]

Subject: Re: [PATCH 1/4] export __put_task_struct for XPMEM

On Mon, Aug 27, 2007 at 08:35:10PM +0100, Al Viro wrote:
> On Mon, Aug 27, 2007 at 02:19:06PM -0500, Dean Nelson wrote:
>
> > No operations can be done once it's closed, only while it's opened.
>
> What the hell do you mean, can't be done?
>
> fd = open(...);
> fp = popen("/bin/date", "r");
> /* read from fp */
> fclose(fp);

But this will operate on the dup'd fd. We detect that in the flush
(ignore) and ioctl (return errors) operations. All other operations
are not handled by xpmem.

If you look at the fourth patch, at the beginning of the xpmem_flush
function, we have:

+ tg = xpmem_tg_ref_by_tgid(xpmem_my_part, current->tgid);
+ if (IS_ERR(tg))
+ return 0; /* probably child process who inherited fd */

This will protect the xpmem structures of the parent from closes by
the child.

Thanks,
Robin

2007-08-27 20:48:09

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH 1/4] export __put_task_struct for XPMEM

On Mon, Aug 27, 2007 at 03:24:20PM -0500, Robin Holt wrote:
> On Mon, Aug 27, 2007 at 08:35:10PM +0100, Al Viro wrote:
> > On Mon, Aug 27, 2007 at 02:19:06PM -0500, Dean Nelson wrote:
> >
> > > No operations can be done once it's closed, only while it's opened.
> >
> > What the hell do you mean, can't be done?
> >
> > fd = open(...);
> > fp = popen("/bin/date", "r");
> > /* read from fp */
> > fclose(fp);
>
> But this will operate on the dup'd fd. We detect that in the flush
> (ignore) and ioctl (return errors) operations. All other operations
> are not handled by xpmem.

How the hell do you detect dup'd fd? It's identical to the original
in every respect and it doesn't have to be held by a different task.
Seriously, what you are proposing makes no sense whatsoever...

2007-08-28 18:02:12

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 1/4] export __put_task_struct for XPMEM

On Mon, Aug 27, 2007 at 01:10:56PM -0500, Dean Nelson wrote:
> > Does it? Well, then open the file in question and start doing close(dup(fd))
> > in a loop. Won't take long for an oops...
>
> Actually it won't oops. And that's because when the file is opened,
> xpmem_open() creates a structure for that thread group, and when
> xpmem_flush() is called on the close() it first looks for that structure
> and if it finds it then it does what it needs to do (which includes the
> put_task_struct() call) and then finishes off by destroying the structure.
> So for subsequent closes xpmem_flush() returns without calling
> put_task_struct().

Your refcounting is totally broken. fds can be passed around in the same
process group (which btw is not something driver should look at because
there are variants of different kinds of process groups depending on clone
flags), and your driver is going boom in most of the case.

We don't export the routines to acquire reference to task structs for
a reason, and this piece of junk called xpmem is not going to change it.

2007-08-28 18:02:47

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 4/4] add SGI Altix cross partition memory (XPMEM) driver

Big fat NACK, for dirty VM tricks, playing with task_struct lifetimes,
and last but not least the horrible ioctl "API".

2007-08-28 19:00:57

by Robin Holt

[permalink] [raw]

Subject: Re: [PATCH 4/4] add SGI Altix cross partition memory (XPMEM) driver

On Tue, Aug 28, 2007 at 07:02:35PM +0100, Christoph Hellwig wrote:
> Big fat NACK, for dirty VM tricks, playing with task_struct lifetimes,
> and last but not least the horrible ioctl "API".

The ioctl is sort of historical. IIRC, in ProPack 3 (RHEL4 based 2.4
kernel), we added system calls. When the community started making noise
about system calls being bad, we went to a device special file with a
read/write (couldn't get the needed performance from the ioctl() interface
which used to acquire the BKL). Now that the community fixed the ioctl
issues, we went to using an ioctl, but are completely open to change.

If you want to introduce system calls, we would expect to need, IIRC, 8.
We also pondered an xpmem filesystem today. It really felt wrong,
but we could pursue that as an alternative.

What is the correct direction to go with this? get_user_pages() does
currently require the task_struct. Are you proposing we develop a way
to fault pages without the task_struct of the owning process/thread group?

Thanks,
Robin

2007-08-28 19:05:27

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 4/4] add SGI Altix cross partition memory (XPMEM) driver

On Tue, Aug 28, 2007 at 02:00:43PM -0500, Robin Holt wrote:
> The ioctl is sort of historical. IIRC, in ProPack 3 (RHEL4 based 2.4
> kernel), we added system calls. When the community started making noise
> about system calls being bad, we went to a device special file with a
> read/write (couldn't get the needed performance from the ioctl() interface
> which used to acquire the BKL). Now that the community fixed the ioctl
> issues, we went to using an ioctl, but are completely open to change.
>
> If you want to introduce system calls, we would expect to need, IIRC, 8.
> We also pondered an xpmem filesystem today. It really felt wrong,
> but we could pursue that as an alternative.

The problem is not ioctls per s?, but the kind of operation you
export.

> What is the correct direction to go with this? get_user_pages() does
> currently require the task_struct. Are you proposing we develop a way
> to fault pages without the task_struct of the owning process/thread group?

Stop trying to mess with vmas and get_user_pages on processes entirely.
The only region of virtual memory a driver can deal with is the one it
got a mmap request for, or when using get_user_pages the one it's got
a read/write request for. You're doing a worse variant of the rdma page
pinning scheme we're rejected countless times.