LinuxLists.cc - [PATCH] Use MPOL_INTERLEAVE for tmpfs files

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

> This patch causes memory allocation for tmpfs files to be distributed
> evenly across NUMA machines. In most circumstances today, tmpfs files
> will be allocated on the same node as the task writing to the file.
> In many cases, particularly when large files are created, or a large
> number of files are created by a single task, this leads to a severe
> imbalance in free memory amongst nodes. This patch corrects that
> situation.

Yeah, but it also ruins your locality of reference (in a NUMA sense).
Not convinced that's a good idea. You're guaranteeing universally consistent
worse-case performance for everyone. And you're only looking at a situation
where there's one allocator on the system, and that's imbalanced.

You WANT your data to be local. That's the whole idea.

M.

2004-11-02 16:08:59

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
> > This patch causes memory allocation for tmpfs files to be distributed
> > evenly across NUMA machines. In most circumstances today, tmpfs files
> > will be allocated on the same node as the task writing to the file.
> > In many cases, particularly when large files are created, or a large
> > number of files are created by a single task, this leads to a severe
> > imbalance in free memory amongst nodes. This patch corrects that
> > situation.
>
> Yeah, but it also ruins your locality of reference (in a NUMA sense).
> Not convinced that's a good idea. You're guaranteeing universally consistent
> worse-case performance for everyone. And you're only looking at a situation
> where there's one allocator on the system, and that's imbalanced.
>
> You WANT your data to be local. That's the whole idea.

I think it depends on how you use tmpfs. When you use it for read/write
it's a good idea because you likely don't care about a bit of additional
latency and it's better to not fill up your local nodes with temporary
files.

If you use it with mmap then you likely want local policy.

But that's a big ugly to distingush, that is why I suggested the sysctl.

-Andi

2004-11-02 17:05:58

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

--Andi Kleen <[email protected]> wrote (on Tuesday, November 02, 2004 16:55:07 +0100):

> On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
>> > This patch causes memory allocation for tmpfs files to be distributed
>> > evenly across NUMA machines. In most circumstances today, tmpfs files
>> > will be allocated on the same node as the task writing to the file.
>> > In many cases, particularly when large files are created, or a large
>> > number of files are created by a single task, this leads to a severe
>> > imbalance in free memory amongst nodes. This patch corrects that
>> > situation.
>>
>> Yeah, but it also ruins your locality of reference (in a NUMA sense).
>> Not convinced that's a good idea. You're guaranteeing universally consistent
>> worse-case performance for everyone. And you're only looking at a situation
>> where there's one allocator on the system, and that's imbalanced.
>>
>> You WANT your data to be local. That's the whole idea.
>
> I think it depends on how you use tmpfs. When you use it for read/write
> it's a good idea because you likely don't care about a bit of additional
> latency and it's better to not fill up your local nodes with temporary
> files.
>
> If you use it with mmap then you likely want local policy.
>
> But that's a big ugly to distingush, that is why I suggested the sysctl.

As long as it defaults to off, I guess I don't really care. Though I'm still
wholly unconvinced it makes much sense. I think we're just papering over the
underlying problem - that we don't do good balancing between nodes under
mem pressure.

M.

2004-11-02 23:01:09

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

> I fully agree with Martin's statement "You WANT your data to be local."
> However, it can become non-local in two different manners, each of
> which wants tmpfs to behave differently. (The next two paragraphs are
> simply to summarize, no position is being taken.)
>
> The manner I'm concerned about is when a long-lived file (long-lived
> meaning at least for the duration of the run of a large multithreaded app)
> is placed in memory as an accidental artifact of the CPU which happened
> to create the file.

Agreed, I see that point - if it's a globally accessed file that's
created by one CPU, you want it spread around. However ... how the hell
is the VM meant to distinguish that? The correct way is for the application
to tell us that - ie use the NUMA API.

> Imagine, for instance, an application startup
> script copying a large file into a tmpfs filesystem before spawning
> the actual computation application itself. There is a large class of
> HPC applications which are tightly synchronized, and which require nearly
> all of the system memory (i.e. almost all the memory on each node).
> In this case the "victim" node will be forced to go off-node for the
> bulk of its memory accesses, destroying locality, and slowing down
> the entire application. This problem is alleviated if the tmpfs file
> is fairly well distributed across the nodes.

There are definitely situations where you'd want both, I'd agree.

> But I do understand the opposite situation. If a lightly-threaded
> application wants to access the file data, or the application does
> not require large amounts of additional memory (i.e. nearly all the
> memory on each node) getting the file itself allocated close to the
> processor is more beneficial. In this case the distribution of tmpfs
> pages is non-ideal (though I'm not sure it's worst-case).

Well, it's not quite worst case ... you'll only get (n-1)/n of your pages
off node (where n is number of nodes). I guess we could deliberately
force ALL of them off-node just to make sure ;-)

>> > But that's a big ugly to distingush, that is why I suggested the sysctl.
>>
>> As long as it defaults to off, I guess I don't really care. Though I'm still
>> wholly unconvinced it makes much sense. I think we're just papering over the
>> underlying problem - that we don't do good balancing between nodes under
>> mem pressure.
>
> It's a tough situation, as shown above. The HPC workload I mentioned
> would much prefer the tmpfs file to be distributed. A non-HPC workload
> would prefer the tmpfs files be local. Short of a sysctl I'm not sure
> how the system could make an intelligent decision about what to do under
> memory pressure -- it simply isn't knowledge the kernel can have.

It is if you tell it from the app ;-) But otherwise yes, I'd agree.

> I've got a new patch including Andi's suggested sysctl ready to go.
> But I've seen one vote for defaulting to on, and one for defaulting
> to off. I'd vote for on, but then again I'm biased. :) Who arbitrates
> this one, Hugh Dickins (it's a tmpfs change, after all)? From SGI's
> perspective we can live with it either way; we'll simply need to
> document our recommendations for our customers.

I guess Hugh or Andrew. But I'm very relucant to see the current default
changed to benefit one particular set of workloads ... unless there's
overwhelming evidence that those workloads are massively predominant.
A per-arch thing might make sense I suppose if you're convinced that
all ia64 NUMA machines will ever run is HPC apps.

> As soon as I know whether this should default on or off, I'll post
> the new patch, including the sysctl.

Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
personally, but maybe others wouldn't. Hugh, is that nuts?

M.

2004-11-03 00:33:40

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Tue, 2 Nov 2004, Martin J. Bligh wrote:

> --Andi Kleen <[email protected]> wrote (on Tuesday, November 02, 2004 16:55:07 +0100):
>
> > On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
> >> > This patch causes memory allocation for tmpfs files to be distributed
> >> > evenly across NUMA machines. In most circumstances today, tmpfs files
> >> > will be allocated on the same node as the task writing to the file.
> >> > In many cases, particularly when large files are created, or a large
> >> > number of files are created by a single task, this leads to a severe
> >> > imbalance in free memory amongst nodes. This patch corrects that
> >> > situation.
> >>
> >> Yeah, but it also ruins your locality of reference (in a NUMA sense).
> >> Not convinced that's a good idea. You're guaranteeing universally consistent
> >> worse-case performance for everyone. And you're only looking at a situation
> >> where there's one allocator on the system, and that's imbalanced.
> >>
> >> You WANT your data to be local. That's the whole idea.
> >
> > I think it depends on how you use tmpfs. When you use it for read/write
> > it's a good idea because you likely don't care about a bit of additional
> > latency and it's better to not fill up your local nodes with temporary
> > files.
> >
> > If you use it with mmap then you likely want local policy.

Yeah, what Andi said. :)

I fully agree with Martin's statement "You WANT your data to be local."
However, it can become non-local in two different manners, each of
which wants tmpfs to behave differently. (The next two paragraphs are
simply to summarize, no position is being taken.)

The manner I'm concerned about is when a long-lived file (long-lived
meaning at least for the duration of the run of a large multithreaded app)
is placed in memory as an accidental artifact of the CPU which happened
to create the file. Imagine, for instance, an application startup
script copying a large file into a tmpfs filesystem before spawning
the actual computation application itself. There is a large class of
HPC applications which are tightly synchronized, and which require nearly
all of the system memory (i.e. almost all the memory on each node).
In this case the "victim" node will be forced to go off-node for the
bulk of its memory accesses, destroying locality, and slowing down
the entire application. This problem is alleviated if the tmpfs file
is fairly well distributed across the nodes.

But I do understand the opposite situation. If a lightly-threaded
application wants to access the file data, or the application does
not require large amounts of additional memory (i.e. nearly all the
memory on each node) getting the file itself allocated close to the
processor is more beneficial. In this case the distribution of tmpfs
pages is non-ideal (though I'm not sure it's worst-case).

> > But that's a big ugly to distingush, that is why I suggested the sysctl.
>
> As long as it defaults to off, I guess I don't really care. Though I'm still
> wholly unconvinced it makes much sense. I think we're just papering over the
> underlying problem - that we don't do good balancing between nodes under
> mem pressure.

It's a tough situation, as shown above. The HPC workload I mentioned
would much prefer the tmpfs file to be distributed. A non-HPC workload
would prefer the tmpfs files be local. Short of a sysctl I'm not sure
how the system could make an intelligent decision about what to do under
memory pressure -- it simply isn't knowledge the kernel can have.

I've got a new patch including Andi's suggested sysctl ready to go.
But I've seen one vote for defaulting to on, and one for defaulting
to off. I'd vote for on, but then again I'm biased. :) Who arbitrates
this one, Hugh Dickins (it's a tmpfs change, after all)? From SGI's
perspective we can live with it either way; we'll simply need to
document our recommendations for our customers.

As soon as I know whether this should default on or off, I'll post
the new patch, including the sysctl.

Thanks,
Brent

--
Brent Casavant [email protected] Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.

2004-11-03 01:13:09

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Tue, 2 Nov 2004, Martin J. Bligh wrote:

> > The manner I'm concerned about is when a long-lived file (long-lived
> > meaning at least for the duration of the run of a large multithreaded app)
> > is placed in memory as an accidental artifact of the CPU which happened
> > to create the file.
>
> Agreed, I see that point - if it's a globally accessed file that's
> created by one CPU, you want it spread around. However ... how the hell
> is the VM meant to distinguish that? The correct way is for the application
> to tell us that - ie use the NUMA API.

Right. The application can already tell us that without this patch,
but only if the file is mapped. Using writes there doesn't appear to
be any way to control this behavior (unless I overlooked something).
So it is impossible to use normal system utilities (i.e. cp, dd, tar)
and get appropriate placement.

This change gives us a method to get a different default (i.e write)
behavior.

> > It's a tough situation, as shown above. The HPC workload I mentioned
> > would much prefer the tmpfs file to be distributed. A non-HPC workload
> > would prefer the tmpfs files be local. Short of a sysctl I'm not sure
> > how the system could make an intelligent decision about what to do under
> > memory pressure -- it simply isn't knowledge the kernel can have.
>
> It is if you tell it from the app ;-) But otherwise yes, I'd agree.

Yep. Also, bear in mind that many applications/utilities are not going
to be very savvy in regard to tmpfs placement, and really shouldn't be.
I wouldn't expect a compiler to have special code to lay out the generated
objects in a friendly manner. Currently, all it takes is one unaware
application/utility to mess things up for us.

With local-by-default placement, as today, every single application and
utility needs to cooperate in order to be successful.

Which, I guess, might be something to consider (thinking out loud here)...

An application which uses the NUMA API obviously cares about placement.
If it causes data to be placed locally when it would otherwise be placed
globally, it will not cause a problem for a seperate application/utility
which doesn't care (much?) about locality.

However, if an application/utility which does not care (much?) about
locality fails to use the NUMA API and causes data to be placed locally,
it may very well cause a problem for a seperate application which does
care about locality.

Thus, to me, a default of spreading the allocation globally makes
sense. The burden of adding code to "do the right thing" falls to
the applications which care. No other program requires changes and
as such everything will "just work".

But, once again, I fess up to a bias. :)

> Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> personally, but maybe others wouldn't. Hugh, is that nuts?

I kind of like that -- it shouldn't be too hard to stuff into the tmpfs
superblock. But I agree, Hugh knows better than I do.

Brent

--
Brent Casavant [email protected] Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.

2004-11-03 01:31:26

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

--On Tuesday, November 02, 2004 19:12:10 -0600 Brent Casavant <[email protected]> wrote:

> On Tue, 2 Nov 2004, Martin J. Bligh wrote:
>
>> > The manner I'm concerned about is when a long-lived file (long-lived
>> > meaning at least for the duration of the run of a large multithreaded app)
>> > is placed in memory as an accidental artifact of the CPU which happened
>> > to create the file.
>>
>> Agreed, I see that point - if it's a globally accessed file that's
>> created by one CPU, you want it spread around. However ... how the hell
>> is the VM meant to distinguish that? The correct way is for the application
>> to tell us that - ie use the NUMA API.
>
> Right. The application can already tell us that without this patch,
> but only if the file is mapped. Using writes there doesn't appear to
> be any way to control this behavior (unless I overlooked something).
> So it is impossible to use normal system utilities (i.e. cp, dd, tar)
> and get appropriate placement.

Ah yes, I see your point.

> This change gives us a method to get a different default (i.e write)
> behavior.

OK. Might still be more useful to have it per filehandle, but we'd need
a way to send info on an existing filehandle.

> Yep. Also, bear in mind that many applications/utilities are not going
> to be very savvy in regard to tmpfs placement, and really shouldn't be.
> I wouldn't expect a compiler to have special code to lay out the generated
> objects in a friendly manner. Currently, all it takes is one unaware
> application/utility to mess things up for us.

Well, on certain extreme workloads, yes. But not for most people.

> With local-by-default placement, as today, every single application and
> utility needs to cooperate in order to be successful.
>
> Which, I guess, might be something to consider (thinking out loud here)...

I don't like apps having to co-operate any more than you do, as they
obviously won't. However, remember that most workloads won't hit this,
and there are a couple of other approaches:

1) your global switch is making more sense to me now.
2) We can balance nodes under mem pressure (we don't currently)

Number 2 fixes a variety of evils, and we've been talking about fixing
that for a while (see the previous one on pagecache ... do we have to
fix each case?).

What scares me is that what all these discussions are pointing towards
is "we want global striping for all allocations, because Linux is too
crap to deal with cross-node balancing pressure". I don't want to see
us go that way ;-). Some things are more obviously global than others
(eg shmem is more likely to be shared) ... whether we think the usage
of tmpfs is local or global is very much up to debate though.

There are a few other indicators I could see us using as hints from the
OS that might help - the size of the file vs the amount of mem in the
node, for one. How many processes have the file open, and which nodes
they're on for another. Neither of which are flawless, or even terribly
simple, but we're making heuristic guesses.

> An application which uses the NUMA API obviously cares about placement.
> If it causes data to be placed locally when it would otherwise be placed
> globally, it will not cause a problem for a seperate application/utility
> which doesn't care (much?) about locality.

99.9% of apps won't be NUMA aware, nor should they have to be. Standard
code should just work out the box ... I'm not going to take the viewpoint
that these are only specialist servers running one or two apps - they'll
run all sorts of stuff, esp with the advent of AMD in the marketplace,
and hopefully Intel will get their ass into gear and do local mem soon
as well.

> However, if an application/utility which does not care (much?) about
> locality fails to use the NUMA API and causes data to be placed locally,
> it may very well cause a problem for a seperate application which does
> care about locality.
>
> Thus, to me, a default of spreading the allocation globally makes
> sense. The burden of adding code to "do the right thing" falls to
> the applications which care. No other program requires changes and
> as such everything will "just work".
>
> But, once again, I fess up to a bias. :)

Yeah ;-) I'm strongly against taking the road of "NUMA performance will
suck unless your app is NUMA aware" (ie default to non-local alloc).
OTOH, I don't like us crapping out under imbalance either. However,
I think there are other ways to solve this (see above).

>> Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
>> personally, but maybe others wouldn't. Hugh, is that nuts?
>
> I kind of like that -- it shouldn't be too hard to stuff into the tmpfs
> superblock. But I agree, Hugh knows better than I do.

OK, that'd be under the sysadmin's control fairly easily at least.

M>

2004-11-03 08:44:57

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Tue, 2 Nov 2004, Martin J. Bligh wrote:
>
> Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> personally, but maybe others wouldn't. Hugh, is that nuts?

Only nuts if I am, I was going to suggest the same: the sysctl idea seems
very inadequate; a mount option at least allows the possibility of having
different tmpfs files allocated with different policies at the same time.

But I'm not usually qualified to comment on NUMA matters, and my tmpfs
maintenance shouldn't be allowed to get in the way of progress. Plus
I've barely been attending in recent days: back to normality tomorrow.

Hugh

2004-11-03 09:01:22

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Wed, Nov 03, 2004 at 08:44:32AM +0000, Hugh Dickins wrote:
> On Tue, 2 Nov 2004, Martin J. Bligh wrote:
> >
> > Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> > personally, but maybe others wouldn't. Hugh, is that nuts?
>
> Only nuts if I am, I was going to suggest the same: the sysctl idea seems
> very inadequate; a mount option at least allows the possibility of having
> different tmpfs files allocated with different policies at the same time.
>
> But I'm not usually qualified to comment on NUMA matters, and my tmpfs
> maintenance shouldn't be allowed to get in the way of progress. Plus
> I've barely been attending in recent days: back to normality tomorrow.

If you want to go more finegraid then you can always use numactl
or even libnuma in the application. For a quick policy decision a sysctl
is fine imho.

-Andi

2004-11-03 16:35:43

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Wed, 3 Nov 2004, Andi Kleen wrote:

> If you want to go more finegraid then you can always use numactl
> or even libnuma in the application. For a quick policy decision a sysctl
> is fine imho.

OK, so I'm not seeing a definitive stance by the interested parties
either way. So since the code's already done, I'm posting the sysctl
method, and defaulting to on. I assume that if we later decide that
a mount option was correct after all, that it's no big deal to axe the
sysctl?

The sysctl code in this patch is based on work originally done by
Andi. It has been changed a bit, mostly to make it appear only
in CONFIG_NUMA && CONFIG_TMPFS kernels.

Signed-off-by: Brent Casavant <[email protected]>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c 2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/mempolicy.c 2004-11-03 10:26:30.000000000 -0600
@@ -1027,6 +1027,28 @@
return 0;
}

+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave)
+{
+ info->root = RB_ROOT;
+ init_MUTEX(&info->sem);
+
+ if (unlikely(interleave)) {
+ struct mempolicy *newpol;
+
+ /* Falls back to MPOL_DEFAULT on any error */
+ newpol = mpol_new(MPOL_INTERLEAVE, nodes_addr(node_online_map));
+ if (likely(!IS_ERR(newpol))) {
+ /* Create pseudo-vma that contains just the policy */
+ struct vm_area_struct pvma;
+
+ memset(&pvma, 0, sizeof(struct vm_area_struct));
+ /* Policy covers entire file */
+ pvma.vm_end = ~0UL;
+ mpol_set_shared_policy(info, &pvma, newpol);
+ }
+ }
+}
+
int mpol_set_shared_policy(struct shared_policy *info,
struct vm_area_struct *vma, struct mempolicy *npol)
{
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c 2004-11-03 10:24:16.000000000 -0600
+++ linux/fs/hugetlbfs/inode.c 2004-11-03 10:26:30.000000000 -0600
@@ -384,7 +384,7 @@
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
info = HUGETLBFS_I(inode);
- mpol_shared_policy_init(&info->policy);
+ mpol_shared_policy_init(&info->policy, 0);
switch (mode & S_IFMT) {
default:
init_special_inode(inode, mode, dev);
Index: linux/include/linux/mempolicy.h
===================================================================
--- linux.orig/include/linux/mempolicy.h 2004-11-03 10:24:16.000000000 -0600
+++ linux/include/linux/mempolicy.h 2004-11-03 10:26:30.000000000 -0600
@@ -137,11 +137,7 @@
struct semaphore sem;
};

-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
- info->root = RB_ROOT;
- init_MUTEX(&info->sem);
-}
+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave);

int mpol_set_shared_policy(struct shared_policy *info,
struct vm_area_struct *vma,
@@ -198,7 +194,8 @@
return -EINVAL;
}

-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+ unsigned interleave)
{
}

Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c 2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/shmem.c 2004-11-03 10:26:30.000000000 -0600
@@ -72,6 +72,12 @@
/* Keep swapped page count in private field of indirect struct page */
#define nr_swapped private

+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+int sysctl_tmpfs_rr = 1;
+#else
+#define sysctl_tmpfs_rr (0)
+#endif
+
/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
enum sgp_type {
SGP_QUICK, /* don't try more than file page cache lookup */
@@ -1236,7 +1242,7 @@
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
spin_lock_init(&info->lock);
- mpol_shared_policy_init(&info->policy);
+ mpol_shared_policy_init(&info->policy, sbinfo ? sysctl_tmpfs_rr : 0);
INIT_LIST_HEAD(&info->swaplist);

switch (mode & S_IFMT) {
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c 2004-11-03 10:24:20.000000000 -0600
+++ linux/kernel/sysctl.c 2004-11-03 10:26:30.000000000 -0600
@@ -74,6 +74,10 @@
void __user *, size_t *, loff_t *);
#endif

+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+extern int sysctl_tmpfs_rr;
+#endif
+
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
static int minolduid;
@@ -622,6 +626,16 @@
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_unknown_nmi_panic,
+ },
+#endif
+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+ {
+ .ctl_name = KERN_NUMA_TMPFS_RR,
+ .procname = "numa-tmpfs-rr",
+ .data = &sysctl_tmpfs_rr,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
},
#endif
{ .ctl_name = 0 }
Index: linux/include/linux/sysctl.h
===================================================================
--- linux.orig/include/linux/sysctl.h 2004-11-03 10:26:20.000000000 -0600
+++ linux/include/linux/sysctl.h 2004-11-03 10:26:41.000000000 -0600
@@ -134,6 +134,7 @@
KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */
KERN_HZ_TIMER=65, /* int: hz timer on or off */
KERN_UNKNOWN_NMI_PANIC=66, /* int: unknown nmi panic flag */
+ KERN_NUMA_TMPFS_RR=67, /* int: NUMA interleave tmpfs allocations */
};

--
Brent Casavant [email protected] Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.

2004-11-03 21:07:05

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

--On Wednesday, November 03, 2004 10:32:56 -0600 Brent Casavant <[email protected]> wrote:

> On Wed, 3 Nov 2004, Andi Kleen wrote:
>
>> If you want to go more finegraid then you can always use numactl
>> or even libnuma in the application. For a quick policy decision a sysctl
>> is fine imho.
>
> OK, so I'm not seeing a definitive stance by the interested parties
> either way. So since the code's already done, I'm posting the sysctl
> method, and defaulting to on. I assume that if we later decide that
> a mount option was correct after all, that it's no big deal to axe the
> sysctl?

Matt has volunteered to write the mount option for this, so let's hold
off for a couple of days until that's done.

M

2004-11-08 19:59:54

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Wed, 3 Nov 2004, Martin J. Bligh wrote:

> Matt has volunteered to write the mount option for this, so let's hold
> off for a couple of days until that's done.

I had the time to do this myself. Updated patch attached below.

Description:

This patch creates a tmpfs mount option and adds code which causes
tmpfs allocations to be interleaved across the nodes of a NUMA machine.
This is useful in situations where a large tmpfs file would otherwise
consume most of the memory on a single node, forcing tasks running on
that node to perform off-node allocations for other (i.e. non-tmpfs)
memory needs.

Tightly synchronized HPC applications with large (on the order of
a single node's total RAM) per-task memory requirements are an example
of a type of application which benefits from this change. Other
types of workloads may prefer local tmpfs allocations, thus a mount
option is provided.

The new mount option is "interleave=", and defaults to 0. Any non-zero
setting turns on interleaving. Interleaving behavior can be changed
via a remount operation.

Signed-off-by: Brent Casavant <[email protected]>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c 2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/mempolicy.c 2004-11-03 10:26:30.000000000 -0600
@@ -1027,6 +1027,28 @@
return 0;
}

+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave)
+{
+ info->root = RB_ROOT;
+ init_MUTEX(&info->sem);
+
+ if (unlikely(interleave)) {
+ struct mempolicy *newpol;
+
+ /* Falls back to MPOL_DEFAULT on any error */
+ newpol = mpol_new(MPOL_INTERLEAVE, nodes_addr(node_online_map));
+ if (likely(!IS_ERR(newpol))) {
+ /* Create pseudo-vma that contains just the policy */
+ struct vm_area_struct pvma;
+
+ memset(&pvma, 0, sizeof(struct vm_area_struct));
+ /* Policy covers entire file */
+ pvma.vm_end = ~0UL;
+ mpol_set_shared_policy(info, &pvma, newpol);
+ }
+ }
+}
+
int mpol_set_shared_policy(struct shared_policy *info,
struct vm_area_struct *vma, struct mempolicy *npol)
{
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c 2004-11-03 10:24:16.000000000 -0600
+++ linux/fs/hugetlbfs/inode.c 2004-11-03 10:26:30.000000000 -0600
@@ -384,7 +384,7 @@
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
info = HUGETLBFS_I(inode);
- mpol_shared_policy_init(&info->policy);
+ mpol_shared_policy_init(&info->policy, 0);
switch (mode & S_IFMT) {
default:
init_special_inode(inode, mode, dev);
Index: linux/include/linux/mempolicy.h
===================================================================
--- linux.orig/include/linux/mempolicy.h 2004-11-03 10:24:16.000000000 -0600
+++ linux/include/linux/mempolicy.h 2004-11-03 10:26:30.000000000 -0600
@@ -137,11 +137,7 @@
struct semaphore sem;
};

-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
- info->root = RB_ROOT;
- init_MUTEX(&info->sem);
-}
+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave);

int mpol_set_shared_policy(struct shared_policy *info,
struct vm_area_struct *vma,
@@ -198,7 +194,8 @@
return -EINVAL;
}

-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+ unsigned interleave)
{
}

Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c 2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/shmem.c 2004-11-05 17:41:46.000000000 -0600
@@ -66,6 +66,9 @@
#define SHMEM_PAGEIN VM_READ
#define SHMEM_TRUNCATE VM_WRITE

+/* sbinfo flags */
+#define SHMEM_INTERLEAVE 0x1 /* Interleave tmpfs allocations */
+
/* Pretend that each entry is of this size in directory's i_size */
#define BOGO_DIRENT_SIZE 20

@@ -1236,7 +1239,8 @@
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
spin_lock_init(&info->lock);
- mpol_shared_policy_init(&info->policy);
+ mpol_shared_policy_init(&info->policy,
+ sbinfo && (sbinfo->flags & SHMEM_INTERLEAVE) ? 1 : 0);
INIT_LIST_HEAD(&info->swaplist);

switch (mode & S_IFMT) {
@@ -1784,7 +1788,9 @@
#endif
};

-static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes)
+static int shmem_parse_options(char *options, int *mode, uid_t *uid,
+ gid_t *gid, unsigned long *blocks,
+ unsigned long *inodes, int *interleave)
{
char *this_char, *value, *rest;

@@ -1838,6 +1844,12 @@
*gid = simple_strtoul(value,&rest,0);
if (*rest)
goto bad_val;
+ } else if (!strcmp(this_char,"interleave")) {
+ if (!interleave)
+ continue;
+ *interleave = (0 != simple_strtoul(value,&rest,0));
+ if (*rest)
+ goto bad_val;
} else {
printk(KERN_ERR "tmpfs: Bad mount option %s\n",
this_char);
@@ -1858,12 +1870,14 @@
struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
unsigned long max_blocks = 0;
unsigned long max_inodes = 0;
+ int interleave = 0;

if (sbinfo) {
max_blocks = sbinfo->max_blocks;
max_inodes = sbinfo->max_inodes;
}
- if (shmem_parse_options(data, NULL, NULL, NULL, &max_blocks, &max_inodes))
+ if (shmem_parse_options(data, NULL, NULL, NULL, &max_blocks,
+ &max_inodes, &interleave))
return -EINVAL;
/* Keep it simple: disallow limited <-> unlimited remount */
if ((max_blocks || max_inodes) == !sbinfo)
@@ -1871,6 +1885,12 @@
/* But allow the pointless unlimited -> unlimited remount */
if (!sbinfo)
return 0;
+ /* Allow change of interleaving */
+ if (interleave)
+ sbinfo->flags |= SHMEM_INTERLEAVE;
+ else
+ sbinfo->flags &= ~SHMEM_INTERLEAVE;
+
return shmem_set_size(sbinfo, max_blocks, max_inodes);
}
#endif
@@ -1900,6 +1920,7 @@
#ifdef CONFIG_TMPFS
unsigned long blocks = 0;
unsigned long inodes = 0;
+ int interleave = 0;

/*
* Per default we only allow half of the physical ram per
@@ -1912,12 +1933,12 @@
if (inodes > blocks)
inodes = blocks;

- if (shmem_parse_options(data, &mode,
- &uid, &gid, &blocks, &inodes))
+ if (shmem_parse_options(data, &mode, &uid, &gid, &blocks,
+ &inodes, &interleave))
return -EINVAL;
}

- if (blocks || inodes) {
+ if (blocks || inodes || interleave) {
struct shmem_sb_info *sbinfo;
sbinfo = kmalloc(sizeof(struct shmem_sb_info), GFP_KERNEL);
if (!sbinfo)
@@ -1928,6 +1949,9 @@
sbinfo->free_blocks = blocks;
sbinfo->max_inodes = inodes;
sbinfo->free_inodes = inodes;
+ sbinfo->flags = 0;
+ if (interleave)
+ sbinfo->flags |= SHMEM_INTERLEAVE;
}
sb->s_xattr = shmem_xattr_handlers;
#endif
Index: linux/include/linux/shmem_fs.h
===================================================================
--- linux.orig/include/linux/shmem_fs.h 2004-10-18 16:54:55.000000000 -0500
+++ linux/include/linux/shmem_fs.h 2004-11-05 17:35:25.000000000 -0600
@@ -26,6 +26,7 @@
unsigned long free_blocks; /* How many are left for allocation */
unsigned long max_inodes; /* How many inodes are allowed */
unsigned long free_inodes; /* How many are left for allocation */
+ unsigned long flags;
spinlock_t stat_lock;
};

--
Brent Casavant If you had nothing to fear,
[email protected] how then could you be brave?
Silicon Graphics, Inc. -- Queen Dama, Source Wars

2004-11-08 21:01:27

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

>> Matt has volunteered to write the mount option for this, so let's hold
>> off for a couple of days until that's done.
>
> I had the time to do this myself. Updated patch attached below.
>
> Description:
>
> This patch creates a tmpfs mount option and adds code which causes
> tmpfs allocations to be interleaved across the nodes of a NUMA machine.
> This is useful in situations where a large tmpfs file would otherwise
> consume most of the memory on a single node, forcing tasks running on
> that node to perform off-node allocations for other (i.e. non-tmpfs)
> memory needs.
>
> Tightly synchronized HPC applications with large (on the order of
> a single node's total RAM) per-task memory requirements are an example
> of a type of application which benefits from this change. Other
> types of workloads may prefer local tmpfs allocations, thus a mount
> option is provided.
>
> The new mount option is "interleave=", and defaults to 0. Any non-zero
> setting turns on interleaving. Interleaving behavior can be changed
> via a remount operation.

Looks good to me, though I don't really know that area of code very well.
Hopefully Hugh or someone can give it a better review.

Thanks for doing that,

M.

2004-11-09 19:04:49

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Mon, 8 Nov 2004, Brent Casavant wrote:
> On Wed, 3 Nov 2004, Martin J. Bligh wrote:
>
> > Matt has volunteered to write the mount option for this, so let's hold
> > off for a couple of days until that's done.
>
> I had the time to do this myself. Updated patch attached below.

Looks pretty good to me.

Doesn't quite play right with what was my "NULL sbinfo" convention.
Given this mpol patch of yours, and Adam's devfs patch, it's becoming
clear that my "NULL sbinfo" was unhelpful, making life harder for both
of you to add things into the tmpfs superblock - unchanged since 2.4.0,
as soon as I mess with it, people come up with valid new uses for it.

Not to say that your patch or Adam's will go further (I've no objection
to the way Adam is using tmpfs, but no opinion on the future of devfs),
but they're two hints that I should rework that to get out of people's
way. I'll do a patch for that, then another something like yours on
top, for you to go back and check.

I think the option should be "mpol=interleave" rather than just
"interleave", who knows what baroque mpols we might want to support
there in future?

I'm irritated to realize that we can't change the default for SysV
shared memory or /dev/zero this way, because that mount is internal.
But neither you nor Andi were wanting that, so okay, never mind;
could use his sysctl instead if the need emerges.

At one time (August) you were worried about MPOL_INTERLEAVE
overloading node 0 on small files - is that still a worry?
Perhaps you skirt that issue in recommending this option
for use with giant files.

There are quite a lot of mpol patches flying around, aren't there?
>From Ray Bryant and from Steve Longerbeam. Would this tmpfs patch
make (adaptable) sense if we went either or both of those ways - or
have they been knocked on the head? I don't mean in the details
(I think one of them does away with the pseudo-vma stuff - great!),
but in basic design - would your mount option mesh together well
with them, or would it be adding a further layer of confusion?

Hugh

2004-11-09 20:10:54

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Sounds sensible.

> I'm irritated to realize that we can't change the default for SysV
> shared memory or /dev/zero this way, because that mount is internal.

Boggle. shmem I can perfectly understand, and have been intending to
change for a while. But why /dev/zero ? Presumably you'd always want
that local?

Thanks,

M.

2004-11-09 21:08:33

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Tue, 9 Nov 2004, Martin J. Bligh wrote:
>
> > I'm irritated to realize that we can't change the default for SysV
> > shared memory or /dev/zero this way, because that mount is internal.
>
> Boggle. shmem I can perfectly understand, and have been intending to
> change for a while. But why /dev/zero ? Presumably you'd always want
> that local?

I was meaning the mmap shared writable of /dev/zero, to get memory
shared between parent and child and descendants, a restricted form
of shared memory. I was thinking of them running on different cpus,
you're suggesting they'd at least be on the same node. I dare say,
I don't know. I'm not desperate to be able to set some other mpol
default for all of them (and each object can be set in the established
way), just would have been happier if the possibility of doing so came
for free with the mount option work.

Hugh

2004-11-09 22:08:20

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

--On Tuesday, November 09, 2004 21:08:11 +0000 Hugh Dickins <[email protected]> wrote:

> On Tue, 9 Nov 2004, Martin J. Bligh wrote:
>>
>> > I'm irritated to realize that we can't change the default for SysV
>> > shared memory or /dev/zero this way, because that mount is internal.
>>
>> Boggle. shmem I can perfectly understand, and have been intending to
>> change for a while. But why /dev/zero ? Presumably you'd always want
>> that local?
>
> I was meaning the mmap shared writable of /dev/zero, to get memory
> shared between parent and child and descendants, a restricted form
> of shared memory. I was thinking of them running on different cpus,
> you're suggesting they'd at least be on the same node. I dare say,
> I don't know. I'm not desperate to be able to set some other mpol
> default for all of them (and each object can be set in the established
> way), just would have been happier if the possibility of doing so came
> for free with the mount option work.

Oh yeah ... the anon mem allocator trick. Mmmm. Not sure that should
have a different default than normal alloced memory, but either way,
what you're suggesting makes a whole lot more sense to me know than
just straight /dev/zero ;-)

Thanks for the explanation.

M.

2004-11-10 02:48:15

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

Argh. I fatfingered my mail client and deleted my response rather
than send it this morning. Sorry for the delay.

On Tue, 9 Nov 2004, Hugh Dickins wrote:

> Doesn't quite play right with what was my "NULL sbinfo" convention.

Howso? I thought it played quite nicely with it. We've been using
NULL sbinfo as an indicator that an inode is from tmpfs rather than
from SysV or /dev/zero. Or at least that's the way my brain was
wrapped around it.

> Given this mpol patch of yours, and Adam's devfs patch, it's becoming
> clear that my "NULL sbinfo" was unhelpful, making life harder for both
> of you to add things into the tmpfs superblock - unchanged since 2.4.0,
> as soon as I mess with it, people come up with valid new uses for it.

I haven't seen that other patch, but in this case I didn't see a problem.
The NULL sbinfo scheme worked perfectly for me, with very little hassle.

> Not to say that your patch or Adam's will go further (I've no objection
> to the way Adam is using tmpfs, but no opinion on the future of devfs),
> but they're two hints that I should rework that to get out of people's
> way. I'll do a patch for that, then another something like yours on
> top, for you to go back and check.

Is this something imminent, or on the "someday" queue? Just asking
because I'd like to avoid doing additional work that might get thrown
away soon.

> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Makes sense to me. I'll be happy to do it, pending your answer to
my preceding question.

> I'm irritated to realize that we can't change the default for SysV
> shared memory or /dev/zero this way, because that mount is internal.

Well, the only thing preventing this is that I stuck the flag into
sbinfo, since it's an filesystem-wide setting. I don't see any reason
we couldn't add a new flag in the inode info flag field instead. I
think there would also be some work to set pvma.vm_end more precisely
(in mpol_shared_policy_init()) in the SysV case.

> At one time (August) you were worried about MPOL_INTERLEAVE
> overloading node 0 on small files - is that still a worry?
> Perhaps you skirt that issue in recommending this option
> for use with giant files.

Yeah, there's still a bit of concern about that, but it's dwarfed
in comparison. Taking care of that would be relatively easy,
adding something like the inode number (or other inode-constant
"randomness") in as a offset to pvma.vm_pgoff in shmem_alloc_page()
(or maybe a bit higher up the call-chain, I'd have to look closer).

> There are quite a lot of mpol patches flying around, aren't there?

Yep. SGI solved some of these problems for our own 2.4.x kernel
distributions, but now we want to get things settled out in the
mainline kernel. So far I think we're hitting distinct chunks
of code.

> >From Ray Bryant and from Steve Longerbeam. Would this tmpfs patch
> make (adaptable) sense if we went either or both of those ways - or
> have they been knocked on the head? I don't mean in the details
> (I think one of them does away with the pseudo-vma stuff - great!),
> but in basic design - would your mount option mesh together well
> with them, or would it be adding a further layer of confusion?

I see what you mean. I believe Ray's work is addressing the buffer
cache in general. I'll try to touch base with him again soon (I
admit to losing track of what he's been doing).

Brent

--
Brent Casavant If you had nothing to fear,
[email protected] how then could you be brave?
Silicon Graphics, Inc. -- Queen Dama, Source Wars

2004-11-10 14:32:37

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Tue, 9 Nov 2004, Brent Casavant wrote:
> On Tue, 9 Nov 2004, Hugh Dickins wrote:
>
> > Doesn't quite play right with what was my "NULL sbinfo" convention.
>
> Howso? I thought it played quite nicely with it. We've been using
> NULL sbinfo as an indicator that an inode is from tmpfs rather than
> from SysV or /dev/zero. Or at least that's the way my brain was
> wrapped around it.

That was the case you cared about, but remember I extended yours so
that tmpfs mounts could also suppress limiting, and get NULL sbinfo.

> The NULL sbinfo scheme worked perfectly for me, with very little hassle.

Yes, it would have worked just right for the important cases.

> > but they're two hints that I should rework that to get out of people's
> > way. I'll do a patch for that, then another something like yours on
> > top, for you to go back and check.
>
> Is this something imminent, or on the "someday" queue? Just asking
> because I'd like to avoid doing additional work that might get thrown
> away soon.

I understand your concern ;) I'm working on it, today or tomorrow.

> > I'm irritated to realize that we can't change the default for SysV
> > shared memory or /dev/zero this way, because that mount is internal.
>
> Well, the only thing preventing this is that I stuck the flag into
> sbinfo, since it's an filesystem-wide setting. I don't see any reason
> we couldn't add a new flag in the inode info flag field instead. I
> think there would also be some work to set pvma.vm_end more precisely
> (in mpol_shared_policy_init()) in the SysV case.

It's not a matter of where to store the info, it's that we don't have
a user interface for remounting something that's not mounted anywhere.

Hugh

2004-11-11 19:53:43

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Tue, 9 Nov 2004, Hugh Dickins wrote:
>
> Not to say that your patch or Adam's will go further (I've no objection
> to the way Adam is using tmpfs, but no opinion on the future of devfs),
> but they're two hints that I should rework that to get out of people's
> way. I'll do a patch for that, then another something like yours on
> top, for you to go back and check.
>
> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Okay, two pattachments.

The first (against 2.6.10-rc1-mm5) being my reversion of NULL sbinfo
in shmem.c, to make it easier for others to add things into sbinfo
without having to worry about NULL cases. So that goes back to
allocating an sbinfo even for the internal mount: I've rounded up to
L1_CACHE_BYTES to avoid false sharing, but even so, please test it out
on your 512-way to make sure I haven't screwed up the scalability we
got before - thanks. If you find it okay, I'll send to akpm soonish.

The second (against the first) being my take on your patch, with
mpol=interleave, and minor alterations which may irritate you so much
you'll revert them immediately! (mainly, using MPOL_INTERLEAVE and
MPOL_DEFAULT within shmem.c rather than defining separate flags).
Only slightly tested at this end.

Hugh

Attachments:

tmpfs1.patch (8.64 kB)
revert NULL sbinfo tmpfs2.patch (6.09 kB)
mpol=interleave Download all attachments

2004-11-11 23:17:14

[permalink] [raw]

Subject: Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files

On Thu, 11 Nov 2004, Hugh Dickins wrote:

> The first (against 2.6.10-rc1-mm5) being my reversion of NULL sbinfo
> in shmem.c, to make it easier for others to add things into sbinfo
> without having to worry about NULL cases. So that goes back to
> allocating an sbinfo even for the internal mount: I've rounded up to
> L1_CACHE_BYTES to avoid false sharing, but even so, please test it out
> on your 512-way to make sure I haven't screwed up the scalability we
> got before - thanks. If you find it okay, I'll send to akpm soonish.

I won't be able to get a 512 run in until Monday, due to test machine
availability. However runs at 32P and 64P indicate nothing disastrous.
Results seem to be in line with the numbers we were getting when doing
the NULL sbinfo work.

So, thus far a preliminary "Looks good".

> The second (against the first) being my take on your patch, with
> mpol=interleave, and minor alterations which may irritate you so much
> you'll revert them immediately! (mainly, using MPOL_INTERLEAVE and
> MPOL_DEFAULT within shmem.c rather than defining separate flags).
> Only slightly tested at this end.

Seems to work just fine, and I rather like how this was made a bit more
general. Thumbs up!

Brent

--
Brent Casavant If you had nothing to fear,
[email protected] how then could you be brave?
Silicon Graphics, Inc. -- Queen Dama, Source Wars

2004-11-15 22:11:21