Hello,
On a NUMA host, if a driver calls __get_free_pages() then it will eventually invoke ->alloc_pages_current(..). The comment above/within alloc_pages_current() says 'current->mempolicy' will be used.So what memory policy will kick-in if the driver is trying to allocate some memory blocks during driver load time(say from probe_one)? System-wide default policy,correct?
What if the driver wishes to i) stay confined to a 'cpulist' OR ii) use a different mem-policy? How do I achieve this?
I will choose the 'cpulist' after I am successfuly able to affinitize the MSI-X vectors.
regards
Rick
On Tue, Apr 06, 2010 at 01:46:44PM -0700, Rick Sherm wrote:
> On a NUMA host, if a driver calls __get_free_pages() then
> it will eventually invoke ->alloc_pages_current(..). The comment
> above/within alloc_pages_current() says 'current->mempolicy' will be
> used.So what memory policy will kick-in if the driver is trying to
> allocate some memory blocks during driver load time(say from probe_one)? System-wide default policy,correct?
Actually the policy of the modprobe or the kernel boot up if built in
(which is interleaving)
>
> What if the driver wishes to i) stay confined to a 'cpulist' OR ii) use a different mem-policy? How
> do I achieve this?
> I will choose the 'cpulist' after I am successfuly able to affinitize the MSI-X vectors.
You can do that right now by running numactl ... modprobe ...
Yes there should be probably a better way, like using a policy
based on the affinity of the PCI device.
-Andi
--
[email protected] -- Speaking for myself only.
Hi Andy,
--- On Wed, 4/7/10, Andi Kleen <[email protected]> wrote:
> On Tue, Apr 06, 2010 at 01:46:44PM -0700, Rick Sherm wrote:
> > On a NUMA host, if a driver calls __get_free_pages()
> then
> > it will eventually invoke
> ->alloc_pages_current(..). The comment
> > above/within alloc_pages_current() says
> 'current->mempolicy' will be
> > used.So what memory policy will kick-in if the driver
> is trying to
> > allocate some memory blocks during driver load
> time(say from probe_one)? System-wide default
> policy,correct?
>
> Actually the policy of the modprobe or the kernel boot up
> if built in
> (which is interleaving)
>
Interleaving,yup that's what I thought. I've tight control on the environment.So for one driver I need high throughput and I will use the interleaving-policy.But for the other 2-3 drivers, I need low latency.So I would like to restrict it to the local node.These are just my thoughts but I'll have to experiment and see what the numbers look like. Once I've some numbers I will post them in a few weeks.
> >
> > What if the driver wishes to i) stay confined to a
> 'cpulist' OR ii) use a different mem-policy? How
> > do I achieve this?
> > I will choose the 'cpulist' after I am successfuly
> able to affinitize the MSI-X vectors.
>
> You can do that right now by running numactl ... modprobe
> ...
>
Perfect.Ok, then I'll probably write a simple user-space wrapper:
1)set mem-policy type depending on driver-foo-M.
2)load driver-foo-M.
3)goto 1) and repeat for other driver[s]-foo-X
BTW - I would know before hand which adapter is placed in which slot and so I will be able to deduce its proximity to a Node.
> Yes there should be probably a better way, like using a
> policy
> based on the affinity of the PCI device.
>
> -Andi
>
Thanks
Rick
On Wed, 2010-04-07 at 08:48 -0700, Rick Sherm wrote:
> Hi Andy,
>
> --- On Wed, 4/7/10, Andi Kleen <[email protected]> wrote:
> > On Tue, Apr 06, 2010 at 01:46:44PM -0700, Rick Sherm wrote:
> > > On a NUMA host, if a driver calls __get_free_pages()
> > then
> > > it will eventually invoke
> > ->alloc_pages_current(..). The comment
> > > above/within alloc_pages_current() says
> > 'current->mempolicy' will be
> > > used.So what memory policy will kick-in if the driver
> > is trying to
> > > allocate some memory blocks during driver load
> > time(say from probe_one)? System-wide default
> > policy,correct?
> >
> > Actually the policy of the modprobe or the kernel boot up
> > if built in
> > (which is interleaving)
> >
>
> Interleaving,yup that's what I thought. I've tight control on the environment.So for one driver I need high throughput and I will use the interleaving-policy.But for the other 2-3 drivers, I need low latency.So I would like to restrict it to the local node.These are just my thoughts but I'll have to experiment and see what the numbers look like. Once I've some numbers I will post them in a few weeks.
>
> > >
> > > What if the driver wishes to i) stay confined to a
> > 'cpulist' OR ii) use a different mem-policy? How
> > > do I achieve this?
> > > I will choose the 'cpulist' after I am successfuly
> > able to affinitize the MSI-X vectors.
> >
> > You can do that right now by running numactl ... modprobe
> > ...
> >
> Perfect.Ok, then I'll probably write a simple user-space wrapper:
> 1)set mem-policy type depending on driver-foo-M.
> 2)load driver-foo-M.
> 3)goto 1) and repeat for other driver[s]-foo-X
> BTW - I would know before hand which adapter is placed in which slot
> and so I will be able to deduce its proximity to a Node.
>
> > Yes there should be probably a better way, like using a
> > policy
> > based on the affinity of the PCI device.
> >
Rick:
If you want/need to use __get_free_page(), you will need to set the
current task's memory policy. If you're loading the driver from user
space, then you can set the mempolicy of the task [shell, modprobe, ...]
using numactl as you suggest above. From within the kernel, you'd need
to temporarily change current's mempolicy to what you need and then put
it back. We don't have a formal interface to do this, I think, but such
could be added.
Another option, if you just want memory on a specific node, would be to
use kmalloc(). But for a multiple page allocation, this might not be
the best method.
As to how to find the node where the adapter is attached, from user
space you can look at /sys/devices/pci<pci-bus>/<pci-dev>/numa_node.
You can also find the 'local_cpus' [hex mask] and 'local_cpulist' in the
same directory. From within the driver, you can examine dev->numa_node.
Look at 'local_cpu{s|list}_show()' to see how to find the local cpus for
a device.
Note that if your device is attached to a memoryless node on x86, this
info won't be accurate. x86 arch code removes memoryless nodes and
reassigns cpus to other nodes that do have memory. I'm not sure what it
does with the dev->numa_node info. Maybe not a problem for you.
Regards,
Lee
Hello,
PS - Please 'CC' me on the emails.I have not subscribed to the list.
> Hi Andy,
>
> --- On Wed, 4/7/10, Andi Kleen <[email protected]>
> wrote:
> > On Tue, Apr 06, 2010 at 01:46:44PM -0700, Rick Sherm
> wrote:
> > > On a NUMA host, if a driver calls
> __get_free_pages()
> > then
> > > it will eventually invoke
> > ->alloc_pages_current(..). The comment
> > > above/within alloc_pages_current() says
> > 'current->mempolicy' will be
> > > used.So what memory policy will kick-in if the
> driver
> > is trying to
> > > allocate some memory blocks during driver load
> > time(say from probe_one)? System-wide default
> > policy,correct?
> >
> > Actually the policy of the modprobe or the kernel boot
> up
> > if built in
> > (which is interleaving)
> >
I may be wrong but I think there's a difference. system-wide run-time default policy is M_PREFERRED | M_LOCAL and not Interleaving.
So, if current->mempolicy is set then default_policy will not be used.
And now if you don't want the default_policy mode then what?
I'm stuck in this confused state too. So we have two cases to take care off -
Case1) current->mempolicy is initialized and so we can just set it to whatever we like and then reset it once we are done with __get_free_pages(..) etc.
Case2) current->mempolicy is not initialized. Then default_policy is used. Now if we have to muck with the default_policy then we will need to lock it down. Otherwise some other consumer will get affected by it.
But both the above solutions are twisted.Why not just create a different wrapper? This way we can leave both current & default_policy alone.
#ifdef CONFIG_NUMA
__get_free_policy_pages(policy,mask,order)??
endif
For now I may end up hacking my kernel and implementing the above mentioned quick and dirty solution. But if there's a cleaner approach then please let me know.
PS - We should create some wrapper's that will automatically figure out the MSIX-affinity(if present/set) and then default the allocation to that node? Also, is there a way to configure irqbalance and ask it to leave these guys alone? Like a config file that says - leave these irqs/pci-devices alone.For now I've shut down irqbalance.
>
> > -Andi
> >
>
> Thanks
> Rick
>
thanks
Chetan Loke
> I may be wrong but I think there's a difference. system-wide run-time default policy is M_PREFERRED | M_LOCAL and not Interleaving.
The policy at early kernel boot time (before mounting root) is interleaving.
-Andi
Andi,
--- On Sat, 4/17/10, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
> Subject: Re: Memory policy question for NUMA arch....
> To: "Chetan Loke" <[email protected]>
> Cc: [email protected], [email protected], [email protected], [email protected]
> Date: Saturday, April 17, 2010, 6:35 AM
> > I may be wrong but I think
> there's a difference. system-wide run-time default policy is
> M_PREFERRED | M_LOCAL and not Interleaving.
>
> The policy at early kernel boot time (before mounting root)
> is interleaving.
>
Yup.
PS - What do you think about __get_free_policy_pages(..)? Good/bad idea?
> -Andi
>
thanks
Chetan Loke
> PS - What do you think about __get_free_policy_pages(..)? Good/bad idea?
You can do it today by temporarily replacing the policy in current. It's somewhat
ugly. Interrupts don't use the policy, so it's not racy.
But yes if it's common it would seem reasonable to have a cleaner function for it,
if you have a clear use case and you can do it without impacting the page allocator fast path.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, 2010-04-16 at 16:17 -0700, Chetan Loke wrote:
> Hello,
>
> PS - Please 'CC' me on the emails.I have not subscribed to the list.
>
> > Hi Andy,
> >
> > --- On Wed, 4/7/10, Andi Kleen <[email protected]>
> > wrote:
> > > On Tue, Apr 06, 2010 at 01:46:44PM -0700, Rick Sherm
> > wrote:
> > > > On a NUMA host, if a driver calls
> > __get_free_pages()
> > > then
> > > > it will eventually invoke
> > > ->alloc_pages_current(..). The comment
> > > > above/within alloc_pages_current() says
> > > 'current->mempolicy' will be
> > > > used.So what memory policy will kick-in if the
> > driver
> > > is trying to
> > > > allocate some memory blocks during driver load
> > > time(say from probe_one)? System-wide default
> > > policy,correct?
> > >
> > > Actually the policy of the modprobe or the kernel boot
> > up
> > > if built in
> > > (which is interleaving)
> > >
>
> I may be wrong but I think there's a difference. system-wide run-time default policy is M_PREFERRED | M_LOCAL and not Interleaving.
>
> So, if current->mempolicy is set then default_policy will not be used.
> And now if you don't want the default_policy mode then what?
> I'm stuck in this confused state too. So we have two cases to take care off -
>
> Case1) current->mempolicy is initialized and so we can just set it to
> whatever we like and then reset it once we are done with
> __get_free_pages(..) etc.
Yes, as Andi mentioned. Also, see my response to Rick at:
http://marc.info/?l=linux-kernel&m=127066130315241&w=4
>
> Case2) current->mempolicy is not initialized. Then default_policy is
> used. Now if we have to muck with the default_policy then we will need
> to lock it down. Otherwise some other consumer will get affected by
> it.
If current->mempolicy is not initialized, you can create a new one and
set it temporarily. You could probably call do_set_mempolicy() directly
the way numa_policy_init() does and then call numa_default_policy() to
restore it to default.
You should never change the system default once the system is up and
running.
>
> But both the above solutions are twisted.Why not just create a
> different wrapper? This way we can leave both current & default_policy
> alone.
>
> #ifdef CONFIG_NUMA
> __get_free_policy_pages(policy,mask,order)??
> endif
As Andi mentioned in his response, you could certainly do this as long
as it doesn't impact the normal allocation path.
>
> For now I may end up hacking my kernel and implementing the above
> mentioned quick and dirty solution. But if there's a cleaner approach
> then please let me know.
>
> PS - We should create some wrapper's that will automatically figure
> out the MSIX-affinity(if present/set) and then default the allocation
> to that node?
Still not clear on what your requirements are but, if existing
interfaces don't suffice, such a wrapper might make sense.
__get_free_pages() is simply a wrapper around alloc_pages() that then
returns page_address() of the resulting page. So, something like
'get_free_pages_node()'--which should probably live in
mm/page_alloc.c--would just be a wrapper around alloc_pages_node() that
then returns the page_address() of the page.
A device-centric interface--e.g., 'get_free_pages_dev()'--could get the
device/bus node affinity via dev_to_node() and then do the
allocation/conversion. I think this is close to what you're suggesting
above. See dma_generic_alloc_coherent() [in arch/x86/kernel/pci-dma.c]
for an example of a wrapper that does the device affinity lookup and
allocation in one function.
Of course, you could just do this in your driver, as well.
> Also, is there a way to configure irqbalance and ask it to leave these
> guys alone? Like a config file that says - leave these
> irqs/pci-devices alone.For now I've shut down irqbalance.
You can set the environment variable IRQBALANCE_BANNED_INTERRUPTS--when
starting irqbalance--to list of interrupts that irqbalance should ignore
if you're using a version that supports that. Check the init script
that starts irqbalance on your distro of choice.
Regards,
Lee
Lee,
--- On Mon, 4/19/10, Lee Schermerhorn <[email protected]> wrote:
> You should never change the system default once the system
> is up and
> running.
>
Not sure what I was thinking.Agreed, we should leave the default-policy alone.
> Still not clear on what your requirements are
Thanks for pointing me to the other post. I might have a similar problem.I've a nehalem box.
1) Drivers supports MSI-X.
? 1.1) Drivers at load time allocate a chunk of DMA'able memory.
???
2) Sometime later, after the OS boots, I need to load my apps.
3) Now, the apps and the drivers communicate via a mmap'd region.
I need a deterministic way of allocating memory depending on my needs(interleave or localalloc).So I can't be at the mercy of either 'current' or 'global-policy' or anyone else.Also, why reference 'current' to begin with? And everytime I reference 'current' from within my driver, current points to 'work_for_cpu' kthread. So that clearly doesn't help.
> A device-centric interface--e.g.,
> 'get_free_pages_dev()'--could get the
> device/bus node affinity via dev_to_node() and then do the
> allocation/conversion.???I think this is
> close to what you're suggesting
> above. See dma_generic_alloc_coherent() [in
> arch/x86/kernel/pci-dma.c]
> for an example of a wrapper that does the device affinity
> lookup and
> allocation in one function.
>
> Of course, you could just do this in your driver, as well.
>
Very helpful thanks.I will mimic 'dma_generic_alloc_coherent' in my driver when I need local-node memory.
> > Also, is there a way to configure irqbalance and ask
> it to leave these
> > guys alone? Like a config file that says - leave
> these
> > irqs/pci-devices alone.For now I've shut down
> irqbalance.
>
> You can set the environment variable
> IRQBALANCE_BANNED_INTERRUPTS--when
> starting irqbalance--to list of interrupts that irqbalance
> should ignore
> if you're using a version that supports that.? Check
> the init script
> that starts irqbalance on your distro of choice.
>
aaaah...mine is old and I could see IRQBALANCE_BANNED_CPUS and not _INTERRUPTS.I will upgrade it.
> Regards,
> Lee
Thanks
Chetan Loke