2015-08-17 10:25:27

by Alexander Gordeev

[permalink] [raw]
Subject: Make RCU tree CPU topology aware?

Hi Paul,

Currently RCU tree distributes CPUs to leafs based on consequent CPU
IDs. That means CPUs from remote caches and even nodes might end up
in the same leaf.

I did not research the impact, but at the glance that seems at least
sub-optimal; especially in case of remote nodes, when CPUs access
each others' memory?

I am thinking of topology-aware RCU geometry where the RCU tree reflects
the actual system topology. I.e by borrowing it from schedulling domains
or soemthing like that.

Do you think it worth the effort to research this question or I am
missing something and the current access patterns are just optimal?

Thanks!

--
Regards,
Alexander Gordeev
[email protected]


2015-08-17 15:28:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Make RCU tree CPU topology aware?

On Mon, Aug 17, 2015 at 11:39:34AM +0100, Alexander Gordeev wrote:
> Hi Paul,
>
> Currently RCU tree distributes CPUs to leafs based on consequent CPU
> IDs. That means CPUs from remote caches and even nodes might end up
> in the same leaf.
>
> I did not research the impact, but at the glance that seems at least
> sub-optimal; especially in case of remote nodes, when CPUs access
> each others' memory?
>
> I am thinking of topology-aware RCU geometry where the RCU tree reflects
> the actual system topology. I.e by borrowing it from schedulling domains
> or soemthing like that.
>
> Do you think it worth the effort to research this question or I am
> missing something and the current access patterns are just optimal?

The first thing to try would be to specify the rcutree.rcu_fanout_leaf
kernel boot parameter to align with the system's hardware boundaries and
to misalign, and see if you can measure any difference whatsoever at the
system level. For example, if you are using a multi-socket eight-core
x86 CPU with hyperthreading enabled, specify rcutree.rcu_fanout_leaf=8
to account for the "interesting" x86 CPU numbering. The default of
rcutree.rcu_fanout_leaf=16 would have the first two sockets sharing the
first leaf rcu_node structure. Perhaps also try rcutree.rcu_fanout_leaf=7
and rcutree.rcu_fanout_leaf=9 to tease out contention effects. I suggest
also running tests with hyperthreading disabled.

I bet that you won't see any system-level effect. The reason for that
bet is that people have been asking me this for years, but have always
declined to provide any data. In addition, RCU's fast paths are designed
to avoid hitting the rcu_node structures -- even call_rcu() normally is
confined to the per-CPU rcu_data structure.

Please note that I am particularly unhappy with the thought of having
RCU having non-contiguous CPU numbering within the rcu_node structures.
For example, having the first rcu_node structure have CPUs 0-7 and
32-39, the second have 8-15 and 40-47, and so on is really really ugly.
That isn't to say that I am inalterably opposed, but rather that there
had better be extremely good measurable system-level reasons for such
a change.

On the other hand, having some sort of option to allow architectures to
specify the RCU_FANOUT and RCU_FANOUT_LEAF values at boot time is not
that big a deal.

Does that help?

Thanx, Paul

2015-08-18 08:41:36

by Alexander Gordeev

[permalink] [raw]
Subject: Re: Make RCU tree CPU topology aware?

On Mon, Aug 17, 2015 at 08:28:16AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 17, 2015 at 11:39:34AM +0100, Alexander Gordeev wrote:
> > Hi Paul,
> >
> > Currently RCU tree distributes CPUs to leafs based on consequent CPU
> > IDs. That means CPUs from remote caches and even nodes might end up
> > in the same leaf.
> >
> > I did not research the impact, but at the glance that seems at least
> > sub-optimal; especially in case of remote nodes, when CPUs access
> > each others' memory?
> >
> > I am thinking of topology-aware RCU geometry where the RCU tree reflects
> > the actual system topology. I.e by borrowing it from schedulling domains
> > or soemthing like that.
> >
> > Do you think it worth the effort to research this question or I am
> > missing something and the current access patterns are just optimal?
>
> The first thing to try would be to specify the rcutree.rcu_fanout_leaf
> kernel boot parameter to align with the system's hardware boundaries and
> to misalign, and see if you can measure any difference whatsoever at the
> system level. For example, if you are using a multi-socket eight-core
> x86 CPU with hyperthreading enabled, specify rcutree.rcu_fanout_leaf=8
> to account for the "interesting" x86 CPU numbering. The default of
> rcutree.rcu_fanout_leaf=16 would have the first two sockets sharing the
> first leaf rcu_node structure. Perhaps also try rcutree.rcu_fanout_leaf=7
> and rcutree.rcu_fanout_leaf=9 to tease out contention effects. I suggest
> also running tests with hyperthreading disabled.
>
> I bet that you won't see any system-level effect. The reason for that
> bet is that people have been asking me this for years, but have always
> declined to provide any data. In addition, RCU's fast paths are designed
> to avoid hitting the rcu_node structures -- even call_rcu() normally is
> confined to the per-CPU rcu_data structure.
>
> Please note that I am particularly unhappy with the thought of having
> RCU having non-contiguous CPU numbering within the rcu_node structures.
> For example, having the first rcu_node structure have CPUs 0-7 and
> 32-39, the second have 8-15 and 40-47, and so on is really really ugly.
> That isn't to say that I am inalterably opposed, but rather that there
> had better be extremely good measurable system-level reasons for such
> a change.
>
> On the other hand, having some sort of option to allow architectures to
> specify the RCU_FANOUT and RCU_FANOUT_LEAF values at boot time is not
> that big a deal.
>
> Does that help?

A lot!

I suspected there could be no benefit in such a change and it is good
to know at first hand.

I could only think of large NUMA systems where that might matter, but
if the problem exists I guess it should be mitigated by NUMA balancer
anyways.

Thank you, Paul!

> Thanx, Paul
>

--
Regards,
Alexander Gordeev
[email protected]

2015-08-18 13:21:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Make RCU tree CPU topology aware?

On Tue, Aug 18, 2015 at 09:55:40AM +0100, Alexander Gordeev wrote:
> On Mon, Aug 17, 2015 at 08:28:16AM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 17, 2015 at 11:39:34AM +0100, Alexander Gordeev wrote:
> > > Hi Paul,
> > >
> > > Currently RCU tree distributes CPUs to leafs based on consequent CPU
> > > IDs. That means CPUs from remote caches and even nodes might end up
> > > in the same leaf.
> > >
> > > I did not research the impact, but at the glance that seems at least
> > > sub-optimal; especially in case of remote nodes, when CPUs access
> > > each others' memory?
> > >
> > > I am thinking of topology-aware RCU geometry where the RCU tree reflects
> > > the actual system topology. I.e by borrowing it from schedulling domains
> > > or soemthing like that.
> > >
> > > Do you think it worth the effort to research this question or I am
> > > missing something and the current access patterns are just optimal?
> >
> > The first thing to try would be to specify the rcutree.rcu_fanout_leaf
> > kernel boot parameter to align with the system's hardware boundaries and
> > to misalign, and see if you can measure any difference whatsoever at the
> > system level. For example, if you are using a multi-socket eight-core
> > x86 CPU with hyperthreading enabled, specify rcutree.rcu_fanout_leaf=8
> > to account for the "interesting" x86 CPU numbering. The default of
> > rcutree.rcu_fanout_leaf=16 would have the first two sockets sharing the
> > first leaf rcu_node structure. Perhaps also try rcutree.rcu_fanout_leaf=7
> > and rcutree.rcu_fanout_leaf=9 to tease out contention effects. I suggest
> > also running tests with hyperthreading disabled.
> >
> > I bet that you won't see any system-level effect. The reason for that
> > bet is that people have been asking me this for years, but have always
> > declined to provide any data. In addition, RCU's fast paths are designed
> > to avoid hitting the rcu_node structures -- even call_rcu() normally is
> > confined to the per-CPU rcu_data structure.
> >
> > Please note that I am particularly unhappy with the thought of having
> > RCU having non-contiguous CPU numbering within the rcu_node structures.
> > For example, having the first rcu_node structure have CPUs 0-7 and
> > 32-39, the second have 8-15 and 40-47, and so on is really really ugly.
> > That isn't to say that I am inalterably opposed, but rather that there
> > had better be extremely good measurable system-level reasons for such
> > a change.
> >
> > On the other hand, having some sort of option to allow architectures to
> > specify the RCU_FANOUT and RCU_FANOUT_LEAF values at boot time is not
> > that big a deal.
> >
> > Does that help?
>
> A lot!
>
> I suspected there could be no benefit in such a change and it is good
> to know at first hand.
>
> I could only think of large NUMA systems where that might matter, but
> if the problem exists I guess it should be mitigated by NUMA balancer

Well, please let me know how the measurement goes for you! As you say,
there is no substitute for first-hand data.

Thanx, Paul