From: Alexander Duyck <[email protected]>
Change the logic used to generate randomness in the suffle path so that we
can avoid cache line bouncing. The previous logic was sharing the offset
and entropy word between all CPUs. As such this can result in cache line
bouncing and will ultimately hurt performance when enabled.
To resolve this I have moved to a per-cpu logic for maintaining a unsigned
long containing some amount of bits, and an offset value for which bit we
can use for entropy with each call.
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
mm/shuffle.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)
diff --git a/mm/shuffle.c b/mm/shuffle.c
index 3ce12481b1dc..9ba542ecf335 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -183,25 +183,38 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
shuffle_zone(z);
}
+struct batched_bit_entropy {
+ unsigned long entropy_bool;
+ int position;
+};
+
+static DEFINE_PER_CPU(struct batched_bit_entropy, batched_entropy_bool);
+
void add_to_free_area_random(struct page *page, struct free_area *area,
int migratetype)
{
- static u64 rand;
- static u8 rand_bits;
+ struct batched_bit_entropy *batch;
+ unsigned long entropy;
+ int position;
/*
- * The lack of locking is deliberate. If 2 threads race to
- * update the rand state it just adds to the entropy.
+ * We shouldn't need to disable IRQs as the only caller is
+ * __free_one_page and it should only be called with the zone lock
+ * held and either from IRQ context or with local IRQs disabled.
*/
- if (rand_bits == 0) {
- rand_bits = 64;
- rand = get_random_u64();
+ batch = raw_cpu_ptr(&batched_entropy_bool);
+ position = batch->position;
+
+ if (--position < 0) {
+ batch->entropy_bool = get_random_long();
+ position = BITS_PER_LONG - 1;
}
- if (rand & 1)
+ batch->position = position;
+ entropy = batch->entropy_bool;
+
+ if (1ul & (entropy >> position))
add_to_free_area(page, area, migratetype);
else
add_to_free_area_tail(page, area, migratetype);
- rand_bits--;
- rand >>= 1;
}
On Sat, Sep 07, 2019 at 10:25:12AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Change the logic used to generate randomness in the suffle path so that we
Typo.
> can avoid cache line bouncing. The previous logic was sharing the offset
> and entropy word between all CPUs. As such this can result in cache line
> bouncing and will ultimately hurt performance when enabled.
>
> To resolve this I have moved to a per-cpu logic for maintaining a unsigned
> long containing some amount of bits, and an offset value for which bit we
> can use for entropy with each call.
>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> mm/shuffle.c | 33 +++++++++++++++++++++++----------
> 1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 3ce12481b1dc..9ba542ecf335 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -183,25 +183,38 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> shuffle_zone(z);
> }
>
> +struct batched_bit_entropy {
> + unsigned long entropy_bool;
> + int position;
> +};
> +
> +static DEFINE_PER_CPU(struct batched_bit_entropy, batched_entropy_bool);
> +
> void add_to_free_area_random(struct page *page, struct free_area *area,
> int migratetype)
> {
> - static u64 rand;
> - static u8 rand_bits;
> + struct batched_bit_entropy *batch;
> + unsigned long entropy;
> + int position;
>
> /*
> - * The lack of locking is deliberate. If 2 threads race to
> - * update the rand state it just adds to the entropy.
> + * We shouldn't need to disable IRQs as the only caller is
> + * __free_one_page and it should only be called with the zone lock
> + * held and either from IRQ context or with local IRQs disabled.
> */
> - if (rand_bits == 0) {
> - rand_bits = 64;
> - rand = get_random_u64();
> + batch = raw_cpu_ptr(&batched_entropy_bool);
> + position = batch->position;
> +
> + if (--position < 0) {
> + batch->entropy_bool = get_random_long();
> + position = BITS_PER_LONG - 1;
> }
>
> - if (rand & 1)
> + batch->position = position;
> + entropy = batch->entropy_bool;
> +
> + if (1ul & (entropy >> position))
Maybe something like this would be more readble:
if (entropy & BIT(position))
> add_to_free_area(page, area, migratetype);
> else
> add_to_free_area_tail(page, area, migratetype);
> - rand_bits--;
> - rand >>= 1;
> }
>
>
--
Kirill A. Shutemov
On 07.09.19 19:25, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Change the logic used to generate randomness in the suffle path so that we
> can avoid cache line bouncing. The previous logic was sharing the offset
> and entropy word between all CPUs. As such this can result in cache line
> bouncing and will ultimately hurt performance when enabled.
So, usually we perform such changes if there is real evidence. Do you
have any such performance numbers to back your claims?
>
> To resolve this I have moved to a per-cpu logic for maintaining a unsigned
> long containing some amount of bits, and an offset value for which bit we
> can use for entropy with each call.
>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> mm/shuffle.c | 33 +++++++++++++++++++++++----------
> 1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 3ce12481b1dc..9ba542ecf335 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -183,25 +183,38 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> shuffle_zone(z);
> }
>
> +struct batched_bit_entropy {
> + unsigned long entropy_bool;
> + int position;
> +};
> +
> +static DEFINE_PER_CPU(struct batched_bit_entropy, batched_entropy_bool);
> +
> void add_to_free_area_random(struct page *page, struct free_area *area,
> int migratetype)
> {
> - static u64 rand;
> - static u8 rand_bits;
> + struct batched_bit_entropy *batch;
> + unsigned long entropy;
> + int position;
>
> /*
> - * The lack of locking is deliberate. If 2 threads race to
> - * update the rand state it just adds to the entropy.
> + * We shouldn't need to disable IRQs as the only caller is
> + * __free_one_page and it should only be called with the zone lock
> + * held and either from IRQ context or with local IRQs disabled.
> */
> - if (rand_bits == 0) {
> - rand_bits = 64;
> - rand = get_random_u64();
> + batch = raw_cpu_ptr(&batched_entropy_bool);
> + position = batch->position;
> +
> + if (--position < 0) {
> + batch->entropy_bool = get_random_long();
> + position = BITS_PER_LONG - 1;
> }
>
> - if (rand & 1)
> + batch->position = position;
> + entropy = batch->entropy_bool;
> +
> + if (1ul & (entropy >> position))
> add_to_free_area(page, area, migratetype);
> else
> add_to_free_area_tail(page, area, migratetype);
> - rand_bits--;
> - rand >>= 1;
> }
>
>
--
Thanks,
David / dhildenb
On Mon, 2019-09-09 at 12:07 +0300, Kirill A. Shutemov wrote:
> On Sat, Sep 07, 2019 at 10:25:12AM -0700, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Change the logic used to generate randomness in the suffle path so that we
>
> Typo.
>
> > can avoid cache line bouncing. The previous logic was sharing the offset
> > and entropy word between all CPUs. As such this can result in cache line
> > bouncing and will ultimately hurt performance when enabled.
> >
> > To resolve this I have moved to a per-cpu logic for maintaining a unsigned
> > long containing some amount of bits, and an offset value for which bit we
> > can use for entropy with each call.
> >
> > Reviewed-by: Dan Williams <[email protected]>
> > Signed-off-by: Alexander Duyck <[email protected]>
> > ---
> > mm/shuffle.c | 33 +++++++++++++++++++++++----------
> > 1 file changed, 23 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/shuffle.c b/mm/shuffle.c
> > index 3ce12481b1dc..9ba542ecf335 100644
> > --- a/mm/shuffle.c
> > +++ b/mm/shuffle.c
> > @@ -183,25 +183,38 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> > shuffle_zone(z);
> > }
> >
> > +struct batched_bit_entropy {
> > + unsigned long entropy_bool;
> > + int position;
> > +};
> > +
> > +static DEFINE_PER_CPU(struct batched_bit_entropy, batched_entropy_bool);
> > +
> > void add_to_free_area_random(struct page *page, struct free_area *area,
> > int migratetype)
> > {
> > - static u64 rand;
> > - static u8 rand_bits;
> > + struct batched_bit_entropy *batch;
> > + unsigned long entropy;
> > + int position;
> >
> > /*
> > - * The lack of locking is deliberate. If 2 threads race to
> > - * update the rand state it just adds to the entropy.
> > + * We shouldn't need to disable IRQs as the only caller is
> > + * __free_one_page and it should only be called with the zone lock
> > + * held and either from IRQ context or with local IRQs disabled.
> > */
> > - if (rand_bits == 0) {
> > - rand_bits = 64;
> > - rand = get_random_u64();
> > + batch = raw_cpu_ptr(&batched_entropy_bool);
> > + position = batch->position;
> > +
> > + if (--position < 0) {
> > + batch->entropy_bool = get_random_long();
> > + position = BITS_PER_LONG - 1;
> > }
> >
> > - if (rand & 1)
> > + batch->position = position;
> > + entropy = batch->entropy_bool;
> > +
> > + if (1ul & (entropy >> position))
>
> Maybe something like this would be more readble:
>
> if (entropy & BIT(position))
>
> > add_to_free_area(page, area, migratetype);
> > else
> > add_to_free_area_tail(page, area, migratetype);
> > - rand_bits--;
> > - rand >>= 1;
> > }
> >
> >
Thanks for the review. I will update these two items for v10.
- Alex
On Mon, 2019-09-09 at 10:14 +0200, David Hildenbrand wrote:
> On 07.09.19 19:25, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Change the logic used to generate randomness in the suffle path so that we
> > can avoid cache line bouncing. The previous logic was sharing the offset
> > and entropy word between all CPUs. As such this can result in cache line
> > bouncing and will ultimately hurt performance when enabled.
>
> So, usually we perform such changes if there is real evidence. Do you
> have any such performance numbers to back your claims?
I'll have to go rerun the test to get the exact numbers. The reason this
came up is that my original test was spanning NUMA nodes and that made
this more expensive as a result since the memory was both not local to the
CPU and was being updated by multiple sockets.
I will try building a pair of host kernels with shuffling enabled and this
patch applied to one and can add that data to the patch description.
- Alex
On Mon 09-09-19 08:11:36, Alexander Duyck wrote:
> On Mon, 2019-09-09 at 10:14 +0200, David Hildenbrand wrote:
> > On 07.09.19 19:25, Alexander Duyck wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > Change the logic used to generate randomness in the suffle path so that we
> > > can avoid cache line bouncing. The previous logic was sharing the offset
> > > and entropy word between all CPUs. As such this can result in cache line
> > > bouncing and will ultimately hurt performance when enabled.
> >
> > So, usually we perform such changes if there is real evidence. Do you
> > have any such performance numbers to back your claims?
>
> I'll have to go rerun the test to get the exact numbers. The reason this
> came up is that my original test was spanning NUMA nodes and that made
> this more expensive as a result since the memory was both not local to the
> CPU and was being updated by multiple sockets.
What was the pattern of page freeing in your testing? I am wondering
because order 0 pages should be prevailing and those usually go via pcp
lists so they do not get shuffled unless the batch is full IIRC.
--
Michal Hocko
SUSE Labs
On Mon, 2019-09-09 at 10:14 +0200, David Hildenbrand wrote:
> On 07.09.19 19:25, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Change the logic used to generate randomness in the suffle path so that we
> > can avoid cache line bouncing. The previous logic was sharing the offset
> > and entropy word between all CPUs. As such this can result in cache line
> > bouncing and will ultimately hurt performance when enabled.
>
> So, usually we perform such changes if there is real evidence. Do you
> have any such performance numbers to back your claims?
I don't have any numbers. From what I can tell the impact is small enough
that this doesn't really have much impact.
With that being the case I can probably just drop this patch. I will
instead just use "rand & 1" in the 2nd patch to generate the return value
which was what was previously done in add_to_free_area_random.
On Tue, 2019-09-10 at 14:11 +0200, Michal Hocko wrote:
> On Mon 09-09-19 08:11:36, Alexander Duyck wrote:
> > On Mon, 2019-09-09 at 10:14 +0200, David Hildenbrand wrote:
> > > On 07.09.19 19:25, Alexander Duyck wrote:
> > > > From: Alexander Duyck <[email protected]>
> > > >
> > > > Change the logic used to generate randomness in the suffle path so that we
> > > > can avoid cache line bouncing. The previous logic was sharing the offset
> > > > and entropy word between all CPUs. As such this can result in cache line
> > > > bouncing and will ultimately hurt performance when enabled.
> > >
> > > So, usually we perform such changes if there is real evidence. Do you
> > > have any such performance numbers to back your claims?
> >
> > I'll have to go rerun the test to get the exact numbers. The reason this
> > came up is that my original test was spanning NUMA nodes and that made
> > this more expensive as a result since the memory was both not local to the
> > CPU and was being updated by multiple sockets.
>
> What was the pattern of page freeing in your testing? I am wondering
> because order 0 pages should be prevailing and those usually go via pcp
> lists so they do not get shuffled unless the batch is full IIRC.
So I am pretty sure my previous data was faulty. One side effect of the
page reporting is that it was evicting pages out of the guest and when the
pages were faulted back in they were coming from local page pools. This
was throwing off my early numbers and making tests look better than they
should have for the reported case.
I had this patch previously merged with another one so I wasn't testing it
on its own, it was instead a part of a bigger set. Now that I have tried
testing it on its own I can see that it has no significant impact on
performance. With that being the case I will probably just drop it.