2023-01-17 08:50:38

by Jaewon Kim

[permalink] [raw]
Subject: Re: [PATCH] dma-buf: system_heap: avoid reclaim for order 4

> Using order 4 pages would be helpful for many IOMMUs, but it could spend
> quite much time in page allocation perspective.
>
> The order 4 allocation with __GFP_RECLAIM may spend much time in
> reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> unpredictable delay.
>
> To get reasonable allocation speed from dma-buf system heap, use
> HIGH_ORDER_GFP for order 4 to avoid reclaim.
>
> Signed-off-by: Jaewon Kim <[email protected]>
> ---
> drivers/dma-buf/heaps/system_heap.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index e8bd10e60998..5a405e99ef1e 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -42,11 +42,10 @@ struct dma_heap_attachment {
> };
>
> #define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
> -#define MID_ORDER_GFP (LOW_ORDER_GFP | __GFP_NOWARN)
> #define HIGH_ORDER_GFP (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
> | __GFP_NORETRY) & ~__GFP_RECLAIM) \
> | __GFP_COMP)
> -static gfp_t order_flags[] = {HIGH_ORDER_GFP, MID_ORDER_GFP, LOW_ORDER_GFP};
> +static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
> /*
> * The selection of the orders used for allocation (1MB, 64K, 4K) is designed
> * to match with the sizes often found in IOMMUs. Using order 4 pages instead
> --
> 2.17.1

added John Stultz <[email protected]>


2023-01-18 08:38:49

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH] dma-buf: system_heap: avoid reclaim for order 4

On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <[email protected]> wrote:
> > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > quite much time in page allocation perspective.
> >
> > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > unpredictable delay.
> >
> > To get reasonable allocation speed from dma-buf system heap, use
> > HIGH_ORDER_GFP for order 4 to avoid reclaim.

Thanks for sharing this!
The case where the allocation gets stuck behind reclaim under pressure
does sound undesirable, but I'd be a bit hesitant to tweak numbers
that have been used for a long while (going back to ion) without a bit
more data.

It might be good to also better understand the tradeoff of potential
on-going impact to performance from using low order pages when the
buffer is used. Do you have any details like or tests that you could
share to help ensure this won't impact other users?

TJ: Do you have any additional thoughts on this?

thanks
-john

2023-01-18 21:33:16

by T.J. Mercier

[permalink] [raw]
Subject: Re: [PATCH] dma-buf: system_heap: avoid reclaim for order 4

On Tue, Jan 17, 2023 at 10:54 PM John Stultz <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <[email protected]> wrote:
> > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > quite much time in page allocation perspective.
> > >
> > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > unpredictable delay.
> > >
> > > To get reasonable allocation speed from dma-buf system heap, use
> > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
>
> Thanks for sharing this!
> The case where the allocation gets stuck behind reclaim under pressure
> does sound undesirable, but I'd be a bit hesitant to tweak numbers
> that have been used for a long while (going back to ion) without a bit
> more data.
>
> It might be good to also better understand the tradeoff of potential
> on-going impact to performance from using low order pages when the
> buffer is used. Do you have any details like or tests that you could
> share to help ensure this won't impact other users?
>
> TJ: Do you have any additional thoughts on this?
>
I don't have any data on how often we hit reclaim for mid order
allocations. That would be interesting to know. However the 70th
percentile of system-wide buffer sizes while running the camera on my
phone is still only 1 page, so it looks like this change would affect
a subset of use-cases.

Wouldn't this change make it less likely to get an order 4 allocation
(under memory pressure)? The commit message makes me think the goal of
the change is to get more of them.

Actually with the low order being 0, I don't think __GFP_COMP makes
sense in LOW_ORDER_GFP. But I guess that flag isn't harmful there.

> thanks
> -john

2023-01-25 09:56:56

by Jaewon Kim

[permalink] [raw]
Subject: Re: [PATCH] dma-buf: system_heap: avoid reclaim for order 4

> On Tue, Jan 17, 2023 at 10:54 PM John Stultz <[email protected]> wrote:
> >
> > On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <[email protected]> wrote:
> > > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > > quite much time in page allocation perspective.
> > > >
> > > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > > unpredictable delay.
> > > >
> > > > To get reasonable allocation speed from dma-buf system heap, use
> > > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
> >
> > Thanks for sharing this!
> > The case where the allocation gets stuck behind reclaim under pressure
> > does sound undesirable, but I'd be a bit hesitant to tweak numbers
> > that have been used for a long while (going back to ion) without a bit
> > more data.
> >
> > It might be good to also better understand the tradeoff of potential
> > on-going impact to performance from using low order pages when the
> > buffer is used. Do you have any details like or tests that you could
> > share to help ensure this won't impact other users?
> >
> > TJ: Do you have any additional thoughts on this?
> >
> I don't have any data on how often we hit reclaim for mid order
> allocations. That would be interesting to know. However the 70th
> percentile of system-wide buffer sizes while running the camera on my
> phone is still only 1 page, so it looks like this change would affect
> a subset of use-cases.
>
> Wouldn't this change make it less likely to get an order 4 allocation
> (under memory pressure)? The commit message makes me think the goal of
> the change is to get more of them.

Hello John Stultz

I've been waiting for your next reply.

With my commit, we may gather less number of order 4 pages and fill the
requested size with more number of order 0 pages. I think, howerver, stable
allocation speed is quite important so that corresponding user space
context can move on within a specific time.

Not only compaction but reclaim also, I think, would be invoked more if the
__GFP_RECLAIM is added on order 4. I expect the reclaim could be decreased
if we move to order 0.

Thank you
Jaewon Kim

>
> Actually with the low order being 0, I don't think __GFP_COMP makes
> sense in LOW_ORDER_GFP. But I guess that flag isn't harmful there.
>
> > thanks
> > -john


2023-01-25 10:20:10

by Jaewon Kim

[permalink] [raw]
Subject: Re: [PATCH] dma-buf: system_heap: avoid reclaim for order 4

> > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <[email protected]> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <[email protected]> wrote:
> > > > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > > > quite much time in page allocation perspective.
> > > > >
> > > > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > > > unpredictable delay.
> > > > >
> > > > > To get reasonable allocation speed from dma-buf system heap, use
> > > > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
> > >
> > > Thanks for sharing this!
> > > The case where the allocation gets stuck behind reclaim under pressure
> > > does sound undesirable, but I'd be a bit hesitant to tweak numbers
> > > that have been used for a long while (going back to ion) without a bit
> > > more data.
> > >
> > > It might be good to also better understand the tradeoff of potential
> > > on-going impact to performance from using low order pages when the
> > > buffer is used. Do you have any details like or tests that you could
> > > share to help ensure this won't impact other users?
> > >
> > > TJ: Do you have any additional thoughts on this?
> > >
> > I don't have any data on how often we hit reclaim for mid order
> > allocations. That would be interesting to know. However the 70th
> > percentile of system-wide buffer sizes while running the camera on my
> > phone is still only 1 page, so it looks like this change would affect
> > a subset of use-cases.
> >
> > Wouldn't this change make it less likely to get an order 4 allocation
> > (under memory pressure)? The commit message makes me think the goal of
> > the change is to get more of them.
>
> Hello John Stultz
>
> I've been waiting for your next reply.
>
> With my commit, we may gather less number of order 4 pages and fill the
> requested size with more number of order 0 pages. I think, howerver, stable
> allocation speed is quite important so that corresponding user space
> context can move on within a specific time.
>
> Not only compaction but reclaim also, I think, would be invoked more if the
> __GFP_RECLAIM is added on order 4. I expect the reclaim could be decreased
> if we move to order 0.
>

Additionally I'd like to say the old legacy ion system heap also used the
__GFP_RECLAIM only for order 8, not for order 4.

drivers/staging/android/ion/ion_system_heap.c

static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
__GFP_NORETRY) & ~__GFP_RECLAIM;
static gfp_t low_order_gfp_flags = GFP_HIGHUSER | __GFP_ZERO;
static const unsigned int orders[] = {8, 4, 0};

static int ion_system_heap_create_pools(struct ion_page_pool **pools)
{
int i;

for (i = 0; i < NUM_ORDERS; i++) {
struct ion_page_pool *pool;
gfp_t gfp_flags = low_order_gfp_flags;

if (orders[i] > 4)
gfp_flags = high_order_gfp_flags;


> Thank you
> Jaewon Kim
>
> >
> > Actually with the low order being 0, I don't think __GFP_COMP makes
> > sense in LOW_ORDER_GFP. But I guess that flag isn't harmful there.
> >
> > > thanks
> > > -john

2023-01-25 20:32:56

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH] dma-buf: system_heap: avoid reclaim for order 4

On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <[email protected]> wrote:
> > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <[email protected]> wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <[email protected]> wrote:
> > > > > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > > > > quite much time in page allocation perspective.
> > > > > >
> > > > > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > > > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > > > > unpredictable delay.
> > > > > >
> > > > > > To get reasonable allocation speed from dma-buf system heap, use
> > > > > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
> > > >
> > > > Thanks for sharing this!
> > > > The case where the allocation gets stuck behind reclaim under pressure
> > > > does sound undesirable, but I'd be a bit hesitant to tweak numbers
> > > > that have been used for a long while (going back to ion) without a bit
> > > > more data.
> > > >
> > > > It might be good to also better understand the tradeoff of potential
> > > > on-going impact to performance from using low order pages when the
> > > > buffer is used. Do you have any details like or tests that you could
> > > > share to help ensure this won't impact other users?
> > > >
> > > > TJ: Do you have any additional thoughts on this?
> > > >
> > > I don't have any data on how often we hit reclaim for mid order
> > > allocations. That would be interesting to know. However the 70th
> > > percentile of system-wide buffer sizes while running the camera on my
> > > phone is still only 1 page, so it looks like this change would affect
> > > a subset of use-cases.
> > >
> > > Wouldn't this change make it less likely to get an order 4 allocation
> > > (under memory pressure)? The commit message makes me think the goal of
> > > the change is to get more of them.
> >
> > Hello John Stultz
> >
> > I've been waiting for your next reply.

Sorry, I was thinking you were gathering data on the tradeoffs. Sorry
for my confusion.

> > With my commit, we may gather less number of order 4 pages and fill the
> > requested size with more number of order 0 pages. I think, howerver, stable
> > allocation speed is quite important so that corresponding user space
> > context can move on within a specific time.
> >
> > Not only compaction but reclaim also, I think, would be invoked more if the
> > __GFP_RECLAIM is added on order 4. I expect the reclaim could be decreased
> > if we move to order 0.
> >
>
> Additionally I'd like to say the old legacy ion system heap also used the
> __GFP_RECLAIM only for order 8, not for order 4.
>
> drivers/staging/android/ion/ion_system_heap.c
>
> static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
> __GFP_NORETRY) & ~__GFP_RECLAIM;
> static gfp_t low_order_gfp_flags = GFP_HIGHUSER | __GFP_ZERO;
> static const unsigned int orders[] = {8, 4, 0};
>
> static int ion_system_heap_create_pools(struct ion_page_pool **pools)
> {
> int i;
>
> for (i = 0; i < NUM_ORDERS; i++) {
> struct ion_page_pool *pool;
> gfp_t gfp_flags = low_order_gfp_flags;
>
> if (orders[i] > 4)
> gfp_flags = high_order_gfp_flags;


This seems a bit backwards from your statement. It's only removing
__GFP_RECLAIM on order 8 (high_order_gfp_flags).

So apologies again, but how is that different from the existing code?

#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
#define MID_ORDER_GFP (LOW_ORDER_GFP | __GFP_NOWARN)
#define HIGH_ORDER_GFP (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
| __GFP_NORETRY) & ~__GFP_RECLAIM) \
| __GFP_COMP)
static gfp_t order_flags[] = {HIGH_ORDER_GFP, MID_ORDER_GFP, LOW_ORDER_GFP};

Where the main reason we introduced the mid-order flags is to avoid
the warnings on order 4 allocation failures when we'll fall back to
order 0

The only substantial difference I see between the old ion code and
what we have now is the GFP_COMP addition, which is a bit hazy in my
memory. I unfortunately don't have a record of why it was added (don't
have access to my old mail box), so I suspect it was something brought
up in private review. Dropping that from the low order flags probably
makes sense as TJ pointed out, but this isn't what your patch is
changing.

Your patch is changing that for mid-order allocations we'll use the
high order flags, so we'll not retry and not reclaim, so there will be
more failing and falling back to single page allocations.
This makes sense to make allocation time faster and more deterministic
(I like it!), but potentially has the tradeoff of losing the
performance benefit of using mid order page sizes.

I suspect your change is a net win overall, as the cumulative effect
of using larger pages probably won't benefit more than the large
indeterministic allocation time, particularly under pressure.

But because your change is different from what the old ion code did, I
want to be a little cautious. So it would be nice to see some
evaluation of not just the benefits the patch provides you but also of
what negative impact it might have. And so far you haven't provided
any details there.

A quick example might be for the use case where mid-order allocations
are causing you trouble, you could see how the performance changes if
you force all mid-order allocations to be single page allocations (so
orders[] = {8, 0, 0};) and compare it with the current code when
there's no memory pressure (right after reboot when pages haven't been
fragmented) so the mid-order allocations will succeed. That will let
us know the potential downside if we have brief / transient pressure
at allocation time that forces small pages.

Does that make sense?

thanks
-john

2023-01-26 04:42:33

by Jaewon Kim

[permalink] [raw]
Subject: RE:(2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4

> On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <[email protected]> wrote:
> > > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <[email protected]> wrote:
> > > > > > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > > > > > quite much time in page allocation perspective.
> > > > > > >
> > > > > > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > > > > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > > > > > unpredictable delay.
> > > > > > >
> > > > > > > To get reasonable allocation speed from dma-buf system heap, use
> > > > > > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
> > > > >
> > > > > Thanks for sharing this!
> > > > > The case where the allocation gets stuck behind reclaim under pressure
> > > > > does sound undesirable, but I'd be a bit hesitant to tweak numbers
> > > > > that have been used for a long while (going back to ion) without a bit
> > > > > more data.
> > > > >
> > > > > It might be good to also better understand the tradeoff of potential
> > > > > on-going impact to performance from using low order pages when the
> > > > > buffer is used. Do you have any details like or tests that you could
> > > > > share to help ensure this won't impact other users?
> > > > >
> > > > > TJ: Do you have any additional thoughts on this?
> > > > >
> > > > I don't have any data on how often we hit reclaim for mid order
> > > > allocations. That would be interesting to know. However the 70th
> > > > percentile of system-wide buffer sizes while running the camera on my
> > > > phone is still only 1 page, so it looks like this change would affect
> > > > a subset of use-cases.
> > > >
> > > > Wouldn't this change make it less likely to get an order 4 allocation
> > > > (under memory pressure)? The commit message makes me think the goal of
> > > > the change is to get more of them.
> > >
> > > Hello John Stultz
> > >
> > > I've been waiting for your next reply.
>
> Sorry, I was thinking you were gathering data on the tradeoffs. Sorry
> for my confusion.
>
> > > With my commit, we may gather less number of order 4 pages and fill the
> > > requested size with more number of order 0 pages. I think, howerver, stable
> > > allocation speed is quite important so that corresponding user space
> > > context can move on within a specific time.
> > >
> > > Not only compaction but reclaim also, I think, would be invoked more if the
> > > __GFP_RECLAIM is added on order 4. I expect the reclaim could be decreased
> > > if we move to order 0.
> > >
> >
> > Additionally I'd like to say the old legacy ion system heap also used the
> > __GFP_RECLAIM only for order 8, not for order 4.
> >
> > drivers/staging/android/ion/ion_system_heap.c
> >
> > static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
> > __GFP_NORETRY) & ~__GFP_RECLAIM;
> > static gfp_t low_order_gfp_flags = GFP_HIGHUSER | __GFP_ZERO;
> > static const unsigned int orders[] = {8, 4, 0};
> >
> > static int ion_system_heap_create_pools(struct ion_page_pool **pools)
> > {
> > int i;
> >
> > for (i = 0; i < NUM_ORDERS; i++) {
> > struct ion_page_pool *pool;
> > gfp_t gfp_flags = low_order_gfp_flags;
> >
> > if (orders[i] > 4)
> > gfp_flags = high_order_gfp_flags;
>
>
> This seems a bit backwards from your statement. It's only removing
> __GFP_RECLAIM on order 8 (high_order_gfp_flags).

Oh sorry, my fault. I also read wrongly. But as far as I know, most of
AP chipset vendors have been using __GFP_RECLAIM only for order 0.
I can't say in detail though.

>
> So apologies again, but how is that different from the existing code?
>
> #define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
> #define MID_ORDER_GFP (LOW_ORDER_GFP | __GFP_NOWARN)
> #define HIGH_ORDER_GFP (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
> | __GFP_NORETRY) & ~__GFP_RECLAIM) \
> | __GFP_COMP)
> static gfp_t order_flags[] = {HIGH_ORDER_GFP, MID_ORDER_GFP, LOW_ORDER_GFP};
>
> Where the main reason we introduced the mid-order flags is to avoid
> the warnings on order 4 allocation failures when we'll fall back to
> order 0
>
> The only substantial difference I see between the old ion code and
> what we have now is the GFP_COMP addition, which is a bit hazy in my
> memory. I unfortunately don't have a record of why it was added (don't
> have access to my old mail box), so I suspect it was something brought
> up in private review. Dropping that from the low order flags probably
> makes sense as TJ pointed out, but this isn't what your patch is
> changing.
>
> Your patch is changing that for mid-order allocations we'll use the
> high order flags, so we'll not retry and not reclaim, so there will be
> more failing and falling back to single page allocations.
> This makes sense to make allocation time faster and more deterministic
> (I like it!), but potentially has the tradeoff of losing the
> performance benefit of using mid order page sizes.
>
> I suspect your change is a net win overall, as the cumulative effect
> of using larger pages probably won't benefit more than the large
> indeterministic allocation time, particularly under pressure.
>
> But because your change is different from what the old ion code did, I
> want to be a little cautious. So it would be nice to see some
> evaluation of not just the benefits the patch provides you but also of
> what negative impact it might have. And so far you haven't provided
> any details there.
>
> A quick example might be for the use case where mid-order allocations
> are causing you trouble, you could see how the performance changes if
> you force all mid-order allocations to be single page allocations (so
> orders[] = {8, 0, 0};) and compare it with the current code when
> there's no memory pressure (right after reboot when pages haven't been
> fragmented) so the mid-order allocations will succeed. That will let
> us know the potential downside if we have brief / transient pressure
> at allocation time that forces small pages.
>
> Does that make sense?

Let me try this. It make take some days. But I guess it depends on memory
status as you said. If there were quite many order 4 pages, then 8 4 0
should be faster than 8 0 0.

I don't know this is a right approach. In my opinion, except the specific
cases like right after reboot, there are not many order 4 pages. And
in determinisitic allocation time perspective, I think avoiding too long
allocations is more important than making faster with already existing
free order 4 pages.

BR
Jaewon Kim

>
> thanks
> -john

2023-01-26 05:05:00

by John Stultz

[permalink] [raw]
Subject: Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4

On Wed, Jan 25, 2023 at 8:42 PM 김재원 <[email protected]> wrote:
> > On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <[email protected]> wrote:
> > > > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <[email protected]> wrote:
> > But because your change is different from what the old ion code did, I
> > want to be a little cautious. So it would be nice to see some
> > evaluation of not just the benefits the patch provides you but also of
> > what negative impact it might have. And so far you haven't provided
> > any details there.
> >
> > A quick example might be for the use case where mid-order allocations
> > are causing you trouble, you could see how the performance changes if
> > you force all mid-order allocations to be single page allocations (so
> > orders[] = {8, 0, 0};) and compare it with the current code when
> > there's no memory pressure (right after reboot when pages haven't been
> > fragmented) so the mid-order allocations will succeed. That will let
> > us know the potential downside if we have brief / transient pressure
> > at allocation time that forces small pages.
> >
> > Does that make sense?
>
> Let me try this. It make take some days. But I guess it depends on memory
> status as you said. If there were quite many order 4 pages, then 8 4 0
> should be faster than 8 0 0.
>
> I don't know this is a right approach. In my opinion, except the specific
> cases like right after reboot, there are not many order 4 pages. And
> in determinisitic allocation time perspective, I think avoiding too long
> allocations is more important than making faster with already existing
> free order 4 pages.

I suspect you are right, and do think your change will be helpful.
But I just want to make sure we're doing some due diligence, instead
of going on just gut instinct.

Thanks so much for helping with this!
-john

2023-02-04 15:02:29

by Jaewon Kim

[permalink] [raw]
Subject: RE:(2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4

>
> --------- Original Message ---------
>
> Sender : John Stultz <[email protected]>
>
> Date : 2023-01-26 14:04 (GMT+9)
>
> Title : Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4
>
>
>
> On Wed, Jan 25, 2023 at 8:42 PM 김재원 <[email protected]> wrote:
>
> > > On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <[email protected]> wrote:
>
> > > > > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <[email protected]> wrote:
>
> > > But because your change is different from what the old ion code did, I
>
> > > want to be a little cautious. So it would be nice to see some
>
> > > evaluation of not just the benefits the patch provides you but also of
>
> > > what negative impact it might have. And so far you haven't provided
>
> > > any details there.
>
> > >
>
> > > A quick example might be for the use case where mid-order allocations
>
> > > are causing you trouble, you could see how the performance changes if
>
> > > you force all mid-order allocations to be single page allocations (so
>
> > > orders[] = {8, 0, 0};) and compare it with the current code when
>
> > > there's no memory pressure (right after reboot when pages haven't been
>
> > > fragmented) so the mid-order allocations will succeed. That will let
>
> > > us know the potential downside if we have brief / transient pressure
>
> > > at allocation time that forces small pages.
>
> > >
>
> > > Does that make sense?
>
> >
>
> > Let me try this. It make take some days. But I guess it depends on memory
>
> > status as you said. If there were quite many order 4 pages, then 8 4 0
>
> > should be faster than 8 0 0.
>
> >
>
> > I don't know this is a right approach. In my opinion, except the specific
>
> > cases like right after reboot, there are not many order 4 pages. And
>
> > in determinisitic allocation time perspective, I think avoiding too long
>
> > allocations is more important than making faster with already existing
>
> > free order 4 pages.
>
>
>
> I suspect you are right, and do think your change will be helpful.
>
> But I just want to make sure we're doing some due diligence, instead
>
> of going on just gut instinct.
>
>
>
> Thanks so much for helping with this!
>
> -john
>
>

Hello John Stultz, sorry for late reply.
I had to manage other urgent things and this test also took some time to finish.
Any I hope you to be happy with following my test results.


1. system heap modification

To avoid effect of allocation from the pool, all the freed dma
buffer were passed to buddy without keeping them in the pool.
Some trace_printk and order counting logic were added.

2. the test tool

To test the dma-buf system heap allocation speed, I prepared
a userspace test program which requests a specified size to a heap.
With the program, I tried to request 16 times of 10 MB size and
added 1 sleep between each request. Each memory was not freed
until the total 16 times total memory was allocated.

3. the test device

The test device has arm64 CPU cores and v5.15 based kernel.
To get stable results, the CPU clock was fixed not to be changed
in run time, and the test tool was set to some specific CPU cores
running in the same CPU clock.

4. test results

As we expected if order 4 exist in the buddy, the order 8, 4, 0
allocation was 1 to 4 times faster than the order 8, 0, 0. But
the order 8, 0, 0 also looks fast enough.

Here's time diff, and number of each order.

order 8, 4, 0 in the enough order 4 case

diff 8 4 0
665 usec 0 160 0
1,148 usec 0 160 0
1,089 usec 0 160 0
1,154 usec 0 160 0
1,264 usec 0 160 0
1,414 usec 0 160 0
873 usec 0 160 0
1,148 usec 0 160 0
1,158 usec 0 160 0
1,139 usec 0 160 0
1,169 usec 0 160 0
1,174 usec 0 160 0
1,210 usec 0 160 0
995 usec 0 160 0
1,151 usec 0 160 0
977 usec 0 160 0

order 8, 0, 0 in the enough order 4 case

diff 8 4 0
441 usec 10 0 0
747 usec 10 0 0
2,330 usec 2 0 2048
2,469 usec 0 0 2560
2,518 usec 0 0 2560
1,176 usec 0 0 2560
1,487 usec 0 0 2560
1,402 usec 0 0 2560
1,449 usec 0 0 2560
1,330 usec 0 0 2560
1,089 usec 0 0 2560
1,481 usec 0 0 2560
1,326 usec 0 0 2560
3,057 usec 0 0 2560
2,758 usec 0 0 2560
3,271 usec 0 0 2560

From the perspective of responsiveness, the deterministic
memory allocation speed, I think, is quite important. So I
tested other case where the free memory are not enough.

On this test, I ran the 16 times allocation sets twice
consecutively. Then it showed the first set order 8, 4, 0
became very slow and varied, but the second set became
faster because of the already created the high order.

order 8, 4, 0 in low memory

diff 8 4 0
584 usec 0 160 0
28,428 usec 0 160 0
100,701 usec 0 160 0
76,645 usec 0 160 0
25,522 usec 0 160 0
38,798 usec 0 160 0
89,012 usec 0 160 0
23,015 usec 0 160 0
73,360 usec 0 160 0
76,953 usec 0 160 0
31,492 usec 0 160 0
75,889 usec 0 160 0
84,551 usec 0 160 0
84,352 usec 0 160 0
57,103 usec 0 160 0
93,452 usec 0 160 0

diff 8 4 0
808 usec 10 0 0
778 usec 4 96 0
829 usec 0 160 0
700 usec 0 160 0
937 usec 0 160 0
651 usec 0 160 0
636 usec 0 160 0
811 usec 0 160 0
622 usec 0 160 0
674 usec 0 160 0
677 usec 0 160 0
738 usec 0 160 0
1,130 usec 0 160 0
677 usec 0 160 0
553 usec 0 160 0
1,048 usec 0 160 0


order 8, 0, 0 in low memory

diff 8 4 0
1,699 usec 2 0 2048
2,082 usec 0 0 2560
840 usec 0 0 2560
875 usec 0 0 2560
845 usec 0 0 2560
1,706 usec 0 0 2560
967 usec 0 0 2560
1,000 usec 0 0 2560
1,905 usec 0 0 2560
2,451 usec 0 0 2560
3,384 usec 0 0 2560
2,397 usec 0 0 2560
3,171 usec 0 0 2560
2,376 usec 0 0 2560
3,347 usec 0 0 2560
2,554 usec 0 0 2560

diff 8 4 0
1,409 usec 2 0 2048
1,438 usec 0 0 2560
1,035 usec 0 0 2560
1,108 usec 0 0 2560
825 usec 0 0 2560
927 usec 0 0 2560
1,931 usec 0 0 2560
2,024 usec 0 0 2560
1,884 usec 0 0 2560
1,769 usec 0 0 2560
2,136 usec 0 0 2560
1,738 usec 0 0 2560
1,328 usec 0 0 2560
1,438 usec 0 0 2560
1,972 usec 0 0 2560
2,963 usec 0 0 2560

Finally if we change order 4 to use HIGH_ORDER_GFP,
I expect that we could avoid the very slow cases.

order 8, 4, 0 in low memory with HIGH_ORDER_GFP

diff 8 4 0
1,356 usec 0 155 80
1,901 usec 0 11 2384
1,912 usec 0 0 2560
1,911 usec 0 0 2560
1,884 usec 0 0 2560
1,577 usec 0 0 2560
1,366 usec 0 0 2560
1,711 usec 0 0 2560
1,635 usec 0 28 2112
544 usec 10 0 0
633 usec 2 128 0
848 usec 0 160 0
729 usec 0 160 0
1,000 usec 0 160 0
1,358 usec 0 160 0
2,638 usec 0 31 2064

diff 8 4 0
669 usec 10 0 0
789 usec 8 32 0
603 usec 3 112 0
578 usec 0 160 0
562 usec 0 160 0
564 usec 0 160 0
686 usec 0 160 0
1,621 usec 0 160 0
2,080 usec 0 40 1920
1,749 usec 0 0 2560
2,244 usec 0 0 2560
2,333 usec 0 0 2560
1,257 usec 0 0 2560
1,703 usec 0 0 2560
1,782 usec 0 1 2544
2,225 usec 0 0 2560

Thank you
Jaewon Kim

2023-02-07 04:37:59

by John Stultz

[permalink] [raw]
Subject: Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4

On Sat, Feb 4, 2023 at 7:02 AM Jaewon Kim <[email protected]> wrote:
> Hello John Stultz, sorry for late reply.
> I had to manage other urgent things and this test also took some time to finish.
> Any I hope you to be happy with following my test results.
>
>
> 1. system heap modification
>
> To avoid effect of allocation from the pool, all the freed dma
> buffer were passed to buddy without keeping them in the pool.
> Some trace_printk and order counting logic were added.
>
> 2. the test tool
>
> To test the dma-buf system heap allocation speed, I prepared
> a userspace test program which requests a specified size to a heap.
> With the program, I tried to request 16 times of 10 MB size and
> added 1 sleep between each request. Each memory was not freed
> until the total 16 times total memory was allocated.

Oof. I really appreciate all your effort that I'm sure went in to
generate these numbers, but this wasn't quite what I was asking for.
I know you've been focused on allocation performance under memory
pressure, but I was hoping to see the impact of __using__ order 0
pages over order 4 pages in real world conditions (for camera or video
recording or other use cases that use large allocations). These
results seem to be still just focused on the difference in allocation
performance between order 0 and order 4 with and without contention.

That said, re-reading my email, I probably could have been more clear
on this aspect.


> 3. the test device
>
> The test device has arm64 CPU cores and v5.15 based kernel.
> To get stable results, the CPU clock was fixed not to be changed
> in run time, and the test tool was set to some specific CPU cores
> running in the same CPU clock.
>
> 4. test results
>
> As we expected if order 4 exist in the buddy, the order 8, 4, 0
> allocation was 1 to 4 times faster than the order 8, 0, 0. But
> the order 8, 0, 0 also looks fast enough.
>
> Here's time diff, and number of each order.
>
> order 8, 4, 0 in the enough order 4 case
>
> diff 8 4 0
> 665 usec 0 160 0
> 1,148 usec 0 160 0
> 1,089 usec 0 160 0
> 1,154 usec 0 160 0
> 1,264 usec 0 160 0
> 1,414 usec 0 160 0
> 873 usec 0 160 0
> 1,148 usec 0 160 0
> 1,158 usec 0 160 0
> 1,139 usec 0 160 0
> 1,169 usec 0 160 0
> 1,174 usec 0 160 0
> 1,210 usec 0 160 0
> 995 usec 0 160 0
> 1,151 usec 0 160 0
> 977 usec 0 160 0
>
> order 8, 0, 0 in the enough order 4 case
>
> diff 8 4 0
> 441 usec 10 0 0
> 747 usec 10 0 0
> 2,330 usec 2 0 2048
> 2,469 usec 0 0 2560
> 2,518 usec 0 0 2560
> 1,176 usec 0 0 2560
> 1,487 usec 0 0 2560
> 1,402 usec 0 0 2560
> 1,449 usec 0 0 2560
> 1,330 usec 0 0 2560
> 1,089 usec 0 0 2560
> 1,481 usec 0 0 2560
> 1,326 usec 0 0 2560
> 3,057 usec 0 0 2560
> 2,758 usec 0 0 2560
> 3,271 usec 0 0 2560
>
> From the perspective of responsiveness, the deterministic
> memory allocation speed, I think, is quite important. So I
> tested other case where the free memory are not enough.
>
> On this test, I ran the 16 times allocation sets twice
> consecutively. Then it showed the first set order 8, 4, 0
> became very slow and varied, but the second set became
> faster because of the already created the high order.
>
> order 8, 4, 0 in low memory
>
> diff 8 4 0
> 584 usec 0 160 0
> 28,428 usec 0 160 0
> 100,701 usec 0 160 0
> 76,645 usec 0 160 0
> 25,522 usec 0 160 0
> 38,798 usec 0 160 0
> 89,012 usec 0 160 0
> 23,015 usec 0 160 0
> 73,360 usec 0 160 0
> 76,953 usec 0 160 0
> 31,492 usec 0 160 0
> 75,889 usec 0 160 0
> 84,551 usec 0 160 0
> 84,352 usec 0 160 0
> 57,103 usec 0 160 0
> 93,452 usec 0 160 0
>
> diff 8 4 0
> 808 usec 10 0 0
> 778 usec 4 96 0
> 829 usec 0 160 0
> 700 usec 0 160 0
> 937 usec 0 160 0
> 651 usec 0 160 0
> 636 usec 0 160 0
> 811 usec 0 160 0
> 622 usec 0 160 0
> 674 usec 0 160 0
> 677 usec 0 160 0
> 738 usec 0 160 0
> 1,130 usec 0 160 0
> 677 usec 0 160 0
> 553 usec 0 160 0
> 1,048 usec 0 160 0
>
>
> order 8, 0, 0 in low memory
>
> diff 8 4 0
> 1,699 usec 2 0 2048
> 2,082 usec 0 0 2560
> 840 usec 0 0 2560
> 875 usec 0 0 2560
> 845 usec 0 0 2560
> 1,706 usec 0 0 2560
> 967 usec 0 0 2560
> 1,000 usec 0 0 2560
> 1,905 usec 0 0 2560
> 2,451 usec 0 0 2560
> 3,384 usec 0 0 2560
> 2,397 usec 0 0 2560
> 3,171 usec 0 0 2560
> 2,376 usec 0 0 2560
> 3,347 usec 0 0 2560
> 2,554 usec 0 0 2560
>
> diff 8 4 0
> 1,409 usec 2 0 2048
> 1,438 usec 0 0 2560
> 1,035 usec 0 0 2560
> 1,108 usec 0 0 2560
> 825 usec 0 0 2560
> 927 usec 0 0 2560
> 1,931 usec 0 0 2560
> 2,024 usec 0 0 2560
> 1,884 usec 0 0 2560
> 1,769 usec 0 0 2560
> 2,136 usec 0 0 2560
> 1,738 usec 0 0 2560
> 1,328 usec 0 0 2560
> 1,438 usec 0 0 2560
> 1,972 usec 0 0 2560
> 2,963 usec 0 0 2560

So, thank you for generating all of this. I think this all looks as
expected, showing the benefit of your change to allocation under
contention and showing the potential downside in the non-contention
case.

I still worry about the performance impact outside of allocation time
of using the smaller order pages (via map and unmap through iommu to
devices, etc), so it would still be nice to have some confidence this
won't introduce other regressions, but I do agree the worse case
impact is very bad.

> Finally if we change order 4 to use HIGH_ORDER_GFP,
> I expect that we could avoid the very slow cases.
>

Yeah. Again, this all aligns with the upside of your changes, which
I'm eager for.
I just want to have a strong sense of any regressions it might also cause.

I don't mean to discourage you, especially after all the effort here.
Do you think evaluating the before and after impact to buffer usage
(not just allocation) would be doable in the near term?

If you don't think so, given the benefit to allocation under pressure
is large (and I don't mean to give you hurdles to jump), I'm willing
to ack your change to get it merged, but if we later see performance
trouble, I'll be quick to advocate for reverting it. Is that ok?

thanks
-john

2023-02-07 07:33:47

by Jaewon Kim

[permalink] [raw]
Subject: RE: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4

>
>
>
>
>--------- Original Message ---------
>
>Sender : John Stultz <[email protected]>
>
>Date : 2023-02-07 13:37 (GMT+9)
>
>Title : Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4
>
>
>
>On Sat, Feb 4, 2023 at 7:02 AM Jaewon Kim <[email protected]> wrote:
>
>> Hello John Stultz, sorry for late reply.
>
>> I had to manage other urgent things and this test also took some time to finish.
>
>> Any I hope you to be happy with following my test results.
>
>>
>
>>
>
>> 1. system heap modification
>
>>
>
>> To avoid effect of allocation from the pool, all the freed dma
>
>> buffer were passed to buddy without keeping them in the pool.
>
>> Some trace_printk and order counting logic were added.
>
>>
>
>> 2. the test tool
>
>>
>
>> To test the dma-buf system heap allocation speed, I prepared
>
>> a userspace test program which requests a specified size to a heap.
>
>> With the program, I tried to request 16 times of 10 MB size and
>
>> added 1 sleep between each request. Each memory was not freed
>
>> until the total 16 times total memory was allocated.
>
>
>
>Oof. I really appreciate all your effort that I'm sure went in to
>
>generate these numbers, but this wasn't quite what I was asking for.
>
>I know you've been focused on allocation performance under memory
>
>pressure, but I was hoping to see the impact of __using__ order 0
>
>pages over order 4 pages in real world conditions (for camera or video
>
>recording or other use cases that use large allocations). These
>
>results seem to be still just focused on the difference in allocation
>
>performance between order 0 and order 4 with and without contention.
>
>
>
>That said, re-reading my email, I probably could have been more clear
>
>on this aspect.
>
>
>
>
>
>> 3. the test device
>
>>
>
>> The test device has arm64 CPU cores and v5.15 based kernel.
>
>> To get stable results, the CPU clock was fixed not to be changed
>
>> in run time, and the test tool was set to some specific CPU cores
>
>> running in the same CPU clock.
>
>>
>
>> 4. test results
>
>>
>
>> As we expected if order 4 exist in the buddy, the order 8, 4, 0
>
>> allocation was 1 to 4 times faster than the order 8, 0, 0. But
>
>> the order 8, 0, 0 also looks fast enough.
>
>>
>
>> Here's time diff, and number of each order.
>
>>
>
>> order 8, 4, 0 in the enough order 4 case
>
>>
>
>> diff 8 4 0
>
>> 665 usec 0 160 0
>
>> 1,148 usec 0 160 0
>
>> 1,089 usec 0 160 0
>
>> 1,154 usec 0 160 0
>
>> 1,264 usec 0 160 0
>
>> 1,414 usec 0 160 0
>
>> 873 usec 0 160 0
>
>> 1,148 usec 0 160 0
>
>> 1,158 usec 0 160 0
>
>> 1,139 usec 0 160 0
>
>> 1,169 usec 0 160 0
>
>> 1,174 usec 0 160 0
>
>> 1,210 usec 0 160 0
>
>> 995 usec 0 160 0
>
>> 1,151 usec 0 160 0
>
>> 977 usec 0 160 0
>
>>
>
>> order 8, 0, 0 in the enough order 4 case
>
>>
>
>> diff 8 4 0
>
>> 441 usec 10 0 0
>
>> 747 usec 10 0 0
>
>> 2,330 usec 2 0 2048
>
>> 2,469 usec 0 0 2560
>
>> 2,518 usec 0 0 2560
>
>> 1,176 usec 0 0 2560
>
>> 1,487 usec 0 0 2560
>
>> 1,402 usec 0 0 2560
>
>> 1,449 usec 0 0 2560
>
>> 1,330 usec 0 0 2560
>
>> 1,089 usec 0 0 2560
>
>> 1,481 usec 0 0 2560
>
>> 1,326 usec 0 0 2560
>
>> 3,057 usec 0 0 2560
>
>> 2,758 usec 0 0 2560
>
>> 3,271 usec 0 0 2560
>
>>
>
>> From the perspective of responsiveness, the deterministic
>
>> memory allocation speed, I think, is quite important. So I
>
>> tested other case where the free memory are not enough.
>
>>
>
>> On this test, I ran the 16 times allocation sets twice
>
>> consecutively. Then it showed the first set order 8, 4, 0
>
>> became very slow and varied, but the second set became
>
>> faster because of the already created the high order.
>
>>
>
>> order 8, 4, 0 in low memory
>
>>
>
>> diff 8 4 0
>
>> 584 usec 0 160 0
>
>> 28,428 usec 0 160 0
>
>> 100,701 usec 0 160 0
>
>> 76,645 usec 0 160 0
>
>> 25,522 usec 0 160 0
>
>> 38,798 usec 0 160 0
>
>> 89,012 usec 0 160 0
>
>> 23,015 usec 0 160 0
>
>> 73,360 usec 0 160 0
>
>> 76,953 usec 0 160 0
>
>> 31,492 usec 0 160 0
>
>> 75,889 usec 0 160 0
>
>> 84,551 usec 0 160 0
>
>> 84,352 usec 0 160 0
>
>> 57,103 usec 0 160 0
>
>> 93,452 usec 0 160 0
>
>>
>
>> diff 8 4 0
>
>> 808 usec 10 0 0
>
>> 778 usec 4 96 0
>
>> 829 usec 0 160 0
>
>> 700 usec 0 160 0
>
>> 937 usec 0 160 0
>
>> 651 usec 0 160 0
>
>> 636 usec 0 160 0
>
>> 811 usec 0 160 0
>
>> 622 usec 0 160 0
>
>> 674 usec 0 160 0
>
>> 677 usec 0 160 0
>
>> 738 usec 0 160 0
>
>> 1,130 usec 0 160 0
>
>> 677 usec 0 160 0
>
>> 553 usec 0 160 0
>
>> 1,048 usec 0 160 0
>
>>
>
>>
>
>> order 8, 0, 0 in low memory
>
>>
>
>> diff 8 4 0
>
>> 1,699 usec 2 0 2048
>
>> 2,082 usec 0 0 2560
>
>> 840 usec 0 0 2560
>
>> 875 usec 0 0 2560
>
>> 845 usec 0 0 2560
>
>> 1,706 usec 0 0 2560
>
>> 967 usec 0 0 2560
>
>> 1,000 usec 0 0 2560
>
>> 1,905 usec 0 0 2560
>
>> 2,451 usec 0 0 2560
>
>> 3,384 usec 0 0 2560
>
>> 2,397 usec 0 0 2560
>
>> 3,171 usec 0 0 2560
>
>> 2,376 usec 0 0 2560
>
>> 3,347 usec 0 0 2560
>
>> 2,554 usec 0 0 2560
>
>>
>
>> diff 8 4 0
>
>> 1,409 usec 2 0 2048
>
>> 1,438 usec 0 0 2560
>
>> 1,035 usec 0 0 2560
>
>> 1,108 usec 0 0 2560
>
>> 825 usec 0 0 2560
>
>> 927 usec 0 0 2560
>
>> 1,931 usec 0 0 2560
>
>> 2,024 usec 0 0 2560
>
>> 1,884 usec 0 0 2560
>
>> 1,769 usec 0 0 2560
>
>> 2,136 usec 0 0 2560
>
>> 1,738 usec 0 0 2560
>
>> 1,328 usec 0 0 2560
>
>> 1,438 usec 0 0 2560
>
>> 1,972 usec 0 0 2560
>
>> 2,963 usec 0 0 2560
>
>
>
>So, thank you for generating all of this. I think this all looks as
>
>expected, showing the benefit of your change to allocation under
>
>contention and showing the potential downside in the non-contention
>
>case.
>
>
>
>I still worry about the performance impact outside of allocation time
>
>of using the smaller order pages (via map and unmap through iommu to
>
>devices, etc), so it would still be nice to have some confidence this
>
>won't introduce other regressions, but I do agree the worse case
>
>impact is very bad.
>
>
>
>> Finally if we change order 4 to use HIGH_ORDER_GFP,
>
>> I expect that we could avoid the very slow cases.
>
>>
>
>
>
>Yeah. Again, this all aligns with the upside of your changes, which
>
>I'm eager for.
>
>I just want to have a strong sense of any regressions it might also cause.
>
>
>
>I don't mean to discourage you, especially after all the effort here.
>
>Do you think evaluating the before and after impact to buffer usage
>
>(not just allocation) would be doable in the near term?
>

Hello sorry but I don't have expertise on iommu. Actually I'm also wondering
all IOMMU can use order 4 free pages, if they are allocated. I am not sure
but I remember I heard order 9 (2MB) could be used, but I don't know about order 8 4.

I guess IOMMU mmap also be same patern like we expect. I mean if order 4 is
prepared it could be faster like 1 to 4 times. But it, I think, should NOT be
that much slow even though the entire free memory is prepared as order 0 pages.

>
>
>If you don't think so, given the benefit to allocation under pressure
>
>is large (and I don't mean to give you hurdles to jump), I'm willing
>
>to ack your change to get it merged, but if we later see performance
>
>trouble, I'll be quick to advocate for reverting it. Is that ok?
>

Yes sure. I also want to know if it is.
Thank you

>
>
>thanks
>
>-john
>
>

2023-02-07 16:56:26

by John Stultz

[permalink] [raw]
Subject: Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4

On Mon, Feb 6, 2023 at 11:33 PM Jaewon Kim <[email protected]> wrote:
> >I don't mean to discourage you, especially after all the effort here.
> >
> >Do you think evaluating the before and after impact to buffer usage
> >
> >(not just allocation) would be doable in the near term?
> >
>
> Hello sorry but I don't have expertise on iommu. Actually I'm also wondering
> all IOMMU can use order 4 free pages, if they are allocated. I am not sure
> but I remember I heard order 9 (2MB) could be used, but I don't know about order 8 4.
>
> I guess IOMMU mmap also be same patern like we expect. I mean if order 4 is
> prepared it could be faster like 1 to 4 times. But it, I think, should NOT be
> that much slow even though the entire free memory is prepared as order 0 pages.
>
> >
> >
> >If you don't think so, given the benefit to allocation under pressure
> >
> >is large (and I don't mean to give you hurdles to jump), I'm willing
> >
> >to ack your change to get it merged, but if we later see performance
> >
> >trouble, I'll be quick to advocate for reverting it. Is that ok?
> >
>
> Yes sure. I also want to know if it is.

Ok. Please resend your latest patch and I'll go ahead and ack it and
we'll watch.

Thanks again for your efforts here!
-john