2022-04-28 10:47:40

by Thomas Weißschuh

[permalink] [raw]
Subject: [PATCH] nvme-pci: fix host memory buffer allocation size

We want to allocate the smallest possible amount of buffers with the
largest possible size (1 buffer of size "hmpre").

Previously we were allocating as many buffers as possible of the smallest
possible size.
This also lead to "hmpre" to not be satisifed as not enough buffer slots
where available.

Signed-off-by: Thomas Weißschuh <[email protected]>
---

Also discussed at https://lore.kernel.org/linux-nvme/[email protected]/

drivers/nvme/host/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aacf1c0d5a5..0546523cc20b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2090,7 +2090,7 @@ static int __nvme_alloc_host_mem(struct nvme_dev *dev, u64 preferred,

static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
{
- u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
+ u64 min_chunk = max_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
u64 hmminds = max_t(u32, dev->ctrl.hmminds * 4096, PAGE_SIZE * 2);
u64 chunk_size;


base-commit: 46cf2c613f4b10eb12f749207b0fd2c1bfae3088
--
2.36.0


2022-04-29 01:04:13

by Thomas Weißschuh

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: fix host memory buffer allocation size

On 2022-04-28 16:36+0200, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 12:19:22PM +0200, Thomas Weißschuh wrote:
> > We want to allocate the smallest possible amount of buffers with the
> > largest possible size (1 buffer of size "hmpre").
> >
> > Previously we were allocating as many buffers as possible of the smallest
> > possible size.
> > This also lead to "hmpre" to not be satisifed as not enough buffer slots
> > where available.
> >
> > Signed-off-by: Thomas Weißschuh <[email protected]>
> > ---
> >
> > Also discussed at https://lore.kernel.org/linux-nvme/[email protected]/
> >
> > drivers/nvme/host/pci.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 3aacf1c0d5a5..0546523cc20b 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2090,7 +2090,7 @@ static int __nvme_alloc_host_mem(struct nvme_dev *dev, u64 preferred,
> >
> > static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
> > {
> > - u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
> > + u64 min_chunk = max_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
>
> preferred is based on the HMPRE field in the spec, which documents the
> preffered size. So the max here would not make ny sense at all.

Is the current code supposed to reach HMPRE? It does not for me.

The code tries to allocate memory for HMPRE in chunks.
The best allocation would be to allocate one chunk for all of HMPRE.
If this fails we half the chunk size on each iteration and try again.

On my hardware we start with a chunk_size of 4MiB and just allocate
8 (hmmaxd) * 4 = 32 MiB which is worse than 1 * 200MiB.

What am I missing?

2022-04-29 10:02:23

by Thomas Weißschuh

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: fix host memory buffer allocation size

On 2022-04-28 17:06+0200, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 04:44:47PM +0200, Thomas Weißschuh wrote:
> > Is the current code supposed to reach HMPRE? It does not for me.
> >
> > The code tries to allocate memory for HMPRE in chunks.
> > The best allocation would be to allocate one chunk for all of HMPRE.
> > If this fails we half the chunk size on each iteration and try again.
> >
> > On my hardware we start with a chunk_size of 4MiB and just allocate
> > 8 (hmmaxd) * 4 = 32 MiB which is worse than 1 * 200MiB.
>
> And that is because the hardware only has a limited set of descriptors.

Wouldn't it make more sense then to allocate as much memory as possible for
each descriptor that is available?

The comment in nvme_alloc_host_mem() tries to "start big".
But it actually starts with at most 4MiB.

And on devices that have hmminds > 4MiB the loop condition will never succeed
at all and HMB will not be used.
My fairly boring hardware already is at a hmminds of 3.3MiB.

> Is there any real problem you are fixing with this? Do you actually
> see a performance difference on a relevant workload?

I don't have a concrete problem or performance issue.
During some debugging I stumbled in my kernel logs upon
"nvme nvme0: allocated 32 MiB host memory buffer"
and investigated why it was so low.

2022-05-01 21:36:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: fix host memory buffer allocation size

On Thu, Apr 28, 2022 at 12:19:22PM +0200, Thomas Wei?schuh wrote:
> We want to allocate the smallest possible amount of buffers with the
> largest possible size (1 buffer of size "hmpre").
>
> Previously we were allocating as many buffers as possible of the smallest
> possible size.
> This also lead to "hmpre" to not be satisifed as not enough buffer slots
> where available.
>
> Signed-off-by: Thomas Wei?schuh <[email protected]>
> ---
>
> Also discussed at https://lore.kernel.org/linux-nvme/[email protected]/
>
> drivers/nvme/host/pci.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3aacf1c0d5a5..0546523cc20b 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2090,7 +2090,7 @@ static int __nvme_alloc_host_mem(struct nvme_dev *dev, u64 preferred,
>
> static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
> {
> - u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
> + u64 min_chunk = max_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);

preferred is based on the HMPRE field in the spec, which documents the
preffered size. So the max here would not make ny sense at all.

2022-05-03 00:26:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: fix host memory buffer allocation size

On Thu, Apr 28, 2022 at 04:44:47PM +0200, Thomas Wei?schuh wrote:
> Is the current code supposed to reach HMPRE? It does not for me.
>
> The code tries to allocate memory for HMPRE in chunks.
> The best allocation would be to allocate one chunk for all of HMPRE.
> If this fails we half the chunk size on each iteration and try again.
>
> On my hardware we start with a chunk_size of 4MiB and just allocate
> 8 (hmmaxd) * 4 = 32 MiB which is worse than 1 * 200MiB.

And that is because the hardware only has a limited set of descriptors.

Is there any real problem you are fixing with this? Do you actually
see a performance difference on a relevant workload?

2022-05-10 08:24:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: fix host memory buffer allocation size

On Thu, Apr 28, 2022 at 06:09:11PM +0200, Thomas Weißschuh wrote:
> > > On my hardware we start with a chunk_size of 4MiB and just allocate
> > > 8 (hmmaxd) * 4 = 32 MiB which is worse than 1 * 200MiB.
> >
> > And that is because the hardware only has a limited set of descriptors.
>
> Wouldn't it make more sense then to allocate as much memory as possible for
> each descriptor that is available?
>
> The comment in nvme_alloc_host_mem() tries to "start big".
> But it actually starts with at most 4MiB.

Compared to what other operating systems offer, that is quite large.

> And on devices that have hmminds > 4MiB the loop condition will never succeed
> at all and HMB will not be used.
> My fairly boring hardware already is at a hmminds of 3.3MiB.
>
> > Is there any real problem you are fixing with this? Do you actually
> > see a performance difference on a relevant workload?
>
> I don't have a concrete problem or performance issue.
> During some debugging I stumbled in my kernel logs upon
> "nvme nvme0: allocated 32 MiB host memory buffer"
> and investigated why it was so low.

Until recently we could not even support these large sizes at all on
typical x86 configs. With my fairly recent change to allow vmap
remapped iommu allocations on x86 we can do that now. But if we
unconditionally enabled it I'd be a little worried about using too
much memory very easily.

We could look into removing the min with
PAGE_SIZE * MAX_ORDER_NR_PAGES to try to do larger segments for
"segment challenged" controllers now that it could work on a lot
of iommu enabled setups. But I'd rather have a very good reason for
that.

2022-05-10 20:33:33

by Thomas Weißschuh

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: fix host memory buffer allocation size

On 2022-05-10 09:03+0200, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 06:09:11PM +0200, Thomas Weißschuh wrote:
> > > > On my hardware we start with a chunk_size of 4MiB and just allocate
> > > > 8 (hmmaxd) * 4 = 32 MiB which is worse than 1 * 200MiB.
> > >
> > > And that is because the hardware only has a limited set of descriptors.
> >
> > Wouldn't it make more sense then to allocate as much memory as possible for
> > each descriptor that is available?
> >
> > The comment in nvme_alloc_host_mem() tries to "start big".
> > But it actually starts with at most 4MiB.
>
> Compared to what other operating systems offer, that is quite large.

Ok. I only looked at FreeBSD, which uses up to 5% of total memory per
device. [0]

> > And on devices that have hmminds > 4MiB the loop condition will never succeed
> > at all and HMB will not be used.
> > My fairly boring hardware already is at a hmminds of 3.3MiB.
> >
> > > Is there any real problem you are fixing with this? Do you actually
> > > see a performance difference on a relevant workload?
> >
> > I don't have a concrete problem or performance issue.
> > During some debugging I stumbled in my kernel logs upon
> > "nvme nvme0: allocated 32 MiB host memory buffer"
> > and investigated why it was so low.
>
> Until recently we could not even support these large sizes at all on
> typical x86 configs. With my fairly recent change to allow vmap
> remapped iommu allocations on x86 we can do that now. But if we
> unconditionally enabled it I'd be a little worried about using too
> much memory very easily.

This should still be limited to max_host_mem_size_mb which defaults to 128MiB,
or?

> We could look into removing the min with
> PAGE_SIZE * MAX_ORDER_NR_PAGES to try to do larger segments for
> "segment challenged" controllers now that it could work on a lot
> of iommu enabled setups. But I'd rather have a very good reason for
> that.

On my current setup (WD SN770 on ThinkPad X1 Carbon Gen9) frequently the NVME
controller stops responding. Switching from no scheduler to mq-deadline reduced
this but did not eliminate it.
Since switching to HMB of 1 * 200MiB and no scheduler this did not happen anymore.
(But I'll need some more time to gain real confidence in this)

Initially I assumed that the PAGE_SIZE * MAX_ORDER_NR_PAGES was indeed
meant as a minimum for DMA allocation.
As that is not the case, removing the min() completely instead of the max() I
proposed would obviously be the correct thing to do.

[0] https://manpages.debian.org/testing/freebsd-manpages/nvme.4freebsd.en.html

2022-05-26 02:18:42

by Thomas Weißschuh

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: fix host memory buffer allocation size

On 2022-05-10 12:20+0200, Thomas Weißschuh wrote:
> [..]
> > We could look into removing the min with
> > PAGE_SIZE * MAX_ORDER_NR_PAGES to try to do larger segments for
> > "segment challenged" controllers now that it could work on a lot
> > of iommu enabled setups. But I'd rather have a very good reason for
> > that.
>
> On my current setup (WD SN770 on ThinkPad X1 Carbon Gen9) frequently the NVME
> controller stops responding. Switching from no scheduler to mq-deadline reduced
> this but did not eliminate it.
> Since switching to HMB of 1 * 200MiB and no scheduler this did not happen anymore.
> (But I'll need some more time to gain real confidence in this)

So this patch dramatically improves the stability of my disk.

Without it and queue/scheduler=none the controller stops responding after a few
minutes. mq-deadline reduced it to every few hours.

With the patch it happens roughly once a week.

I'll still RMA the disk and see if the replacement changes anything.

Maybe some of the Western Digital employees here could take a look or check if
there is a new firmware available.
(The official updater requires Windows and there is no external documentation
about the firmware)

Not sure if a change from very broken to only slightly broken would be enough
of a good reason to be honest.

Thomas