2015-02-02 14:02:35

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On 01/30/2015, 10:54 PM, Tim Chen wrote:
> On Sat, 2015-01-31 at 00:03 +0300, Sergei Shtylyov wrote:
>> On 01/30/2015 10:54 PM, Tim Chen wrote:
>
>>>
>>> return NULL;
>>> }
>>> + /* round up to full page size */
>>> + size = (((size-1) >> PAGE_SHIFT) + 1) * PAGE_SIZE;
>>
>> This is quite suboptimal formula, we can do without shifts and
>> multiplications (hopefully, still converted to shifts by gcc):
>>
>> size = (size + ~PAGE_MASK) & PAGE_MASK;
>
> Agree. I've updated patch below
>
> Thanks.
>
> Tim
>
> --->8---
>
> From: Tim Chen <[email protected]>
> Subject: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned
>
>
> Commit d92ef66c4f8f ("x86: make dma_alloc_coherent() return zeroed memory
> if CMA is enabled") changed the dma_alloc_coherent page clearance from
> using an __GFP_ZERO in page allocation to not setting the flag but doing
> an explicit memory clear at the end.
>
> However the memory clear only covered the memory size that
> was requested, but may not be up to the full extent of the
> last page, if the total pages returned exceed the
> memory size requested. This behavior has caused problem with XHCI
> and caused it to hang:
>
> kernel: xhci_hcd 0000:00:14.0: Stopped the command ring failed, maybe the host is dead
> kernel: xhci_hcd 0000:00:14.0: Abort command ring failed
> kernel: xhci_hcd 0000:00:14.0: HC died; cleaning up
> kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID
> kernel: xhci_hcd 0000:00:14.0: Max number of devices this xHCI host supports is 64.
>
> Other drivers may have similar issue if it assumes that the pages
> allocated are completely zeroed.
>
> This patch ensures that the pages returned are fully cleared.
>
> Signed-off-by: Tim Chen <[email protected]>
> Cc: <[email protected]> # 3.16+
>
> ---
> arch/x86/kernel/pci-dma.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index a25e202..e9d8dee 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -125,6 +125,8 @@ again:
>
> return NULL;
> }
> + /* round up to full page size */
> + size = (size + ~PAGE_MASK) & PAGE_MASK;

Hi, is this an open-coded version of PAGE_ALIGN?

> memset(page_address(page), 0, size);
> *dma_addr = addr;
> return page_address(page);
>


--
js
suse labs


2015-02-02 16:39:26

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

Hello.

On 02/02/2015 05:02 PM, Jiri Slaby wrote:

>> From: Tim Chen <[email protected]>
>> Subject: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned


>> Commit d92ef66c4f8f ("x86: make dma_alloc_coherent() return zeroed memory
>> if CMA is enabled") changed the dma_alloc_coherent page clearance from
>> using an __GFP_ZERO in page allocation to not setting the flag but doing
>> an explicit memory clear at the end.

>> However the memory clear only covered the memory size that
>> was requested, but may not be up to the full extent of the
>> last page, if the total pages returned exceed the
>> memory size requested. This behavior has caused problem with XHCI
>> and caused it to hang:

>> kernel: xhci_hcd 0000:00:14.0: Stopped the command ring failed, maybe the host is dead
>> kernel: xhci_hcd 0000:00:14.0: Abort command ring failed
>> kernel: xhci_hcd 0000:00:14.0: HC died; cleaning up
>> kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID
>> kernel: xhci_hcd 0000:00:14.0: Max number of devices this xHCI host supports is 64.

>> Other drivers may have similar issue if it assumes that the pages
>> allocated are completely zeroed.

>> This patch ensures that the pages returned are fully cleared.

>> Signed-off-by: Tim Chen <[email protected]>
>> Cc: <[email protected]> # 3.16+

>> ---
>> arch/x86/kernel/pci-dma.c | 2 ++
>> 1 file changed, 2 insertions(+)

>> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
>> index a25e202..e9d8dee 100644
>> --- a/arch/x86/kernel/pci-dma.c
>> +++ b/arch/x86/kernel/pci-dma.c
>> @@ -125,6 +125,8 @@ again:
>>
>> return NULL;
>> }
>> + /* round up to full page size */
>> + size = (size + ~PAGE_MASK) & PAGE_MASK;

> Hi, is this an open-coded version of PAGE_ALIGN?

Yes, it appears so. :-)

WBR, Sergei

2015-02-04 18:30:38

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On Mon, 2015-02-02 at 19:39 +0300, Sergei Shtylyov wrote:

>
> > Hi, is this an open-coded version of PAGE_ALIGN?
>
> Yes, it appears so. :-)
>
> WBR, Sergei
>

Thanks for the suggestion by Jiri. I updated the patch to use PAGE_ALIGN
below.

Regards,
Tim

--->8---
From: Tim Chen <[email protected]>
Subject: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

Commit d92ef66c4f8f ("x86: make dma_alloc_coherent() return zeroed memory
if CMA is enabled") changed the dma_alloc_coherent page clearance from
using an __GFP_ZERO in page allocation to not setting the flag but doing
an explicit memory clear at the end.

However the memory clear only covered the memory size that
was requested, but may not be up to the full extent of the
last page, if the total pages returned exceed the
memory size requested. This behavior has caused problem with XHCI
and caused it to hang:

kernel: xhci_hcd 0000:00:14.0: Stopped the command ring failed, maybe the host is dead
kernel: xhci_hcd 0000:00:14.0: Abort command ring failed
kernel: xhci_hcd 0000:00:14.0: HC died; cleaning up
kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID
kernel: xhci_hcd 0000:00:14.0: Max number of devices this xHCI host supports is 64.

Other drivers may have similar issue if it assumes that the pages
allocated are completely zeroed.

This patch ensures that the pages returned are fully cleared.

Signed-off-by: Tim Chen <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/pci-dma.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index a25e202..3bdee55 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -125,6 +125,8 @@ again:

return NULL;
}
+ /* round up to full page size */
+ size = PAGE_ALIGN(size);
memset(page_address(page), 0, size);
*dma_addr = addr;
return page_address(page);
--
1.9.3


2015-02-18 19:40:15

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On Wed, 2015-02-04 at 10:30 -0800, Tim Chen wrote:
> On Mon, 2015-02-02 at 19:39 +0300, Sergei Shtylyov wrote:
>
> >
> > > Hi, is this an open-coded version of PAGE_ALIGN?
> >
> > Yes, it appears so. :-)
> >
> > WBR, Sergei
> >
>
> Thanks for the suggestion by Jiri. I updated the patch to use PAGE_ALIGN
> below.
>
> Regards,
> Tim
>

Is there any resolution on this patch? I haven't seen fixes from the
XHCI folks yet. This is breaking many of our systems.

Thanks.

Tim

> --->8---
> From: Tim Chen <[email protected]>
> Subject: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned
>
> Commit d92ef66c4f8f ("x86: make dma_alloc_coherent() return zeroed memory
> if CMA is enabled") changed the dma_alloc_coherent page clearance from
> using an __GFP_ZERO in page allocation to not setting the flag but doing
> an explicit memory clear at the end.
>
> However the memory clear only covered the memory size that
> was requested, but may not be up to the full extent of the
> last page, if the total pages returned exceed the
> memory size requested. This behavior has caused problem with XHCI
> and caused it to hang:
>
> kernel: xhci_hcd 0000:00:14.0: Stopped the command ring failed, maybe the host is dead
> kernel: xhci_hcd 0000:00:14.0: Abort command ring failed
> kernel: xhci_hcd 0000:00:14.0: HC died; cleaning up
> kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID
> kernel: xhci_hcd 0000:00:14.0: Max number of devices this xHCI host supports is 64.
>
> Other drivers may have similar issue if it assumes that the pages
> allocated are completely zeroed.
>
> This patch ensures that the pages returned are fully cleared.
>
> Signed-off-by: Tim Chen <[email protected]>
> Cc: [email protected]
> ---
> arch/x86/kernel/pci-dma.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index a25e202..3bdee55 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -125,6 +125,8 @@ again:
>
> return NULL;
> }
> + /* round up to full page size */
> + size = PAGE_ALIGN(size);
> memset(page_address(page), 0, size);
> *dma_addr = addr;
> return page_address(page);

2015-02-18 19:53:07

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On Wed, 18 Feb 2015, Tim Chen wrote:

> On Wed, 2015-02-04 at 10:30 -0800, Tim Chen wrote:
> > On Mon, 2015-02-02 at 19:39 +0300, Sergei Shtylyov wrote:
> >
> > >
> > > > Hi, is this an open-coded version of PAGE_ALIGN?
> > >
> > > Yes, it appears so. :-)
> > >
> > > WBR, Sergei
> > >
> >
> > Thanks for the suggestion by Jiri. I updated the patch to use PAGE_ALIGN
> > below.
> >
> > Regards,
> > Tim
> >
>
> Is there any resolution on this patch? I haven't seen fixes from the
> XHCI folks yet. This is breaking many of our systems.

Have you tried doing the experiments I suggested in

http://marc.info/?l=linux-usb&m=142272448620716&w=2

to determine where the problem occurs?

Alan Stern

2015-02-18 20:19:08

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On Wed, 2015-02-18 at 14:53 -0500, Alan Stern wrote:
> On Wed, 18 Feb 2015, Tim Chen wrote:
>
> > On Wed, 2015-02-04 at 10:30 -0800, Tim Chen wrote:
> > > On Mon, 2015-02-02 at 19:39 +0300, Sergei Shtylyov wrote:
> > >
> > > >
> > > > > Hi, is this an open-coded version of PAGE_ALIGN?
> > > >
> > > > Yes, it appears so. :-)
> > > >
> > > > WBR, Sergei
> > > >
> > >
> > > Thanks for the suggestion by Jiri. I updated the patch to use PAGE_ALIGN
> > > below.
> > >
> > > Regards,
> > > Tim
> > >
> >
> > Is there any resolution on this patch? I haven't seen fixes from the
> > XHCI folks yet. This is breaking many of our systems.
>
> Have you tried doing the experiments I suggested in
>
> http://marc.info/?l=linux-usb&m=142272448620716&w=2
>
> to determine where the problem occurs?
>

I was bogged down with other things lately and I haven't got a chance to
test that. But as you said, there's very few places where xhci
call this memory allocation. So I think the problem has been fairly
narrowed down for the XHCI folks.

Tim

> Alan Stern
>

2015-02-18 20:36:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On Wed, Feb 18, 2015 at 12:19:03PM -0800, Tim Chen wrote:
> On Wed, 2015-02-18 at 14:53 -0500, Alan Stern wrote:
> > On Wed, 18 Feb 2015, Tim Chen wrote:
> >
> > > On Wed, 2015-02-04 at 10:30 -0800, Tim Chen wrote:
> > > > On Mon, 2015-02-02 at 19:39 +0300, Sergei Shtylyov wrote:
> > > >
> > > > >
> > > > > > Hi, is this an open-coded version of PAGE_ALIGN?
> > > > >
> > > > > Yes, it appears so. :-)
> > > > >
> > > > > WBR, Sergei
> > > > >
> > > >
> > > > Thanks for the suggestion by Jiri. I updated the patch to use PAGE_ALIGN
> > > > below.
> > > >
> > > > Regards,
> > > > Tim
> > > >
> > >
> > > Is there any resolution on this patch? I haven't seen fixes from the
> > > XHCI folks yet. This is breaking many of our systems.
> >
> > Have you tried doing the experiments I suggested in
> >
> > http://marc.info/?l=linux-usb&m=142272448620716&w=2
> >
> > to determine where the problem occurs?
> >
>
> I was bogged down with other things lately and I haven't got a chance to
> test that. But as you said, there's very few places where xhci
> call this memory allocation. So I think the problem has been fairly
> narrowed down for the XHCI folks.

The "XHCI folks" are in a cube near you, go poke them in person if they
aren't answering their emails :)

good luck,

greg k-h

2015-02-18 20:39:15

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

> > Have you tried doing the experiments I suggested in
> >
> > http://marc.info/?l=linux-usb&m=142272448620716&w=2
> >
> > to determine where the problem occurs?
> >
>
> I was bogged down with other things lately and I haven't got a chance to
> test that. But as you said, there's very few places where xhci
> call this memory allocation. So I think the problem has been fairly
> narrowed down for the XHCI folks.

Also I don't really understand why we're even discussing this. The patch
only makes an widely used API behave as it was before. Who knows who
else was broken with this change. There's no sane way to audit all
users. There is no real advantage of the new behavior.

The only good way is to revert to old behavior, like in Tim's
original patch. And doing it quickly for mainline and stable.

FWIW we have a large number of systems here that are broken
without this change.

-Andi

--
[email protected] -- Speaking for myself only.

2015-02-18 20:45:34

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On Wed, 18 Feb 2015, Tim Chen wrote:

> > Have you tried doing the experiments I suggested in
> >
> > http://marc.info/?l=linux-usb&m=142272448620716&w=2
> >
> > to determine where the problem occurs?
> >
>
> I was bogged down with other things lately and I haven't got a chance to
> test that. But as you said, there's very few places where xhci
> call this memory allocation. So I think the problem has been fairly
> narrowed down for the XHCI folks.

I disagree. _You_ reported the error. How can you expect other people
to figure out where it is with no help from you?

I looked briefly at the xhci-hcd DMA allocation code. It does not
contain any obvious mistakes.

Alan Stern

2015-02-18 21:15:39

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On Wed, 18 Feb 2015, Andi Kleen wrote:

> > > Have you tried doing the experiments I suggested in
> > >
> > > http://marc.info/?l=linux-usb&m=142272448620716&w=2
> > >
> > > to determine where the problem occurs?
> > >
> >
> > I was bogged down with other things lately and I haven't got a chance to
> > test that. But as you said, there's very few places where xhci
> > call this memory allocation. So I think the problem has been fairly
> > narrowed down for the XHCI folks.
>
> Also I don't really understand why we're even discussing this. The patch
> only makes an widely used API behave as it was before. Who knows who
> else was broken with this change. There's no sane way to audit all
> users. There is no real advantage of the new behavior.

We are discussing it because fixing problems is better than papering
around them.

> The only good way is to revert to old behavior, like in Tim's
> original patch. And doing it quickly for mainline and stable.

I will agree that applying the patch is a reasonable thing to do.
However, I also believe that it is important to fix the bugs revealed
by the API change.

> FWIW we have a large number of systems here that are broken
> without this change.

For all you know, they will still be broken even after the change is
applied. The breakage may become less obvious, but that doesn't mean
it will disappear entirely.

Alan Stern

2015-02-18 23:59:44

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] pci-dma: Fix x86 dma_alloc_coherent to fully clear all pages returned

On Wed, 2015-02-18 at 15:45 -0500, Alan Stern wrote:
> On Wed, 18 Feb 2015, Tim Chen wrote:
>
> > > Have you tried doing the experiments I suggested in
> > >
> > > http://marc.info/?l=linux-usb&m=142272448620716&w=2
> > >
> > > to determine where the problem occurs?
> > >
> >
> > I was bogged down with other things lately and I haven't got a chance to
> > test that. But as you said, there's very few places where xhci
> > call this memory allocation. So I think the problem has been fairly
> > narrowed down for the XHCI folks.
>
> I disagree. _You_ reported the error. How can you expect other people
> to figure out where it is with no help from you?
>
> I looked briefly at the xhci-hcd DMA allocation code. It does not
> contain any obvious mistakes.
>
> Alan Stern
>

The error and my quick fix is right here. And xhci still needs to be
fixed properly. I'll send out the patch below in a proper patch.

Tim

diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index 5cb3d7a..39e7196 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -1658,7 +1658,7 @@ static int scratchpad_alloc(struct xhci_hcd *xhci,
gfp_t flags)
goto fail_sp;

xhci->scratchpad->sp_array = dma_alloc_coherent(dev,
- num_sp * sizeof(u64),
+ PAGE_ALIGN(num_sp * sizeof(u64)),
&xhci->scratchpad->sp_dma, flags);
if (!xhci->scratchpad->sp_array)