2012-06-05 17:03:39

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Regression introduced by 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6 ("X86: integrate CMA with DMA-mapping subsystem" Re: Bug in BUG: Bad page state in process work_for_cpu pfn:cf800

On Sat, Jun 2, 2012 at 7:36 AM, Konrad Rzeszutek Wilk <[email protected]> wrote:
> On Thu, May 31, 2012 at 3:19 AM, Marek Szyprowski
> <[email protected]> wrote:
>> Hi Konrad,
>>
>> On Thursday, May 31, 2012 2:45 AM Konrad Rzeszutek Wilk wrote:
>>
>>> About two-three days ago I started getting this on one of the AMD
>>> machines I run nighly bootup test (full bootup log attached):
>>> [Note: This is baremetal]
>>>
>>> ehci_hcd 0000:00:02.1: reset hcc_params a086 caching frame 256/512/1024 park
>>> BUG: Bad page state in process work_for_cpu ?pfn:cf800
>>> page:ffffea0002d64000 count:-1 mapcount:0 ing: ? ? ? ? ?(null) index:0x0
>>> page flags: 0x100000000000000()
>>> Modules linked in:
>>> Pid: 1207, comm: work_for_cpu Not tainted 3.4.0upstream-09208-gaf56e0a #1
>>> Call Trace:
>>> ?[<ffffffff81103eb7>] ? dump_page+0x97/0xf0
>>> ?[<ffffffff811050bd>] bad_page+0xad/0x100
>>> ?[<ffffffff811067a2>] get_page_from_freelist+0x712/0x850
>>> ?[<ffffffff812916d8>] ? __const_udelay+0x28/0x30
>>> ?[<ffffffff81107a82>] __alloc_pages_nodemask+0x162/0x900
>>> ?[<ffffffff810a2975>] ? dequeue_task_fair+0xa5/0x330
>>> ?[<ffffffff810367e2>] ? __switch_to+0x152/0x440
>>> ?[<ffffffff8107ee37>] ? lock_timer_base+0x37/0x70
>>> ?[<ffffffff8103c7ff>] dma_generic_alloc_coherent+0x10f/0x170
>>> ?[<ffffffff81062e7e>] gart_alloc_coherent+0xee/0x120
>>> ?[<ffffffff81137542>] dma_pool_alloc+0x102/0x2e0
>>> ?[<ffffffff8109f240>] ? try_to_wake_up+0x310/0x310
>>> ?[<ffffffff813f3dc7>] ehci_qh_alloc+0x47/0xf0
>>> ?[<ffffffff813f81e7>] ehci_pci_setup+0x367/0xea0
>>> ?[<ffffffff81389213>] ? device_pm_init+0x43/0x80
>>> ?[<ffffffff813d3065>] ? usb_alloc_dev+0x2d5/0x330
>>> ?[<ffffffff81002030>] ? do_one_initcall+0x30/0x170
>>> ?[<ffffffff813db6a9>] usb_add_hcd+0x1e9/0x7a0
>>> ?[<ffffffff813ea0fa>] usb_hcd_pci_probe+0x1ba/0x3a0
>>> ?[<ffffffff81088890>] ? cwq_dec_nr_in_flight+0x90/0x90
>>> ?[<ffffffff812ad3f2>] local_pci_probe+0x12/0x20
>>> ?[<ffffffff810888a3>] do_work_for_cpu+0x13/0x30
>>> ?[<ffffffff810906e6>] kthread+0x96/0xa0
>>> ?[<ffffffff815b61e4>] kernel_thread_helper+0x4/0x10
>>> ?[<ffffffff81090650>] ? kthread_freezable_should_stop+0x70/0x70
>>> ?[<ffffffff815b61e0>] ? gs_change+0x13/0x13
>>> Disabling lock debugging due to kernel taint
>>> BUG: Bad page state in process work_for_cpu ?pfn:cf801
>>>
>>> I haven't actually run a git bisection, but the last git commit
>>> that does something in the gart code looks to be this one:
..snip..
> Doing a git bisection points it to this one:
> is the first bad commit

I pulled todays linus's tree and if I revert the commit the bug disappears.
This is on an Dell T105 AMD box running native x86_64.
Any thoughts on what the bug might be?

> commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
> Author: Marek Szyprowski <[email protected]>
> Date: ? Thu Dec 29 13:09:51 2011 +0100
>
> ? ?X86: integrate CMA with DMA-mapping subsystem
>
> ? ?This patch adds support for CMA to dma-mapping subsystem for x86
> ? ?architecture that uses common pci-dma/pci-nommu implementation. This
> ? ?allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
>
> ? ?Signed-off-by: Marek Szyprowski <[email protected]>
> ? ?Signed-off-by: Kyungmin Park <[email protected]>
> ? ?CC: Michal Nazarewicz <[email protected]>
> ? ?Acked-by: Arnd Bergmann <[email protected]>
>
> :040000 040000 be152c4e3a5641fbd6dfc2f8faf3e634f47bd94e
> 4e5424f0b11ff1fead974e6c4ea7341b046cc960 M ? ? ?arch
>
>
> [konrad@build linux]$ git bisect log
> git bisect start
> # good: [b5f4035adfffbcc6b478de5b8c44b618b3124aff] Merge tag
> 'stable/for-linus-3.5-rc0-tag' of
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
> git bisect good b5f4035adfffbcc6b478de5b8c44b618b3124aff
> # bad: [da89fb165e5e51a2ec1ff8a0ff6bc052d1068184] Merge tag
> 'tag-for-linus-3.5' of
> git://git.linaro.org/people/sumitsemwal/linux-dma-buf
> git bisect bad da89fb165e5e51a2ec1ff8a0ff6bc052d1068184
> # good: [ece78b7df734726e790dcab207f463401ff80440] Merge branch
> 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> git bisect good ece78b7df734726e790dcab207f463401ff80440
> # good: [58823de9d2f1265030d0d06cb03cc2a551994398] Merge tag
> 'hda-switcheroo' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
> git bisect good 58823de9d2f1265030d0d06cb03cc2a551994398
> # bad: [0f51596bd39a5c928307ffcffc9ba07f90f42a8b] Merge branch
> 'for-next-arm-dma' into for-linus
> git bisect bad 0f51596bd39a5c928307ffcffc9ba07f90f42a8b
> # bad: [0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6] X86: integrate CMA
> with DMA-mapping subsystem
> git bisect bad 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6
> # good: [6d4a49160de2c684fb59fa627bce80e200224331] mm: page_alloc:
> change fallbacks array handling
> git bisect good 6d4a49160de2c684fb59fa627bce80e200224331
> # good: [cfd3da1e49bb95c355c01c0f502d657deb3d34a4] mm: Serialize
> access to min_free_kbytes
> git bisect good cfd3da1e49bb95c355c01c0f502d657deb3d34a4
> # good: [49f223a9cd96c7293d7258ff88c2bdf83065f69c] mm: trigger page
> reclaim in alloc_contig_range() to stabilise watermarks
> git bisect good 49f223a9cd96c7293d7258ff88c2bdf83065f69c
> # good: [c64be2bb1c6eb43c838b2c6d57b074078be208dd] drivers: add
> Contiguous Memory Allocator
> git bisect good c64be2bb1c6eb43c838b2c6d57b074078be208dd
>
>>
>>> Was wondering if other people had seen something similar to this?
>>
>> Best regards
>> --
>> Marek Szyprowski
>> Samsung Poland R&D Center
>>
>>
>>


2012-06-06 12:20:34

by Marek Szyprowski

[permalink] [raw]
Subject: RE: Regression introduced by 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6 ("X86: integrate CMA with DMA-mapping subsystem" Re: Bug in BUG: Bad page state in process work_for_cpu pfn:cf800

Hi Konrad,

On Tuesday, June 05, 2012 7:04 PM Konrad Rzeszutek Wilk wrote:

> On Sat, Jun 2, 2012 at 7:36 AM, Konrad Rzeszutek Wilk <[email protected]> wrote:
> > On Thu, May 31, 2012 at 3:19 AM, Marek Szyprowski
> > <[email protected]> wrote:
> >> Hi Konrad,
> >>
> >> On Thursday, May 31, 2012 2:45 AM Konrad Rzeszutek Wilk wrote:
> >>
> >>> About two-three days ago I started getting this on one of the AMD
> >>> machines I run nighly bootup test (full bootup log attached):
> >>> [Note: This is baremetal]
> >>>
> >>> ehci_hcd 0000:00:02.1: reset hcc_params a086 caching frame 256/512/1024 park
> >>> BUG: Bad page state in process work_for_cpu ?pfn:cf800
> >>> page:ffffea0002d64000 count:-1 mapcount:0 ing: ? ? ? ? ?(null) index:0x0
> >>> page flags: 0x100000000000000()
> >>> Modules linked in:
> >>> Pid: 1207, comm: work_for_cpu Not tainted 3.4.0upstream-09208-gaf56e0a #1
> >>> Call Trace:
> >>> ?[<ffffffff81103eb7>] ? dump_page+0x97/0xf0
> >>> ?[<ffffffff811050bd>] bad_page+0xad/0x100
> >>> ?[<ffffffff811067a2>] get_page_from_freelist+0x712/0x850
> >>> ?[<ffffffff812916d8>] ? __const_udelay+0x28/0x30
> >>> ?[<ffffffff81107a82>] __alloc_pages_nodemask+0x162/0x900
> >>> ?[<ffffffff810a2975>] ? dequeue_task_fair+0xa5/0x330
> >>> ?[<ffffffff810367e2>] ? __switch_to+0x152/0x440
> >>> ?[<ffffffff8107ee37>] ? lock_timer_base+0x37/0x70
> >>> ?[<ffffffff8103c7ff>] dma_generic_alloc_coherent+0x10f/0x170
> >>> ?[<ffffffff81062e7e>] gart_alloc_coherent+0xee/0x120
> >>> ?[<ffffffff81137542>] dma_pool_alloc+0x102/0x2e0
> >>> ?[<ffffffff8109f240>] ? try_to_wake_up+0x310/0x310
> >>> ?[<ffffffff813f3dc7>] ehci_qh_alloc+0x47/0xf0
> >>> ?[<ffffffff813f81e7>] ehci_pci_setup+0x367/0xea0
> >>> ?[<ffffffff81389213>] ? device_pm_init+0x43/0x80
> >>> ?[<ffffffff813d3065>] ? usb_alloc_dev+0x2d5/0x330
> >>> ?[<ffffffff81002030>] ? do_one_initcall+0x30/0x170
> >>> ?[<ffffffff813db6a9>] usb_add_hcd+0x1e9/0x7a0
> >>> ?[<ffffffff813ea0fa>] usb_hcd_pci_probe+0x1ba/0x3a0
> >>> ?[<ffffffff81088890>] ? cwq_dec_nr_in_flight+0x90/0x90
> >>> ?[<ffffffff812ad3f2>] local_pci_probe+0x12/0x20
> >>> ?[<ffffffff810888a3>] do_work_for_cpu+0x13/0x30
> >>> ?[<ffffffff810906e6>] kthread+0x96/0xa0
> >>> ?[<ffffffff815b61e4>] kernel_thread_helper+0x4/0x10
> >>> ?[<ffffffff81090650>] ? kthread_freezable_should_stop+0x70/0x70
> >>> ?[<ffffffff815b61e0>] ? gs_change+0x13/0x13
> >>> Disabling lock debugging due to kernel taint
> >>> BUG: Bad page state in process work_for_cpu ?pfn:cf801
> >>>
> >>> I haven't actually run a git bisection, but the last git commit
> >>> that does something in the gart code looks to be this one:
> ..snip..
> > Doing a git bisection points it to this one:
> > is the first bad commit
>
> I pulled todays linus's tree and if I revert the commit the bug disappears.
> This is on an Dell T105 AMD box running native x86_64.
> Any thoughts on what the bug might be?

I've read that patch three times line by line and I really have no idea what might cause such
weird effect. When CMA is disabled this patch should not change anything in the code flow and
the called functions... I assume that the CMA has not been enabled in your test config?

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center


2012-06-06 17:20:44

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: Regression introduced by 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6 ("X86: integrate CMA with DMA-mapping subsystem" Re: Bug in BUG: Bad page state in process work_for_cpu pfn:cf800

>> >>> ehci_hcd 0000:00:02.1: reset hcc_params a086 caching frame 256/512/1024 park
>> >>> BUG: Bad page state in process work_for_cpu ?pfn:cf800
>> >>> page:ffffea0002d64000 count:-1 mapcount:0 ing: ? ? ? ? ?(null) index:0x0
>> >>> page flags: 0x100000000000000()
>> >>> Modules linked in:
>> >>> Pid: 1207, comm: work_for_cpu Not tainted 3.4.0upstream-09208-gaf56e0a #1
>> >>> Call Trace:
>> >>> ?[<ffffffff81103eb7>] ? dump_page+0x97/0xf0
>> >>> ?[<ffffffff811050bd>] bad_page+0xad/0x100
>> >>> ?[<ffffffff811067a2>] get_page_from_freelist+0x712/0x850
>> >>> ?[<ffffffff812916d8>] ? __const_udelay+0x28/0x30
>> >>> ?[<ffffffff81107a82>] __alloc_pages_nodemask+0x162/0x900
>> >>> ?[<ffffffff810a2975>] ? dequeue_task_fair+0xa5/0x330
>> >>> ?[<ffffffff810367e2>] ? __switch_to+0x152/0x440
>> >>> ?[<ffffffff8107ee37>] ? lock_timer_base+0x37/0x70
>> >>> ?[<ffffffff8103c7ff>] dma_generic_alloc_coherent+0x10f/0x170
>> >>> ?[<ffffffff81062e7e>] gart_alloc_coherent+0xee/0x120
>> >>> ?[<ffffffff81137542>] dma_pool_alloc+0x102/0x2e0
>> >>> ?[<ffffffff8109f240>] ? try_to_wake_up+0x310/0x310
>> >>> ?[<ffffffff813f3dc7>] ehci_qh_alloc+0x47/0xf0
>> >>> ?[<ffffffff813f81e7>] ehci_pci_setup+0x367/0xea0
>> >>> ?[<ffffffff81389213>] ? device_pm_init+0x43/0x80
>> >>> ?[<ffffffff813d3065>] ? usb_alloc_dev+0x2d5/0x330
>> >>> ?[<ffffffff81002030>] ? do_one_initcall+0x30/0x170
>> >>> ?[<ffffffff813db6a9>] usb_add_hcd+0x1e9/0x7a0
>> >>> ?[<ffffffff813ea0fa>] usb_hcd_pci_probe+0x1ba/0x3a0
>> >>> ?[<ffffffff81088890>] ? cwq_dec_nr_in_flight+0x90/0x90
>> >>> ?[<ffffffff812ad3f2>] local_pci_probe+0x12/0x20
>> >>> ?[<ffffffff810888a3>] do_work_for_cpu+0x13/0x30
>> >>> ?[<ffffffff810906e6>] kthread+0x96/0xa0
>> >>> ?[<ffffffff815b61e4>] kernel_thread_helper+0x4/0x10
>> >>> ?[<ffffffff81090650>] ? kthread_freezable_should_stop+0x70/0x70
>> >>> ?[<ffffffff815b61e0>] ? gs_change+0x13/0x13
>> >>> Disabling lock debugging due to kernel taint
>> >>> BUG: Bad p
.. snip..
>>
>> I pulled todays linus's tree and if I revert the commit the bug disappears.
>> This is on an Dell T105 AMD box running native x86_64.
>> Any thoughts on what the bug might be?
>
> I've read that patch three times line by line and I really have no idea what might cause such
> weird effect. When CMA is disabled this patch should not change anything in the code flow and
> the called functions... I assume that the CMA has not been enabled in your test config?

Correct. There is no mention of CMA. Do you have any debug patches you
want me to try?

I am also attaching the .config file.


Attachments:
dot_config (77.93 kB)