2017-03-02 11:59:07

by Jürgen Groß

[permalink] [raw]
Subject: BUG due to "xen-netback: protect resource cleaning on XenBus disconnect"

With commits f16f1df65 and 9a6cdf52b we get in our Xen testing:

[ 174.512861] switch: port 2(vif3.0) entered disabled state
[ 174.522735] BUG: sleeping function called from invalid context at
/home/build/linux-linus/mm/vmalloc.c:1441
[ 174.523451] in_atomic(): 1, irqs_disabled(): 0, pid: 28, name: xenwatch
[ 174.524131] CPU: 1 PID: 28 Comm: xenwatch Tainted: G W
4.10.0upstream-11073-g4977ab6-dirty #1
[ 174.524819] Hardware name: MSI MS-7680/H61M-P23 (MS-7680), BIOS V17.0
03/14/2011
[ 174.525517] Call Trace:
[ 174.526217] show_stack+0x23/0x60
[ 174.526899] dump_stack+0x5b/0x88
[ 174.527562] ___might_sleep+0xde/0x130
[ 174.528208] __might_sleep+0x35/0xa0
[ 174.528840] ? _raw_spin_unlock_irqrestore+0x13/0x20
[ 174.529463] ? __wake_up+0x40/0x50
[ 174.530089] remove_vm_area+0x20/0x90
[ 174.530724] __vunmap+0x1d/0xc0
[ 174.531346] ? delete_object_full+0x13/0x20
[ 174.531973] vfree+0x40/0x80
[ 174.532594] set_backend_state+0x18a/0xa90
[ 174.533221] ? dwc_scan_descriptors+0x24d/0x430
[ 174.533850] ? kfree+0x5b/0xc0
[ 174.534476] ? xenbus_read+0x3d/0x50
[ 174.535101] ? xenbus_read+0x3d/0x50
[ 174.535718] ? xenbus_gather+0x31/0x90
[ 174.536332] ? ___might_sleep+0xf6/0x130
[ 174.536945] frontend_changed+0x6b/0xd0
[ 174.537565] xenbus_otherend_changed+0x7d/0x80
[ 174.538185] frontend_changed+0x12/0x20
[ 174.538803] xenwatch_thread+0x74/0x110
[ 174.539417] ? woken_wake_function+0x20/0x20
[ 174.540049] kthread+0xe5/0x120
[ 174.540663] ? xenbus_printf+0x50/0x50
[ 174.541278] ? __kthread_init_worker+0x40/0x40
[ 174.541898] ret_from_fork+0x21/0x2c
[ 174.548635] switch: port 2(vif3.0) entered disabled state

I believe calling vfree() when holding a spin_lock isn't a good idea.

Boris, this is the dumpdata failure:
FAILURE 4.10.0upstream-11073-g4977ab6-dirty(x86_64)
4.10.0upstream-11073-g4977ab6-dirty(i386)\: 2017-03-02 (tst007)


Juergen


2017-03-02 12:24:21

by Wei Liu

[permalink] [raw]
Subject: Re: BUG due to "xen-netback: protect resource cleaning on XenBus disconnect"

On Thu, Mar 02, 2017 at 12:56:20PM +0100, Juergen Gross wrote:
> With commits f16f1df65 and 9a6cdf52b we get in our Xen testing:
>
> [ 174.512861] switch: port 2(vif3.0) entered disabled state
> [ 174.522735] BUG: sleeping function called from invalid context at
> /home/build/linux-linus/mm/vmalloc.c:1441
> [ 174.523451] in_atomic(): 1, irqs_disabled(): 0, pid: 28, name: xenwatch
> [ 174.524131] CPU: 1 PID: 28 Comm: xenwatch Tainted: G W
> 4.10.0upstream-11073-g4977ab6-dirty #1
> [ 174.524819] Hardware name: MSI MS-7680/H61M-P23 (MS-7680), BIOS V17.0
> 03/14/2011
> [ 174.525517] Call Trace:
> [ 174.526217] show_stack+0x23/0x60
> [ 174.526899] dump_stack+0x5b/0x88
> [ 174.527562] ___might_sleep+0xde/0x130
> [ 174.528208] __might_sleep+0x35/0xa0
> [ 174.528840] ? _raw_spin_unlock_irqrestore+0x13/0x20
> [ 174.529463] ? __wake_up+0x40/0x50
> [ 174.530089] remove_vm_area+0x20/0x90
> [ 174.530724] __vunmap+0x1d/0xc0
> [ 174.531346] ? delete_object_full+0x13/0x20
> [ 174.531973] vfree+0x40/0x80
> [ 174.532594] set_backend_state+0x18a/0xa90
> [ 174.533221] ? dwc_scan_descriptors+0x24d/0x430
> [ 174.533850] ? kfree+0x5b/0xc0
> [ 174.534476] ? xenbus_read+0x3d/0x50
> [ 174.535101] ? xenbus_read+0x3d/0x50
> [ 174.535718] ? xenbus_gather+0x31/0x90
> [ 174.536332] ? ___might_sleep+0xf6/0x130
> [ 174.536945] frontend_changed+0x6b/0xd0
> [ 174.537565] xenbus_otherend_changed+0x7d/0x80
> [ 174.538185] frontend_changed+0x12/0x20
> [ 174.538803] xenwatch_thread+0x74/0x110
> [ 174.539417] ? woken_wake_function+0x20/0x20
> [ 174.540049] kthread+0xe5/0x120
> [ 174.540663] ? xenbus_printf+0x50/0x50
> [ 174.541278] ? __kthread_init_worker+0x40/0x40
> [ 174.541898] ret_from_fork+0x21/0x2c
> [ 174.548635] switch: port 2(vif3.0) entered disabled state
>
> I believe calling vfree() when holding a spin_lock isn't a good idea.
>

Use vfree_atomic instead?

> Boris, this is the dumpdata failure:
> FAILURE 4.10.0upstream-11073-g4977ab6-dirty(x86_64)
> 4.10.0upstream-11073-g4977ab6-dirty(i386)\: 2017-03-02 (tst007)
>
>
> Juergen

2017-03-02 12:25:21

by Paul Durrant

[permalink] [raw]
Subject: RE: BUG due to "xen-netback: protect resource cleaning on XenBus disconnect"

> -----Original Message-----
> From: Juergen Gross [mailto:[email protected]]
> Sent: 02 March 2017 12:13
> To: Wei Liu <[email protected]>
> Cc: Igor Druzhinin <[email protected]>; xen-devel <xen-
> [email protected]>; Linux Kernel Mailing List <linux-
> [email protected]>; [email protected]; Boris Ostrovsky
> <[email protected]>; David Miller <[email protected]>; Paul
> Durrant <[email protected]>
> Subject: Re: BUG due to "xen-netback: protect resource cleaning on XenBus
> disconnect"
>
> On 02/03/17 13:06, Wei Liu wrote:
> > On Thu, Mar 02, 2017 at 12:56:20PM +0100, Juergen Gross wrote:
> >> With commits f16f1df65 and 9a6cdf52b we get in our Xen testing:
> >>
> >> [ 174.512861] switch: port 2(vif3.0) entered disabled state
> >> [ 174.522735] BUG: sleeping function called from invalid context at
> >> /home/build/linux-linus/mm/vmalloc.c:1441
> >> [ 174.523451] in_atomic(): 1, irqs_disabled(): 0, pid: 28, name: xenwatch
> >> [ 174.524131] CPU: 1 PID: 28 Comm: xenwatch Tainted: G W
> >> 4.10.0upstream-11073-g4977ab6-dirty #1
> >> [ 174.524819] Hardware name: MSI MS-7680/H61M-P23 (MS-7680), BIOS
> V17.0
> >> 03/14/2011
> >> [ 174.525517] Call Trace:
> >> [ 174.526217] show_stack+0x23/0x60
> >> [ 174.526899] dump_stack+0x5b/0x88
> >> [ 174.527562] ___might_sleep+0xde/0x130
> >> [ 174.528208] __might_sleep+0x35/0xa0
> >> [ 174.528840] ? _raw_spin_unlock_irqrestore+0x13/0x20
> >> [ 174.529463] ? __wake_up+0x40/0x50
> >> [ 174.530089] remove_vm_area+0x20/0x90
> >> [ 174.530724] __vunmap+0x1d/0xc0
> >> [ 174.531346] ? delete_object_full+0x13/0x20
> >> [ 174.531973] vfree+0x40/0x80
> >> [ 174.532594] set_backend_state+0x18a/0xa90
> >> [ 174.533221] ? dwc_scan_descriptors+0x24d/0x430
> >> [ 174.533850] ? kfree+0x5b/0xc0
> >> [ 174.534476] ? xenbus_read+0x3d/0x50
> >> [ 174.535101] ? xenbus_read+0x3d/0x50
> >> [ 174.535718] ? xenbus_gather+0x31/0x90
> >> [ 174.536332] ? ___might_sleep+0xf6/0x130
> >> [ 174.536945] frontend_changed+0x6b/0xd0
> >> [ 174.537565] xenbus_otherend_changed+0x7d/0x80
> >> [ 174.538185] frontend_changed+0x12/0x20
> >> [ 174.538803] xenwatch_thread+0x74/0x110
> >> [ 174.539417] ? woken_wake_function+0x20/0x20
> >> [ 174.540049] kthread+0xe5/0x120
> >> [ 174.540663] ? xenbus_printf+0x50/0x50
> >> [ 174.541278] ? __kthread_init_worker+0x40/0x40
> >> [ 174.541898] ret_from_fork+0x21/0x2c
> >> [ 174.548635] switch: port 2(vif3.0) entered disabled state
> >>
> >> I believe calling vfree() when holding a spin_lock isn't a good idea.
> >>
> >
> > Use vfree_atomic instead?
>
> Hmm, isn't this overkill here?
>
> You can just set a local variable with the address and do vfree() after
> releasing the lock.
>

Yep, that's what I was thinking. Patch coming shortly.

Paul

>
> Juergen

2017-03-02 13:05:41

by Jürgen Groß

[permalink] [raw]
Subject: Re: BUG due to "xen-netback: protect resource cleaning on XenBus disconnect"

On 02/03/17 13:06, Wei Liu wrote:
> On Thu, Mar 02, 2017 at 12:56:20PM +0100, Juergen Gross wrote:
>> With commits f16f1df65 and 9a6cdf52b we get in our Xen testing:
>>
>> [ 174.512861] switch: port 2(vif3.0) entered disabled state
>> [ 174.522735] BUG: sleeping function called from invalid context at
>> /home/build/linux-linus/mm/vmalloc.c:1441
>> [ 174.523451] in_atomic(): 1, irqs_disabled(): 0, pid: 28, name: xenwatch
>> [ 174.524131] CPU: 1 PID: 28 Comm: xenwatch Tainted: G W
>> 4.10.0upstream-11073-g4977ab6-dirty #1
>> [ 174.524819] Hardware name: MSI MS-7680/H61M-P23 (MS-7680), BIOS V17.0
>> 03/14/2011
>> [ 174.525517] Call Trace:
>> [ 174.526217] show_stack+0x23/0x60
>> [ 174.526899] dump_stack+0x5b/0x88
>> [ 174.527562] ___might_sleep+0xde/0x130
>> [ 174.528208] __might_sleep+0x35/0xa0
>> [ 174.528840] ? _raw_spin_unlock_irqrestore+0x13/0x20
>> [ 174.529463] ? __wake_up+0x40/0x50
>> [ 174.530089] remove_vm_area+0x20/0x90
>> [ 174.530724] __vunmap+0x1d/0xc0
>> [ 174.531346] ? delete_object_full+0x13/0x20
>> [ 174.531973] vfree+0x40/0x80
>> [ 174.532594] set_backend_state+0x18a/0xa90
>> [ 174.533221] ? dwc_scan_descriptors+0x24d/0x430
>> [ 174.533850] ? kfree+0x5b/0xc0
>> [ 174.534476] ? xenbus_read+0x3d/0x50
>> [ 174.535101] ? xenbus_read+0x3d/0x50
>> [ 174.535718] ? xenbus_gather+0x31/0x90
>> [ 174.536332] ? ___might_sleep+0xf6/0x130
>> [ 174.536945] frontend_changed+0x6b/0xd0
>> [ 174.537565] xenbus_otherend_changed+0x7d/0x80
>> [ 174.538185] frontend_changed+0x12/0x20
>> [ 174.538803] xenwatch_thread+0x74/0x110
>> [ 174.539417] ? woken_wake_function+0x20/0x20
>> [ 174.540049] kthread+0xe5/0x120
>> [ 174.540663] ? xenbus_printf+0x50/0x50
>> [ 174.541278] ? __kthread_init_worker+0x40/0x40
>> [ 174.541898] ret_from_fork+0x21/0x2c
>> [ 174.548635] switch: port 2(vif3.0) entered disabled state
>>
>> I believe calling vfree() when holding a spin_lock isn't a good idea.
>>
>
> Use vfree_atomic instead?

Hmm, isn't this overkill here?

You can just set a local variable with the address and do vfree() after
releasing the lock.


Juergen

2017-03-02 14:57:23

by Igor Druzhinin

[permalink] [raw]
Subject: Re: BUG due to "xen-netback: protect resource cleaning on XenBus disconnect"

On 02/03/17 12:19, Paul Durrant wrote:
>> -----Original Message-----
>> From: Juergen Gross [mailto:[email protected]]
>> Sent: 02 March 2017 12:13
>> To: Wei Liu <[email protected]>
>> Cc: Igor Druzhinin <[email protected]>; xen-devel <xen-
>> [email protected]>; Linux Kernel Mailing List <linux-
>> [email protected]>; [email protected]; Boris Ostrovsky
>> <[email protected]>; David Miller <[email protected]>; Paul
>> Durrant <[email protected]>
>> Subject: Re: BUG due to "xen-netback: protect resource cleaning on XenBus
>> disconnect"
>>
>> On 02/03/17 13:06, Wei Liu wrote:
>>> On Thu, Mar 02, 2017 at 12:56:20PM +0100, Juergen Gross wrote:
>>>> With commits f16f1df65 and 9a6cdf52b we get in our Xen testing:
>>>>
>>>> [ 174.512861] switch: port 2(vif3.0) entered disabled state
>>>> [ 174.522735] BUG: sleeping function called from invalid context at
>>>> /home/build/linux-linus/mm/vmalloc.c:1441
>>>> [ 174.523451] in_atomic(): 1, irqs_disabled(): 0, pid: 28, name: xenwatch
>>>> [ 174.524131] CPU: 1 PID: 28 Comm: xenwatch Tainted: G W
>>>> 4.10.0upstream-11073-g4977ab6-dirty #1
>>>> [ 174.524819] Hardware name: MSI MS-7680/H61M-P23 (MS-7680), BIOS
>> V17.0
>>>> 03/14/2011
>>>> [ 174.525517] Call Trace:
>>>> [ 174.526217] show_stack+0x23/0x60
>>>> [ 174.526899] dump_stack+0x5b/0x88
>>>> [ 174.527562] ___might_sleep+0xde/0x130
>>>> [ 174.528208] __might_sleep+0x35/0xa0
>>>> [ 174.528840] ? _raw_spin_unlock_irqrestore+0x13/0x20
>>>> [ 174.529463] ? __wake_up+0x40/0x50
>>>> [ 174.530089] remove_vm_area+0x20/0x90
>>>> [ 174.530724] __vunmap+0x1d/0xc0
>>>> [ 174.531346] ? delete_object_full+0x13/0x20
>>>> [ 174.531973] vfree+0x40/0x80
>>>> [ 174.532594] set_backend_state+0x18a/0xa90
>>>> [ 174.533221] ? dwc_scan_descriptors+0x24d/0x430
>>>> [ 174.533850] ? kfree+0x5b/0xc0
>>>> [ 174.534476] ? xenbus_read+0x3d/0x50
>>>> [ 174.535101] ? xenbus_read+0x3d/0x50
>>>> [ 174.535718] ? xenbus_gather+0x31/0x90
>>>> [ 174.536332] ? ___might_sleep+0xf6/0x130
>>>> [ 174.536945] frontend_changed+0x6b/0xd0
>>>> [ 174.537565] xenbus_otherend_changed+0x7d/0x80
>>>> [ 174.538185] frontend_changed+0x12/0x20
>>>> [ 174.538803] xenwatch_thread+0x74/0x110
>>>> [ 174.539417] ? woken_wake_function+0x20/0x20
>>>> [ 174.540049] kthread+0xe5/0x120
>>>> [ 174.540663] ? xenbus_printf+0x50/0x50
>>>> [ 174.541278] ? __kthread_init_worker+0x40/0x40
>>>> [ 174.541898] ret_from_fork+0x21/0x2c
>>>> [ 174.548635] switch: port 2(vif3.0) entered disabled state
>>>>
>>>> I believe calling vfree() when holding a spin_lock isn't a good idea.
>>>>
>>>
>>> Use vfree_atomic instead?
>>
>> Hmm, isn't this overkill here?
>>
>> You can just set a local variable with the address and do vfree() after
>> releasing the lock.
>>
>
> Yep, that's what I was thinking. Patch coming shortly.
>
> Paul

We have an internal patch that was just recently tested without using
spinlocks. Calling vfree in the spinlock section is not the worst that
could happen as our testing revealed. I switched to RCU for protecting
the environment from memory release. I'll share it today.

Igor

>
>>
>> Juergen

2017-03-02 18:01:03

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: BUG due to "xen-netback: protect resource cleaning on XenBus disconnect"

On 03/02/2017 06:56 AM, Juergen Gross wrote:
> With commits f16f1df65 and 9a6cdf52b we get in our Xen testing:
>
> [ 174.512861] switch: port 2(vif3.0) entered disabled state
> [ 174.522735] BUG: sleeping function called from invalid context at
> /home/build/linux-linus/mm/vmalloc.c:1441
> [ 174.523451] in_atomic(): 1, irqs_disabled(): 0, pid: 28, name: xenwatch
> [ 174.524131] CPU: 1 PID: 28 Comm: xenwatch Tainted: G W
> 4.10.0upstream-11073-g4977ab6-dirty #1
> [ 174.524819] Hardware name: MSI MS-7680/H61M-P23 (MS-7680), BIOS V17.0
> 03/14/2011
> [ 174.525517] Call Trace:
> [ 174.526217] show_stack+0x23/0x60
> [ 174.526899] dump_stack+0x5b/0x88
> [ 174.527562] ___might_sleep+0xde/0x130
> [ 174.528208] __might_sleep+0x35/0xa0
> [ 174.528840] ? _raw_spin_unlock_irqrestore+0x13/0x20
> [ 174.529463] ? __wake_up+0x40/0x50
> [ 174.530089] remove_vm_area+0x20/0x90
> [ 174.530724] __vunmap+0x1d/0xc0
> [ 174.531346] ? delete_object_full+0x13/0x20
> [ 174.531973] vfree+0x40/0x80
> [ 174.532594] set_backend_state+0x18a/0xa90
> [ 174.533221] ? dwc_scan_descriptors+0x24d/0x430
> [ 174.533850] ? kfree+0x5b/0xc0
> [ 174.534476] ? xenbus_read+0x3d/0x50
> [ 174.535101] ? xenbus_read+0x3d/0x50
> [ 174.535718] ? xenbus_gather+0x31/0x90
> [ 174.536332] ? ___might_sleep+0xf6/0x130
> [ 174.536945] frontend_changed+0x6b/0xd0
> [ 174.537565] xenbus_otherend_changed+0x7d/0x80
> [ 174.538185] frontend_changed+0x12/0x20
> [ 174.538803] xenwatch_thread+0x74/0x110
> [ 174.539417] ? woken_wake_function+0x20/0x20
> [ 174.540049] kthread+0xe5/0x120
> [ 174.540663] ? xenbus_printf+0x50/0x50
> [ 174.541278] ? __kthread_init_worker+0x40/0x40
> [ 174.541898] ret_from_fork+0x21/0x2c
> [ 174.548635] switch: port 2(vif3.0) entered disabled state
>
> I believe calling vfree() when holding a spin_lock isn't a good idea.
>
> Boris, this is the dumpdata failure:
> FAILURE 4.10.0upstream-11073-g4977ab6-dirty(x86_64)
> 4.10.0upstream-11073-g4977ab6-dirty(i386)\: 2017-03-02 (tst007)


That's not the cause of the test failure though --- it's "just" a warning.

The problem here was that 64- and 32-bit build trees got out of sync
(which is my fault, I switched the former to staging but forgot to do
the same for the latter). We have in the log:

libxl: error: libxl_create.c:564:libxl__domain_make: domain creation
fail: Operation not supported
libxl: error: libxl_create.c:931:initiate_domain_create: cannot make
domain: -3

I now have both trees use staging.

-boris