2022-05-02 23:29:54

by Daniel Harding

[permalink] [raw]
Subject: Re: [REGRESSION] lxc-stop hang on 5.17.x kernels

On 5/2/22 16:26, Jens Axboe wrote:
> On 5/2/22 7:17 AM, Daniel Harding wrote:
>> I use lxc-4.0.12 on Gentoo, built with io-uring support
>> (--enable-liburing), targeting liburing-2.1. My kernel config is a
>> very lightly modified version of Fedora's generic kernel config. After
>> moving from the 5.16.x series to the 5.17.x kernel series, I started
>> noticed frequent hangs in lxc-stop. It doesn't happen 100% of the
>> time, but definitely more than 50% of the time. Bisecting narrowed
>> down the issue to commit aa43477b040251f451db0d844073ac00a8ab66ee:
>> io_uring: poll rework. Testing indicates the problem is still present
>> in 5.18-rc5. Unfortunately I do not have the expertise with the
>> codebases of either lxc or io-uring to try to debug the problem
>> further on my own, but I can easily apply patches to any of the
>> involved components (lxc, liburing, kernel) and rebuild for testing or
>> validation. I am also happy to provide any further information that
>> would be helpful with reproducing or debugging the problem.
> Do you have a recipe to reproduce the hang? That would make it
> significantly easier to figure out.

I can reproduce it with just the following:

    sudo lxc-create --n lxc-test --template download --bdev dir --dir
/var/lib/lxc/lxc-test/rootfs -- -d ubuntu -r bionic -a amd64
    sudo lxc-start -n lxc-test
    sudo lxc-stop -n lxc-test

The lxc-stop command never exits and the container continues running. 
If that isn't sufficient to reproduce, please let me know.

--
Regards,

Daniel Harding


2022-05-03 00:22:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [REGRESSION] lxc-stop hang on 5.17.x kernels

On 5/2/22 7:36 AM, Daniel Harding wrote:
> On 5/2/22 16:26, Jens Axboe wrote:
>> On 5/2/22 7:17 AM, Daniel Harding wrote:
>>> I use lxc-4.0.12 on Gentoo, built with io-uring support
>>> (--enable-liburing), targeting liburing-2.1. My kernel config is a
>>> very lightly modified version of Fedora's generic kernel config. After
>>> moving from the 5.16.x series to the 5.17.x kernel series, I started
>>> noticed frequent hangs in lxc-stop. It doesn't happen 100% of the
>>> time, but definitely more than 50% of the time. Bisecting narrowed
>>> down the issue to commit aa43477b040251f451db0d844073ac00a8ab66ee:
>>> io_uring: poll rework. Testing indicates the problem is still present
>>> in 5.18-rc5. Unfortunately I do not have the expertise with the
>>> codebases of either lxc or io-uring to try to debug the problem
>>> further on my own, but I can easily apply patches to any of the
>>> involved components (lxc, liburing, kernel) and rebuild for testing or
>>> validation. I am also happy to provide any further information that
>>> would be helpful with reproducing or debugging the problem.
>> Do you have a recipe to reproduce the hang? That would make it
>> significantly easier to figure out.
>
> I can reproduce it with just the following:
>
> sudo lxc-create --n lxc-test --template download --bdev dir --dir /var/lib/lxc/lxc-test/rootfs -- -d ubuntu -r bionic -a amd64
> sudo lxc-start -n lxc-test
> sudo lxc-stop -n lxc-test
>
> The lxc-stop command never exits and the container continues running.
> If that isn't sufficient to reproduce, please let me know.

Thanks, that's useful! I'm at a conference this week and hence have
limited amount of time to debug, hopefully Pavel has time to take a look
at this.

--
Jens Axboe

2022-05-03 01:33:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [REGRESSION] lxc-stop hang on 5.17.x kernels

On 5/2/22 7:59 AM, Jens Axboe wrote:
> On 5/2/22 7:36 AM, Daniel Harding wrote:
>> On 5/2/22 16:26, Jens Axboe wrote:
>>> On 5/2/22 7:17 AM, Daniel Harding wrote:
>>>> I use lxc-4.0.12 on Gentoo, built with io-uring support
>>>> (--enable-liburing), targeting liburing-2.1. My kernel config is a
>>>> very lightly modified version of Fedora's generic kernel config. After
>>>> moving from the 5.16.x series to the 5.17.x kernel series, I started
>>>> noticed frequent hangs in lxc-stop. It doesn't happen 100% of the
>>>> time, but definitely more than 50% of the time. Bisecting narrowed
>>>> down the issue to commit aa43477b040251f451db0d844073ac00a8ab66ee:
>>>> io_uring: poll rework. Testing indicates the problem is still present
>>>> in 5.18-rc5. Unfortunately I do not have the expertise with the
>>>> codebases of either lxc or io-uring to try to debug the problem
>>>> further on my own, but I can easily apply patches to any of the
>>>> involved components (lxc, liburing, kernel) and rebuild for testing or
>>>> validation. I am also happy to provide any further information that
>>>> would be helpful with reproducing or debugging the problem.
>>> Do you have a recipe to reproduce the hang? That would make it
>>> significantly easier to figure out.
>>
>> I can reproduce it with just the following:
>>
>> sudo lxc-create --n lxc-test --template download --bdev dir --dir /var/lib/lxc/lxc-test/rootfs -- -d ubuntu -r bionic -a amd64
>> sudo lxc-start -n lxc-test
>> sudo lxc-stop -n lxc-test
>>
>> The lxc-stop command never exits and the container continues running.
>> If that isn't sufficient to reproduce, please let me know.
>
> Thanks, that's useful! I'm at a conference this week and hence have
> limited amount of time to debug, hopefully Pavel has time to take a look
> at this.

Didn't manage to reproduce. Can you try, on both the good and bad
kernel, to do:

# echo 1 > /sys/kernel/debug/tracing/events/io_uring/enable

run lxc-stop

# cp /sys/kernel/debug/tracing/trace ~/iou-trace

so we can see what's going on? Looking at the source, lxc is just using
plain POLL_ADD, so I'm guessing it's not getting a notification when it
expects to, or it's POLL_REMOVE not doing its job. If we have a trace
from both a working and broken kernel, that might shed some light on it.

--
Jens Axboe

2022-05-03 01:34:08

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [REGRESSION] lxc-stop hang on 5.17.x kernels

On 5/2/22 18:00, Jens Axboe wrote:
> On 5/2/22 7:59 AM, Jens Axboe wrote:
>> On 5/2/22 7:36 AM, Daniel Harding wrote:
>>> On 5/2/22 16:26, Jens Axboe wrote:
>>>> On 5/2/22 7:17 AM, Daniel Harding wrote:
>>>>> I use lxc-4.0.12 on Gentoo, built with io-uring support
>>>>> (--enable-liburing), targeting liburing-2.1. My kernel config is a
>>>>> very lightly modified version of Fedora's generic kernel config. After
>>>>> moving from the 5.16.x series to the 5.17.x kernel series, I started
>>>>> noticed frequent hangs in lxc-stop. It doesn't happen 100% of the
>>>>> time, but definitely more than 50% of the time. Bisecting narrowed
>>>>> down the issue to commit aa43477b040251f451db0d844073ac00a8ab66ee:
>>>>> io_uring: poll rework. Testing indicates the problem is still present
>>>>> in 5.18-rc5. Unfortunately I do not have the expertise with the
>>>>> codebases of either lxc or io-uring to try to debug the problem
>>>>> further on my own, but I can easily apply patches to any of the
>>>>> involved components (lxc, liburing, kernel) and rebuild for testing or
>>>>> validation. I am also happy to provide any further information that
>>>>> would be helpful with reproducing or debugging the problem.
>>>> Do you have a recipe to reproduce the hang? That would make it
>>>> significantly easier to figure out.
>>>
>>> I can reproduce it with just the following:
>>>
>>> sudo lxc-create --n lxc-test --template download --bdev dir --dir /var/lib/lxc/lxc-test/rootfs -- -d ubuntu -r bionic -a amd64
>>> sudo lxc-start -n lxc-test
>>> sudo lxc-stop -n lxc-test
>>>
>>> The lxc-stop command never exits and the container continues running.
>>> If that isn't sufficient to reproduce, please let me know.
>>
>> Thanks, that's useful! I'm at a conference this week and hence have
>> limited amount of time to debug, hopefully Pavel has time to take a look
>> at this.
>
> Didn't manage to reproduce. Can you try, on both the good and bad
> kernel, to do:

Same here, it doesn't reproduce for me


> # echo 1 > /sys/kernel/debug/tracing/events/io_uring/enable
>
> run lxc-stop
>
> # cp /sys/kernel/debug/tracing/trace ~/iou-trace
>
> so we can see what's going on? Looking at the source, lxc is just using
> plain POLL_ADD, so I'm guessing it's not getting a notification when it
> expects to, or it's POLL_REMOVE not doing its job. If we have a trace
> from both a working and broken kernel, that might shed some light on it.

--
Pavel Begunkov