2020-08-22 15:49:32

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ext4: flag as supporting buffered async reads

On 8/22/20 8:33 AM, Theodore Y. Ts'o wrote:
> On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
>>>>> Resending this one, as I've been carrying it privately since May. The
>>>>> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
>>>>> please consider this one for 5.9. Thanks!
>>>>
>>>> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
>>>> missing something? It's on my queue to send to Linus once I get my
>>>> (late) ext4 primary pull request for 5.9.
>>>
>>> Right, it went in at the start of the merge window for 5.9. Thanks Ted!
>>
>> Didn't see it in the queue that just sent in, is it still queued up?
>
> It wasn't in the queue which I queued up because that was based on
> 5.8-rc4. Linus was a bit grumpy (fairly so) because it was late, and
> that's totally on me.
>
> He has said that he's going to start ignoring pull requests that
> aren't fixes only if this becomes a pattern, so while I can send him
> another pull request which will just have that one change, there are
> no guarantees he's going to take it at this late date.
>
> Sorry, when you sent me the commit saying that the changes that were
> needed were already upstream on August 3rd, I thought that meant that
> they were aready in Linus's tree. I should have checked and noticed
> that that in fact "ext4: flag as supporting buffered async reads"
> wasn't compiling against Linus's upstream tree, so I didn't realize
> this needed to be handled as a special case during the merge window.

Well to be honest, this kind of sucks. I've been posting it since May,
and the ideal approach would have been to just ack it and I could have
carried it in my tree. That's what we did for btrfs and XFS, both of
which have it.

The required patches *were* upstreamed on August 3rd, which is why I
mentioned that. But yes, not in 5.8 or earlier, of course.

So I suggest that you either include it for the next pull request for
Linus, or that I put it in with your ack. Either is fine with me. I'd
consider this a "dropping the ball" kind of thing, it's not like the
patch hasn't been in linux-next or hasn't been ready for months. This
isn't some "oh I wrote this feature after the merge window" event. It'd
be a real shame to ship 5.9 and ext4 not have support for the more
efficient async buffered reads, imho, especially since the two other
major local file systems already have it.

Let me know what you think.

--
Jens Axboe


2020-08-24 10:57:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ext4: flag as supporting buffered async reads

On 8/22/20 9:48 AM, Jens Axboe wrote:
> On 8/22/20 8:33 AM, Theodore Y. Ts'o wrote:
>> On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
>>>>>> Resending this one, as I've been carrying it privately since May. The
>>>>>> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
>>>>>> please consider this one for 5.9. Thanks!
>>>>>
>>>>> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
>>>>> missing something? It's on my queue to send to Linus once I get my
>>>>> (late) ext4 primary pull request for 5.9.
>>>>
>>>> Right, it went in at the start of the merge window for 5.9. Thanks Ted!
>>>
>>> Didn't see it in the queue that just sent in, is it still queued up?
>>
>> It wasn't in the queue which I queued up because that was based on
>> 5.8-rc4. Linus was a bit grumpy (fairly so) because it was late, and
>> that's totally on me.
>>
>> He has said that he's going to start ignoring pull requests that
>> aren't fixes only if this becomes a pattern, so while I can send him
>> another pull request which will just have that one change, there are
>> no guarantees he's going to take it at this late date.
>>
>> Sorry, when you sent me the commit saying that the changes that were
>> needed were already upstream on August 3rd, I thought that meant that
>> they were aready in Linus's tree. I should have checked and noticed
>> that that in fact "ext4: flag as supporting buffered async reads"
>> wasn't compiling against Linus's upstream tree, so I didn't realize
>> this needed to be handled as a special case during the merge window.
>
> Well to be honest, this kind of sucks. I've been posting it since May,
> and the ideal approach would have been to just ack it and I could have
> carried it in my tree. That's what we did for btrfs and XFS, both of
> which have it.
>
> The required patches *were* upstreamed on August 3rd, which is why I
> mentioned that. But yes, not in 5.8 or earlier, of course.
>
> So I suggest that you either include it for the next pull request for
> Linus, or that I put it in with your ack. Either is fine with me. I'd
> consider this a "dropping the ball" kind of thing, it's not like the
> patch hasn't been in linux-next or hasn't been ready for months. This
> isn't some "oh I wrote this feature after the merge window" event. It'd
> be a real shame to ship 5.9 and ext4 not have support for the more
> efficient async buffered reads, imho, especially since the two other
> major local file systems already have it.
>
> Let me know what you think.

Ted, can you make a call on this, please? It's now post -rc2. Let's
get this settled and included, one way or another.


--
Jens Axboe

2020-08-25 14:19:46

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ext4: flag as supporting buffered async reads

On 8/24/20 4:56 AM, Jens Axboe wrote:
> On 8/22/20 9:48 AM, Jens Axboe wrote:
>> On 8/22/20 8:33 AM, Theodore Y. Ts'o wrote:
>>> On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
>>>>>>> Resending this one, as I've been carrying it privately since May. The
>>>>>>> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
>>>>>>> please consider this one for 5.9. Thanks!
>>>>>>
>>>>>> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
>>>>>> missing something? It's on my queue to send to Linus once I get my
>>>>>> (late) ext4 primary pull request for 5.9.
>>>>>
>>>>> Right, it went in at the start of the merge window for 5.9. Thanks Ted!
>>>>
>>>> Didn't see it in the queue that just sent in, is it still queued up?
>>>
>>> It wasn't in the queue which I queued up because that was based on
>>> 5.8-rc4. Linus was a bit grumpy (fairly so) because it was late, and
>>> that's totally on me.
>>>
>>> He has said that he's going to start ignoring pull requests that
>>> aren't fixes only if this becomes a pattern, so while I can send him
>>> another pull request which will just have that one change, there are
>>> no guarantees he's going to take it at this late date.
>>>
>>> Sorry, when you sent me the commit saying that the changes that were
>>> needed were already upstream on August 3rd, I thought that meant that
>>> they were aready in Linus's tree. I should have checked and noticed
>>> that that in fact "ext4: flag as supporting buffered async reads"
>>> wasn't compiling against Linus's upstream tree, so I didn't realize
>>> this needed to be handled as a special case during the merge window.
>>
>> Well to be honest, this kind of sucks. I've been posting it since May,
>> and the ideal approach would have been to just ack it and I could have
>> carried it in my tree. That's what we did for btrfs and XFS, both of
>> which have it.
>>
>> The required patches *were* upstreamed on August 3rd, which is why I
>> mentioned that. But yes, not in 5.8 or earlier, of course.
>>
>> So I suggest that you either include it for the next pull request for
>> Linus, or that I put it in with your ack. Either is fine with me. I'd
>> consider this a "dropping the ball" kind of thing, it's not like the
>> patch hasn't been in linux-next or hasn't been ready for months. This
>> isn't some "oh I wrote this feature after the merge window" event. It'd
>> be a real shame to ship 5.9 and ext4 not have support for the more
>> efficient async buffered reads, imho, especially since the two other
>> major local file systems already have it.
>>
>> Let me know what you think.
>
> Ted, can you make a call on this, please? It's now post -rc2. Let's
> get this settled and included, one way or another.

Daily ping on this one...

--
Jens Axboe

2020-08-27 01:55:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ext4: flag as supporting buffered async reads

On 8/25/20 8:18 AM, Jens Axboe wrote:
> On 8/24/20 4:56 AM, Jens Axboe wrote:
>> On 8/22/20 9:48 AM, Jens Axboe wrote:
>>> On 8/22/20 8:33 AM, Theodore Y. Ts'o wrote:
>>>> On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
>>>>>>>> Resending this one, as I've been carrying it privately since May. The
>>>>>>>> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
>>>>>>>> please consider this one for 5.9. Thanks!
>>>>>>>
>>>>>>> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
>>>>>>> missing something? It's on my queue to send to Linus once I get my
>>>>>>> (late) ext4 primary pull request for 5.9.
>>>>>>
>>>>>> Right, it went in at the start of the merge window for 5.9. Thanks Ted!
>>>>>
>>>>> Didn't see it in the queue that just sent in, is it still queued up?
>>>>
>>>> It wasn't in the queue which I queued up because that was based on
>>>> 5.8-rc4. Linus was a bit grumpy (fairly so) because it was late, and
>>>> that's totally on me.
>>>>
>>>> He has said that he's going to start ignoring pull requests that
>>>> aren't fixes only if this becomes a pattern, so while I can send him
>>>> another pull request which will just have that one change, there are
>>>> no guarantees he's going to take it at this late date.
>>>>
>>>> Sorry, when you sent me the commit saying that the changes that were
>>>> needed were already upstream on August 3rd, I thought that meant that
>>>> they were aready in Linus's tree. I should have checked and noticed
>>>> that that in fact "ext4: flag as supporting buffered async reads"
>>>> wasn't compiling against Linus's upstream tree, so I didn't realize
>>>> this needed to be handled as a special case during the merge window.
>>>
>>> Well to be honest, this kind of sucks. I've been posting it since May,
>>> and the ideal approach would have been to just ack it and I could have
>>> carried it in my tree. That's what we did for btrfs and XFS, both of
>>> which have it.
>>>
>>> The required patches *were* upstreamed on August 3rd, which is why I
>>> mentioned that. But yes, not in 5.8 or earlier, of course.
>>>
>>> So I suggest that you either include it for the next pull request for
>>> Linus, or that I put it in with your ack. Either is fine with me. I'd
>>> consider this a "dropping the ball" kind of thing, it's not like the
>>> patch hasn't been in linux-next or hasn't been ready for months. This
>>> isn't some "oh I wrote this feature after the merge window" event. It'd
>>> be a real shame to ship 5.9 and ext4 not have support for the more
>>> efficient async buffered reads, imho, especially since the two other
>>> major local file systems already have it.
>>>
>>> Let me know what you think.
>>
>> Ted, can you make a call on this, please? It's now post -rc2. Let's
>> get this settled and included, one way or another.
>
> Daily ping on this one...

And again. Ted, not sure how to make any progress with this, to be
honest, it's like pounding sand.

--
Jens Axboe

2020-09-04 00:11:00

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ext4: flag as supporting buffered async reads

On 8/26/20 7:54 PM, Jens Axboe wrote:
> On 8/25/20 8:18 AM, Jens Axboe wrote:
>> On 8/24/20 4:56 AM, Jens Axboe wrote:
>>> On 8/22/20 9:48 AM, Jens Axboe wrote:
>>>> On 8/22/20 8:33 AM, Theodore Y. Ts'o wrote:
>>>>> On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
>>>>>>>>> Resending this one, as I've been carrying it privately since May. The
>>>>>>>>> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
>>>>>>>>> please consider this one for 5.9. Thanks!
>>>>>>>>
>>>>>>>> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
>>>>>>>> missing something? It's on my queue to send to Linus once I get my
>>>>>>>> (late) ext4 primary pull request for 5.9.
>>>>>>>
>>>>>>> Right, it went in at the start of the merge window for 5.9. Thanks Ted!
>>>>>>
>>>>>> Didn't see it in the queue that just sent in, is it still queued up?
>>>>>
>>>>> It wasn't in the queue which I queued up because that was based on
>>>>> 5.8-rc4. Linus was a bit grumpy (fairly so) because it was late, and
>>>>> that's totally on me.
>>>>>
>>>>> He has said that he's going to start ignoring pull requests that
>>>>> aren't fixes only if this becomes a pattern, so while I can send him
>>>>> another pull request which will just have that one change, there are
>>>>> no guarantees he's going to take it at this late date.
>>>>>
>>>>> Sorry, when you sent me the commit saying that the changes that were
>>>>> needed were already upstream on August 3rd, I thought that meant that
>>>>> they were aready in Linus's tree. I should have checked and noticed
>>>>> that that in fact "ext4: flag as supporting buffered async reads"
>>>>> wasn't compiling against Linus's upstream tree, so I didn't realize
>>>>> this needed to be handled as a special case during the merge window.
>>>>
>>>> Well to be honest, this kind of sucks. I've been posting it since May,
>>>> and the ideal approach would have been to just ack it and I could have
>>>> carried it in my tree. That's what we did for btrfs and XFS, both of
>>>> which have it.
>>>>
>>>> The required patches *were* upstreamed on August 3rd, which is why I
>>>> mentioned that. But yes, not in 5.8 or earlier, of course.
>>>>
>>>> So I suggest that you either include it for the next pull request for
>>>> Linus, or that I put it in with your ack. Either is fine with me. I'd
>>>> consider this a "dropping the ball" kind of thing, it's not like the
>>>> patch hasn't been in linux-next or hasn't been ready for months. This
>>>> isn't some "oh I wrote this feature after the merge window" event. It'd
>>>> be a real shame to ship 5.9 and ext4 not have support for the more
>>>> efficient async buffered reads, imho, especially since the two other
>>>> major local file systems already have it.
>>>>
>>>> Let me know what you think.
>>>
>>> Ted, can you make a call on this, please? It's now post -rc2. Let's
>>> get this settled and included, one way or another.
>>
>> Daily ping on this one...
>
> And again. Ted, not sure how to make any progress with this, to be
> honest, it's like pounding sand.

And 8 days later...

--
Jens Axboe

2020-09-04 03:57:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: flag as supporting buffered async reads

Hi Jens,

Sorry, the past few months have been *crazy* insane. Between Linux
Plumbers Conference organizing, having to deal with some interns at
$WORK, performance reviews, etc.., etc., it's been a perfect storm.
As a result, I was late getting the primary ext4 pull request to
Linus, and he's already yelled at me for a late pull request.

As a result, I'm being super paranoid about asking Linus for anything,
even a one-time change, if it's not a bug fix. And what you are
asking for isn't a bug fix, but enabling a new feature.

Worse, right now, -rc1 and -rc2 is causing random crashes in my
gce-xfstests framework. Sometimes it happens before we've run even a
single xfstests; sometimes it happens after we have successfully
completed all of the tests, and we're doing a shutdown of the VM under
test. Other times it happens in the middle of a test run. Given that
I'm seeing this at -rc1, which is before my late ext4 pull request to
Linus, it's probably not an ext4 related bug. But it also means that
I'm partially blind in terms of my kernel testing at the moment. So I
can't even tell Linus that I've run lots of tests and I'm 100%
confident your one-line change is 100% safe.

Next week after Labor Day, I should be completely done with the
performance review writeups that are on my plate, and I will hopefully
have a bit more time to try to debug these mysterious failures. Or
maybe someone else will be able to figure out what the heck is going
wrong, and perhaps -rc3 will make all of these failures go away.

I'm sorry your frustrated; I'm frustrated too! But until I can find
the time to do a full bisect (v5.8 works just fine. v5.9-rc1 is
flakey as hell when booting my test config in a GCE VM), I'm not in a
position to send anything to Linus.

Sorry,

- Ted

[ 3326.130032] BUG: unable to handle page fault for address: ffff949e7c958e3f
[ 3326.137056] #PF: supervisor write access in kernel mode
[ 3326.142497] #PF: error_code(0x0002) - not-present page
[ 3326.147776] PGD 23fe01067 P4D 23fe01067 PUD 0
[ 3326.154384] Oops: 0002 [#1] SMP PTI
[ 3326.157995] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.0-rc2-xfstests #1886
[ 3326.165337] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 3326.174695] RIP: 0010:free_block+0x14b/0x260
[ 3326.180083] Code: af f1 0f b6 4d 1c 48 c1 ee 20 29 f0 d3 e8 0f b6 4d 1d 01 f0 49 8b 75 20 d3 e8 8d 4f ff 48 85 f6 41 89 4d 30 0f 84 f8 00 00 00 <88> 04 0e 41 8b 45 30 85 c0 0f 84 f9 fe ff ff 4d 8b 67 50 49 8d 57
[ 3326.199667] RSP: 0018:ffffb11c80003ab8 EFLAGS: 00010082
[ 3326.205015] RAX: 0000000000000027 RBX: ffffd11c7fa007b0 RCX: 00000000ffffffff
[ 3326.212282] RDX: fffffab945f25608 RSI: ffff949d7c958e40 RDI: 0000000000000000
[ 3326.219537] RBP: ffff949dd8041180 R08: fffffab9458bdf88 R09: 00000000000c201d
[ 3326.226791] R10: 0000000000000000 R11: 0000000000080000 R12: ffff949d7c9589c0
[ 3326.234151] R13: fffffab945f25600 R14: fffffab945f25608 R15: ffff949dd8042240
[ 3326.241512] FS: 0000000000000000(0000) GS:ffff949dd9000000(0000) knlGS:0000000000000000
[ 3326.249725] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3326.255590] CR2: ffff949e7c958e3f CR3: 00000001cf5b4003 CR4: 00000000003706f0
[ 3326.262863] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3326.270214] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3326.277481] Call Trace:
[ 3326.280046] <IRQ>
[ 3326.282182] cache_flusharray+0x90/0x110
[ 3326.286239] ___cache_free+0x2e0/0x390
[ 3326.290130] kfree+0xc9/0x1d0
[ 3326.293222] kmem_freepages+0xa0/0xf0
[ 3326.297014] slabs_destroy+0x8e/0xd0
[ 3326.300717] cache_flusharray+0xa5/0x110
[ 3326.304761] ___cache_free+0x2e0/0x390
[ 3326.308634] kfree+0xc9/0x1d0
[ 3326.311722] kmem_freepages+0xa0/0xf0
[ 3326.315508] slabs_destroy+0x8e/0xd0
[ 3326.319235] cache_flusharray+0xa5/0x110
[ 3326.323305] ___cache_free+0x2e0/0x390
[ 3326.328915] kfree+0xc9/0x1d0
[ 3326.332028] kmem_freepages+0xa0/0xf0
[ 3326.336531] slabs_destroy+0x8e/0xd0
[ 3326.340251] cache_flusharray+0xa5/0x110
[ 3326.344318] ___cache_free+0x2e0/0x390
[ 3326.348209] ? rcu_do_batch+0x14e/0x780
[ 3326.352167] kmem_cache_free.part.0+0x34/0x110
[ 3326.356766] rcu_do_batch+0x19e/0x780
[ 3326.360569] rcu_core+0xef/0x190
[ 3326.363920] __do_softirq+0xc9/0x425
[ 3326.367644] asm_call_on_stack+0x12/0x20
[ 3326.371685] </IRQ>
[ 3326.373906] do_softirq_own_stack+0x4e/0x60
[ 3326.378583] irq_exit_rcu+0x96/0xd0
[ 3326.382198] sysvec_apic_timer_interrupt+0x43/0x90
[ 3326.387129] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 3326.392405] RIP: 0010:acpi_safe_halt+0x29/0x30
[ 3326.397056] Code: cc 0f 1f 44 00 00 65 48 8b 04 25 00 6e 01 00 48 8b 00 a8 08 74 01 c3 e8 25 96 5c ff e9 07 00 00 00 0f 00 2d b3 80 4f 00 fb f4 <fa> e9 e1 97 5c ff cc 0f 1f 44 00 00 41 56 49 89 f6 41 55 41 89 d5
[ 3326.415945] RSP: 0018:ffffffffade03e20 EFLAGS: 00000202
[ 3326.421293] RAX: 000000000199dedf RBX: 0000000000000001 RCX: 00000000ffffffff
[ 3326.428656] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffad5e433b
[ 3326.435910] RBP: ffff949dd7a29800 R08: 0000000000000001 R09: 0000000000000001
[ 3326.443169] R10: ffff949dd91ea724 R11: 00000000000008ee R12: ffff949dd7a29864
[ 3326.450420] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[ 3326.457690] ? acpi_safe_halt+0x1b/0x30
[ 3326.461647] acpi_idle_enter+0x1d0/0x260
[ 3326.465704] cpuidle_enter_state+0x6e/0x3a0
[ 3326.470019] cpuidle_enter+0x29/0x40
[ 3326.473726] cpuidle_idle_call+0xf8/0x160
[ 3326.477854] do_idle+0x72/0xc0
[ 3326.481036] cpu_startup_entry+0x19/0x20
[ 3326.485079] start_kernel+0x433/0x452
[ 3326.488899] secondary_startup_64+0xb6/0xc0
[ 3326.493204] CR2: ffff949e7c958e3f
[ 3326.496757] ---[ end trace 464b5b002bebb81c ]---



On Thu, Sep 03, 2020 at 06:10:19PM -0600, Jens Axboe wrote:
> On 8/26/20 7:54 PM, Jens Axboe wrote:
> > On 8/25/20 8:18 AM, Jens Axboe wrote:
> >> On 8/24/20 4:56 AM, Jens Axboe wrote:
> >>> On 8/22/20 9:48 AM, Jens Axboe wrote:
> >>>> On 8/22/20 8:33 AM, Theodore Y. Ts'o wrote:
> >>>>> On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
> >>>>>>>>> Resending this one, as I've been carrying it privately since May. The
> >>>>>>>>> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
> >>>>>>>>> please consider this one for 5.9. Thanks!
> >>>>>>>>
> >>>>>>>> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
> >>>>>>>> missing something? It's on my queue to send to Linus once I get my
> >>>>>>>> (late) ext4 primary pull request for 5.9.
> >>>>>>>
> >>>>>>> Right, it went in at the start of the merge window for 5.9. Thanks Ted!
> >>>>>>
> >>>>>> Didn't see it in the queue that just sent in, is it still queued up?
> >>>>>
> >>>>> It wasn't in the queue which I queued up because that was based on
> >>>>> 5.8-rc4. Linus was a bit grumpy (fairly so) because it was late, and
> >>>>> that's totally on me.
> >>>>>
> >>>>> He has said that he's going to start ignoring pull requests that
> >>>>> aren't fixes only if this becomes a pattern, so while I can send him
> >>>>> another pull request which will just have that one change, there are
> >>>>> no guarantees he's going to take it at this late date.
> >>>>>
> >>>>> Sorry, when you sent me the commit saying that the changes that were
> >>>>> needed were already upstream on August 3rd, I thought that meant that
> >>>>> they were aready in Linus's tree. I should have checked and noticed
> >>>>> that that in fact "ext4: flag as supporting buffered async reads"
> >>>>> wasn't compiling against Linus's upstream tree, so I didn't realize
> >>>>> this needed to be handled as a special case during the merge window.
> >>>>
> >>>> Well to be honest, this kind of sucks. I've been posting it since May,
> >>>> and the ideal approach would have been to just ack it and I could have
> >>>> carried it in my tree. That's what we did for btrfs and XFS, both of
> >>>> which have it.
> >>>>
> >>>> The required patches *were* upstreamed on August 3rd, which is why I
> >>>> mentioned that. But yes, not in 5.8 or earlier, of course.
> >>>>
> >>>> So I suggest that you either include it for the next pull request for
> >>>> Linus, or that I put it in with your ack. Either is fine with me. I'd
> >>>> consider this a "dropping the ball" kind of thing, it's not like the
> >>>> patch hasn't been in linux-next or hasn't been ready for months. This
> >>>> isn't some "oh I wrote this feature after the merge window" event. It'd
> >>>> be a real shame to ship 5.9 and ext4 not have support for the more
> >>>> efficient async buffered reads, imho, especially since the two other
> >>>> major local file systems already have it.
> >>>>
> >>>> Let me know what you think.
> >>>
> >>> Ted, can you make a call on this, please? It's now post -rc2. Let's
> >>> get this settled and included, one way or another.
> >>
> >> Daily ping on this one...
> >
> > And again. Ted, not sure how to make any progress with this, to be
> > honest, it's like pounding sand.
>
> And 8 days later...
>
> --
> Jens Axboe
>

2020-09-04 14:53:10

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ext4: flag as supporting buffered async reads

On 9/3/20 9:55 PM, Theodore Y. Ts'o wrote:
> Hi Jens,
>
> Sorry, the past few months have been *crazy* insane. Between Linux
> Plumbers Conference organizing, having to deal with some interns at
> $WORK, performance reviews, etc.., etc., it's been a perfect storm.
> As a result, I was late getting the primary ext4 pull request to
> Linus, and he's already yelled at me for a late pull request.
>
> As a result, I'm being super paranoid about asking Linus for anything,
> even a one-time change, if it's not a bug fix. And what you are
> asking for isn't a bug fix, but enabling a new feature.

That's fine, we blew past the 5.9 window long ago imho. As mentioned,
back in May we could have solved this by just adding your acked-by
or reviewed-by to the patch like we did for xfs/btrfs. That removes the
ext4 tree dependency. Obviously that's no longer a concern for 5.10, but
I need to know if we're going one of two routes:

a) I'm queueing this up, in which case I'll still need that ack/review
b) you're queueing it up, in which case you can just grab it

I just don't want to end up in the same situation for 5.10, and then
we're pushing this out 2 releases at that point...

> Worse, right now, -rc1 and -rc2 is causing random crashes in my
> gce-xfstests framework. Sometimes it happens before we've run even a
> single xfstests; sometimes it happens after we have successfully
> completed all of the tests, and we're doing a shutdown of the VM under
> test. Other times it happens in the middle of a test run. Given that
> I'm seeing this at -rc1, which is before my late ext4 pull request to
> Linus, it's probably not an ext4 related bug. But it also means that
> I'm partially blind in terms of my kernel testing at the moment. So I
> can't even tell Linus that I've run lots of tests and I'm 100%
> confident your one-line change is 100% safe.
>
> Next week after Labor Day, I should be completely done with the
> performance review writeups that are on my plate, and I will hopefully
> have a bit more time to try to debug these mysterious failures. Or
> maybe someone else will be able to figure out what the heck is going
> wrong, and perhaps -rc3 will make all of these failures go away.
>
> I'm sorry your frustrated; I'm frustrated too! But until I can find
> the time to do a full bisect (v5.8 works just fine. v5.9-rc1 is
> flakey as hell when booting my test config in a GCE VM), I'm not in a
> position to send anything to Linus.

That looks nasty, good luck! Hopefully the bisect will be helpful. FWIW,
in the testing I've done (with this patch) on ext4 and with XFS, I
haven't seen anything odd. That's using real hardware though, and qemu
on the laptop, no GCE.

--
Jens Axboe

2020-09-04 15:29:43

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH] ext4: flag as supporting buffered async reads

On Fri, Sep 04, 2020 at 08:51:24AM -0600, Jens Axboe wrote:
> On 9/3/20 9:55 PM, Theodore Y. Ts'o wrote:
> > Hi Jens,
> >
> > Sorry, the past few months have been *crazy* insane. Between Linux
> > Plumbers Conference organizing, having to deal with some interns at
> > $WORK, performance reviews, etc.., etc., it's been a perfect storm.
> > As a result, I was late getting the primary ext4 pull request to
> > Linus, and he's already yelled at me for a late pull request.
> >
> > As a result, I'm being super paranoid about asking Linus for anything,
> > even a one-time change, if it's not a bug fix. And what you are
> > asking for isn't a bug fix, but enabling a new feature.
>
> That's fine, we blew past the 5.9 window long ago imho. As mentioned,
> back in May we could have solved this by just adding your acked-by
> or reviewed-by to the patch like we did for xfs/btrfs. That removes the
> ext4 tree dependency. Obviously that's no longer a concern for 5.10, but
> I need to know if we're going one of two routes:
>
> a) I'm queueing this up, in which case I'll still need that ack/review
> b) you're queueing it up, in which case you can just grab it
>
> I just don't want to end up in the same situation for 5.10, and then
> we're pushing this out 2 releases at that point...
>
> > Worse, right now, -rc1 and -rc2 is causing random crashes in my
> > gce-xfstests framework. Sometimes it happens before we've run even a
> > single xfstests; sometimes it happens after we have successfully
> > completed all of the tests, and we're doing a shutdown of the VM under
> > test. Other times it happens in the middle of a test run. Given that
> > I'm seeing this at -rc1, which is before my late ext4 pull request to
> > Linus, it's probably not an ext4 related bug. But it also means that
> > I'm partially blind in terms of my kernel testing at the moment. So I
> > can't even tell Linus that I've run lots of tests and I'm 100%
> > confident your one-line change is 100% safe.
> >
> > Next week after Labor Day, I should be completely done with the
> > performance review writeups that are on my plate, and I will hopefully
> > have a bit more time to try to debug these mysterious failures. Or
> > maybe someone else will be able to figure out what the heck is going
> > wrong, and perhaps -rc3 will make all of these failures go away.

FWIW I saw a strange crash in the xfs/iomap directio code that I could
never reproduce again, but none of my newly onlined ext4 fstests vms
have died, and I've been running a combination of 5.9-rc1 and 5.9-rc3 +
other xfs goodies. Good luck!

--D

> > I'm sorry your frustrated; I'm frustrated too! But until I can find
> > the time to do a full bisect (v5.8 works just fine. v5.9-rc1 is
> > flakey as hell when booting my test config in a GCE VM), I'm not in a
> > position to send anything to Linus.
>
> That looks nasty, good luck! Hopefully the bisect will be helpful. FWIW,
> in the testing I've done (with this patch) on ext4 and with XFS, I
> haven't seen anything odd. That's using real hardware though, and qemu
> on the laptop, no GCE.
>
> --
> Jens Axboe
>

2020-09-15 04:46:24

by Theodore Ts'o

[permalink] [raw]
Subject: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote:
> Worse, right now, -rc1 and -rc2 is causing random crashes in my
> gce-xfstests framework. Sometimes it happens before we've run even a
> single xfstests; sometimes it happens after we have successfully
> completed all of the tests, and we're doing a shutdown of the VM under
> test. Other times it happens in the middle of a test run. Given that
> I'm seeing this at -rc1, which is before my late ext4 pull request to
> Linus, it's probably not an ext4 related bug. But it also means that
> I'm partially blind in terms of my kernel testing at the moment. So I
> can't even tell Linus that I've run lots of tests and I'm 100%
> confident your one-line change is 100% safe.

I was finally able to bisect it down to the commit:

37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

(See below for [1] Bisect log.)

The previous commit allows the tests to run to completion. With
commit 37f4a24c2469 and later all 11 test scenarios (4k blocks, 1k
blocks, ext3 compat, ext4 w/ fscrypt, nojournal mode, data=journal,
bigalloc, etc.) the VM will get stuck.

The symptom is that while running xfstests in a Google Compute Engine
(GCE) VM, the tests just hang. There are a number of tests where this
is more likely, but it's not unique to a single test.

In most cases, there is nothing; just the test stops running until the
test framework times out after an hour (tests usually complete in
seconds or at most a few tens of minutes or so in the worst case) and
kills the VM. In one case, I did get a report like this. (See below
for [2] stack trace from 37f4a24c2469.)

I attempted to revert the commit in question against -rc1 and -rc4;
that result can be found at the branches manual-revert-of-blk-mq-patch
and manual-revert-of-blk-mq-patch-rc4 at https://github.com/tytso/ext4.

I don't think I got the revert quite right; with the revert, most of
the test VM's successfully complete, but 2 out of the 11 fail, with a
different stack trace. (See below for [3] stack trace from my
attempted manual revert of 37f4a24c2469). But it does seem to confirm
that the primary cause of the test VM hangs is caused by commit
37f4a24c2469.

Does this make any sense as to what might be going on? I hope it does
for you, since I'm pretty confused what might be going on.

Thanks,

- Ted


[1] Bisect log

git bisect start
# bad: [9123e3a74ec7b934a4a099e98af6a61c2f80bbf5] Linux 5.9-rc1
git bisect bad 9123e3a74ec7b934a4a099e98af6a61c2f80bbf5
# good: [bcf876870b95592b52519ed4aafcf9d95999bc9c] Linux 5.8
git bisect good bcf876870b95592b52519ed4aafcf9d95999bc9c
# bad: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag 'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm
git bisect bad 8186749621ed6b8fc42644c399e8c755a2b6f630
# bad: [2324d50d051ec0f14a548e78554fb02513d6dcef] Merge tag 'docs-5.9' of git://git.lwn.net/linux
git bisect bad 2324d50d051ec0f14a548e78554fb02513d6dcef
# bad: [92c59e126b21fd212195358a0d296e787e444087] Merge tag 'arm-defconfig-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad 92c59e126b21fd212195358a0d296e787e444087
# bad: [cdc8fcb49905c0b67e355e027cb462ee168ffaa3] Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
git bisect bad cdc8fcb49905c0b67e355e027cb462ee168ffaa3
# good: [ab5c60b79ab6cc50b39bbb21b2f9fb55af900b84] Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect good ab5c60b79ab6cc50b39bbb21b2f9fb55af900b84
# bad: [d958e343bdc3de2643ce25225bed082dc222858d] block: blk-timeout: delete duplicated word
git bisect bad d958e343bdc3de2643ce25225bed082dc222858d
# bad: [53042f3cc411adc79811ba3cfbca5d7a42a7b806] ps3vram: stop using ->queuedata
git bisect bad 53042f3cc411adc79811ba3cfbca5d7a42a7b806
# good: [621c1f42945e76015c3a585e7a9fe6e71665eba0] block: move struct block_device to blk_types.h
git bisect good 621c1f42945e76015c3a585e7a9fe6e71665eba0
# good: [36a3df5a4574d5ddf59804fcd0c4e9654c514d9a] blk-mq: put driver tag when this request is completed
git bisect good 36a3df5a4574d5ddf59804fcd0c4e9654c514d9a
# good: [570e9b73b0af2e5381ca5343759779b8c1ed20e3] blk-mq: move blk_mq_get_driver_tag into blk-mq.c
git bisect good 570e9b73b0af2e5381ca5343759779b8c1ed20e3
# bad: [b5fc1e8bedf8ad2c6381e0df6331ad5686aca425] blk-mq: remove pointless call of list_entry_rq() in hctx_show_busy_rq()
git bisect bad b5fc1e8bedf8ad2c6381e0df6331ad5686aca425
# bad: [37f4a24c2469a10a4c16c641671bd766e276cf9f] blk-mq: centralise related handling into blk_mq_get_driver_tag
git bisect bad 37f4a24c2469a10a4c16c641671bd766e276cf9f
# good: [723bf178f158abd1ce6069cb049581b3cb003aab] blk-mq: move blk_mq_put_driver_tag() into blk-mq.c
git bisect good 723bf178f158abd1ce6069cb049581b3cb003aab
# first bad commit: [37f4a24c2469a10a4c16c641671bd766e276cf9f] blk-mq: centralise related handling into blk_mq_get_driver_tag

========================================================================

[2] Lockup stack traces from commit 37f4a24c2469

[ 1842.777020] INFO: task fsstress:20611 blocked for more than 122 seconds.
[ 1842.784641] Not tainted 5.8.0-rc2-xfstests-00060-g37f4a24c2469 #2
[ 1842.791650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 10 22:26:22 [ 1842.800070] RIP: 0033:0x7f4f526c5287
[ 1842.805552] Code: Bad RIP value.
[ 1842.809028] RSP: 002b:00007ffd02de72e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[ 1842.817482] RAX: ffffffffffffffda RBX: 000000000000001d RCX: 00007f4f526c5287
xfstests-ltm-202[ 1842.824896] RDX: 0000000000000000 RSI: 00000000107e85e6 RDI: 000000000000001d
[ 1842.834501] RBP: 0000556a76eed4e0 R08: 000000000000002a R09: 0000556a78579410
[ 1842.841927] R10: fffffffffffff146 R11: 0000000000000206 R12: 00000000000001f4
[ 1842.849331] R13: 0000000051eb851f R14: 00007ffd02de7360 R15: 0000556a76eda2f0
[ 1842.856928]
[ 1842.856928] Showing all locks held in the system:
[ 1842.863371] 1 lock held by khungtaskd/23:
[ 1842.867539] #0: ffffffff8de69100 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.53+0x0/0x30
00910215247-aa k[ 1842.877502] 6 locks held by kworker/u4:1/77:
ernel: [ 1842.77[ 1842.883403] #0: ffff8a61183a5d48 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x1cf/0x590
[ 1842.895259] #1: ffffa607419cbe68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x1cf/0x590
7020] INFO: task[ 1842.908549] #2: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: trylock_super+0x16/0x50
[ 1842.920206] #3: ffff8a60fc8c0ae8 (&sbi->s_writepages_rwsem){++++}-{0:0}, at: do_writepages+0x41/0xe0
fsstress:20611 [ 1842.930026] #4: ffff8a6111fbf8d8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x1b2/0x690
[ 1842.941998] #5: ffff8a60fc35a610 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0xd4/0x640
blocked for more[ 1842.951178] 4 locks held by kworker/u4:3/510:
than 122 second[ 1842.957297] #0: ffff8a610f3ce548 ((wq_completion)ext4-rsv-conversion#2){+.+.}-{0:0}, at: process_one_work+0x1cf/0x590
[ 1842.970201] #1: ffffa6074055be68 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}-{0:0}, at: process_one_work+0x1cf/0x590
s.
Sep 10 22:26[ 1842.982027] #2: ffff8a6111fbf8d8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x1b2/0x690
[ 1842.994472] #3: ffff8a60ff80f7d0 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0xd4/0x640
:22 xfstests-ltm[ 1843.003697] 1 lock held by in:imklog/1887:
[ 1843.009383] 5 locks held by rs:main Q:Reg/1888:
-20200910215247-[ 1843.014635] 3 locks held by fsstress/20602:
[ 1843.020470] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
aa kernel: [ 184[ 1843.028764] #1: ffff8a6105e53848 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_buffered_write_iter+0x33/0x120
[ 1843.041101] #2: ffff8a6111fbf8d8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x1b2/0x690
2.784641] [ 1843.050219] 2 locks held by fsstress/20604:
[ 1843.056033] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
Not tainted 5.8.[ 1843.065637] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
[ 1843.076050] 1 lock held by fsstress/20605:
[ 1843.080301] #0: ffff8a60dece9708 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x50b/0x700
0-rc2-xfstests-0[ 1843.090782] 1 lock held by fsstress/20606:
[ 1843.096535] #0: ffff8a60fe7c1670 (&pipe->mutex/1){+.+.}-{3:3}, at: do_splice+0x509/0x660
0060-g37f4a24c24[ 1843.105309] 2 locks held by fsstress/20607:
[ 1843.111362] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
69 #2
Sep 10 22[ 1843.121495] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
[ 1843.131856] 2 locks held by fsstress/20608:
:26:22 xfstests-[ 1843.136214] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
[ 1843.145943] #1: ffff8a6111740488 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x2c7/0x700
ltm-202009102152[ 1843.156514] 2 locks held by fsstress/20610:
[ 1843.162313] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
47-aa kernel: [ [ 1843.172161] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
[ 1843.182439] 2 locks held by fsstress/20611:
[ 1843.186975] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
1842.791650] "ec[ 1843.196426] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
[ 1843.206798] 2 locks held by fsstress/20612:
[ 1843.211146] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
ho 0 > /proc/sys[ 1843.219359] #1: ffff8a60fc24d6c8 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x2c7/0x700
[ 1843.232693] 2 locks held by fsstress/20613:
[ 1843.237043] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
/kernel/hung_tas[ 1843.245394] #1: ffff8a6100fa69c8 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x50b/0x700
[ 1843.257573] 2 locks held by fsstress/20614:
[ 1843.261909] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
[ 1843.271222] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
[ 1843.280216] 2 locks held by fsstress/20616:
k_timeout_secs" [ 1843.284594] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[ 1843.294701] #1: ffff8a610bff3508 (&type->i_mutex_dir_key#3){++++}-{3:3}, at: path_openat+0x336/0x980
disables this me[ 1843.304245] 2 locks held by fsstress/20617:
ssage.
Sep 10 2[ 1843.310081] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
[ 1843.321202] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
2:26:22 xfstests[ 1843.333613] 2 locks held by fsstress/20618:
[ 1843.339777] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
-ltm-20200910215[ 1843.348226] #1: ffff8a610bea0888 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x50b/0x700
[ 1843.360211] 2 locks held by fsstress/20619:
[ 1843.364558] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
247-aa kernel: [[ 1843.373869] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
[ 1843.384165] 1 lock held by fsstress/20620:
[ 1843.388424] #0: ffff8a61117b1188 (&bdev->bd_mutex){+.+.}-{3:3}, at: iterate_bdevs+0x96/0x13f
1842.799971] fs[ 1843.397134] 1 lock held by fsstress/20621:
[ 1843.402773] #0: ffff8a6100c8b8c8 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_read_iter+0x11c/0x160
stress D [ 1843.413537]
[ 1843.416751] =============================================
[ 1843.416751]
0 20611 20601 0x00004000
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.799979] Call Trace:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.799993] __schedule+0x2cc/0x6e0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800002] ? sched_clock_cpu+0xc/0xb0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800009] ? sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800012] schedule+0x55/0xd0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800017] schedule_preempt_disabled+0x11/0x20
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800020] __mutex_lock+0x397/0x9b0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800025] ? mark_held_locks+0x47/0x70
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800031] ? kvm_sched_clock_read+0x14/0x30
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800034] ? sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800036] sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800042] ? __ia32_sys_tee+0xc0/0xc0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800047] iterate_supers+0x8c/0xf0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800051] ksys_sync+0x40/0xb0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800055] __do_sys_sync+0xa/0x10
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800059] do_syscall_64+0x56/0xa0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800065] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.800070] RIP: 0033:0x7f4f526c5287
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.805552] Code: Bad RIP value.
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.809028] RSP: 002b:00007ffd02de72e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.817482] RAX: ffffffffffffffda RBX: 000000000000001d RCX: 00007f4f526c5287
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.824896] RDX: 0000000000000000 RSI: 00000000107e85e6 RDI: 000000000000001d
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.834501] RBP: 0000556a76eed4e0 R08: 000000000000002a R09: 0000556a78579410
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.841927] R10: fffffffffffff146 R11: 0000000000000206 R12: 00000000000001f4
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.849331] R13: 0000000051eb851f R14: 00007ffd02de7360 R15: 0000556a76eda2f0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.856928]
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.856928] Showing all locks held in the system:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.863371] 1 lock held by khungtaskd/23:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.867539] #0: ffffffff8de69100 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.53+0x0/0x30
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.877502] 6 locks held by kworker/u4:1/77:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.883403] #0: ffff8a61183a5d48 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x1cf/0x590
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.895259] #1: ffffa607419cbe68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x1cf/0x590
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.908549] #2: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: trylock_super+0x16/0x50
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.920206] #3: ffff8a60fc8c0ae8 (&sbi->s_writepages_rwsem){++++}-{0:0}, at: do_writepages+0x41/0xe0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.930026] #4: ffff8a6111fbf8d8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x1b2/0x690
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.941998] #5: ffff8a60fc35a610 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0xd4/0x640
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.951178] 4 locks held by kworker/u4:3/510:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.957297] #0: ffff8a610f3ce548 ((wq_completion)ext4-rsv-conversion#2){+.+.}-{0:0}, at: process_one_work+0x1cf/0x590
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.970201] #1: ffffa6074055be68 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}-{0:0}, at: process_one_work+0x1cf/0x590
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.982027] #2: ffff8a6111fbf8d8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x1b2/0x690
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1842.994472] #3: ffff8a60ff80f7d0 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0xd4/0x640
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.003697] 1 lock held by in:imklog/1887:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.009383] 5 locks held by rs:main Q:Reg/1888:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.014635] 3 locks held by fsstress/20602:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.020470] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.028764] #1: ffff8a6105e53848 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_buffered_write_iter+0x33/0x120
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.041101] #2: ffff8a6111fbf8d8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x1b2/0x690
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.050219] 2 locks held by fsstress/20604:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.056033] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.065637] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.076050] 1 lock held by fsstress/20605:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.080301] #0: ffff8a60dece9708 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x50b/0x700
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.090782] 1 lock held by fsstress/20606:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.096535] #0: ffff8a60fe7c1670 (&pipe->mutex/1){+.+.}-{3:3}, at: do_splice+0x509/0x660
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.105309] 2 locks held by fsstress/20607:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.111362] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.121495] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.131856] 2 locks held by fsstress/20608:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.136214] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.145943] #1: ffff8a6111740488 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x2c7/0x700
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.156514] 2 locks held by fsstress/20610:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.162313] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.172161] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.182439] 2 locks held by fsstress/20611:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.186975] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.196426] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.206798] 2 locks held by fsstress/20612:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.211146] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.219359] #1: ffff8a60fc24d6c8 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x2c7/0x700
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.232693] 2 locks held by fsstress/20613:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.237043] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.245394] #1: ffff8a6100fa69c8 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x50b/0x700
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.257573] 2 locks held by fsstress/20614:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.261909] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.271222] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.280216] 2 locks held by fsstress/20616:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.284594] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.294701] #1: ffff8a610bff3508 (&type->i_mutex_dir_key#3){++++}-{3:3}, at: path_openat+0x336/0x980
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.304245] 2 locks held by fsstress/20617:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.310081] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.321202] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.333613] 2 locks held by fsstress/20618:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.339777] #0: ffff8a60fc9a3488 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x18b/0x1c0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.348226] #1: ffff8a610bea0888 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_write_iter+0x50b/0x700
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.360211] 2 locks held by fsstress/20619:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.364558] #0: ffff8a60fc9a30e8 (&type->s_umount_key#36){++++}-{3:3}, at: iterate_supers+0x74/0xf0
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.373869] #1: ffff8a60fc9a3978 (&s->s_sync_lock){+.+.}-{3:3}, at: sync_inodes_sb+0x120/0x480
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.384165] 1 lock held by fsstress/20620:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.388424] #0: ffff8a61117b1188 (&bdev->bd_mutex){+.+.}-{3:3}, at: iterate_bdevs+0x96/0x13f
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.397134] 1 lock held by fsstress/20621:
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.402773] #0: ffff8a6100c8b8c8 (&sb->s_type->i_mutex_key#13){++++}-{3:3}, at: ext4_file_read_iter+0x11c/0x160
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.413537]
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.416751] =============================================
Sep 10 22:26:22 xfstests-ltm-20200910215247-aa kernel: [ 1843.416751]

========================================================================

[3] stack trace from my attempted manual revert of 37f4a24c2469

[ 2738.659707] BUG: unable to handle page fault for address: ffff8d7c2825e59e
[ 2738.666740] #PF: supervisor write access in kernel mode
[ 2738.672172] #PF: error_code(0x0002) - not-present page
[ 2738.677424] PGD 23fe01067 P4D 23fe01067 PUD 1dffff067 PMD 0
[ 2738.683217] Oops: 0002 [#1] SMP PTI
[ 2738.686822] CPU: 1 PID: 15400 Comm: kworker/1:0 Not tainted 5.9.0-rc1-xfstests-13682-g6a184ca6da62 #1
[ 2738.696238] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 2738.705571] Workqueue: events free_work
[ 2738.709638] RIP: 0010:free_block+0x157/0x270
[ 2738.714029] Code: b6 4c 24 1c 48 c1 ee 20 29 f0 d3 e8 41 0f b6 4c 24 1d 01 f0 49 8b 76 20 d3 e8 8d 4f ff 48 85 f6 41 89 4e 30 0f 84 01 01 00 00 <88> 04 0e 41 8b 46 30 85 c0 0f 84 ed fe ff ff 4d 8b 6f 50 4d 8d 4f
[ 2738.732911] RSP: 0018:ffff9131052d7ad0 EFLAGS: 00010046
[ 2738.738253] RAX: 000000000000002d RBX: ffffb130ffc00bd8 RCX: 00000000fffffffe
[ 2738.745492] RDX: ffffbc1444a09788 RSI: ffff8d7b2825e5a0 RDI: 00000000ffffffff
[ 2738.752733] RBP: ffffbc1444a09788 R08: ffffbc1444a09780 R09: ffff8d7bd80421c8
[ 2738.759973] R10: 0000000000000001 R11: 00000000ffffffff R12: ffff8d7bd80410c0
[ 2738.767379] R13: ffff8d7b2825e5a0 R14: ffffbc1444a09780 R15: ffff8d7bd8042180
[ 2738.774624] FS: 0000000000000000(0000) GS:ffff8d7bd9200000(0000) knlGS:0000000000000000
[ 2738.783005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2738.788867] CR2: ffff8d7c2825e59e CR3: 000000023e212004 CR4: 00000000001706e0
[ 2738.796125] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2738.803467] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2738.810798] Call Trace:
[ 2738.813359] ___cache_free+0x49a/0x6a0
[ 2738.817218] kfree+0xc9/0x1d0
[ 2738.820294] kmem_freepages+0xa0/0xf0
[ 2738.824063] slab_destroy+0x19/0x50
[ 2738.827660] slabs_destroy+0x6d/0x90
[ 2738.831345] ___cache_free+0x4b6/0x6a0
[ 2738.835200] kfree+0xc9/0x1d0
[ 2738.838275] kmem_freepages+0xa0/0xf0
[ 2738.842045] slab_destroy+0x19/0x50
[ 2738.845651] slabs_destroy+0x6d/0x90
[ 2738.849347] ___cache_free+0x4b6/0x6a0
[ 2738.853229] ? sched_clock_cpu+0xc/0xb0
[ 2738.857177] kfree+0xc9/0x1d0
[ 2738.860254] __vunmap+0x1e0/0x230
[ 2738.863680] free_work+0x21/0x30
[ 2738.867035] process_one_work+0x250/0x590
[ 2738.871158] ? worker_thread+0xcf/0x390
[ 2738.875119] ? process_one_work+0x590/0x590
[ 2738.880191] worker_thread+0x3d/0x390
[ 2738.883965] ? process_one_work+0x590/0x590
[ 2738.888256] kthread+0x136/0x150
[ 2738.891681] ? __kthread_queue_delayed_work+0x90/0x90
[ 2738.897105] ret_from_fork+0x22/0x30
[ 2738.900868] CR2: ffff8d7c2825e59e
[ 2738.904297] ---[ end trace 6b0889ea3014c11b ]---



2020-09-15 07:34:29

by Ming Lei

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

Hi Theodore,

On Tue, Sep 15, 2020 at 12:45:19AM -0400, Theodore Y. Ts'o wrote:
> On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote:
> > Worse, right now, -rc1 and -rc2 is causing random crashes in my
> > gce-xfstests framework. Sometimes it happens before we've run even a
> > single xfstests; sometimes it happens after we have successfully
> > completed all of the tests, and we're doing a shutdown of the VM under
> > test. Other times it happens in the middle of a test run. Given that
> > I'm seeing this at -rc1, which is before my late ext4 pull request to
> > Linus, it's probably not an ext4 related bug. But it also means that
> > I'm partially blind in terms of my kernel testing at the moment. So I
> > can't even tell Linus that I've run lots of tests and I'm 100%
> > confident your one-line change is 100% safe.
>
> I was finally able to bisect it down to the commit:
>
> 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

37f4a24c2469 has been reverted in:

4e2f62e566b5 Revert "blk-mq: put driver tag when this request is completed"

And later the patch is committed as the following after being fixed:

568f27006577 blk-mq: centralise related handling into blk_mq_get_driver_tag

So can you reproduce the issue by running kernel of commit 568f27006577?
If yes, can the issue be fixed by reverting 568f27006577?

>
> (See below for [1] Bisect log.)
>
> The previous commit allows the tests to run to completion. With
> commit 37f4a24c2469 and later all 11 test scenarios (4k blocks, 1k
> blocks, ext3 compat, ext4 w/ fscrypt, nojournal mode, data=journal,
> bigalloc, etc.) the VM will get stuck.

Can you share the exact mount command line for setup the environment?
and the exact xfstest item?



Thanks,
Ming

2020-09-17 02:28:54

by Ming Lei

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Wed, Sep 16, 2020 at 04:20:26PM -0400, Theodore Y. Ts'o wrote:
> On Wed, Sep 16, 2020 at 07:09:41AM +0800, Ming Lei wrote:
> > > The problem is it's a bit tricky to revert 568f27006577, since there
> > > is a merge conflict in blk_kick_flush(). I attempted to do the bisect
> > > manually here, but it's clearly not right since the kernel is not
> > > booting after the revert:
> > >
> > > https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870
> > >
> > > branch:
> > >
> > > https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577
> > >
> > > Can you send me a patch which correctly reverts 568f27006577? I can
> > > try it against -rc1 .. -rc4, whichever is most convenient.
> >
> > Please test the following revert patch against -rc4.
>
> Unfortunately the results of the revert is... wierd.
>
> With -rc4, *all* of the VM's are failing --- reliably. With rc4 with
> the revert, *some* of the VM's are able to complete the tests, but
> over half are still locking up or failing with some kind of oops. So
> that seems to imply that there is some kind of timing issue going on,
> or maybe there are multiple bugs in play?

Obviously there is other more serious issue, since 568f27006577 is
completely reverted in your test, and you still see list corruption
issue.

So I'd suggest to find the big issue first. Once it is fixed, maybe
everything becomes fine.

......

>
> v5.9-rc4 with the revert of 568f27006577: we're seeing a similar
> number of VM failures, but the failure signature is different.
> The most common failure is:
>
> [ 390.023691] ------------[ cut here ]------------
> [ 390.028614] list_del corruption, ffffe1c241b00408->next is LIST_POISON1 (dead000000000100)
> [ 390.037040] WARNING: CPU: 1 PID: 5948 at lib/list_debug.c:47 __list_del_entry_valid+0x4e/0x90
> [ 390.045684] CPU: 1 PID: 5948 Comm: umount Not tainted 5.9.0-rc4-xfstests-00001-g6fdef015feba #11
> [ 390.054581] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> [ 390.063943] RIP: 0010:__list_del_entry_valid+0x4e/0x90
> [ 390.069731] Code: 2e 48 8b 32 48 39 fe 75 3a 48 8b 50 08 48 39 f2 75 48 b8 01 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 10 13 12 9b e8 30 2f 8c ff <0f> 0b 31 c0 c3 48 89 fe 48 c7 c7 48 13 12 9b e8 1c 2f 8c ff 0f 0b
> [ 390.088615] RSP: 0018:ffffae95c6ddba28 EFLAGS: 00010082
> [ 390.094079] RAX: 0000000000000000 RBX: ffffce95bfc007d0 RCX: 0000000000000027
> [ 390.101338] RDX: 0000000000000027 RSI: ffffa0c9d93d7dc0 RDI: ffffa0c9d93d7dc8
> [ 390.108659] RBP: ffffe1c241b00408 R08: 0000006ba6bff7dc R09: 0000000000000000
> [ 390.115925] R10: ffffa0c9d3f444c0 R11: 0000000000000046 R12: ffffa0c9d8041180
> [ 390.123186] R13: ffffa0c86c010e00 R14: ffffe1c241b00400 R15: ffffa0c9d8042240
> [ 390.130637] FS: 00007fb227580080(0000) GS:ffffa0c9d9200000(0000) knlGS:0000000000000000
> [ 390.138860] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 390.144721] CR2: 00007ff72d2dfe74 CR3: 00000001cdbb8002 CR4: 00000000003706e0
> [ 390.152022] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 390.159314] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 390.166569] Call Trace:
> [ 390.169148] free_block+0xec/0x270
> [ 390.173100] ___cache_free+0x488/0x6b0
> [ 390.177062] kfree+0xc9/0x1d0
> [ 390.181202] kmem_freepages+0xa0/0xf0
> [ 390.185009] slab_destroy+0x19/0x50
> [ 390.188653] slabs_destroy+0x6d/0x90
> [ 390.192339] ___cache_free+0x4a3/0x6b0
> [ 390.196477] kfree+0xc9/0x1d0
> [ 390.199651] kmem_freepages+0xa0/0xf0
> [ 390.203459] slab_destroy+0x19/0x50
> [ 390.207060] slabs_destroy+0x6d/0x90
> [ 390.210784] ___cache_free+0x4a3/0x6b0
> [ 390.214672] ? lockdep_hardirqs_on_prepare+0xe7/0x180
> [ 390.219845] kfree+0xc9/0x1d0
> [ 390.222928] put_crypt_info+0xe3/0x100
> [ 390.226801] fscrypt_put_encryption_info+0x15/0x30
> [ 390.231721] ext4_clear_inode+0x80/0x90
> [ 390.235774] ext4_evict_inode+0x6d/0x630
> [ 390.239960] evict+0xd0/0x1a0
> [ 390.243049] dispose_list+0x51/0x80
> [ 390.246659] evict_inodes+0x15b/0x1b0
> [ 390.250526] generic_shutdown_super+0x37/0x100
> [ 390.255094] kill_block_super+0x21/0x50
> [ 390.259066] deactivate_locked_super+0x2f/0x70
> [ 390.263638] cleanup_mnt+0xb8/0x140
> [ 390.267248] task_work_run+0x73/0xc0
> [ 390.270953] exit_to_user_mode_prepare+0x197/0x1a0
> [ 390.277333] syscall_exit_to_user_mode+0x3c/0x210
> [ 390.282171] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 390.287348] RIP: 0033:0x7fb2279a6507
> [ 390.291128] Code: 19 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 19 0c 00 f7 d8 64 89 01 48
> [ 390.310018] RSP: 002b:00007ffd41391c08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> [ 390.317711] RAX: 0000000000000000 RBX: 000055b889e1f970 RCX: 00007fb2279a6507
> [ 390.324969] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b889e24f00
> [ 390.332300] RBP: 0000000000000000 R08: 000055b889e24f20 R09: 00007fb227a27e80
> [ 390.339543] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b889e24f00
> [ 390.346818] R13: 00007fb227acc1c4 R14: 000055b889e1fa68 R15: 000055b889e1fb80
> [ 390.354075] irq event stamp: 3176310
> [ 390.357762] hardirqs last enabled at (3176309): [<ffffffff9a090df2>] kfree+0x132/0x1d0
> [ 390.365891] hardirqs last disabled at (3176310): [<ffffffff9a090df9>] kfree+0x139/0x1d0
> [ 390.374021] softirqs last enabled at (3174992): [<ffffffff9ac00347>] __do_softirq+0x347/0x45f
> [ 390.382762] softirqs last disabled at (3174535): [<ffffffff9aa00f72>] asm_call_on_stack+0x12/0x20
> [ 390.391742] ---[ end trace 8fb872d4de3e00ed ]---

Looks it is more like a memory corruption issue, is there any helpful log
dumped when running kernel with kasan?

BTW, I have kvm/qumu auto test which runs blktest/xfstest over virtio-blk/virito-scsi/loop/nvme
with xfs/ext4 every two days, and not see such failure recently, but the kernel config is based
rhel8's config.


thanks,
Ming

2020-09-24 01:01:08

by Ming Lei

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Thu, Sep 17, 2020 at 10:30:12AM -0400, Theodore Y. Ts'o wrote:
> On Thu, Sep 17, 2020 at 10:20:51AM +0800, Ming Lei wrote:
> >
> > Obviously there is other more serious issue, since 568f27006577 is
> > completely reverted in your test, and you still see list corruption
> > issue.
> >
> > So I'd suggest to find the big issue first. Once it is fixed, maybe
> > everything becomes fine.
> > ...
> > Looks it is more like a memory corruption issue, is there any helpful log
> > dumped when running kernel with kasan?
>
> Last night, I ran six VM's using -rc4 with and without KASAN; without
> Kasan, half of them hung. With KASAN enabled, all of the test VM's
> ran to completion.
>
> This strongly suggests whatever the problem is, it's timing related.
> I'll run a larger set of test runs to see if this pattern is confirmed
> today.
>
> > BTW, I have kvm/qumu auto test which runs blktest/xfstest over virtio-blk/virito-scsi/loop/nvme
> > with xfs/ext4 every two days, and not see such failure recently, but the kernel config is based
> > rhel8's config.
>
> Here is the configs I'm using, with and without KASAN. (With KASAN is
> enabled is sent as a diff to avoid running into LKML's e-mail size
> restrictrions.)
>

The list corruption issue can be reproduced on kvm/qumu guest too when
running xfstests(ext4) generic/038.

However, the issue may become not reproduced when adding or removing memory
debug options, such as adding KASAN.

When I enable PAGE_POISONING, double free on kmalloc(192) is captured:

[ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp ffff89ada7584300^M
[ 1198.326651] ------------[ cut here ]------------^M
[ 1198.327969] kernel BUG at mm/slab.c:2535!^M
[ 1198.329129] invalid opcode: 0000 [#1] SMP PTI^M
[ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc4_quiesce_srcu-xfstests #102^M
[ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M
[ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M
[ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 e9 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff <0f> 0b 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M
[ 1198.347331] RSP: 0018:ffff982e40710be8 EFLAGS: 00010046^M
[ 1198.349091] RAX: 0000000000000048 RBX: ffff89adb6441400 RCX: 0000000000000000^M
[ 1198.351839] RDX: 0000000000000000 RSI: ffff89adbaa97800 RDI: ffff89adbaa97800^M
[ 1198.354572] RBP: ffff89ada7584300 R08: 0000000000000417 R09: 0000000000000057^M
[ 1198.357150] R10: 0000000000000001 R11: ffff982e40710aa5 R12: ffff89adbaaae598^M
[ 1198.359067] R13: ffffe7bc819d6108 R14: ffffe7bc819d6100 R15: ffff89adb6442280^M
[ 1198.360975] FS: 0000000000000000(0000) GS:ffff89adbaa80000(0000) knlGS:0000000000000000^M
[ 1198.363202] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[ 1198.365986] CR2: 000055f6a3811318 CR3: 000000017adca005 CR4: 0000000000770ee0^M
[ 1198.368679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[ 1198.371386] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
[ 1198.374203] PKRU: 55555554^M
[ 1198.375174] Call Trace:^M
[ 1198.376165] <IRQ>^M
[ 1198.376908] ___cache_free+0x56d/0x770^M
[ 1198.378355] ? kmem_freepages+0xa0/0xf0^M
[ 1198.379814] kfree+0x91/0x120^M
[ 1198.382121] kmem_freepages+0xa0/0xf0^M
[ 1198.383474] slab_destroy+0x9f/0x120^M
[ 1198.384779] slabs_destroy+0x6d/0x90^M
[ 1198.386110] ___cache_free+0x632/0x770^M
[ 1198.387547] ? kmem_freepages+0xa0/0xf0^M
[ 1198.389016] kfree+0x91/0x120^M
[ 1198.390160] kmem_freepages+0xa0/0xf0^M
[ 1198.391551] slab_destroy+0x9f/0x120^M
[ 1198.392964] slabs_destroy+0x6d/0x90^M
[ 1198.394439] ___cache_free+0x632/0x770^M
[ 1198.395896] kmem_cache_free.part.75+0x19/0x70^M
[ 1198.397791] rcu_core+0x1eb/0x6b0^M
[ 1198.399829] ? ktime_get+0x37/0xa0^M
[ 1198.401343] __do_softirq+0xdf/0x2c5^M
[ 1198.403010] asm_call_on_stack+0x12/0x20^M
[ 1198.404847] </IRQ>^M
[ 1198.405799] do_softirq_own_stack+0x39/0x50^M
[ 1198.407621] irq_exit_rcu+0x97/0xa0^M
[ 1198.409127] sysvec_apic_timer_interrupt+0x2c/0x80^M
[ 1198.410608] asm_sysvec_apic_timer_interrupt+0x12/0x20^M
[ 1198.411883] RIP: 0010:default_idle+0x13/0x20^M
[ 1198.412994] Code: 89 44 24 20 48 83 c0 22 48 89 44 24 28 eb c7 e8 03 93 ff ff cc cc cc 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 11 ec 55 00 fb f4 <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 65 48 8b 04^M
[ 1198.418791] RSP: 0018:ffff982e406abec8 EFLAGS: 00000246^M
[ 1198.420529] RAX: ffffffff9bca6ea0 RBX: 0000000000000001 RCX: ffff89aeafa69648^M
[ 1198.422884] RDX: ffff89adbaaaa9c0 RSI: 0000000000000000 RDI: 00000116f5b68899^M
[ 1198.425283] RBP: 0000000000000001 R08: 0000012531639bb4 R09: 0000000000000000^M
[ 1198.427804] R10: ffff89adb5e1f600 R11: 0000000000000000 R12: 0000000000000000^M
[ 1198.430532] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
[ 1198.433025] ? __sched_text_end+0x3/0x3^M
[ 1198.434264] ? __sched_text_end+0x3/0x3^M
[ 1198.435727] default_idle_call+0x3a/0xe0^M
[ 1198.437250] do_idle+0x1d8/0x1f0^M
[ 1198.438476] cpu_startup_entry+0x19/0x20^M
[ 1198.439973] start_secondary+0x10a/0x150^M
[ 1198.440912] secondary_startup_64+0xb6/0xc0^M
[ 1198.441924] Dumping ftrace buffer:^M
[ 1198.442872] (ftrace buffer empty)^M
[ 1198.443839] ---[ end trace a133037ccc3cf29e ]---^M




Thanks,
Ming

2020-09-24 14:37:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Thu, Sep 24, 2020 at 08:59:01AM +0800, Ming Lei wrote:
>
> The list corruption issue can be reproduced on kvm/qumu guest too when
> running xfstests(ext4) generic/038.
>
> However, the issue may become not reproduced when adding or removing memory
> debug options, such as adding KASAN.

Can you reliably reprodue this crash? And if so, with what config and
what kernel version.

One of the reasons why I had gone silent on this bug is that I've been
trying many, many configurations and configurations which reliably
reproduced on say, -rc4 would not reproduce on -rc5, and then I would
get a completely different set of results on -rc6. So I've been
trying to run a lot of different experiments to try to understand what
might be going on, since it seems pretty clear this must be a very
timing-sensitive failure.

I also found that the re-occrance went down significantly if I enabled
KASAN, and while it didn't go away, I wasn't able to get a KASAN
failure to trigger, either. Turning off CONFIG_PROVE_LOCKING and a
*lot* of other debugging configs made the problem vanish in -rc4, but
that trick didn't work with -rc5 or -rc6.

Each time I discovered one of these things, I was about to post to the
e-mail thread, only to have a confirmation test run on a different
kernel version make the problem go away. In particular, your revert
helped with -rc4 and -rc6 IIRC, but it didn't help in -rc5.....

HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
that one of the stack traces had virtio balloon in the trace, I
realized that when I switched the GCE VM type from e1-standard-2 to
n1-standard-2 (where e1 VM's are cheaper because they use
virtio-balloon to better manage host OS memory utilization), problem
has become, much, *much* rarer (and possibly has gone away, although
I'm going to want to run a lot more tests before I say that
conclusively) on my test setup. At the very least, using an n1 VM
(which doesn't have virtio-balloon enabled in the hypervisor) is
enough to unblock ext4 development.

Any chance your kvm/qemu configuration might have been using
virtio-ballon? Because other ext4 developers who have been using
kvm-xftests have not had any problems....

> When I enable PAGE_POISONING, double free on kmalloc(192) is captured:
>
> [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp ffff89ada7584300^M
> [ 1198.326651] ------------[ cut here ]------------^M
> [ 1198.327969] kernel BUG at mm/slab.c:2535!^M
> [ 1198.329129] invalid opcode: 0000 [#1] SMP PTI^M
> [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc4_quiesce_srcu-xfstests #102^M
> [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M
> [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M
> [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 e9 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff <0f> 0b 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M
> [ 1198.347331] RSP: 0018:ffff982e40710be8 EFLAGS: 00010046^M
> [ 1198.349091] RAX: 0000000000000048 RBX: ffff89adb6441400 RCX: 0000000000000000^M
> [ 1198.351839] RDX: 0000000000000000 RSI: ffff89adbaa97800 RDI: ffff89adbaa97800^M
> [ 1198.354572] RBP: ffff89ada7584300 R08: 0000000000000417 R09: 0000000000000057^M
> [ 1198.357150] R10: 0000000000000001 R11: ffff982e40710aa5 R12: ffff89adbaaae598^M
> [ 1198.359067] R13: ffffe7bc819d6108 R14: ffffe7bc819d6100 R15: ffff89adb6442280^M
> [ 1198.360975] FS: 0000000000000000(0000) GS:ffff89adbaa80000(0000) knlGS:0000000000000000^M
> [ 1198.363202] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [ 1198.365986] CR2: 000055f6a3811318 CR3: 000000017adca005 CR4: 0000000000770ee0^M
> [ 1198.368679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> [ 1198.371386] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> [ 1198.374203] PKRU: 55555554^M
> [ 1198.375174] Call Trace:^M
> [ 1198.376165] <IRQ>^M
> [ 1198.376908] ___cache_free+0x56d/0x770^M
> [ 1198.378355] ? kmem_freepages+0xa0/0xf0^M
> [ 1198.379814] kfree+0x91/0x120^M
> [ 1198.382121] kmem_freepages+0xa0/0xf0^M
> [ 1198.383474] slab_destroy+0x9f/0x120^M
> [ 1198.384779] slabs_destroy+0x6d/0x90^M
> [ 1198.386110] ___cache_free+0x632/0x770^M
> [ 1198.387547] ? kmem_freepages+0xa0/0xf0^M
> [ 1198.389016] kfree+0x91/0x120^M
> [ 1198.390160] kmem_freepages+0xa0/0xf0^M
> [ 1198.391551] slab_destroy+0x9f/0x120^M
> [ 1198.392964] slabs_destroy+0x6d/0x90^M
> [ 1198.394439] ___cache_free+0x632/0x770^M
> [ 1198.395896] kmem_cache_free.part.75+0x19/0x70^M
> [ 1198.397791] rcu_core+0x1eb/0x6b0^M
> [ 1198.399829] ? ktime_get+0x37/0xa0^M
> [ 1198.401343] __do_softirq+0xdf/0x2c5^M
> [ 1198.403010] asm_call_on_stack+0x12/0x20^M
> [ 1198.404847] </IRQ>^M
> [ 1198.405799] do_softirq_own_stack+0x39/0x50^M
> [ 1198.407621] irq_exit_rcu+0x97/0xa0^M
> [ 1198.409127] sysvec_apic_timer_interrupt+0x2c/0x80^M
> [ 1198.410608] asm_sysvec_apic_timer_interrupt+0x12/0x20^M
> [ 1198.411883] RIP: 0010:default_idle+0x13/0x20^M
> [ 1198.412994] Code: 89 44 24 20 48 83 c0 22 48 89 44 24 28 eb c7 e8 03 93 ff ff cc cc cc 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 11 ec 55 00 fb f4 <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 65 48 8b 04^M

Hmm, so that this looks like some kind of RCU involvement? Some kind
of object that had been scheduled to be freed via RCU, but then got
freed before RCU cleanup happened?

The question then is what subsystem the object involved came from.

- Ted

2020-09-25 01:13:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> that one of the stack traces had virtio balloon in the trace, I
> realized that when I switched the GCE VM type from e1-standard-2 to
> n1-standard-2 (where e1 VM's are cheaper because they use
> virtio-balloon to better manage host OS memory utilization), problem
> has become, much, *much* rarer (and possibly has gone away, although
> I'm going to want to run a lot more tests before I say that
> conclusively) on my test setup. At the very least, using an n1 VM
> (which doesn't have virtio-balloon enabled in the hypervisor) is
> enough to unblock ext4 development.

.... and I spoke too soon. A number of runs using -rc6 are now
failing even with the n1-standard-2 VM, so virtio-ballon may not be an
indicator.

This is why debugging this is frustrating; it is very much a heisenbug
--- although 5.8 seems to work completely reliably, as does commits
before 37f4a24c2469. Anything after that point will show random
failures. :-(

- Ted



2020-09-25 01:16:33

by Ming Lei

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> On Thu, Sep 24, 2020 at 08:59:01AM +0800, Ming Lei wrote:
> >
> > The list corruption issue can be reproduced on kvm/qumu guest too when
> > running xfstests(ext4) generic/038.
> >
> > However, the issue may become not reproduced when adding or removing memory
> > debug options, such as adding KASAN.
>
> Can you reliably reprodue this crash? And if so, with what config and
> what kernel version.

Yeah, it can be reproduced reliably by running xfstests(ext4)
generic/038 over virtio-scsi(test device)/virtio-blk(scratch device).

The kernel is -rc4, and not test -rc5 yet.

It is exactly the config you provided, and I just enabled CDROM & ISOFS
against your config for passing cloud-init data via cdrom to VM.

> One of the reasons why I had gone silent on this bug is that I've been
> trying many, many configurations and configurations which reliably
> reproduced on say, -rc4 would not reproduce on -rc5, and then I would
> get a completely different set of results on -rc6. So I've been
> trying to run a lot of different experiments to try to understand what
> might be going on, since it seems pretty clear this must be a very
> timing-sensitive failure.
>
> I also found that the re-occrance went down significantly if I enabled
> KASAN, and while it didn't go away, I wasn't able to get a KASAN
> failure to trigger, either. Turning off CONFIG_PROVE_LOCKING and a
> *lot* of other debugging configs made the problem vanish in -rc4, but
> that trick didn't work with -rc5 or -rc6.

The issue can be reproduced reliably in my environment after
disabling LOCKDEP only for disabling KMEMLEAK only.

But after disabling DEBUG_OBJECTS, it becomes hard to trigger.

>
> Each time I discovered one of these things, I was about to post to the
> e-mail thread, only to have a confirmation test run on a different
> kernel version make the problem go away. In particular, your revert
> helped with -rc4 and -rc6 IIRC, but it didn't help in -rc5.....
>
> HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> that one of the stack traces had virtio balloon in the trace, I
> realized that when I switched the GCE VM type from e1-standard-2 to
> n1-standard-2 (where e1 VM's are cheaper because they use
> virtio-balloon to better manage host OS memory utilization), problem
> has become, much, *much* rarer (and possibly has gone away, although
> I'm going to want to run a lot more tests before I say that
> conclusively) on my test setup. At the very least, using an n1 VM
> (which doesn't have virtio-balloon enabled in the hypervisor) is
> enough to unblock ext4 development.
>
> Any chance your kvm/qemu configuration might have been using
> virtio-ballon? Because other ext4 developers who have been using
> kvm-xftests have not had any problems....

I don't setup virtio-ballon, see the attached script for setting the VM.

>
> > When I enable PAGE_POISONING, double free on kmalloc(192) is captured:
> >
> > [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp ffff89ada7584300^M
> > [ 1198.326651] ------------[ cut here ]------------^M
> > [ 1198.327969] kernel BUG at mm/slab.c:2535!^M
> > [ 1198.329129] invalid opcode: 0000 [#1] SMP PTI^M
> > [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc4_quiesce_srcu-xfstests #102^M
> > [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M
> > [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M
> > [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 e9 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff <0f> 0b 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M
> > [ 1198.347331] RSP: 0018:ffff982e40710be8 EFLAGS: 00010046^M
> > [ 1198.349091] RAX: 0000000000000048 RBX: ffff89adb6441400 RCX: 0000000000000000^M
> > [ 1198.351839] RDX: 0000000000000000 RSI: ffff89adbaa97800 RDI: ffff89adbaa97800^M
> > [ 1198.354572] RBP: ffff89ada7584300 R08: 0000000000000417 R09: 0000000000000057^M
> > [ 1198.357150] R10: 0000000000000001 R11: ffff982e40710aa5 R12: ffff89adbaaae598^M
> > [ 1198.359067] R13: ffffe7bc819d6108 R14: ffffe7bc819d6100 R15: ffff89adb6442280^M
> > [ 1198.360975] FS: 0000000000000000(0000) GS:ffff89adbaa80000(0000) knlGS:0000000000000000^M
> > [ 1198.363202] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> > [ 1198.365986] CR2: 000055f6a3811318 CR3: 000000017adca005 CR4: 0000000000770ee0^M
> > [ 1198.368679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> > [ 1198.371386] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> > [ 1198.374203] PKRU: 55555554^M
> > [ 1198.375174] Call Trace:^M
> > [ 1198.376165] <IRQ>^M
> > [ 1198.376908] ___cache_free+0x56d/0x770^M
> > [ 1198.378355] ? kmem_freepages+0xa0/0xf0^M
> > [ 1198.379814] kfree+0x91/0x120^M
> > [ 1198.382121] kmem_freepages+0xa0/0xf0^M
> > [ 1198.383474] slab_destroy+0x9f/0x120^M
> > [ 1198.384779] slabs_destroy+0x6d/0x90^M
> > [ 1198.386110] ___cache_free+0x632/0x770^M
> > [ 1198.387547] ? kmem_freepages+0xa0/0xf0^M
> > [ 1198.389016] kfree+0x91/0x120^M
> > [ 1198.390160] kmem_freepages+0xa0/0xf0^M
> > [ 1198.391551] slab_destroy+0x9f/0x120^M
> > [ 1198.392964] slabs_destroy+0x6d/0x90^M
> > [ 1198.394439] ___cache_free+0x632/0x770^M
> > [ 1198.395896] kmem_cache_free.part.75+0x19/0x70^M
> > [ 1198.397791] rcu_core+0x1eb/0x6b0^M
> > [ 1198.399829] ? ktime_get+0x37/0xa0^M
> > [ 1198.401343] __do_softirq+0xdf/0x2c5^M
> > [ 1198.403010] asm_call_on_stack+0x12/0x20^M
> > [ 1198.404847] </IRQ>^M
> > [ 1198.405799] do_softirq_own_stack+0x39/0x50^M
> > [ 1198.407621] irq_exit_rcu+0x97/0xa0^M
> > [ 1198.409127] sysvec_apic_timer_interrupt+0x2c/0x80^M
> > [ 1198.410608] asm_sysvec_apic_timer_interrupt+0x12/0x20^M
> > [ 1198.411883] RIP: 0010:default_idle+0x13/0x20^M
> > [ 1198.412994] Code: 89 44 24 20 48 83 c0 22 48 89 44 24 28 eb c7 e8 03 93 ff ff cc cc cc 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 11 ec 55 00 fb f4 <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 65 48 8b 04^M
>
> Hmm, so that this looks like some kind of RCU involvement? Some kind
> of object that had been scheduled to be freed via RCU, but then got
> freed before RCU cleanup happened?
>
> The question then is what subsystem the object involved came from.

Right now, I don't get time to investigate further. It shouldn't be
hard to locate where the 192 slab allocation is from, together with RCU
involved.

BTW, the double free is triggered by the following change on your
config:

- disable lockdep
- disable kmemleak
- enable PAGE_POISONING

BTW, the same issue can be triggered when running same test on xfs, so
the allocation might be a generic one, maybe in vfs or scsi.


Thanks,
Ming


Attachments:
(No filename) (6.84 kB)
bb (2.23 kB)
Download all attachments

2020-09-25 02:36:50

by Ming Lei

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 09:14:16AM +0800, Ming Lei wrote:
> On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > On Thu, Sep 24, 2020 at 08:59:01AM +0800, Ming Lei wrote:
> > >
> > > The list corruption issue can be reproduced on kvm/qumu guest too when
> > > running xfstests(ext4) generic/038.
> > >
> > > However, the issue may become not reproduced when adding or removing memory
> > > debug options, such as adding KASAN.
> >
> > Can you reliably reprodue this crash? And if so, with what config and
> > what kernel version.
>
> Yeah, it can be reproduced reliably by running xfstests(ext4)
> generic/038 over virtio-scsi(test device)/virtio-blk(scratch device).
>
> The kernel is -rc4, and not test -rc5 yet.
>
> It is exactly the config you provided, and I just enabled CDROM & ISOFS
> against your config for passing cloud-init data via cdrom to VM.
>
> > One of the reasons why I had gone silent on this bug is that I've been
> > trying many, many configurations and configurations which reliably
> > reproduced on say, -rc4 would not reproduce on -rc5, and then I would
> > get a completely different set of results on -rc6. So I've been
> > trying to run a lot of different experiments to try to understand what
> > might be going on, since it seems pretty clear this must be a very
> > timing-sensitive failure.
> >
> > I also found that the re-occrance went down significantly if I enabled
> > KASAN, and while it didn't go away, I wasn't able to get a KASAN
> > failure to trigger, either. Turning off CONFIG_PROVE_LOCKING and a
> > *lot* of other debugging configs made the problem vanish in -rc4, but
> > that trick didn't work with -rc5 or -rc6.
>
> The issue can be reproduced reliably in my environment after
> disabling LOCKDEP only for disabling KMEMLEAK only.
>
> But after disabling DEBUG_OBJECTS, it becomes hard to trigger.
>
> >
> > Each time I discovered one of these things, I was about to post to the
> > e-mail thread, only to have a confirmation test run on a different
> > kernel version make the problem go away. In particular, your revert
> > helped with -rc4 and -rc6 IIRC, but it didn't help in -rc5.....
> >
> > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > that one of the stack traces had virtio balloon in the trace, I
> > realized that when I switched the GCE VM type from e1-standard-2 to
> > n1-standard-2 (where e1 VM's are cheaper because they use
> > virtio-balloon to better manage host OS memory utilization), problem
> > has become, much, *much* rarer (and possibly has gone away, although
> > I'm going to want to run a lot more tests before I say that
> > conclusively) on my test setup. At the very least, using an n1 VM
> > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > enough to unblock ext4 development.
> >
> > Any chance your kvm/qemu configuration might have been using
> > virtio-ballon? Because other ext4 developers who have been using
> > kvm-xftests have not had any problems....
>
> I don't setup virtio-ballon, see the attached script for setting the VM.
>
> >
> > > When I enable PAGE_POISONING, double free on kmalloc(192) is captured:
> > >
> > > [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp ffff89ada7584300^M
> > > [ 1198.326651] ------------[ cut here ]------------^M
> > > [ 1198.327969] kernel BUG at mm/slab.c:2535!^M
> > > [ 1198.329129] invalid opcode: 0000 [#1] SMP PTI^M
> > > [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc4_quiesce_srcu-xfstests #102^M
> > > [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M
> > > [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M
> > > [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 e9 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff <0f> 0b 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M
> > > [ 1198.347331] RSP: 0018:ffff982e40710be8 EFLAGS: 00010046^M
> > > [ 1198.349091] RAX: 0000000000000048 RBX: ffff89adb6441400 RCX: 0000000000000000^M
> > > [ 1198.351839] RDX: 0000000000000000 RSI: ffff89adbaa97800 RDI: ffff89adbaa97800^M
> > > [ 1198.354572] RBP: ffff89ada7584300 R08: 0000000000000417 R09: 0000000000000057^M
> > > [ 1198.357150] R10: 0000000000000001 R11: ffff982e40710aa5 R12: ffff89adbaaae598^M
> > > [ 1198.359067] R13: ffffe7bc819d6108 R14: ffffe7bc819d6100 R15: ffff89adb6442280^M
> > > [ 1198.360975] FS: 0000000000000000(0000) GS:ffff89adbaa80000(0000) knlGS:0000000000000000^M
> > > [ 1198.363202] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> > > [ 1198.365986] CR2: 000055f6a3811318 CR3: 000000017adca005 CR4: 0000000000770ee0^M
> > > [ 1198.368679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> > > [ 1198.371386] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> > > [ 1198.374203] PKRU: 55555554^M
> > > [ 1198.375174] Call Trace:^M
> > > [ 1198.376165] <IRQ>^M
> > > [ 1198.376908] ___cache_free+0x56d/0x770^M
> > > [ 1198.378355] ? kmem_freepages+0xa0/0xf0^M
> > > [ 1198.379814] kfree+0x91/0x120^M
> > > [ 1198.382121] kmem_freepages+0xa0/0xf0^M
> > > [ 1198.383474] slab_destroy+0x9f/0x120^M
> > > [ 1198.384779] slabs_destroy+0x6d/0x90^M
> > > [ 1198.386110] ___cache_free+0x632/0x770^M
> > > [ 1198.387547] ? kmem_freepages+0xa0/0xf0^M
> > > [ 1198.389016] kfree+0x91/0x120^M
> > > [ 1198.390160] kmem_freepages+0xa0/0xf0^M
> > > [ 1198.391551] slab_destroy+0x9f/0x120^M
> > > [ 1198.392964] slabs_destroy+0x6d/0x90^M
> > > [ 1198.394439] ___cache_free+0x632/0x770^M
> > > [ 1198.395896] kmem_cache_free.part.75+0x19/0x70^M
> > > [ 1198.397791] rcu_core+0x1eb/0x6b0^M
> > > [ 1198.399829] ? ktime_get+0x37/0xa0^M
> > > [ 1198.401343] __do_softirq+0xdf/0x2c5^M
> > > [ 1198.403010] asm_call_on_stack+0x12/0x20^M
> > > [ 1198.404847] </IRQ>^M
> > > [ 1198.405799] do_softirq_own_stack+0x39/0x50^M
> > > [ 1198.407621] irq_exit_rcu+0x97/0xa0^M
> > > [ 1198.409127] sysvec_apic_timer_interrupt+0x2c/0x80^M
> > > [ 1198.410608] asm_sysvec_apic_timer_interrupt+0x12/0x20^M
> > > [ 1198.411883] RIP: 0010:default_idle+0x13/0x20^M
> > > [ 1198.412994] Code: 89 44 24 20 48 83 c0 22 48 89 44 24 28 eb c7 e8 03 93 ff ff cc cc cc 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 11 ec 55 00 fb f4 <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 65 48 8b 04^M
> >
> > Hmm, so that this looks like some kind of RCU involvement? Some kind
> > of object that had been scheduled to be freed via RCU, but then got
> > freed before RCU cleanup happened?
> >
> > The question then is what subsystem the object involved came from.
>
> Right now, I don't get time to investigate further. It shouldn't be
> hard to locate where the 192 slab allocation is from, together with RCU
> involved.
>
> BTW, the double free is triggered by the following change on your
> config:
>
> - disable lockdep
> - disable kmemleak
> - enable PAGE_POISONING

oops, double free is captured by enabling CONFIG_DEBUG_SLAB instead of
PAGE_POISONING, looks it is triggered reliably too.

Thanks,
Ming

2020-09-25 07:32:52

by Ming Lei

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > that one of the stack traces had virtio balloon in the trace, I
> > realized that when I switched the GCE VM type from e1-standard-2 to
> > n1-standard-2 (where e1 VM's are cheaper because they use
> > virtio-balloon to better manage host OS memory utilization), problem
> > has become, much, *much* rarer (and possibly has gone away, although
> > I'm going to want to run a lot more tests before I say that
> > conclusively) on my test setup. At the very least, using an n1 VM
> > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > enough to unblock ext4 development.
>
> .... and I spoke too soon. A number of runs using -rc6 are now
> failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> indicator.
>
> This is why debugging this is frustrating; it is very much a heisenbug
> --- although 5.8 seems to work completely reliably, as does commits
> before 37f4a24c2469. Anything after that point will show random
> failures. :-(

It does not make sense to mention 37f4a24c2469, which is reverted in
4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
as 568f27006577.

However, I can _not_ reproduce the issue by running the same test on
kernel built from 568f27006577 directly.

Also you have confirmed that the issue can't be fixed after reverting
568f27006577 against v5.9-rc4.

Looks the real issue(slab list corruption) should be introduced between
568f27006577 and v5.9-rc4.


thanks,
Ming

2020-09-25 16:36:26

by Shakeel Butt

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <[email protected]> wrote:
>
> On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote:
> > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > > > that one of the stack traces had virtio balloon in the trace, I
> > > > realized that when I switched the GCE VM type from e1-standard-2 to
> > > > n1-standard-2 (where e1 VM's are cheaper because they use
> > > > virtio-balloon to better manage host OS memory utilization), problem
> > > > has become, much, *much* rarer (and possibly has gone away, although
> > > > I'm going to want to run a lot more tests before I say that
> > > > conclusively) on my test setup. At the very least, using an n1 VM
> > > > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > > > enough to unblock ext4 development.
> > >
> > > .... and I spoke too soon. A number of runs using -rc6 are now
> > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> > > indicator.
> > >
> > > This is why debugging this is frustrating; it is very much a heisenbug
> > > --- although 5.8 seems to work completely reliably, as does commits
> > > before 37f4a24c2469. Anything after that point will show random
> > > failures. :-(
> >
> > It does not make sense to mention 37f4a24c2469, which is reverted in
> > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
> > as 568f27006577.
> >
> > However, I can _not_ reproduce the issue by running the same test on
> > kernel built from 568f27006577 directly.
> >
> > Also you have confirmed that the issue can't be fixed after reverting
> > 568f27006577 against v5.9-rc4.
> >
> > Looks the real issue(slab list corruption) should be introduced between
> > 568f27006577 and v5.9-rc4.
>
> git bisect shows the first bad commit:
>
> [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> kmem_caches for all allocations
>
> And I have double checked that the above commit is really the first bad
> commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> is LIST_POISON1 (dead000000000100)', see the detailed stack trace and
> kernel oops log in the following link:
>
> https://lore.kernel.org/lkml/[email protected]/

The failure signature is similar to
https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/

>
> And the kernel config is the one(without KASAN) used by Theodore in GCE VM, see
> the following link:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> The reproducer is xfstests generic/038. In my setting, test device is virtio-scsi, and
> scratch device is virtio-blk.
>
>
> [1] git bisect log:
>
> git bisect start
> # good: [568f2700657794b8258e72983ba508793a658942] blk-mq: centralise related handling into blk_mq_get_driver_tag
> git bisect good 568f2700657794b8258e72983ba508793a658942
> # bad: [f4d51dffc6c01a9e94650d95ce0104964f8ae822] Linux 5.9-rc4
> git bisect bad f4d51dffc6c01a9e94650d95ce0104964f8ae822
> # good: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag 'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm
> git bisect good 8186749621ed6b8fc42644c399e8c755a2b6f630
> # good: [60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e] Merge tag 'm68knommu-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
> git bisect good 60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e
> # bad: [57b077939287835b9396a1c3b40d35609cf2fcb8] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
> git bisect bad 57b077939287835b9396a1c3b40d35609cf2fcb8
> # bad: [0f43283be7fec4a76cd4ed50dc37db30719bde05] Merge branch 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
> git bisect bad 0f43283be7fec4a76cd4ed50dc37db30719bde05
> # good: [5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7] Merge tag 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> git bisect good 5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7
> # good: [e3083c3f369700cd1eec3de93b8d8ec0918ff778] media: cafe-driver: use generic power management
> git bisect good e3083c3f369700cd1eec3de93b8d8ec0918ff778
> # bad: [42742d9bde2a8e11ec932cb5821f720a40a7c2a9] mm: thp: replace HTTP links with HTTPS ones
> git bisect bad 42742d9bde2a8e11ec932cb5821f720a40a7c2a9
> # bad: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of kmem_caches for all allocations
> git bisect bad 10befea91b61c4e2c2d1df06a2e978d182fcf792
> # good: [cfbe1636c3585c1e032bfac512fb8be903fbc913] mm, kcsan: instrument SLAB/SLUB free with "ASSERT_EXCLUSIVE_ACCESS"
> git bisect good cfbe1636c3585c1e032bfac512fb8be903fbc913
> # good: [0f190a7ab78878f9e6c6930fc0f5f92c1250b57d] mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io
> git bisect good 0f190a7ab78878f9e6c6930fc0f5f92c1250b57d
> # good: [286e04b8ed7a04279ae277f0f024430246ea5eec] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
> git bisect good 286e04b8ed7a04279ae277f0f024430246ea5eec
> # good: [9855609bde03e2472b99a95e869d29ee1e78a751] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
> git bisect good 9855609bde03e2472b99a95e869d29ee1e78a751
> # good: [272911a4ad18c48f8bc449a5db945a54987dd687] mm: memcg/slab: remove memcg_kmem_get_cache()
> git bisect good 272911a4ad18c48f8bc449a5db945a54987dd687
> # good: [15999eef7f25e2ea6a1c33f026166f472c5714e9] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
> git bisect good 15999eef7f25e2ea6a1c33f026166f472c5714e9
> # first bad commit: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of kmem_caches for all allocations
>
>
>
> Thanks,
> Ming
>

2020-09-25 17:19:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <[email protected]> wrote:
>
> git bisect shows the first bad commit:
>
> [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> kmem_caches for all allocations
>
> And I have double checked that the above commit is really the first bad
> commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> is LIST_POISON1 (dead000000000100)',

Thet commit doesn't revert cleanly, but I think that's purely because
we'd also need to revert

849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
(un)account_slab_page()")

too.

Can you verify that a

git revert 74d555bed5d0 849504809f86 10befea91b61

on top of current -git makes things work for you again?

I'm going to do an rc8 this release simply because we have another VM
issue that I hope to get fixed - but there we know what the problem
and the fix _is_, it just needs some care.

So if Roman (or somebody else) can see what's wrong and we can fix
this quickly, we don't need to go down the revert path, but ..

Linus

2020-09-25 17:23:12

by Shakeel Butt

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <[email protected]> wrote:
> >
> > git bisect shows the first bad commit:
> >
> > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > kmem_caches for all allocations
> >
> > And I have double checked that the above commit is really the first bad
> > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > is LIST_POISON1 (dead000000000100)',
>
> Thet commit doesn't revert cleanly, but I think that's purely because
> we'd also need to revert
>
> 849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
> 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
> (un)account_slab_page()")
>
> too.
>
> Can you verify that a
>
> git revert 74d555bed5d0 849504809f86 10befea91b61
>
> on top of current -git makes things work for you again?
>
> I'm going to do an rc8 this release simply because we have another VM
> issue that I hope to get fixed - but there we know what the problem
> and the fix _is_, it just needs some care.
>
> So if Roman (or somebody else) can see what's wrong and we can fix
> this quickly, we don't need to go down the revert path, but ..
>

I think I have a theory. The issue is happening due to the potential
infinite recursion:

[ 5060.124412] ___cache_free+0x488/0x6b0
*****Second recursion
[ 5060.128666] kfree+0xc9/0x1d0
[ 5060.131947] kmem_freepages+0xa0/0xf0
[ 5060.135746] slab_destroy+0x19/0x50
[ 5060.139577] slabs_destroy+0x6d/0x90
[ 5060.143379] ___cache_free+0x4a3/0x6b0
*****First recursion
[ 5060.147896] kfree+0xc9/0x1d0
[ 5060.151082] kmem_freepages+0xa0/0xf0
[ 5060.155121] slab_destroy+0x19/0x50
[ 5060.159028] slabs_destroy+0x6d/0x90
[ 5060.162920] ___cache_free+0x4a3/0x6b0
[ 5060.167097] kfree+0xc9/0x1d0

___cache_free() is calling cache_flusharray() to flush the local cpu
array_cache if the cache has more elements than the limit (ac->avail
>= ac->limit).

cache_flusharray() is removing batchcount number of element from local
cpu array_cache and pass it slabs_destroy (if the node shared cache is
also full).

Note that we have not updated local cpu array_cache size yet and
called slabs_destroy() which can call kfree() through
unaccount_slab_page().

We are on the same CPU and this recursive kfree again check the
(ac->avail >= ac->limit) and call cache_flusharray() again and recurse
indefinitely.

2020-09-25 17:25:07

by Roman Gushchin

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 09:47:43AM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 9:32 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <[email protected]> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote:
> > > > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> > > > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > > > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > > > > > that one of the stack traces had virtio balloon in the trace, I
> > > > > > realized that when I switched the GCE VM type from e1-standard-2 to
> > > > > > n1-standard-2 (where e1 VM's are cheaper because they use
> > > > > > virtio-balloon to better manage host OS memory utilization), problem
> > > > > > has become, much, *much* rarer (and possibly has gone away, although
> > > > > > I'm going to want to run a lot more tests before I say that
> > > > > > conclusively) on my test setup. At the very least, using an n1 VM
> > > > > > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > > > > > enough to unblock ext4 development.
> > > > >
> > > > > .... and I spoke too soon. A number of runs using -rc6 are now
> > > > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> > > > > indicator.
> > > > >
> > > > > This is why debugging this is frustrating; it is very much a heisenbug
> > > > > --- although 5.8 seems to work completely reliably, as does commits
> > > > > before 37f4a24c2469. Anything after that point will show random
> > > > > failures. :-(
> > > >
> > > > It does not make sense to mention 37f4a24c2469, which is reverted in
> > > > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
> > > > as 568f27006577.
> > > >
> > > > However, I can _not_ reproduce the issue by running the same test on
> > > > kernel built from 568f27006577 directly.
> > > >
> > > > Also you have confirmed that the issue can't be fixed after reverting
> > > > 568f27006577 against v5.9-rc4.
> > > >
> > > > Looks the real issue(slab list corruption) should be introduced between
> > > > 568f27006577 and v5.9-rc4.
> > >
> > > git bisect shows the first bad commit:
> > >
> > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > > kmem_caches for all allocations
> > >
> > > And I have double checked that the above commit is really the first bad
> > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > is LIST_POISON1 (dead000000000100)', see the detailed stack trace and
> > > kernel oops log in the following link:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> >
> > The failure signature is similar to
> > https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/
> >
> > >
> > > And the kernel config is the one(without KASAN) used by Theodore in GCE VM, see
> > > the following link:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > The reproducer is xfstests generic/038. In my setting, test device is virtio-scsi, and
> > > scratch device is virtio-blk.
>
> Is it possible to check SLUB as well to confirm that the issue is only
> happening on SLAB?

Can you also, please, check if passing cgroup.memory=nokmem as a boot argument
is fixing the issue?

Thanks!

2020-09-25 17:36:29

by Shakeel Butt

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt <[email protected]> wrote:
>
> On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <[email protected]> wrote:
> > >
> > > git bisect shows the first bad commit:
> > >
> > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > > kmem_caches for all allocations
> > >
> > > And I have double checked that the above commit is really the first bad
> > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > is LIST_POISON1 (dead000000000100)',
> >
> > Thet commit doesn't revert cleanly, but I think that's purely because
> > we'd also need to revert
> >
> > 849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
> > 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
> > (un)account_slab_page()")
> >
> > too.
> >
> > Can you verify that a
> >
> > git revert 74d555bed5d0 849504809f86 10befea91b61
> >
> > on top of current -git makes things work for you again?
> >
> > I'm going to do an rc8 this release simply because we have another VM
> > issue that I hope to get fixed - but there we know what the problem
> > and the fix _is_, it just needs some care.
> >
> > So if Roman (or somebody else) can see what's wrong and we can fix
> > this quickly, we don't need to go down the revert path, but ..
> >
>
> I think I have a theory. The issue is happening due to the potential
> infinite recursion:
>
> [ 5060.124412] ___cache_free+0x488/0x6b0
> *****Second recursion
> [ 5060.128666] kfree+0xc9/0x1d0
> [ 5060.131947] kmem_freepages+0xa0/0xf0
> [ 5060.135746] slab_destroy+0x19/0x50
> [ 5060.139577] slabs_destroy+0x6d/0x90
> [ 5060.143379] ___cache_free+0x4a3/0x6b0
> *****First recursion
> [ 5060.147896] kfree+0xc9/0x1d0
> [ 5060.151082] kmem_freepages+0xa0/0xf0
> [ 5060.155121] slab_destroy+0x19/0x50
> [ 5060.159028] slabs_destroy+0x6d/0x90
> [ 5060.162920] ___cache_free+0x4a3/0x6b0
> [ 5060.167097] kfree+0xc9/0x1d0
>
> ___cache_free() is calling cache_flusharray() to flush the local cpu
> array_cache if the cache has more elements than the limit (ac->avail
> >= ac->limit).
>
> cache_flusharray() is removing batchcount number of element from local
> cpu array_cache and pass it slabs_destroy (if the node shared cache is
> also full).
>
> Note that we have not updated local cpu array_cache size yet and
> called slabs_destroy() which can call kfree() through
> unaccount_slab_page().
>
> We are on the same CPU and this recursive kfree again check the
> (ac->avail >= ac->limit) and call cache_flusharray() again and recurse
> indefinitely.

I can see two possible fixes. We can either do async kfree of
page_obj_cgroups(page) or we can update the local cpu array_cache's
size before slabs_destroy().

Shakeel

2020-09-25 17:49:23

by Roman Gushchin

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 10:35:03AM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <[email protected]> wrote:
> > > >
> > > > git bisect shows the first bad commit:
> > > >
> > > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > > > kmem_caches for all allocations
> > > >
> > > > And I have double checked that the above commit is really the first bad
> > > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > > is LIST_POISON1 (dead000000000100)',
> > >
> > > Thet commit doesn't revert cleanly, but I think that's purely because
> > > we'd also need to revert
> > >
> > > 849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
> > > 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
> > > (un)account_slab_page()")
> > >
> > > too.
> > >
> > > Can you verify that a
> > >
> > > git revert 74d555bed5d0 849504809f86 10befea91b61
> > >
> > > on top of current -git makes things work for you again?
> > >
> > > I'm going to do an rc8 this release simply because we have another VM
> > > issue that I hope to get fixed - but there we know what the problem
> > > and the fix _is_, it just needs some care.
> > >
> > > So if Roman (or somebody else) can see what's wrong and we can fix
> > > this quickly, we don't need to go down the revert path, but ..
> > >
> >
> > I think I have a theory. The issue is happening due to the potential
> > infinite recursion:
> >
> > [ 5060.124412] ___cache_free+0x488/0x6b0
> > *****Second recursion
> > [ 5060.128666] kfree+0xc9/0x1d0
> > [ 5060.131947] kmem_freepages+0xa0/0xf0
> > [ 5060.135746] slab_destroy+0x19/0x50
> > [ 5060.139577] slabs_destroy+0x6d/0x90
> > [ 5060.143379] ___cache_free+0x4a3/0x6b0
> > *****First recursion
> > [ 5060.147896] kfree+0xc9/0x1d0
> > [ 5060.151082] kmem_freepages+0xa0/0xf0
> > [ 5060.155121] slab_destroy+0x19/0x50
> > [ 5060.159028] slabs_destroy+0x6d/0x90
> > [ 5060.162920] ___cache_free+0x4a3/0x6b0
> > [ 5060.167097] kfree+0xc9/0x1d0
> >
> > ___cache_free() is calling cache_flusharray() to flush the local cpu
> > array_cache if the cache has more elements than the limit (ac->avail
> > >= ac->limit).
> >
> > cache_flusharray() is removing batchcount number of element from local
> > cpu array_cache and pass it slabs_destroy (if the node shared cache is
> > also full).
> >
> > Note that we have not updated local cpu array_cache size yet and
> > called slabs_destroy() which can call kfree() through
> > unaccount_slab_page().
> >
> > We are on the same CPU and this recursive kfree again check the
> > (ac->avail >= ac->limit) and call cache_flusharray() again and recurse
> > indefinitely.

It's a coll theory! And it explains why we haven't seen it with SLUB.

>
> I can see two possible fixes. We can either do async kfree of
> page_obj_cgroups(page) or we can update the local cpu array_cache's
> size before slabs_destroy().

I wonder if something like this can fix the problem?
(completely untested).

--

diff --git a/mm/slab.c b/mm/slab.c
index 684ebe5b0c7a..c94b9ccfb803 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -186,6 +186,7 @@ struct array_cache {
unsigned int limit;
unsigned int batchcount;
unsigned int touched;
+ bool flushing;
void *entry[]; /*
* Must have this definition in here for the proper
* alignment of array_cache. Also simplifies accessing
@@ -526,6 +527,7 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch)
ac->limit = limit;
ac->batchcount = batch;
ac->touched = 0;
+ ac->flushing = false;
}
}

@@ -3368,6 +3370,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
int node = numa_mem_id();
LIST_HEAD(list);

+ if (ac->flushing)
+ return;
+
+ ac->flushing = true;
+
batchcount = ac->batchcount;

check_irq_off();
@@ -3404,6 +3411,7 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
spin_unlock(&n->list_lock);
slabs_destroy(cachep, &list);
ac->avail -= batchcount;
+ ac->flushing = false;
memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
}

2020-09-25 18:00:15

by Shakeel Butt

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 10:48 AM Roman Gushchin <[email protected]> wrote:
>
> On Fri, Sep 25, 2020 at 10:35:03AM -0700, Shakeel Butt wrote:
> > On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <[email protected]> wrote:
> > > > >
> > > > > git bisect shows the first bad commit:
> > > > >
> > > > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > > > > kmem_caches for all allocations
> > > > >
> > > > > And I have double checked that the above commit is really the first bad
> > > > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > > > is LIST_POISON1 (dead000000000100)',
> > > >
> > > > Thet commit doesn't revert cleanly, but I think that's purely because
> > > > we'd also need to revert
> > > >
> > > > 849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
> > > > 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
> > > > (un)account_slab_page()")
> > > >
> > > > too.
> > > >
> > > > Can you verify that a
> > > >
> > > > git revert 74d555bed5d0 849504809f86 10befea91b61
> > > >
> > > > on top of current -git makes things work for you again?
> > > >
> > > > I'm going to do an rc8 this release simply because we have another VM
> > > > issue that I hope to get fixed - but there we know what the problem
> > > > and the fix _is_, it just needs some care.
> > > >
> > > > So if Roman (or somebody else) can see what's wrong and we can fix
> > > > this quickly, we don't need to go down the revert path, but ..
> > > >
> > >
> > > I think I have a theory. The issue is happening due to the potential
> > > infinite recursion:
> > >
> > > [ 5060.124412] ___cache_free+0x488/0x6b0
> > > *****Second recursion
> > > [ 5060.128666] kfree+0xc9/0x1d0
> > > [ 5060.131947] kmem_freepages+0xa0/0xf0
> > > [ 5060.135746] slab_destroy+0x19/0x50
> > > [ 5060.139577] slabs_destroy+0x6d/0x90
> > > [ 5060.143379] ___cache_free+0x4a3/0x6b0
> > > *****First recursion
> > > [ 5060.147896] kfree+0xc9/0x1d0
> > > [ 5060.151082] kmem_freepages+0xa0/0xf0
> > > [ 5060.155121] slab_destroy+0x19/0x50
> > > [ 5060.159028] slabs_destroy+0x6d/0x90
> > > [ 5060.162920] ___cache_free+0x4a3/0x6b0
> > > [ 5060.167097] kfree+0xc9/0x1d0
> > >
> > > ___cache_free() is calling cache_flusharray() to flush the local cpu
> > > array_cache if the cache has more elements than the limit (ac->avail
> > > >= ac->limit).
> > >
> > > cache_flusharray() is removing batchcount number of element from local
> > > cpu array_cache and pass it slabs_destroy (if the node shared cache is
> > > also full).
> > >
> > > Note that we have not updated local cpu array_cache size yet and
> > > called slabs_destroy() which can call kfree() through
> > > unaccount_slab_page().
> > >
> > > We are on the same CPU and this recursive kfree again check the
> > > (ac->avail >= ac->limit) and call cache_flusharray() again and recurse
> > > indefinitely.
>
> It's a coll theory! And it explains why we haven't seen it with SLUB.
>
> >
> > I can see two possible fixes. We can either do async kfree of
> > page_obj_cgroups(page) or we can update the local cpu array_cache's
> > size before slabs_destroy().
>
> I wonder if something like this can fix the problem?
> (completely untested).
>
> --
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 684ebe5b0c7a..c94b9ccfb803 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -186,6 +186,7 @@ struct array_cache {
> unsigned int limit;
> unsigned int batchcount;
> unsigned int touched;
> + bool flushing;
> void *entry[]; /*
> * Must have this definition in here for the proper
> * alignment of array_cache. Also simplifies accessing
> @@ -526,6 +527,7 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch)
> ac->limit = limit;
> ac->batchcount = batch;
> ac->touched = 0;
> + ac->flushing = false;
> }
> }
>
> @@ -3368,6 +3370,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> int node = numa_mem_id();
> LIST_HEAD(list);
>
> + if (ac->flushing)
> + return;
> +
> + ac->flushing = true;
> +
> batchcount = ac->batchcount;
>
> check_irq_off();
> @@ -3404,6 +3411,7 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> spin_unlock(&n->list_lock);
> slabs_destroy(cachep, &list);
> ac->avail -= batchcount;
> + ac->flushing = false;
> memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
> }
>

I don't think you can ignore the flushing. The __free_once() in
___cache_free() assumes there is a space available.

BTW do_drain() also have the same issue.

Why not move slabs_destroy() after we update ac->avail and memmove()?

2020-09-25 20:41:15

by Shakeel Butt

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <[email protected]>
wrote:
>
[snip]
>
> I don't think you can ignore the flushing. The __free_once() in
> ___cache_free() assumes there is a space available.
>
> BTW do_drain() also have the same issue.
>
> Why not move slabs_destroy() after we update ac->avail and memmove()?

Ming, can you please try the following patch?


From: Shakeel Butt <[email protected]>

[PATCH] mm: slab: fix potential infinite recursion in ___cache_free

With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
kmem_caches for all allocations"), it becomes possible to call kfree()
from the slabs_destroy(). However if slabs_destroy() is being called for
the array_cache of the local CPU then this opens the potential scenario
of infinite recursion because kfree() called from slabs_destroy() can
call slabs_destroy() with the same array_cache of the local CPU. Since
the array_cache of the local CPU is not updated before calling
slabs_destroy(), it will try to free the same pages.

To fix the issue, simply update the cache before calling
slabs_destroy().

Signed-off-by: Shakeel Butt <[email protected]>
---
mm/slab.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 3160dff6fd76..f658e86ec8ce 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
kmem_cache_free(cachep->freelist_cache, freelist);
}

+/*
+ * Update the size of the caches before calling slabs_destroy as it may
+ * recursively call kfree.
+ */
static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
{
struct page *page, *n;
@@ -2153,8 +2157,8 @@ static void do_drain(void *arg)
spin_lock(&n->list_lock);
free_block(cachep, ac->entry, ac->avail, node, &list);
spin_unlock(&n->list_lock);
- slabs_destroy(cachep, &list);
ac->avail = 0;
+ slabs_destroy(cachep, &list);
}

static void drain_cpu_caches(struct kmem_cache *cachep)
@@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
}
#endif
spin_unlock(&n->list_lock);
- slabs_destroy(cachep, &list);
ac->avail -= batchcount;
memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
+ slabs_destroy(cachep, &list);
}

/*
--
2.28.0.681.g6f77f65b4e-goog

2020-09-25 20:57:14

by Roman Gushchin

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <[email protected]>
> wrote:
> >
> [snip]
> >
> > I don't think you can ignore the flushing. The __free_once() in
> > ___cache_free() assumes there is a space available.
> >
> > BTW do_drain() also have the same issue.
> >
> > Why not move slabs_destroy() after we update ac->avail and memmove()?
>
> Ming, can you please try the following patch?
>
>
> From: Shakeel Butt <[email protected]>
>
> [PATCH] mm: slab: fix potential infinite recursion in ___cache_free
>
> With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
> kmem_caches for all allocations"), it becomes possible to call kfree()
> from the slabs_destroy(). However if slabs_destroy() is being called for
> the array_cache of the local CPU then this opens the potential scenario
> of infinite recursion because kfree() called from slabs_destroy() can
> call slabs_destroy() with the same array_cache of the local CPU. Since
> the array_cache of the local CPU is not updated before calling
> slabs_destroy(), it will try to free the same pages.
>
> To fix the issue, simply update the cache before calling
> slabs_destroy().
>
> Signed-off-by: Shakeel Butt <[email protected]>

I like the patch and I think it should fix the problem.

However the description above should be likely asjusted a bit.
It seems that the problem is not necessary caused by an infinite recursion,
it can be even simpler.

In cache_flusharray() we rely on the state of ac, which is described
by ac->avail. In particular we rely on batchcount < ac->avail,
as we shift the batchcount number of pointers by memmove.
But if slabs_destroy() is called before and leaded to a change of the
ac state, it can lead to a memory corruption.

Also, unconditionally resetting ac->avail to 0 in do_drain() after calling
to slab_destroy() seems to be wrong.
It explains double free BUGs we've seen in stacktraces.

Thanks!

> ---
> mm/slab.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 3160dff6fd76..f658e86ec8ce 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
> kmem_cache_free(cachep->freelist_cache, freelist);
> }
>
> +/*
> + * Update the size of the caches before calling slabs_destroy as it may
> + * recursively call kfree.
> + */
> static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
> {
> struct page *page, *n;
> @@ -2153,8 +2157,8 @@ static void do_drain(void *arg)
> spin_lock(&n->list_lock);
> free_block(cachep, ac->entry, ac->avail, node, &list);
> spin_unlock(&n->list_lock);
> - slabs_destroy(cachep, &list);
> ac->avail = 0;
> + slabs_destroy(cachep, &list);
> }
>
> static void drain_cpu_caches(struct kmem_cache *cachep)
> @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> }
> #endif
> spin_unlock(&n->list_lock);
> - slabs_destroy(cachep, &list);
> ac->avail -= batchcount;
> memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
> + slabs_destroy(cachep, &list);
> }
>
> /*
> --
> 2.28.0.681.g6f77f65b4e-goog
>

2020-09-25 21:22:55

by Shakeel Butt

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 1:56 PM Roman Gushchin <[email protected]> wrote:
>
> On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote:
> > On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <[email protected]>
> > wrote:
> > >
> > [snip]
> > >
> > > I don't think you can ignore the flushing. The __free_once() in
> > > ___cache_free() assumes there is a space available.
> > >
> > > BTW do_drain() also have the same issue.
> > >
> > > Why not move slabs_destroy() after we update ac->avail and memmove()?
> >
> > Ming, can you please try the following patch?
> >
> >
> > From: Shakeel Butt <[email protected]>
> >
> > [PATCH] mm: slab: fix potential infinite recursion in ___cache_free
> >
> > With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
> > kmem_caches for all allocations"), it becomes possible to call kfree()
> > from the slabs_destroy(). However if slabs_destroy() is being called for
> > the array_cache of the local CPU then this opens the potential scenario
> > of infinite recursion because kfree() called from slabs_destroy() can
> > call slabs_destroy() with the same array_cache of the local CPU. Since
> > the array_cache of the local CPU is not updated before calling
> > slabs_destroy(), it will try to free the same pages.
> >
> > To fix the issue, simply update the cache before calling
> > slabs_destroy().
> >
> > Signed-off-by: Shakeel Butt <[email protected]>
>
> I like the patch and I think it should fix the problem.
>
> However the description above should be likely asjusted a bit.
> It seems that the problem is not necessary caused by an infinite recursion,
> it can be even simpler.
>
> In cache_flusharray() we rely on the state of ac, which is described
> by ac->avail. In particular we rely on batchcount < ac->avail,
> as we shift the batchcount number of pointers by memmove.
> But if slabs_destroy() is called before and leaded to a change of the
> ac state, it can lead to a memory corruption.
>
> Also, unconditionally resetting ac->avail to 0 in do_drain() after calling
> to slab_destroy() seems to be wrong.
> It explains double free BUGs we've seen in stacktraces.
>

Yes, you are right. Let's first get this patch tested and after
confirmation we can update the commit message.

2020-09-26 01:44:26

by Ming Lei

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <[email protected]>
> wrote:
> >
> [snip]
> >
> > I don't think you can ignore the flushing. The __free_once() in
> > ___cache_free() assumes there is a space available.
> >
> > BTW do_drain() also have the same issue.
> >
> > Why not move slabs_destroy() after we update ac->avail and memmove()?
>
> Ming, can you please try the following patch?
>
>
> From: Shakeel Butt <[email protected]>
>
> [PATCH] mm: slab: fix potential infinite recursion in ___cache_free
>
> With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
> kmem_caches for all allocations"), it becomes possible to call kfree()
> from the slabs_destroy(). However if slabs_destroy() is being called for
> the array_cache of the local CPU then this opens the potential scenario
> of infinite recursion because kfree() called from slabs_destroy() can
> call slabs_destroy() with the same array_cache of the local CPU. Since
> the array_cache of the local CPU is not updated before calling
> slabs_destroy(), it will try to free the same pages.
>
> To fix the issue, simply update the cache before calling
> slabs_destroy().
>
> Signed-off-by: Shakeel Butt <[email protected]>
> ---
> mm/slab.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 3160dff6fd76..f658e86ec8ce 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
> kmem_cache_free(cachep->freelist_cache, freelist);
> }
>
> +/*
> + * Update the size of the caches before calling slabs_destroy as it may
> + * recursively call kfree.
> + */
> static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
> {
> struct page *page, *n;
> @@ -2153,8 +2157,8 @@ static void do_drain(void *arg)
> spin_lock(&n->list_lock);
> free_block(cachep, ac->entry, ac->avail, node, &list);
> spin_unlock(&n->list_lock);
> - slabs_destroy(cachep, &list);
> ac->avail = 0;
> + slabs_destroy(cachep, &list);
> }
>
> static void drain_cpu_caches(struct kmem_cache *cachep)
> @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> }
> #endif
> spin_unlock(&n->list_lock);
> - slabs_destroy(cachep, &list);
> ac->avail -= batchcount;
> memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
> + slabs_destroy(cachep, &list);
> }

The issue can't be reproduced after applying this patch:

Tested-by: Ming Lei <[email protected]>

Thanks,
Ming

2020-09-26 06:43:42

by Roman Gushchin

[permalink] [raw]
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

On Sat, Sep 26, 2020 at 09:43:25AM +0800, Ming Lei wrote:
> On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote:
> > On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <[email protected]>
> > wrote:
> > >
> > [snip]
> > >
> > > I don't think you can ignore the flushing. The __free_once() in
> > > ___cache_free() assumes there is a space available.
> > >
> > > BTW do_drain() also have the same issue.
> > >
> > > Why not move slabs_destroy() after we update ac->avail and memmove()?
> >
> > Ming, can you please try the following patch?
> >
> >
> > From: Shakeel Butt <[email protected]>
> >
> > [PATCH] mm: slab: fix potential infinite recursion in ___cache_free
> >
> > With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
> > kmem_caches for all allocations"), it becomes possible to call kfree()
> > from the slabs_destroy(). However if slabs_destroy() is being called for
> > the array_cache of the local CPU then this opens the potential scenario
> > of infinite recursion because kfree() called from slabs_destroy() can
> > call slabs_destroy() with the same array_cache of the local CPU. Since
> > the array_cache of the local CPU is not updated before calling
> > slabs_destroy(), it will try to free the same pages.
> >
> > To fix the issue, simply update the cache before calling
> > slabs_destroy().
> >
> > Signed-off-by: Shakeel Butt <[email protected]>
> > ---
> > mm/slab.c | 8 ++++++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 3160dff6fd76..f658e86ec8ce 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
> > kmem_cache_free(cachep->freelist_cache, freelist);
> > }
> >
> > +/*
> > + * Update the size of the caches before calling slabs_destroy as it may
> > + * recursively call kfree.
> > + */
> > static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
> > {
> > struct page *page, *n;
> > @@ -2153,8 +2157,8 @@ static void do_drain(void *arg)
> > spin_lock(&n->list_lock);
> > free_block(cachep, ac->entry, ac->avail, node, &list);
> > spin_unlock(&n->list_lock);
> > - slabs_destroy(cachep, &list);
> > ac->avail = 0;
> > + slabs_destroy(cachep, &list);
> > }
> >
> > static void drain_cpu_caches(struct kmem_cache *cachep)
> > @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> > }
> > #endif
> > spin_unlock(&n->list_lock);
> > - slabs_destroy(cachep, &list);
> > ac->avail -= batchcount;
> > memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
> > + slabs_destroy(cachep, &list);
> > }
>
> The issue can't be reproduced after applying this patch:
>
> Tested-by: Ming Lei <[email protected]>

Perfect, thank you very much for the confirmation!

Shakeel, can you, please, resend the patch with the proper fixes tag
and the updated commit log? Please, feel free to add
Reviewed-by: Roman Gushchin <[email protected]> .

Thank you!

Roman