Hello,
Starting with kernel 5.15 for the past eight months we have a total of
12 kernel panics at a fleet of 1000 KVM/Qemu machines which look the
following way:
kernel BUG at fs/ext4/inode.c:1914
Switching from kernel 4.14 to 5.15 almost immediately triggered the
problem. Therefore we are very confident that userland activity is more
or less the same and is not the root cause. The kernel function which
triggers the BUG is __ext4_journalled_writepage(). In 5.15 the code for
__ext4_journalled_writepage() in "fs/ext4/inode.c" is the same as the
current kernel "master". The line where the BUG is triggered is:
struct buffer_head *page_bufs = page_buffers(page)
The definition of "page_buffers(page)" in "include/linux/buffer_head.h"
hasn't changed since 4.14, so no difference here. This is where the
actual "kernel BUG" event is triggered:
/* If we *know* page->private refers to buffer_heads */
#define page_buffers(page) \
({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})
#define page_has_buffers(page) PagePrivate(page)
Initially I thought that the issue is already discussed here:
https://lore.kernel.org/all/[email protected]/
But this seems to be another (solved) problem and Theodore Ts'o already
made a quick fix by simply reporting the rare occurrence and continuing
forward. The commit is in 5.15 (and in the latest kernel), so it's not
helping our case:
https://github.com/torvalds/linux/commit/cc5095747edfb054ca2068d01af20be3fcc3634f
Back to the problem! 99% of the difference between 4.14 and the latest
kernel for __ext4_journalled_writepage() in "fs/ext4/inode.c" comes from
the following commit:
https://github.com/torvalds/linux/commit/5c48a7df91499e371ef725895b2e2d21a126e227
Is it safe that we revert this patch on the latest 5.15 kernel, so that
we can confirm if this resolves the issue for us?
Best regards.
--Ivan
Hello,
I forgot to mention that the ext4 file system is mounted with
"data=journal" and the crash happens on servers which have more than 20
GB RAM and are I/O busy.
> Back to the problem! 99% of the difference between 4.14 and the latest
> kernel for __ext4_journalled_writepage() in "fs/ext4/inode.c" comes
> from the following commit:
> https://github.com/torvalds/linux/commit/5c48a7df91499e371ef725895b2e2d21a126e227
>
> Is it safe that we revert this patch on the latest 5.15 kernel, so
> that we can confirm if this resolves the issue for us?
If we can't or if it doesn't make sense to revert the patch, is there
anything else we can do to assist in the debug of this rare kernel crash?
The machines are Qemu/KVM guests but dumping the whole memory would take
a couple of minutes, so it's not viable.
Are there any debug statements we could add in
__ext4_journalled_writepage() in "fs/ext4/inode.c" that may give a hint
where the problem is?
Best regards.
--Ivan
On Wed, Nov 23, 2022 at 04:48:01PM +0200, Ivan Zahariev wrote:
> Hello,
>
> Starting with kernel 5.15 for the past eight months we have a total of 12
> kernel panics at a fleet of 1000 KVM/Qemu machines which look the following
> way:
>
> ?? ?kernel BUG at fs/ext4/inode.c:1914
>
> Switching from kernel 4.14 to 5.15 almost immediately triggered the problem.
> Therefore we are very confident that userland activity is more or less the
> same and is not the root cause. The kernel function which triggers the BUG
> is __ext4_journalled_writepage(). In 5.15 the code for
> __ext4_journalled_writepage() in "fs/ext4/inode.c" is the same as the
> current kernel "master". The line where the BUG is triggered is:
>
> ?? ?struct buffer_head *page_bufs = page_buffers(page)
...
>
> Back to the problem! 99% of the difference between 4.14 and the latest
> kernel for __ext4_journalled_writepage() in "fs/ext4/inode.c" comes from the
> following commit: https://github.com/torvalds/linux/commit/5c48a7df91499e371ef725895b2e2d21a126e227
>
> Is it safe that we revert this patch on the latest 5.15 kernel, so that we
> can confirm if this resolves the issue for us?
No, it's not safe to revert this patch; this fixes a potential
use-after-free bug identified by Syzkaller (and use-after-frees can
sometimes be leveraged into security volunerabilities.
I have not tested this change yet, but looking at the code and the
change made in patch yet, this *might* be a possible fix:
size = i_size_read(inode);
- if (page->mapping != mapping || page_offset(page) > size) {
+ if (page->mapping != mapping || page_offset(page) >= size) {
/* The page got truncated from under us */
ext4_journal_stop(handle);
ret = 0;
goto out;
}
Is it fair to say that your workload is using data=journaled and is
frequently truncating that might have been recently modified (hence
triggering the race between truncate and journalled writepages)?
I wonder if you could come up with a more reliable reproducer so we
can test a particular patch.
Thanks,
- Ted
On 5.12.2022 г. 23:10, Theodore Ts'o wrote:
> Is it fair to say that your workload is using data=journaled and is
> frequently truncating that might have been recently modified (hence
> triggering the race between truncate and journalled writepages)?
The servers are hosting hundreds of users who run their own tasks and we
have no control nor a way to closely observe their usage pattern. Unless
you point us in a direction to debug this somehow.
"data=journaled" is definitely in place for all servers.
> I wonder if you could come up with a more reliable reproducer so we
> can test a particular patch.
We already tried different parallel combinations of mmap()'ed reading,
direct and regular write(), drop_caches, sync(), etc. but we can't
trigger the panic.
If you have any suggestions what we should try next as a reproducer,
please share and we will try to implement and execute it.
Did I understand correctly that a possible reproducer would be a loop of
heavy write() followed by truncate() of the same file? Should we
randomly sync() and/or "echo 3 > /proc/sys/vm/drop_caches" to increase
the chance of hitting the bug?
Best regards.
--Ivan
On Mon 05-12-22 16:10:47, Theodore Ts'o wrote:
> On Wed, Nov 23, 2022 at 04:48:01PM +0200, Ivan Zahariev wrote:
> > Hello,
> >
> > Starting with kernel 5.15 for the past eight months we have a total of 12
> > kernel panics at a fleet of 1000 KVM/Qemu machines which look the following
> > way:
> >
> > ?? ?kernel BUG at fs/ext4/inode.c:1914
> >
> > Switching from kernel 4.14 to 5.15 almost immediately triggered the problem.
> > Therefore we are very confident that userland activity is more or less the
> > same and is not the root cause. The kernel function which triggers the BUG
> > is __ext4_journalled_writepage(). In 5.15 the code for
> > __ext4_journalled_writepage() in "fs/ext4/inode.c" is the same as the
> > current kernel "master". The line where the BUG is triggered is:
> >
> > ?? ?struct buffer_head *page_bufs = page_buffers(page)
> ...
> >
> > Back to the problem! 99% of the difference between 4.14 and the latest
> > kernel for __ext4_journalled_writepage() in "fs/ext4/inode.c" comes from the
> > following commit: https://github.com/torvalds/linux/commit/5c48a7df91499e371ef725895b2e2d21a126e227
> >
> > Is it safe that we revert this patch on the latest 5.15 kernel, so that we
> > can confirm if this resolves the issue for us?
>
> No, it's not safe to revert this patch; this fixes a potential
> use-after-free bug identified by Syzkaller (and use-after-frees can
> sometimes be leveraged into security volunerabilities.
>
> I have not tested this change yet, but looking at the code and the
> change made in patch yet, this *might* be a possible fix:
>
> size = i_size_read(inode);
> - if (page->mapping != mapping || page_offset(page) > size) {
> + if (page->mapping != mapping || page_offset(page) >= size) {
> /* The page got truncated from under us */
> ext4_journal_stop(handle);
> ret = 0;
> goto out;
> }
>
> Is it fair to say that your workload is using data=journaled and is
> frequently truncating that might have been recently modified (hence
> triggering the race between truncate and journalled writepages)?
>
> I wonder if you could come up with a more reliable reproducer so we
> can test a particular patch.
So after a bit of thought I agree that the commit 5c48a7df91499 ("ext4: fix
an use-after-free issue about data=journal writeback mode") is broken. The
problem is when we unlock the page in __ext4_journalled_writepage() anybody
else can come, writeout the page, and reclaim page buffers (due to memory
pressure). Previously, bh references were preventing the buffer reclaim to
happen but now there's nothing to prevent it.
My rewrite of data=journal writeback path fixes this problem as a
side-effect but perhaps we need a quickfix for stable kernels? Something
like attached patch?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 12.1.2023 г. 17:07, Jan Kara wrote:
> So after a bit of thought I agree that the commit 5c48a7df91499 ("ext4: fix
> an use-after-free issue about data=journal writeback mode") is broken. The
> problem is when we unlock the page in __ext4_journalled_writepage() anybody
> else can come, writeout the page, and reclaim page buffers (due to memory
> pressure). Previously, bh references were preventing the buffer reclaim to
> happen but now there's nothing to prevent it.
>
> My rewrite of data=journal writeback path fixes this problem as a
> side-effect but perhaps we need a quickfix for stable kernels? Something
> like attached patch?
>
> Honza
Do you consider this patch production ready?
Should we test it on real production machines with a peace of mind that
nothing can go wrong in regards to data loss or corruption?
--Ivan
On Wed 15-03-23 13:27:11, Ivan Zahariev wrote:
> On 12.1.2023 г. 17:07, Jan Kara wrote:
> > So after a bit of thought I agree that the commit 5c48a7df91499 ("ext4: fix
> > an use-after-free issue about data=journal writeback mode") is broken. The
> > problem is when we unlock the page in __ext4_journalled_writepage() anybody
> > else can come, writeout the page, and reclaim page buffers (due to memory
> > pressure). Previously, bh references were preventing the buffer reclaim to
> > happen but now there's nothing to prevent it.
> >
> > My rewrite of data=journal writeback path fixes this problem as a
> > side-effect but perhaps we need a quickfix for stable kernels? Something
> > like attached patch?
> >
> > Honza
>
> Do you consider this patch production ready?
Ah, the patch has likely fallen through the cracks because I waited for
some reply and then forgot about it and Ted likely missed it inside the
thread. But yes I consider the patch safe to test on production machines -
at least it has passed testing with fstests on my test VM without any
visible issues.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Mar 15, 2023 at 06:32:17PM +0100, Jan Kara wrote:
> On Wed 15-03-23 13:27:11, Ivan Zahariev wrote:
> > On 12.1.2023 г. 17:07, Jan Kara wrote:
> > > So after a bit of thought I agree that the commit 5c48a7df91499 ("ext4: fix
> > > an use-after-free issue about data=journal writeback mode") is broken. The
> > > problem is when we unlock the page in __ext4_journalled_writepage() anybody
> > > else can come, writeout the page, and reclaim page buffers (due to memory
> > > pressure). Previously, bh references were preventing the buffer reclaim to
> > > happen but now there's nothing to prevent it.
> > >
> > > My rewrite of data=journal writeback path fixes this problem as a
> > > side-effect but perhaps we need a quickfix for stable kernels? Something
> > > like attached patch?
> > >
> > > Honza
> >
> > Do you consider this patch production ready?
>
> Ah, the patch has likely fallen through the cracks because I waited for
> some reply and then forgot about it and Ted likely missed it inside the
> thread. But yes I consider the patch safe to test on production machines -
> at least it has passed testing with fstests on my test VM without any
> visible issues.
Yeah, sorry, I didn't see it since it was in an attachment as opposed
to with an explicit [PATCH] subject line.
And at this point, the data=journal writeback patches have landed in
the ext4/dev tree, and while we could try to see if we could land this
before the next merge window, I'm worried about merge or semantic
conflicts of having both patches in a tree at one time.
I guess we could send it to Linus, let it get backported into stable,
and then revert it during the merge window, ahead of applying the
data=journal cleanup patch series. But that seems a bit ugly. Or we
could ask for an exception from the stable kernel folks, after I do a
full set of xfstests runs on it. (Of course, I don't think anyone has
been able to create a reliable reproducer, so all we can do is to test
for regression failures.)
Jan, Greg, what do you think?
- Ted
Hi,
> On Wed, Mar 15, 2023 at 18:57, Theodore Ts'o wrote:
>
> Yeah, sorry, I didn't see it since it was in an attachment as opposed
> to with an explicit [PATCH] subject line.
>
> And at this point, the data=journal writeback patches have landed in
> the ext4/dev tree, and while we could try to see if we could land this
> before the next merge window, I'm worried about merge or semantic
> conflicts of having both patches in a tree at one time.
>
> I guess we could send it to Linus, let it get backported into stable,
> and then revert it during the merge window, ahead of applying the
> data=journal cleanup patch series. But that seems a bit ugly. Or we
> could ask for an exception from the stable kernel folks, after I do a
> full set of xfstests runs on it. (Of course, I don't think anyone has
> been able to create a reliable reproducer, so all we can do is to test
> for regression failures.)
>
> Jan, Greg, what do you think?
We've noticed this appearing for us as well now (on 5.15 with
data=journaled) and I wanted to ask what the status here is. Did any fix
here make it into a stable kernel yet? If not, I suppose I can still
apply the patch posted above as a quick-fix until this (or another
solution) makes it into the stable tree?
Best,
Marcus
________________________________
othermo GmbH | Sitz der Gesellschaft: Alzenau | Amtsgericht Aschaffenburg: HRB 14783 | USt-IdNr.: DE319977978 | Geschäftsführung: Dr. Dennis Metz.
On Thu, May 11, 2023 at 11:21:27AM +0200, Marcus Hoffmann wrote:
> Hi,
>
> > On Wed, Mar 15, 2023 at 18:57, Theodore Ts'o wrote:
> >
> > Yeah, sorry, I didn't see it since it was in an attachment as opposed
> > to with an explicit [PATCH] subject line.
> >
> > And at this point, the data=journal writeback patches have landed in
> > the ext4/dev tree, and while we could try to see if we could land this
> > before the next merge window, I'm worried about merge or semantic
> > conflicts of having both patches in a tree at one time.
> >
> > I guess we could send it to Linus, let it get backported into stable,
> > and then revert it during the merge window, ahead of applying the
> > data=journal cleanup patch series. But that seems a bit ugly. Or we
> > could ask for an exception from the stable kernel folks, after I do a
> > full set of xfstests runs on it. (Of course, I don't think anyone has
> > been able to create a reliable reproducer, so all we can do is to test
> > for regression failures.)
> >
> > Jan, Greg, what do you think?
>
> We've noticed this appearing for us as well now (on 5.15 with
> data=journaled) and I wanted to ask what the status here is. Did any fix
> here make it into a stable kernel yet? If not, I suppose I can still
> apply the patch posted above as a quick-fix until this (or another
> solution) makes it into the stable tree?
Any reason you can't just move to 6.1.y instead? What prevents that?
thanks,
greg k-h
On 12.05.23 14:19, Greg KH wrote:
> On Thu, May 11, 2023 at 11:21:27AM +0200, Marcus Hoffmann wrote:
>> Hi,
>>
>>> On Wed, Mar 15, 2023 at 18:57, Theodore Ts'o wrote:
>>>
>>> Yeah, sorry, I didn't see it since it was in an attachment as opposed
>>> to with an explicit [PATCH] subject line.
>>>
>>> And at this point, the data=journal writeback patches have landed in
>>> the ext4/dev tree, and while we could try to see if we could land this
>>> before the next merge window, I'm worried about merge or semantic
>>> conflicts of having both patches in a tree at one time.
>>>
>>> I guess we could send it to Linus, let it get backported into stable,
>>> and then revert it during the merge window, ahead of applying the
>>> data=journal cleanup patch series. But that seems a bit ugly. Or we
>>> could ask for an exception from the stable kernel folks, after I do a
>>> full set of xfstests runs on it. (Of course, I don't think anyone has
>>> been able to create a reliable reproducer, so all we can do is to test
>>> for regression failures.)
>>>
>>> Jan, Greg, what do you think?
>>
>> We've noticed this appearing for us as well now (on 5.15 with
>> data=journaled) and I wanted to ask what the status here is. Did any fix
>> here make it into a stable kernel yet? If not, I suppose I can still
>> apply the patch posted above as a quick-fix until this (or another
>> solution) makes it into the stable tree?
>
> Any reason you can't just move to 6.1.y instead? What prevents that?
>
We will move to 6.1.y soon-ish (we are downstream from the rpi kernel tree)
Is this problem fixed there though? I couldn't really find anything
related to that in the tree?
Best,
Marcus
________________________________
othermo GmbH | Sitz der Gesellschaft: Alzenau | Amtsgericht Aschaffenburg: HRB 14783 | USt-IdNr.: DE319977978 | Geschäftsführung: Dr. Dennis Metz.
On Fri, May 12, 2023 at 04:24:30PM +0200, Marcus Hoffmann wrote:
> On 12.05.23 14:19, Greg KH wrote:
> > On Thu, May 11, 2023 at 11:21:27AM +0200, Marcus Hoffmann wrote:
> > > Hi,
> > >
> > > > On Wed, Mar 15, 2023 at 18:57, Theodore Ts'o wrote:
> > > >
> > > > Yeah, sorry, I didn't see it since it was in an attachment as opposed
> > > > to with an explicit [PATCH] subject line.
> > > >
> > > > And at this point, the data=journal writeback patches have landed in
> > > > the ext4/dev tree, and while we could try to see if we could land this
> > > > before the next merge window, I'm worried about merge or semantic
> > > > conflicts of having both patches in a tree at one time.
> > > >
> > > > I guess we could send it to Linus, let it get backported into stable,
> > > > and then revert it during the merge window, ahead of applying the
> > > > data=journal cleanup patch series. But that seems a bit ugly. Or we
> > > > could ask for an exception from the stable kernel folks, after I do a
> > > > full set of xfstests runs on it. (Of course, I don't think anyone has
> > > > been able to create a reliable reproducer, so all we can do is to test
> > > > for regression failures.)
> > > >
> > > > Jan, Greg, what do you think?
> > >
> > > We've noticed this appearing for us as well now (on 5.15 with
> > > data=journaled) and I wanted to ask what the status here is. Did any fix
> > > here make it into a stable kernel yet? If not, I suppose I can still
> > > apply the patch posted above as a quick-fix until this (or another
> > > solution) makes it into the stable tree?
> >
> > Any reason you can't just move to 6.1.y instead? What prevents that?
> >
>
> We will move to 6.1.y soon-ish (we are downstream from the rpi kernel tree)
> Is this problem fixed there though? I couldn't really find anything
> related to that in the tree?
Test it and see!
And if you are downstream from the RPI kernel tree, my sympathies,
that's a tough place to be given the speed of it updating (i.e. not at
all...)
good luck!
greg k-h
On Fri 12-05-23 16:24:30, Marcus Hoffmann wrote:
> On 12.05.23 14:19, Greg KH wrote:
> > On Thu, May 11, 2023 at 11:21:27AM +0200, Marcus Hoffmann wrote:
> > > Hi,
> > >
> > > > On Wed, Mar 15, 2023 at 18:57, Theodore Ts'o wrote:
> > > >
> > > > Yeah, sorry, I didn't see it since it was in an attachment as opposed
> > > > to with an explicit [PATCH] subject line.
> > > >
> > > > And at this point, the data=journal writeback patches have landed in
> > > > the ext4/dev tree, and while we could try to see if we could land this
> > > > before the next merge window, I'm worried about merge or semantic
> > > > conflicts of having both patches in a tree at one time.
> > > >
> > > > I guess we could send it to Linus, let it get backported into stable,
> > > > and then revert it during the merge window, ahead of applying the
> > > > data=journal cleanup patch series. But that seems a bit ugly. Or we
> > > > could ask for an exception from the stable kernel folks, after I do a
> > > > full set of xfstests runs on it. (Of course, I don't think anyone has
> > > > been able to create a reliable reproducer, so all we can do is to test
> > > > for regression failures.)
> > > >
> > > > Jan, Greg, what do you think?
> > >
> > > We've noticed this appearing for us as well now (on 5.15 with
> > > data=journaled) and I wanted to ask what the status here is. Did any fix
> > > here make it into a stable kernel yet? If not, I suppose I can still
> > > apply the patch posted above as a quick-fix until this (or another
> > > solution) makes it into the stable tree?
> >
> > Any reason you can't just move to 6.1.y instead? What prevents that?
> >
>
> We will move to 6.1.y soon-ish (we are downstream from the rpi kernel tree)
> Is this problem fixed there though? I couldn't really find anything
> related to that in the tree?
Yeah. Due to various delays the rewrite of data=journal mode got merged
only into 6.4-rc1. So updating to 6.1.y isn't going to fix the problem.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hello,
I also encountered that specific issue while migrating from 4.14 to 5.15
with data=journal. The proposed patch fixes the issue for me. The 6.1
branch seems to be affected as well. Would it be an option to have that
patch applied to stable branches?
Thanks,
Mathieu
Hello,
On Wed 20-09-23 11:40:24, Mathieu Othacehe wrote:
> I also encountered that specific issue while migrating from 4.14 to 5.15
> with data=journal. The proposed patch fixes the issue for me. The 6.1
> branch seems to be affected as well. Would it be an option to have that
> patch applied to stable branches?
Well, Greg is rather reluctant to merge into stable tree patches that were
not included upstream. But with a good reason he could do that. I guess the
easiest is to try - take the backported fix and officially submit it to
[email protected] with the explanation why you are submitting
non-upstream patch - in particular that upstream solution is some 14
patch series.
Alternative solution is to backport and test the upstream solution:
bd159398a2d2 ("jdb2: Don't refuse invalidation of already invalidated buffers")
d84c9ebdac1e ("ext4: Mark pages with journalled data dirty")
265e72efa99f ("ext4: Keep pages with journalled data dirty")
5e1bdea6391d ("ext4: Clear dirty bit from pages without data to write")
1f1a55f0bf06 ("ext4: Commit transaction before writing back pages in data=journal mode")
e360c6ed7274 ("ext4: Drop special handling of journalled data from ext4_sync_file()")
c000dfec7e88 ("ext4: Drop special handling of journalled data from extent shifting operations")
783ae448b7a2 ("ext4: Fix special handling of journalled data from extent zeroing")
56c2a0e3d90d ("ext4: Drop special handling of journalled data from ext4_evict_inode()")
7c375870fdc5 ("ext4: Drop special handling of journalled data from ext4_quota_on()")
951cafa6b80e ("ext4: Simplify handling of journalled data in ext4_bmap()")
ab382539adcb ("ext4: Update comment in mpage_prepare_extent_to_map()")
d0ab8368c175 ("Revert "ext4: Fix warnings when freezing filesystem with journaled data"")
1077b2d53ef5 ("ext4: fix fsync for non-directories")
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR