Hi,
while goning into hibernation right after resuming with
2.6.37-05668-g4162cf6 on S390 I get this one:
[ 270.317785] PM: Basic memory bitmaps created
[ 270.317789] PM: Syncing filesystems ...
[ 270.318854] Unable to handle kernel pointer dereference at virtual kernel address (null)
[ 270.318865] Oops: 0004 [#1] PREEMPT SMP
[ 270.318878] Modules linked in: binfmt_misc dm_multipath scsi_dh vmur [last unloaded: scsi_wait_scan]
[ 270.318900] CPU: 0 Not tainted 2.6.37-05668-g4162cf6 #19
[ 270.318907] Process flush-94:0 (pid: 1226, task: 000000003f0a0040, ksp: 000000003d75fc70)
[ 270.318916] Krnl PSW : 0704000180000000 0000000000206662 (jbd2_journal_file_inode+0xfe/0x148)
[ 270.318938] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
[ 270.318948] Krnl GPRS: ffe62f366dac01a2 0000000000000000 0000000000000000 00000000007e4260
[ 270.318961] 00000000002065dc 6880000000000000 000000003f234d60 0000000000000005
[ 270.318971] 000000003f6e15a8 000000003d75fa88 000000003ed59228 000000003ed58c60
[ 270.319105] 0000000000000000 000000003f036488 00000000002065dc 000000003d75f9a0
[ 270.319123] Krnl Code: 0000000000206656: b9140022 lgfr %r2,%r2
[ 270.319133] 000000000020665a: 07f4 bcr 15,%r4
[ 270.319145] 000000000020665c: ec260023007c cgij %r2,0,6,2066a2
[ 270.319158] >0000000000206662: e3d0c0000024 stg %r13,0(%r12)
[ 270.319181] 0000000000206668: 4110c010 la %r1,16(%r12)
[ 270.319198] 000000000020666c: e320d0600004 lg %r2,96(%r13)
[ 270.319241] 0000000000206672: e31020080024 stg %r1,8(%r2)
[ 270.319260] 0000000000206678: e320c0100024 stg %r2,16(%r12)
[ 270.319281] Call Trace:
[ 270.319289] ([<00000000002065dc>] jbd2_journal_file_inode+0x78/0x148)
[ 270.319303] [<00000000001d0f66>] mpage_da_map_and_submit+0x2ea/0x440
[ 270.319319] [<00000000001d197a>] ext4_da_writepages+0x32e/0x744
[ 270.319334] [<000000000016fe64>] writeback_single_inode+0xd4/0x25c
[ 270.319391] [<0000000000170514>] writeback_sb_inodes+0xd0/0x160
[ 270.319396] [<0000000000171278>] writeback_inodes_wb+0x90/0x16c
[ 270.319408] [<00000000001715fa>] wb_writeback+0x2a6/0x480
[ 270.319414] [<00000000001718b6>] wb_do_writeback+0xe2/0x294
[ 270.319419] [<0000000000171b14>] bdi_writeback_thread+0xac/0x2fc
[ 270.319424] [<000000000007d192>] kthread+0xbe/0xc8
[ 270.319457] [<000000000047910a>] kernel_thread_starter+0x6/0xc
[ 270.319467] [<0000000000479104>] kernel_thread_starter+0x0/0xc
bisecting points to:
commit 8aefcd557d26d0023a36f9ec5afbf55e59f8f26b
Author: Theodore Ts'o <[email protected]>
Date: Mon Jan 10 12:29:43 2011 -0500
ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing. This allows us to further slim down the ext4_inode_info
structure.
Regards,
Sebastian
On Wed, Jan 12, 2011 at 02:48:58PM +0100, Sebastian Ott wrote:
> Hi,
>
> while goning into hibernation right after resuming with
> 2.6.37-05668-g4162cf6 on S390 I get this one:
Hmm... what are you hibernating to? A swap device? A file-backed
swap device?
Thanks,
- Ted
On Wed, 12 Jan 2011, Ted Ts'o wrote:
> On Wed, Jan 12, 2011 at 02:48:58PM +0100, Sebastian Ott wrote:
> > Hi,
> >
> > while goning into hibernation right after resuming with
> > 2.6.37-05668-g4162cf6 on S390 I get this one:
>
> Hmm... what are you hibernating to? A swap device? A file-backed
> swap device?
swap device.
Regards,
Sebastian
Since I don't have a machine set up to test hibernation easily at
hand, I'd really appreciate it if you could try this patch to
determine which inode had the NULL jinode --- and then once you get
the device and inode number, to use debugfs's "ncheck" command to map
the inode number to a pathname.
If you could do that, it would be a huge help.
Thanks, regards,
- Ted
P.S. Also, if you could try suspending once or twice, with different
programs running, to see if the inode number and pathname are constant
or vary, that would also be helpful.
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index d8b992e..7d6d7d7 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -252,8 +252,15 @@ static inline int ext4_journal_force_commit(journal_t *journal)
static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
{
- if (ext4_handle_valid(handle))
+ if (ext4_handle_valid(handle)) {
+ if (unlikely(EXT4_I(inode)->jinode == NULL)) {
+ /* Should never happen */
+ ext4_msg(inode->i_sb, KERN_CRIT,
+ "inode #%lu has NULL jinode", inode->i_ino);
+ BUG();
+ }
return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
+ }
return 0;
}
On Wed, 12 Jan 2011, Ted Ts'o wrote:
> Since I don't have a machine set up to test hibernation easily at
> hand, I'd really appreciate it if you could try this patch to
> determine which inode had the NULL jinode --- and then once you get
> the device and inode number, to use debugfs's "ncheck" command to map
> the inode number to a pathname.
>
> If you could do that, it would be a huge help.
ok, will do
>
> Thanks, regards,
>
> - Ted
>
> P.S. Also, if you could try suspending once or twice, with different
> programs running, to see if the inode number and pathname are constant
> or vary, that would also be helpful.
>
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index d8b992e..7d6d7d7 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -252,8 +252,15 @@ static inline int ext4_journal_force_commit(journal_t *journal)
>
> static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
> {
> - if (ext4_handle_valid(handle))
> + if (ext4_handle_valid(handle)) {
> + if (unlikely(EXT4_I(inode)->jinode == NULL)) {
> + /* Should never happen */
> + ext4_msg(inode->i_sb, KERN_CRIT,
> + "inode #%lu has NULL jinode", inode->i_ino);
> + BUG();
> + }
> return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
> + }
> return 0;
> }
>
>
Hi,
On Wed, 12 Jan 2011, Ted Ts'o wrote:
> Since I don't have a machine set up to test hibernation easily at
> hand, I'd really appreciate it if you could try this patch to
> determine which inode had the NULL jinode --- and then once you get
> the device and inode number, to use debugfs's "ncheck" command to map
> the inode number to a pathname.
>
> If you could do that, it would be a huge help.
I did about 5 re-tests with different loads, the result was always:
[ 249.238697] EXT4-fs (dasda1): inode #1106 has NULL jinode
[ 249.238732] ------------[ cut here ]------------
[ 249.238738] kernel BUG at fs/ext4/ext4_jbd2.h:260!
[ 249.238747] illegal operation: 0001 [#1] PREEMPT SMP
[ 249.238760] Modules linked in: binfmt_misc dm_multipath scsi_dh vmur [last unloaded: scsi_wait_scan]
[ 249.238786] CPU: 2 Not tainted 2.6.37-05668-g4162cf6-dirty #20
[ 249.238794] Process flush-94:0 (pid: 1244, task: 000000003f110040, ksp: 000000003dca3c70)
[ 249.238803] Krnl PSW : 0704000180000000 00000000001d1130 (mpage_da_map_and_submit+0x464/0x468)
[ 249.238824] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
[ 249.238835] Krnl GPRS: 000000000000c65f 0000000000000000 0000000000000043 0000000000000000
[ 249.238848] 00000000004735c0 0000000000000000 000000003ee45478 0000000000000001
[ 249.238863] 0000000035544aa8 000000003dca3a88 0000000000000004 000000003d29d680
[ 249.238868] 000000003dca3b80 000000000049ef98 00000000001d112c 000000003dca39d8
[ 249.238886] Krnl Code: 00000000001d1120: c040001e01c2 larl %r4,5914a4
[ 249.238893] 00000000001d1126: c0e5000092ed brasl %r14,1e3700
[ 249.238899] 00000000001d112c: a7f40001 brc 15,1d112e
[ 249.238906] >00000000001d1130: a7f40000 brc 15,1d1130
[ 249.238913] 00000000001d1134: e3e0f0080024 stg %r14,8(%r15)
[ 249.238920] 00000000001d113a: c0f400000006 brcl 15,1d1146
[ 249.238942] 00000000001d1140: b9040000 lgr %r0,%r0
[ 249.238963] 00000000001d1144: 0de1 basr %r14,%r1
[ 249.238989] Call Trace:
[ 249.238998] ([<00000000001d112c>] mpage_da_map_and_submit+0x460/0x468)
[ 249.239012] [<00000000001d19f2>] ext4_da_writepages+0x32e/0x744
[ 249.239028] [<000000000016fe64>] writeback_single_inode+0xd4/0x25c
[ 249.239044] [<0000000000170514>] writeback_sb_inodes+0xd0/0x160
[ 249.239050] [<0000000000171278>] writeback_inodes_wb+0x90/0x16c
[ 249.239055] [<00000000001715fa>] wb_writeback+0x2a6/0x480
[ 249.239060] [<00000000001718b6>] wb_do_writeback+0xe2/0x294
[ 249.239066] [<0000000000171b14>] bdi_writeback_thread+0xac/0x2fc
[ 249.239072] [<000000000007d192>] kthread+0xbe/0xc8
[ 249.239087] [<0000000000479182>] kernel_thread_starter+0x6/0xc
[ 249.239101] [<000000000047917c>] kernel_thread_starter+0x0/0xc
[ 249.239112] INFO: lockdep is turned off.
[ 249.239119] Last Breaking-Event-Address:
[ 249.239126] [<00000000001d112c>] mpage_da_map_and_submit+0x460/0x468
# debugfs /dev/dasda1
debugfs 1.41.10 (10-Feb-2009)
debugfs: ncheck 1106
Inode Pathname
1106 /usr/bin/killall
debugfs: q
there are also a few messages like:
EXT4-fs error (device dasda1): ext4_lookup:1043: inode #130849: comm find: deleted inode referenced: 129148
I guess they appeared after one of the crashes during bisecting (can't
run fsck right now, since its my /)
Regards,
Sebastian
>
> Thanks, regards,
>
> - Ted
>
> P.S. Also, if you could try suspending once or twice, with different
> programs running, to see if the inode number and pathname are constant
> or vary, that would also be helpful.
>
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index d8b992e..7d6d7d7 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -252,8 +252,15 @@ static inline int ext4_journal_force_commit(journal_t *journal)
>
> static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
> {
> - if (ext4_handle_valid(handle))
> + if (ext4_handle_valid(handle)) {
> + if (unlikely(EXT4_I(inode)->jinode == NULL)) {
> + /* Should never happen */
> + ext4_msg(inode->i_sb, KERN_CRIT,
> + "inode #%lu has NULL jinode", inode->i_ino);
> + BUG();
> + }
> return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
> + }
> return 0;
> }
>
>
On Jan 12, 2011, at 1:49 PM, Sebastian Ott wrote:
>
> I did about 5 re-tests with different loads, the result was always:
>
> [ 249.238697] EXT4-fs (dasda1): inode #1106 has NULL jinode
>
> # debugfs /dev/dasda1
> debugfs 1.41.10 (10-Feb-2009)
> debugfs: ncheck 1106
> Inode Pathname
> 1106 /usr/bin/killall
> debugfs: q
>
That looks really bogus. /usr/bin/killall is a system binary, and there's no good reason that hibernation should be trying to write pages to that binary.
You said originally that the oops was happening "while going into hibernation right after resuming with...". So that means you did a successful suspend/resume, and then the second suspend caused the oops? It looks like somehow the pages were left marked as dirty, so the writeback daemons attempted writing back a page to an inode which was never opened read/write (and in fact as a text page for /usr/bin/killall, was mapped read/only). Given that ext4 initializes jinode only when the file is opened read/write, the fact that it is null, and the fact that it makes no sense that a program would be modifying /usr/bin/killall as part of a suspend/resume, it looks very much like we just unmasked a software suspend bug....
-- Ted
On Wed, Jan 12, 2011 at 07:44:17PM -0500, Theodore Tso wrote:
>
> You said originally that the oops was happening "while going into
> hibernation right after resuming with...". So that means you did a
> successful suspend/resume, and then the second suspend caused the
> oops? It looks like somehow the pages were left marked as dirty, so
> the writeback daemons attempted writing back a page to an inode
> which was never opened read/write (and in fact as a text page for
> /usr/bin/killall, was mapped read/only). Given that ext4
> initializes jinode only when the file is opened read/write, the fact
> that it is null, and the fact that it makes no sense that a program
> would be modifying /usr/bin/killall as part of a suspend/resume, it
> looks very much like we just unmasked a software suspend bug....
... and I think I've found the problem. In kernel/power/block_io.c,
in the function submit(), we see this:
if (bio_chain == NULL) {
submit_bio(bio_rw, bio);
wait_on_page_locked(page);
if (rw == READ)
bio_set_pages_dirty(bio); <====
bio_put(bio);
So when we read in pages from the software suspend device, we end up
marking the pages as dirty(!). I'm guessing this was caused by a copy
and paste from the only other caller of bio_set_pages_dirty(), which
is the direct I/O code, which needs this when we are writing from a
file into a user-provided buffer. But for restoring from a software
suspend case, this is as far as I can tell wholely inappropriate.
This causes needless writes, which is bad even before ext4 unmasked
the problem. I will send a patch under separate cover; could you give
it a try and see if it fixes your crash?
I will look into bulletproofing ext4 by adding checks for this case
and printing warning messages, but neverthe less, I think the root
cause is actually in the hibernation's bio code.
- Ted
Everything was sync'ed before the hibernation, so no pages could be
dirty. So this causes a lot of wasted I/O activity right after
resuming from hibernation.
Worse, it also causes pages from files that were opened read/only to
be marked writeble which makes them subject to writeback. This was
discovered when ext4 was changed to so that the jinode pointer was not
initialized unless the file was opened read/write, and this caused
things to blow up. But that just unmasked a problem, since the pages
belonging to the file in question should have never been marked dirty
in the first place. It increases the chances the text blocks for
executables like /usr/bin/killall will get corrupted when they are
needlessly written, and of course it means extra write cycles to the
SSD.
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Sebastian Ott <[email protected]>
Cc: [email protected]
---
kernel/power/block_io.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/kernel/power/block_io.c b/kernel/power/block_io.c
index 83bbc7c..108a4f3 100644
--- a/kernel/power/block_io.c
+++ b/kernel/power/block_io.c
@@ -49,8 +49,6 @@ static int submit(int rw, struct block_device *bdev, sector_t sector,
if (bio_chain == NULL) {
submit_bio(bio_rw, bio);
wait_on_page_locked(page);
- if (rw == READ)
- bio_set_pages_dirty(bio);
bio_put(bio);
} else {
if (rw == READ)
--
1.7.3.1
On Wed, 2011-01-12 at 19:44 -0500, Theodore Tso wrote:
> It looks like somehow the pages were left marked as dirty, so the
> writeback daemons attempted writing back a page to an inode which was
> never opened read/write (and in fact as a text page
> for /usr/bin/killall, was mapped read/only).
Just out of curiosity, when this happens, is LZO compression being used
with hibernation (default in 2.6.37) or has it been disabled?
--
Bojan
On Wed, 12 Jan 2011, Theodore Tso wrote:
> That looks really bogus. /usr/bin/killall is a system binary, and there's
> no good reason that hibernation should be trying to write pages to that
> binary.
>
> You said originally that the oops was happening "while going into
> hibernation right after resuming with...". So that means you did a
> successful suspend/resume, and then the second suspend caused the oops?
Yes. I basically did a
echo reboot > /sys/power/disk
for i in {1..5} ;do echo disk > /sys/power/state ;done
and it crashed very early in the second suspend.
> It looks like somehow the pages were left marked as dirty, so the
> writeback daemons attempted writing back a page to an inode which was
> never opened read/write (and in fact as a text page for
> /usr/bin/killall, was mapped read/only).
> Given that ext4 initializes jinode only when the file is opened
> read/write, the fact that it is null, and the fact that it makes no
> sense that a program would be modifying /usr/bin/killall as part of a
> suspend/resume, it looks very much like we just unmasked a software
> suspend bug....
Ah, ok. Thanks for the explanation!
Regards,
Sebastian
Hi,
On Thu, 13 Jan 2011, Bojan Smojver wrote:
> On Wed, 2011-01-12 at 19:44 -0500, Theodore Tso wrote:
> > It looks like somehow the pages were left marked as dirty, so the
> > writeback daemons attempted writing back a page to an inode which was
> > never opened read/write (and in fact as a text page
> > for /usr/bin/killall, was mapped read/only).
>
> Just out of curiosity, when this happens, is LZO compression being used
> with hibernation (default in 2.6.37) or has it been disabled?
LZO is being used - should I try with hibernate=nocompress?
Regards,
Sebastian
Hi,
On Thu, 13 Jan 2011, Theodore Ts'o wrote:
> Everything was sync'ed before the hibernation, so no pages could be
> dirty. So this causes a lot of wasted I/O activity right after
> resuming from hibernation.
>
> Worse, it also causes pages from files that were opened read/only to
> be marked writeble which makes them subject to writeback. This was
> discovered when ext4 was changed to so that the jinode pointer was not
> initialized unless the file was opened read/write, and this caused
> things to blow up. But that just unmasked a problem, since the pages
> belonging to the file in question should have never been marked dirty
> in the first place. It increases the chances the text blocks for
> executables like /usr/bin/killall will get corrupted when they are
> needlessly written, and of course it means extra write cycles to the
> SSD.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> Cc: Sebastian Ott <[email protected]>
> Cc: [email protected]
> ---
> kernel/power/block_io.c | 2 --
> 1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/power/block_io.c b/kernel/power/block_io.c
> index 83bbc7c..108a4f3 100644
> --- a/kernel/power/block_io.c
> +++ b/kernel/power/block_io.c
> @@ -49,8 +49,6 @@ static int submit(int rw, struct block_device *bdev, sector_t sector,
> if (bio_chain == NULL) {
> submit_bio(bio_rw, bio);
> wait_on_page_locked(page);
> - if (rw == READ)
> - bio_set_pages_dirty(bio);
> bio_put(bio);
> } else {
> if (rw == READ)
> --
> 1.7.3.1
>
>
I did some test with this patch applied, but sadly it didn't help.
The testcase was reduced to one hibernation followed by a sync.
Regards,
Sebastian
On Thu, Jan 13, 2011 at 12:48:40PM +0100, Sebastian Ott wrote:
> On Wed, 12 Jan 2011, Theodore Tso wrote:
> > It looks like somehow the pages were left marked as dirty, so the
> > writeback daemons attempted writing back a page to an inode which was
> > never opened read/write (and in fact as a text page for
> > /usr/bin/killall, was mapped read/only).
> > Given that ext4 initializes jinode only when the file is opened
> > read/write, the fact that it is null, and the fact that it makes no
> > sense that a program would be modifying /usr/bin/killall as part of a
> > suspend/resume, it looks very much like we just unmasked a software
> > suspend bug....
> Ah, ok. Thanks for the explanation!
Eeeek... this seems to be an architecture specific bug that is only present
on s390.
The dirty bit for user space pages on all architectures but s390 are stored
into the PTE's. On s390 however they are stored into the storage key that
exists per _physical_ page.
So, what we should have done, when implementing suspend/resume on s390, is
to save the storage key for each page and write that to the suspend device
and upon resume restore the storage key contents for each physical page.
The code that would do that is missing... Hence _all_ pages of the resumed
image are dirty after they have been copied to their location.
*ouch*
Will fix.
On Thu, Jan 13, 2011 at 02:36:12PM +0100, Heiko Carstens wrote:
>
> Eeeek... this seems to be an architecture specific bug that is only present
> on s390.
> The dirty bit for user space pages on all architectures but s390 are stored
> into the PTE's. On s390 however they are stored into the storage key that
> exists per _physical_ page.
> So, what we should have done, when implementing suspend/resume on s390, is
> to save the storage key for each page and write that to the suspend device
> and upon resume restore the storage key contents for each physical page.
> The code that would do that is missing... Hence _all_ pages of the resumed
> image are dirty after they have been copied to their location.
> *ouch*
>
> Will fix.
Glad you found the root cause. If you don't think you can get this
fixed quickly, before -rc2 or -rc3, I can fairly quickly add some
checks to ext4 to detect this condition, issue a warning, and then
return an error code from the ->writepages() hook. (Which will then
promptly be ignored by the writeback code, since, hey, what are they
going to do with an error, but that's a discussion for another forum.)
Would that be helpful?
I'm still a bit concerned with the call to set the pages' PTE to be
dirty that I found in the hibernate code, but I accept the fact that
removing it doesn't solve the s390 crash. It still seems wrong to me,
and hopefully someone from linux-pm can look at that more closely.
- Ted
On Thu, 2011-01-13 at 13:46 -0500, Ted Ts'o wrote:
> I'm still a bit concerned with the call to set the pages' PTE to be
> dirty that I found in the hibernate code, but I accept the fact that
> removing it doesn't solve the s390 crash. It still seems wrong to me,
> and hopefully someone from linux-pm can look at that more closely.
If I'm understanding things correctly, this should affect only the
situation when compression is not used. Otherwise, pages that are read
into by block I/O are decompressed first and copied into different
pages. No?
--
Bojan
On Fri, Jan 14, 2011 at 08:30:12AM +1100, Bojan Smojver wrote:
> On Thu, 2011-01-13 at 13:46 -0500, Ted Ts'o wrote:
> > I'm still a bit concerned with the call to set the pages' PTE to be
> > dirty that I found in the hibernate code, but I accept the fact that
> > removing it doesn't solve the s390 crash. It still seems wrong to me,
> > and hopefully someone from linux-pm can look at that more closely.
>
> If I'm understanding things correctly, this should affect only the
> situation when compression is not used. Otherwise, pages that are read
> into by block I/O are decompressed first and copied into different
> pages. No?
When the pages get copied to their final resting place the dirty bits
of the corresponding physical pages get set automatically by the
hardware.
If there would be some code that would clear the dirty bit after the
copy operation then we would have some underindication and as a result
the possibility of data corruption, but we've never seen this.
However since s390 is the only architecture which has dirty bits for
physical pages I doubt that there is any such code present. So the
bug should happen independently of image compression.
What is missing is code that restores the original storage keys after
the pages have been copied to their final place.
I need to read the suspend/resume code and figure out how to fix this.