2009-06-14 01:54:00

by Leandro Lucarella

[permalink] [raw]
Subject: NILFS2 get stuck after bio_alloc() fail

Hi!

While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed
sometimes they got stucked in D state, and the kernel had said the
following message:

NILFS: IO error writing segment

A friend gave me a hand and after adding some printk()s we found out that
the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio()
fail, making it return NULL; but we don't know how that causes the
processes to get stucked.

PS: Please CC me.

--
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05)


2009-06-14 01:54:19

by Alberto Bertogli

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

On Sat, Jun 13, 2009 at 10:32:11PM -0300, Leandro Lucarella wrote:
> Hi!
>
> While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed
> sometimes they got stucked in D state, and the kernel had said the
> following message:
>
> NILFS: IO error writing segment
>
> A friend gave me a hand and after adding some printk()s we found out that
> the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio()
> fail, making it return NULL; but we don't know how that causes the
> processes to get stucked.

By the way, those bio_alloc()s are using GFP_NOWAIT but it looks like they
could use at least GFP_NOIO or GFP_NOFS, since the caller can (and sometimes
do) sleep. The only caller is nilfs_submit_bh(), which calls
nilfs_submit_seg_bio() which can sleep calling wait_for_completion(). Is there
something I'm missing?

Thanks a lot,
Alberto

2009-06-14 03:45:29

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

Hi,
On Sat, 13 Jun 2009 22:32:11 -0300, Leandro Lucarella wrote:
> Hi!
>
> While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed
> sometimes they got stucked in D state, and the kernel had said the
> following message:
>
> NILFS: IO error writing segment
>
> A friend gave me a hand and after adding some printk()s we found out that
> the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio()
> fail, making it return NULL; but we don't know how that causes the
> processes to get stucked.
>
> PS: Please CC me.
> --
> Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/

Thank you for reporting this issue.

Could you get stack dump of the stuck nilfs task?
It is acquirable as follows if you enabled magic sysrq feature:

# echo t > /proc/sysrq-trigger

I will dig into the process how it got stuck.

P.S.
nilfs.org and the nilfs mailing list is stopping due to planned power
outage. They will be back on June 15 0:30 a.m.

A backup site except the list is available on
http://nilfs.sourceforge.net/en/ during the period.

Regards,
Ryusuke Konishi

2009-06-14 06:30:54

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

Hi!
On Sat, 13 Jun 2009 22:52:40 -0300, Alberto Bertogli wrote:
> On Sat, Jun 13, 2009 at 10:32:11PM -0300, Leandro Lucarella wrote:
>> Hi!
>>
>> While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed
>> sometimes they got stucked in D state, and the kernel had said the
>> following message:
>>
>> NILFS: IO error writing segment
>>
>> A friend gave me a hand and after adding some printk()s we found out that
>> the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio()
>> fail, making it return NULL; but we don't know how that causes the
>> processes to get stucked.
>
> By the way, those bio_alloc()s are using GFP_NOWAIT but it looks like they
> could use at least GFP_NOIO or GFP_NOFS, since the caller can (and sometimes
> do) sleep. The only caller is nilfs_submit_bh(), which calls
> nilfs_submit_seg_bio() which can sleep calling wait_for_completion(). Is there
> something I'm missing?
>
> Thanks a lot,
> Alberto

The original GFP flag was GFP_NOIO, but replaced to GFP_NOWAIT at a
preliminary release in February 2008. It was because a user
experienced system memory shortage by the bio_alloc() call.

Even though nilfs_alloc_seg_bio() repeatedly calls bio_alloc()
reducing the number of bio vectors in case of failure, this fallback
did not work well.

I'm in two minds whether I should change it back to GFP_NOIO.
Or should I switch the gfp as follows?

Thanks,
Ryusuke Konishi

diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 1e68821..6b8f00a 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -306,6 +306,7 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
* @sb: super block
* @start: beginning disk block number of this BIO.
* @nr_vecs: request size of page vector.
+ * @gfp_mask: gfp flags
*
* alloc_seg_bio() allocates a new BIO structure and initialize it.
*
@@ -313,14 +314,14 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
* On error, NULL is returned.
*/
static struct bio *nilfs_alloc_seg_bio(struct super_block *sb, sector_t start,
- int nr_vecs)
+ gfp_t gfp_mask, int nr_vecs)
{
struct bio *bio;

- bio = bio_alloc(GFP_NOWAIT, nr_vecs);
+ bio = bio_alloc(gfp_mask, nr_vecs);
if (bio == NULL) {
while (!bio && (nr_vecs >>= 1))
- bio = bio_alloc(GFP_NOWAIT, nr_vecs);
+ bio = bio_alloc(gfp_mask, nr_vecs);
}
if (likely(bio)) {
bio->bi_bdev = sb->s_bdev;
@@ -353,9 +354,14 @@ static int nilfs_submit_bh(struct nilfs_write_info *wi, struct buffer_head *bh,
repeat:
if (!wi->bio) {
wi->bio = nilfs_alloc_seg_bio(wi->sb, wi->blocknr + wi->end,
- wi->nr_vecs);
- if (unlikely(!wi->bio))
- return -ENOMEM;
+ GFP_NOWAIT, wi->nr_vecs);
+ if (unlikely(!wi->bio)) {
+ wi->bio = nilfs_alloc_seg_bio(wi->sb,
+ wi->blocknr + wi->end,
+ GFP_NOIO, 1);
+ if (!wi->bio)
+ return -ENOMEM;
+ }
}

len = bio_add_page(wi->bio, bh->b_page, bh->b_size, bh_offset(bh));

2009-06-14 07:00:18

by Pekka Enberg

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

On Sun, Jun 14, 2009 at 9:30 AM, Ryusuke
Konishi<[email protected]> wrote:
> The original GFP flag was GFP_NOIO, but replaced to GFP_NOWAIT at a
> preliminary release in February 2008. ?It was because a user
> experienced system memory shortage by the bio_alloc() call.
>
> Even though nilfs_alloc_seg_bio() repeatedly calls bio_alloc()
> reducing the number of bio vectors in case of failure, this fallback
> did not work well.
>
> I'm in two minds whether I should change it back to GFP_NOIO.
> Or should I switch the gfp as follows?

As far as I can tell, the only difference with GFP_NOIO and GFP_NOWAIT
here is that the former will trigger the mempool_alloc() page reclaim
path. But I am not sure I understand why switching to GFP_NOWAIT
helped with memory shortage. What exactly was the problem there?

Pekka

2009-06-14 12:34:54

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

Hi,
On Sun, 14 Jun 2009 10:00:06 +0300, Pekka Enberg <[email protected]> wrote:
> On Sun, Jun 14, 2009 at 9:30 AM, Ryusuke
> Konishi<[email protected]> wrote:
>> The original GFP flag was GFP_NOIO, but replaced to GFP_NOWAIT at a
>> preliminary release in February 2008. ?It was because a user
>> experienced system memory shortage by the bio_alloc() call.
>>
>> Even though nilfs_alloc_seg_bio() repeatedly calls bio_alloc()
>> reducing the number of bio vectors in case of failure, this fallback
>> did not work well.
>>
>> I'm in two minds whether I should change it back to GFP_NOIO.
>> Or should I switch the gfp as follows?
>
> As far as I can tell, the only difference with GFP_NOIO and GFP_NOWAIT
> here is that the former will trigger the mempool_alloc() page reclaim
> path. But I am not sure I understand why switching to GFP_NOWAIT
> helped with memory shortage. What exactly was the problem there?
>
> Pekka

I will confirm the details later. I cannot dig into the record due to
planned outage at office now. The version I replaced GFP_NOIO with
GFP_NOWAIT is considerably old. A true problem might have been in
other places somewhere.

For reference, the retry loop in question is as follows. It is to
allocate bio for log writing.

bio = bio_alloc(GFP_NOWAIT, nr_vecs);
if (bio == NULL) {
while (!bio && (nr_vecs >>= 1))
bio = bio_alloc(GFP_NOWAIT, nr_vecs);
}

where nr_vecs is less than or equal to bio_get_nr_vecs(bdev).


Thanks,
Ryusuke Konishi

2009-06-14 15:34:10

by Leandro Lucarella

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

Ryusuke Konishi, el 14 de junio a las 12:45 me escribiste:
> Hi,
> On Sat, 13 Jun 2009 22:32:11 -0300, Leandro Lucarella wrote:
> > Hi!
> >
> > While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed
> > sometimes they got stucked in D state, and the kernel had said the
> > following message:
> >
> > NILFS: IO error writing segment
> >
> > A friend gave me a hand and after adding some printk()s we found out that
> > the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio()
> > fail, making it return NULL; but we don't know how that causes the
> > processes to get stucked.
>
> Thank you for reporting this issue.
>
> Could you get stack dump of the stuck nilfs task?
> It is acquirable as follows if you enabled magic sysrq feature:
>
> # echo t > /proc/sysrq-trigger
>
> I will dig into the process how it got stuck.

Here is (what I thought it's) the important stuff:

[...]
kdmflush S dc5abf5c 0 1018 2
dc5abf84 00000046 dc60d780 dc5abf5c c01ad12e dd4d6ed0 dd4d7148 e3504d6e
00003c16 dc8b2560 dc5abf7c c040e24b dd846da0 dc60d7cc dd4d6ed0 dc5abf8c
c040d628 dc5abfd0 c0131dbd dc7fe230 dd4d6ed0 dc5abfa8 dd4d6ed0 dd846da8
Call Trace:
[<c01ad12e>] ? bio_fs_destructor+0xe/0x10
[<c040e24b>] ? down_write+0xb/0x30
[<c040d628>] schedule+0x8/0x20
[<c0131dbd>] worker_thread+0x16d/0x1e0
[<debcba30>] ? dm_wq_work+0x0/0x120 [dm_mod]
[<c0135420>] ? autoremove_wake_function+0x0/0x50
[<c0131c50>] ? worker_thread+0x0/0x1e0
[<c0134fb3>] kthread+0x43/0x80
[<c0134f70>] ? kthread+0x0/0x80
[<c0103513>] kernel_thread_helper+0x7/0x14
[...]
loop0 S dcc7bce0 0 15884 2
d7671f48 00000046 c01ad116 dcc7bce0 dcc7bca0 d4686590 d4686808 b50316ce
000003b8 dc7010a0 c01b0d4f c01b0cf0 dcc7bcec 0c7f3000 00000000 d7671f50
c040d628 d7671fd0 de85391c 00000000 00000000 00000000 dcbbd108 dcbbd000
Call Trace:
[<c01ad116>] ? bio_free+0x46/0x50
[<c01b0d4f>] ? mpage_end_io_read+0x5f/0x70
[<c01b0cf0>] ? mpage_end_io_read+0x0/0x70
[<c040d628>] schedule+0x8/0x20
[<de85391c>] loop_thread+0x1cc/0x490 [loop]
[<de853590>] ? do_lo_send_aops+0x0/0x1c0 [loop]
[<c0135420>] ? autoremove_wake_function+0x0/0x50
[<de853750>] ? loop_thread+0x0/0x490 [loop]
[<c0134fb3>] kthread+0x43/0x80
[<c0134f70>] ? kthread+0x0/0x80
[<c0103513>] kernel_thread_helper+0x7/0x14
segctord D 00000001 0 15886 2
d3847ef4 00000046 c011cefb 00000001 00000001 dcf48fd0 dcf49248 c052b9d0
d50962e4 dc701720 d46871dc d46871e4 c23f180c c23f180c d3847f28 d3847efc
c040d628 d3847f20 c040ed3d c23f1810 dcf48fd0 d46871dc 00000000 c23f180c
Call Trace:
[<c011cefb>] ? dequeue_task_fair+0x27b/0x280
[<c040d628>] schedule+0x8/0x20
[<c040ed3d>] rwsem_down_failed_common+0x7d/0x180
[<c040ee5d>] rwsem_down_write_failed+0x1d/0x30
[<c040eeaa>] call_rwsem_down_write_failed+0x6/0x8
[<c040e25e>] ? down_write+0x1e/0x30
[<decb6299>] nilfs_transaction_lock+0x59/0x100 [nilfs2]
[<decb6d5c>] nilfs_segctor_thread+0xcc/0x2e0 [nilfs2]
[<decb6c80>] ? nilfs_construction_timeout+0x0/0x10 [nilfs2]
[<decb6c90>] ? nilfs_segctor_thread+0x0/0x2e0 [nilfs2]
[<c0134fb3>] kthread+0x43/0x80
[<c0134f70>] ? kthread+0x0/0x80
[<c0103513>] kernel_thread_helper+0x7/0x14
rm D d976bde0 0 16147 1
d976bdf0 00000086 003abc46 d976bde0 c013cc46 c18ad190 c18ad408 00000000
003abc46 dc789900 d976be38 d976bdf0 00000000 d976be30 d976be38 d976bdf8
c040d628 d976be00 c040d67a d976be08 c01668dd d976be24 c040dad7 c01668b0
Call Trace:
[<c013cc46>] ? getnstimeofday+0x56/0x110
[<c040d628>] schedule+0x8/0x20
[<c040d67a>] io_schedule+0x3a/0x70
[<c01668dd>] sync_page+0x2d/0x60
[<c040dad7>] __wait_on_bit+0x47/0x70
[<c01668b0>] ? sync_page+0x0/0x60
[<c0166b08>] wait_on_page_bit+0x98/0xb0
[<c0135470>] ? wake_bit_function+0x0/0x60
[<c016f3e4>] truncate_inode_pages_range+0x244/0x360
[<c01a448c>] ? __mark_inode_dirty+0x2c/0x160
[<decb756c>] ? nilfs_transaction_commit+0x9c/0x170 [nilfs2]
[<c040e27b>] ? down_read+0xb/0x20
[<c016f51a>] truncate_inode_pages+0x1a/0x20
[<deca3e9f>] nilfs_delete_inode+0x9f/0xd0 [nilfs2]
[<deca3e00>] ? nilfs_delete_inode+0x0/0xd0 [nilfs2]
[<c019c082>] generic_delete_inode+0x92/0x150
[<c019c1af>] generic_drop_inode+0x6f/0x1b0
[<c019b457>] iput+0x47/0x50
[<c0194763>] do_unlinkat+0xd3/0x160
[<c0197106>] ? vfs_readdir+0x66/0x90
[<c0196e00>] ? filldir64+0x0/0xf0
[<c01971c6>] ? sys_getdents64+0x96/0xb0
[<c0194913>] sys_unlinkat+0x23/0x50
[<c0102db5>] syscall_call+0x7/0xb
umount D d06bbe6c 0 16727 1
d06bbe7c 00000086 d06bbe58 d06bbe6c c013cc46 dc5ef350 dc5ef5c8 00000000
022bb380 dc6503a0 d06bbec4 d06bbe7c 00000000 d06bbebc d06bbec4 d06bbe84
c040d628 d06bbe8c c040d67a d06bbe94 c01668dd d06bbeb0 c040dad7 c01668b0
Call Trace:
[<c013cc46>] ? getnstimeofday+0x56/0x110
[<c040d628>] schedule+0x8/0x20
[<c040d67a>] io_schedule+0x3a/0x70
[<c01668dd>] sync_page+0x2d/0x60
[<c040dad7>] __wait_on_bit+0x47/0x70
[<c01668b0>] ? sync_page+0x0/0x60
[<c0166b08>] wait_on_page_bit+0x98/0xb0
[<c0135470>] ? wake_bit_function+0x0/0x60
[<c0167494>] wait_on_page_writeback_range+0xa4/0x110
[<c01675a0>] ? __filemap_fdatawrite_range+0x60/0x80
[<c0167534>] filemap_fdatawait+0x34/0x40
[<c016871b>] filemap_write_and_wait+0x3b/0x50
[<c01ae329>] sync_blockdev+0x19/0x20
[<c01a4365>] __sync_inodes+0x45/0x70
[<c01a439d>] sync_inodes+0xd/0x30
[<c01a70d7>] do_sync+0x17/0x70
[<c01a715d>] sys_sync+0xd/0x20
[<c0102db5>] syscall_call+0x7/0xb
[...]

'rm' is the "original" stuck process, 'umount' got stuck after that, when I
tried to umount the nilfs (it was mounted in a loop device).


Here is the complete trace:
http://pastebin.lugmen.org.ar/4931

Thank you.

--
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Don't take life to seriously, you won't get out alive

2009-06-14 18:02:57

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

Hi Leandro,
On Sun, 14 Jun 2009 12:32:56 -0300, Leandro Lucarella wrote:
> Ryusuke Konishi, el 14 de junio a las 12:45 me escribiste:
>> Hi,
>> On Sat, 13 Jun 2009 22:32:11 -0300, Leandro Lucarella wrote:
>> > Hi!
>> >
>> > While testing nilfs2 (using 2.6.30) doing some "cp"s and "rm"s, I noticed
>> > sometimes they got stucked in D state, and the kernel had said the
>> > following message:
>> >
>> > NILFS: IO error writing segment
>> >
>> > A friend gave me a hand and after adding some printk()s we found out that
>> > the problem seems to occur when bio_alloc()s inside nilfs_alloc_seg_bio()
>> > fail, making it return NULL; but we don't know how that causes the
>> > processes to get stucked.
>>
>> Thank you for reporting this issue.
>>
>> Could you get stack dump of the stuck nilfs task?
>> It is acquirable as follows if you enabled magic sysrq feature:
>>
>> # echo t > /proc/sysrq-trigger
>>
>> I will dig into the process how it got stuck.
>
> Here is (what I thought it's) the important stuff:
<snip>

> 'rm' is the "original" stuck process, 'umount' got stuck after that, when I
> tried to umount the nilfs (it was mounted in a loop device).
>
> Here is the complete trace:
> http://pastebin.lugmen.org.ar/4931

Thank you for your help.

According to your log, there seems to be a leakage in clear processing
of the writeback flag on pages. I will review the error path of log
writer to narrow down the cause.

Regards,
Ryusuke Konishi

2009-06-14 18:15:30

by Leandro Lucarella

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

Ryusuke Konishi, el 15 de junio a las 03:02 me escribiste:
[snip]
> > Here is the complete trace:
> > http://pastebin.lugmen.org.ar/4931
>
> Thank you for your help.
>
> According to your log, there seems to be a leakage in clear processing
> of the writeback flag on pages. I will review the error path of log
> writer to narrow down the cause.

Oh! I forgot to tell you that the cleanerd process was not running. When
I mounted the NILFS2 filesystem the cleanerd daemon said:

nilfs_cleanerd[29534]: start
nilfs_cleanerd[29534]: cannot create cleanerd on /dev/loop0
nilfs_cleanerd[29534]: shutdown

I thought I might have an old utilities version and that could be because
of that (since the nilfs website was down I couldn't check), but now
I checked in the backup nilfs website that latest version is 2.0.12 and
I have that (Debian package), so I don't know why the cleanerd failed to
start in the first place.

--
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05)

2009-06-18 17:35:22

by Ryusuke Konishi

[permalink] [raw]
Subject: [PATCH] nilfs2: fix hang problem after bio_alloc() failed

Hi Leandro,
On Sun, 14 Jun 2009 15:13:13 -0300, Leandro Lucarella wrote:
> Ryusuke Konishi, el 15 de junio a las 03:02 me escribiste:
> [snip]
> > > Here is the complete trace:
> > > http://pastebin.lugmen.org.ar/4931
> >
> > Thank you for your help.
> >
> > According to your log, there seems to be a leakage in clear processing
> > of the writeback flag on pages. I will review the error path of log
> > writer to narrow down the cause.

Here is a patch that hopefully fixes the hang problem after
bio_alloc() failed. This is composed of three separate patches, but I
now attach them together.

If you still see problems with it, please let me know.

> Oh! I forgot to tell you that the cleanerd process was not running. When
> I mounted the NILFS2 filesystem the cleanerd daemon said:
>
> nilfs_cleanerd[29534]: start
> nilfs_cleanerd[29534]: cannot create cleanerd on /dev/loop0
> nilfs_cleanerd[29534]: shutdown
>
> I thought I might have an old utilities version and that could be because
> of that (since the nilfs website was down I couldn't check), but now
> I checked in the backup nilfs website that latest version is 2.0.12 and
> I have that (Debian package), so I don't know why the cleanerd failed to
> start in the first place.

The message indicates that initialization of the cleanerd failed.

I don't know for sure, but another memory allocation failure might
cause it.

Thanks,
Ryusuke Konishi
---
Ryusuke Konishi (3):
nilfs2: fix mis-conversion of error code by unlikely directive
nilfs2: fix hang problem of log writer that follows on write failures
nilfs2: remove incorrect warning in nilfs_dat_commit_start function

fs/nilfs2/dat.c | 9 ---------
fs/nilfs2/segment.c | 28 +++++++---------------------
2 files changed, 7 insertions(+), 30 deletions(-)

diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index bb8a581..e2646c3 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -149,15 +149,6 @@ void nilfs_dat_commit_start(struct inode *dat, struct nilfs_palloc_req *req,
entry = nilfs_palloc_block_get_entry(dat, req->pr_entry_nr,
req->pr_entry_bh, kaddr);
entry->de_start = cpu_to_le64(nilfs_mdt_cno(dat));
- if (entry->de_blocknr != cpu_to_le64(0) ||
- entry->de_end != cpu_to_le64(NILFS_CNO_MAX)) {
- printk(KERN_CRIT
- "%s: vbn = %llu, start = %llu, end = %llu, pbn = %llu\n",
- __func__, (unsigned long long)req->pr_entry_nr,
- (unsigned long long)le64_to_cpu(entry->de_start),
- (unsigned long long)le64_to_cpu(entry->de_end),
- (unsigned long long)le64_to_cpu(entry->de_blocknr));
- }
entry->de_blocknr = cpu_to_le64(blocknr);
kunmap_atomic(kaddr, KM_USER0);

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 22c7f65..e8f188b 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1846,26 +1846,13 @@ static int nilfs_segctor_write(struct nilfs_sc_info *sci,
err = nilfs_segbuf_write(segbuf, &wi);

res = nilfs_segbuf_wait(segbuf, &wi);
- err = unlikely(err) ? : res;
+ err = unlikely(err) ? err : res;
if (unlikely(err))
return err;
}
return 0;
}

-static int nilfs_page_has_uncleared_buffer(struct page *page)
-{
- struct buffer_head *head, *bh;
-
- head = bh = page_buffers(page);
- do {
- if (buffer_dirty(bh) && !list_empty(&bh->b_assoc_buffers))
- return 1;
- bh = bh->b_this_page;
- } while (bh != head);
- return 0;
-}
-
static void __nilfs_end_page_io(struct page *page, int err)
{
if (!err) {
@@ -1889,12 +1876,11 @@ static void nilfs_end_page_io(struct page *page, int err)
if (!page)
return;

- if (buffer_nilfs_node(page_buffers(page)) &&
- nilfs_page_has_uncleared_buffer(page))
- /* For b-tree node pages, this function may be called twice
- or more because they might be split in a segment.
- This check assures that cleanup has been done for all
- buffers in a split btnode page. */
+ if (buffer_nilfs_node(page_buffers(page)) && !PageWriteback(page))
+ /*
+ * For b-tree node pages, this function may be called twice
+ * or more because they might be split in a segment.
+ */
return;

__nilfs_end_page_io(page, err);
@@ -1957,7 +1943,7 @@ static void nilfs_segctor_abort_write(struct nilfs_sc_info *sci,
}
if (bh->b_page != fs_page) {
nilfs_end_page_io(fs_page, err);
- if (unlikely(fs_page == failed_page))
+ if (fs_page && fs_page == failed_page)
goto done;
fs_page = bh->b_page;
}
--
1.6.2

2009-06-21 11:08:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] nilfs2: fix hang problem after bio_alloc() failed

Ryusuke Konishi <[email protected]> writes:
>
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 22c7f65..e8f188b 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1846,26 +1846,13 @@ static int nilfs_segctor_write(struct nilfs_sc_info *sci,
> err = nilfs_segbuf_write(segbuf, &wi);
>
> res = nilfs_segbuf_wait(segbuf, &wi);
> - err = unlikely(err) ? : res;
> + err = unlikely(err) ? err : res;

It's very dubious gcc does anything with unlikely here anyways.
They typically only work directly in conditions being tested.

> if (unlikely(err))
> return err;

Also gcc generally considers conditions to blocks that
return unlikely, so it's actually superfluous.

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-21 12:34:26

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: [NILFS users] [PATCH] nilfs2: fix hang problem after bio_alloc() failed

Hi Andi,
On Sun, 21 Jun 2009 13:08:43 +0200, Andi Kleen <[email protected]> wrote:
> Ryusuke Konishi <[email protected]> writes:
> >
> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> > index 22c7f65..e8f188b 100644
> > --- a/fs/nilfs2/segment.c
> > +++ b/fs/nilfs2/segment.c
> > @@ -1846,26 +1846,13 @@ static int nilfs_segctor_write(struct nilfs_sc_info *sci,
> > err = nilfs_segbuf_write(segbuf, &wi);
> >
> > res = nilfs_segbuf_wait(segbuf, &wi);
> > - err = unlikely(err) ? : res;
> > + err = unlikely(err) ? err : res;
>
> It's very dubious gcc does anything with unlikely here anyways.
> They typically only work directly in conditions being tested.

I got it. I'll rewrite the first bugfix so that it just removes the
unlikely directive.

- err = unlikely(err) ? : res;
+ err = err ? : res;


> > if (unlikely(err))
> > return err;
>
> Also gcc generally considers conditions to blocks that
> return unlikely, so it's actually superfluous.

Thanks, I didn't know this. I'll take in this, too.

> -Andi
> --
> [email protected] -- Speaking for myself only.

Regards,
Ryusuke Konishi

2009-06-23 04:40:52

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

Hi Leandro,
On Sun, 14 Jun 2009 15:13:13 -0300, Leandro Lucarella <[email protected]> wrote:
> Ryusuke Konishi, el 15 de junio a las 03:02 me escribiste:
> [snip]
> > > Here is the complete trace:
> > > http://pastebin.lugmen.org.ar/4931
> >
> > Thank you for your help.
> >
> > According to your log, there seems to be a leakage in clear processing
> > of the writeback flag on pages. I will review the error path of log
> > writer to narrow down the cause.
>
> Oh! I forgot to tell you that the cleanerd process was not running. When
> I mounted the NILFS2 filesystem the cleanerd daemon said:
>
> nilfs_cleanerd[29534]: start
> nilfs_cleanerd[29534]: cannot create cleanerd on /dev/loop0
> nilfs_cleanerd[29534]: shutdown
>
> I thought I might have an old utilities version and that could be because
> of that (since the nilfs website was down I couldn't check), but now
> I checked in the backup nilfs website that latest version is 2.0.12 and
> I have that (Debian package), so I don't know why the cleanerd failed to
> start in the first place.

At the last weekend, several users gave me feedback on the same early
failure of cleanerd process.

This problem turned out to be caused by an elementaly bug in the
userland library of nilfs; there was a lack of canonicalization of
pathname in the library, and it caused the failure, for example, when
user adds an extra slash symbol to the mount directory name.

We have released nilfs2-utils-2.0.13 yesterday to resolve this mess.
If you didn't know it, please try the updated package.

Thanks,
Ryusuke Konishi

2009-06-23 13:37:44

by Leandro Lucarella

[permalink] [raw]
Subject: Re: NILFS2 get stuck after bio_alloc() fail

Ryusuke Konishi, el 23 de junio a las 13:40 me escribiste:
> Hi Leandro,
> On Sun, 14 Jun 2009 15:13:13 -0300, Leandro Lucarella <[email protected]> wrote:
> > Ryusuke Konishi, el 15 de junio a las 03:02 me escribiste:
> > [snip]
> > > > Here is the complete trace:
> > > > http://pastebin.lugmen.org.ar/4931
> > >
> > > Thank you for your help.
> > >
> > > According to your log, there seems to be a leakage in clear processing
> > > of the writeback flag on pages. I will review the error path of log
> > > writer to narrow down the cause.
> >
> > Oh! I forgot to tell you that the cleanerd process was not running. When
> > I mounted the NILFS2 filesystem the cleanerd daemon said:
> >
> > nilfs_cleanerd[29534]: start
> > nilfs_cleanerd[29534]: cannot create cleanerd on /dev/loop0
> > nilfs_cleanerd[29534]: shutdown
> >
> > I thought I might have an old utilities version and that could be because
> > of that (since the nilfs website was down I couldn't check), but now
> > I checked in the backup nilfs website that latest version is 2.0.12 and
> > I have that (Debian package), so I don't know why the cleanerd failed to
> > start in the first place.
>
> At the last weekend, several users gave me feedback on the same early
> failure of cleanerd process.
>
> This problem turned out to be caused by an elementaly bug in the
> userland library of nilfs; there was a lack of canonicalization of
> pathname in the library, and it caused the failure, for example, when
> user adds an extra slash symbol to the mount directory name.
>
> We have released nilfs2-utils-2.0.13 yesterday to resolve this mess.
> If you didn't know it, please try the updated package.

Ok, thanks for your help. I'll try the new nilfs-utils and the patch you
sent when I have some time and report the results.

--
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05)