2013-04-01 15:19:45

by Eric Sandeen

[permalink] [raw]
Subject: Re: EXT4 nodelalloc => back to stone age.

On 4/1/13 6:06 AM, Dmitry Monakhov wrote:

> 1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?

...

> 2) Why don't we have writepages for non delalloc case ?

...

I'd add:

3) Why do we have a "nodelalloc" mount option at all?

but then I thought:

Is it also this bad when using the ext4 driver to run an ext3 fs?

-Eric


2013-04-01 15:39:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: EXT4 nodelalloc => back to stone age.

On Mon, Apr 01, 2013 at 10:18:51AM -0500, Eric Sandeen wrote:
> I'd add:
>
> 3) Why do we have a "nodelalloc" mount option at all?
>
> but then I thought:
>
> Is it also this bad when using the ext4 driver to run an ext3 fs?

Yes, and I there would be a similar performance problem if you are
using the ext3 file system driver, since ext3_*_writepage() also ends
up calling block_write_full_page() which will also result in the
writes happening with WRITE_SYNC.

The main reason why we keep nodelalloc at this point is bug-for-bug
compatibility with ext3 file systems --- basically, for users who are
using this as a workaround for the O_PONIES issue instead of fixing
their applications to use fsync() appropriately.

So another question is how much do we care about exact emulation of
ext3's behaviour for those distributions who wish to use ext4 file
system driver for ext2 and ext3 file systems?

One of the reasons for keeping nodealloc mode was the argument was
that it removing it wouldn't really allow us to remove that much
complexity from ext4. But adding a nodealloc specific ext4_writepages
pages would result in adding a huge amount of complexity, and my first
reaction is that it's really not worth the code maintenance headache.
Dmitry, is there a reason why you are especially worried about the
performace of nodelalloc mode?

- Ted

2013-04-01 15:45:41

by Chris Mason

[permalink] [raw]
Subject: Re: EXT4 nodelalloc => back to stone age.

Quoting Eric Sandeen (2013-04-01 11:18:51)
> On 4/1/13 6:06 AM, Dmitry Monakhov wrote:
>
> > 1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?

Yes? The stuff we wait on should be WRITE_SYNC.

>
> ...
>
> > 2) Why don't we have writepages for non delalloc case ?
>
> ...
>
> I'd add:
>
> 3) Why do we have a "nodelalloc" mount option at all?
>
> but then I thought:
>
> Is it also this bad when using the ext4 driver to run an ext3 fs?

Quick comparison on a single iodrive:

Ext4 (defaults):
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.95442 s, 549 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.45012 s, 740 MB/s

Ext4 (nodelalloc):
dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 2.97308 s, 361 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.76617 s, 608 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc

XFS gives 628, 733MB/s

Btrfs gives 659, 635MB/s -- since we're doing fsync, this includes all
the crcs for the data.

Ext3 mounted by ext4.ko: 291, 467MB/s

-chris


2013-04-01 15:57:10

by Chris Mason

[permalink] [raw]
Subject: Re: EXT4 nodelalloc => back to stone age.

Quoting Chris Mason (2013-04-01 11:45:41)
> Quoting Eric Sandeen (2013-04-01 11:18:51)
> > On 4/1/13 6:06 AM, Dmitry Monakhov wrote:
> >
> > > 1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
>
> Yes? The stuff we wait on should be WRITE_SYNC.
>
> >
> > ...
> >
> > > 2) Why don't we have writepages for non delalloc case ?
> >
> > ...
> >
> > I'd add:
> >
> > 3) Why do we have a "nodelalloc" mount option at all?
> >
> > but then I thought:
> >
> > Is it also this bad when using the ext4 driver to run an ext3 fs?
>
> Quick comparison on a single iodrive:

On the theory that writepages is the problem try echo 1 >
/sys/block/xxx/queue/rotational. With request merging on here in
nodelalloc mode:

dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 2.53741 s, 423 MB/s

dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.37795 s, 779 MB/s

-chris

2013-04-01 16:00:33

by Eric Sandeen

[permalink] [raw]
Subject: Re: EXT4 nodelalloc => back to stone age.

On 4/1/13 10:39 AM, Theodore Ts'o wrote:
> On Mon, Apr 01, 2013 at 10:18:51AM -0500, Eric Sandeen wrote:
>> I'd add:
>>
>> 3) Why do we have a "nodelalloc" mount option at all?
>>
>> but then I thought:
>>
>> Is it also this bad when using the ext4 driver to run an ext3 fs?
>
> Yes, and I there would be a similar performance problem if you are
> using the ext3 file system driver, since ext3_*_writepage() also ends
> up calling block_write_full_page() which will also result in the
> writes happening with WRITE_SYNC.

> The main reason why we keep nodelalloc at this point is bug-for-bug
> compatibility with ext3 file systems --- basically, for users who are
> using this as a workaround for the O_PONIES issue instead of fixing
> their applications to use fsync() appropriately.

Sorry for getting off the original thread here, but IMHO these are
2 different things:

nondelalloc behavior makes sense for ext3, but:
-o nodelalloc mount options don't make sense for ext4.

> So another question is how much do we care about exact emulation of
> ext3's behaviour for those distributions who wish to use ext4 file
> system driver for ext2 and ext3 file systems?
>
> One of the reasons for keeping nodealloc mode was the argument was
> that it removing it wouldn't really allow us to remove that much
> complexity from ext4.

IMHO we should keep the mode for ext2/3, but lose the ext4 option.
It'd just be one less row in the ext4 test matrix.

-Eric

> But adding a nodealloc specific ext4_writepages
> pages would result in adding a huge amount of complexity, and my first
> reaction is that it's really not worth the code maintenance headache.
> Dmitry, is there a reason why you are especially worried about the
> performace of nodelalloc mode?
>
> - Ted
>


2013-04-01 16:34:47

by Zheng Liu

[permalink] [raw]
Subject: Re: EXT4 nodelalloc => back to stone age.

Hi Eric,

On 04/02/2013 12:00 AM, Eric Sandeen wrote:
> On 4/1/13 10:39 AM, Theodore Ts'o wrote:
>> On Mon, Apr 01, 2013 at 10:18:51AM -0500, Eric Sandeen wrote:
>>> I'd add:
>>>
>>> 3) Why do we have a "nodelalloc" mount option at all?
>>>
>>> but then I thought:
>>>
>>> Is it also this bad when using the ext4 driver to run an ext3 fs?
>>
>> Yes, and I there would be a similar performance problem if you are
>> using the ext3 file system driver, since ext3_*_writepage() also ends
>> up calling block_write_full_page() which will also result in the
>> writes happening with WRITE_SYNC.
>
>> The main reason why we keep nodelalloc at this point is bug-for-bug
>> compatibility with ext3 file systems --- basically, for users who are
>> using this as a workaround for the O_PONIES issue instead of fixing
>> their applications to use fsync() appropriately.
>
> Sorry for getting off the original thread here, but IMHO these are
> 2 different things:
>
> nondelalloc behavior makes sense for ext3, but:
> -o nodelalloc mount options don't make sense for ext4.

nodelalloc makes sense to me. In our product system, we met a latency
problem that is caused by delalloc feature. The workload is a web app
that does some append writes (approximately 5M/s), and wait flusher to
do write out. We obverse that on every 30 seconds the latency will
reach a high level (approximately 100-200ms or higher, but normally
10-20ms). The reason is that when flush tries to write dirty pages out,
it will take i_data_sem lock (write lock) and allocate some blocks for
these dirty pages. But in the mean time the app does some append
write(2)s that will try to take i_data_sem lock (read lock) too. So the
app will be delayed. So I think nodelalloc is still useful for us.

Regards,
- Zheng



2013-04-02 13:46:34

by Jan Kara

[permalink] [raw]
Subject: Re: EXT4 nodelalloc => back to stone age.

On Mon 01-04-13 15:06:18, Dmitry Monakhov wrote:
>
> I've mounted ext4 with -onodelalloc on my SSD (INTEL SSDSA2CW120G3,4PC10362)
> It shows numbers which are slower than HDD which was produced 15 years ago
> #mount $SCRATCH_DEV $SCRATCH_MNT -onodelalloc
> # dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
> 1073741824 bytes (1.1 GB) copied, 46.7948 s, 22.9 MB/s
> # dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
> 1073741824 bytes (1.1 GB) copied, 41.2717 s, 26.0 MB/s
> blktrace shows horrible traces:

> 253,1 0 11 0.004965203 13618 Q WS 1219360 + 8 [jbd2/dm-1-8]
> 253,1 0 11 0.004965203 13618 Q WS 1219360 + 8 [jbd2/dm-1-8]
> 253,1 0 11 0.004965203 13618 Q WS 1219360 + 8 [jbd2/dm-1-8]
> 253,1 0 11 0.004965203 13618 Q WS 1219360 + 8 [jbd2/dm-1-8]
> 253,1 1 39 0.004983642 0 C WS 1219344 + 8 [0]
> 253,1 1 39 0.004983642 0 C WS 1219344 + 8 [0]
> 253,1 1 39 0.004983642 0 C WS 1219344 + 8 [0]
> 253,1 1 39 0.004983642 0 C WS 1219344 + 8 [0]
> 253,1 1 40 0.005082898 0 C WS 1219352 + 8 [0]
> 253,1 1 40 0.005082898 0 C WS 1219352 + 8 [0]
> 253,1 1 40 0.005082898 0 C WS 1219352 + 8 [0]
> 253,1 1 40 0.005082898 0 C WS 1219352 + 8 [0]
> 253,1 3 12 0.005106049 2580 Q W 1219368 + 8 [flush-253:1]
> 253,1 3 12 0.005106049 2580 Q W 1219368 + 8 [flush-253:1]
> 253,1 3 12 0.005106049 2580 Q W 1219368 + 8 [flush-253:1]
> 253,1 3 12 0.005106049 2580 Q W 1219368 + 8 [flush-253:1]
> 253,1 2 17 0.005197143 13750 Q WS 1219376 + 8 [dd]
> 253,1 2 17 0.005197143 13750 Q WS 1219376 + 8 [dd]
> 253,1 2 17 0.005197143 13750 Q WS 1219376 + 8 [dd]
> 253,1 2 17 0.005197143 13750 Q WS 1219376 + 8 [dd]
> 253,1 1 41 0.005199871 0 C WS 1219360 + 8 [0]
> 253,1 1 41 0.005199871 0 C WS 1219360 + 8 [0]
> 253,1 1 41 0.005199871 0 C WS 1219360 + 8 [0]
> 253,1 1 41 0.005199871 0 C WS 1219360 + 8 [0]
Hum, not sure why you see all the events 4x. But that's not important I
guess.

> As one can see data written from two threads dd and jbd2 on per-page basis and
> jbd2 submit pages with WRITE_SYNC i.e. we write page-by-page
> synchronously :)
>
> Exact calltrace:
> journal_submit_inode_data_buffers
> wbc.sync_mode = WB_SYNC_ALL
> ->generic_writepages
> ->write_cache_pages
> ->ext4_writepage
> ->ext4_bio_write_page
> ->io_submit_add_bh
> ->io_submit_init
> io->io_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC :
> WRITE);
> ->ext4_io_submit(io);
>
> 1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
Actually WRITE_SYNC doesn't mean we write sychronously. We just tell the
IO scheduler that we are going to wait for the IO to complete soon. So it
prioritizes these writes against other async writes. We don't have to use
WRITE_SYNC but really in this case we do pretty much what IO scheduler
people want - flag IO that's going to be waited upon.

> Why blk_finish_plug(&plug) which is called from generic_writepages() is
> not enough? As far as I can see this code was copy-pasted from XFS,
> also DIO also tag bio-s with WRITE_SYNC, but what happen if file
> is highly fragmented (or block device is RAID0) we will endup doing
> synchronous io.
I see you are tracing the DM device. That may be actually somewhat
confusing since you are missing some actions like merges of requests and
dispatches to underlying device.

> 2) Why don't we have writepages for non delalloc case ?
>
> I want to fix (2) by implementing writepages() for non delalloc case
> Once this will be done we may add new flag WB_SYNC_NOALLOC so
> journal_submit_inode_data_buffers will use
> __filemap_fdatawrite_range(, , , WB_SYNC_ALL| WB_SYNC_NOALLC)
> which will call optimized ->ext4_writepages()
So what would you expect from ->writepages() implementation?

Anyway the throughput you see looks bad. What kernel version are you using?
There's possibility my recent changes to ext4_writepage() could have slowed
down something...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR