LinuxLists.cc - [RFC] [PATCH] Avoid livelock for fsync

2009-10-26 18:13:13

Subject: [RFC] [PATCH] Avoid livelock for fsync

Hi,

on my way back from Kernel Summit, I've coded the attached patch which
implements livelock avoidance for write_cache_pages. We tag patches that
should be written in the beginning of write_cache_pages and then write
only tagged pages (see the patch for details). The patch is based on Nick's
idea.
The next thing I've aimed at with this patch is a simplification of
current writeback code. Basically, with this patch I think we can just rip
out all the range_cyclic and nr_to_write (or other "fairness logic"). The
rationalle is following:
What we want to achieve with fairness logic is that when a page is
dirtied, it gets written to disk within some reasonable time (like 30s or
so). We track dirty time on per-inode basis only because keeping it
per-page is simply too expensive. So in this setting fairness between
inodes really does not make any sence - why should be a page in a file
penalized and written later only because there are lots of other dirty
pages in the file? It is enough to make sure that we don't write one file
indefinitely when there are new dirty pages continuously created - and my
patch achieves that.
So with my patch we can make write_cache_pages always write from
range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
after changing balance_dirty_pages() so that a throttled process does not
directly submit the IO (Fengguang has the patches for this), we can
completely remove the nr_to_write logic because nothing really uses it
anymore. Thus also the requeue_io logic should go away etc...
Fengguang, do you have the series somewhere publicly available? You had
there a plenty of changes and quite some of them are not needed when the
above is done. So could you maybe separate out the balance_dirty_pages
change and I'd base my patch and further simplifications on top of that?
Thanks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Attachments:

(No filename) (1.87 kB)
0001-mm-Implement-writeback-livelock-avoidance-using-pag.patch (5.86 kB)
Download all attachments

2009-10-27 03:39:46

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
> Hi,
>
> on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.
> The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
> What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or
> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between
> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.
> So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...
> Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.

Like I said (and as we concluded when I last posted my tagging patch),
I think this idea should work fine, but there is perhaps a little bit of
overhead/complexity so provided that we can get some numbers or show a
real improvement in behaviour or code simplifications then I think we
could justify the patch.

I would be interested to know how it goes.

Thanks,
Nick

2009-10-27 09:17:59

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

On Tue 27-10-09 04:39:47, Nick Piggin wrote:
> On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
> > on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> > The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> > What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
> > So with my patch we can make write_cache_pages always write from
> > range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> > after changing balance_dirty_pages() so that a throttled process does not
> > directly submit the IO (Fengguang has the patches for this), we can
> > completely remove the nr_to_write logic because nothing really uses it
> > anymore. Thus also the requeue_io logic should go away etc...
> > Fengguang, do you have the series somewhere publicly available? You had
> > there a plenty of changes and quite some of them are not needed when the
> > above is done. So could you maybe separate out the balance_dirty_pages
> > change and I'd base my patch and further simplifications on top of that?
> > Thanks.
>
> Like I said (and as we concluded when I last posted my tagging patch),
> I think this idea should work fine, but there is perhaps a little bit of
> overhead/complexity so provided that we can get some numbers or show a
> real improvement in behaviour or code simplifications then I think we
> could justify the patch.
Yes, after I rebase my patch on top of Fengguang's work, I'll write also
the cleanup patch so that we can really see, how much simpler the code gets
and can test what advantages / disadvantages does it bring. I'll keep you
updated.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-10-27 13:54:10

by Nikanth Karthikesan

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

On Monday 26 October 2009 23:43:14 Jan Kara wrote:
> Hi,
>
> on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.

As I understand, livelock can be caused only by dirtying new pages.

So theoretically, if a process can dirty pages faster than we can tag pages
for writeback, even now isn't there a chance for livelock? But if it is really
a very fast operation and livelock is not possible, why not hold the tree_lock
during the entire period of tagging the pages for writeback i.e., call
tag_pages_for_writeback() under mapping->tree_lock? Would it cause
deadlock/starvation or some other serious problems?

Thanks
Nikanth

2009-10-27 15:32:54

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

On Tue 27-10-09 19:26:14, Nikanth Karthikesan wrote:
> On Monday 26 October 2009 23:43:14 Jan Kara wrote:
> > Hi,
> >
> > on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
>
> As I understand, livelock can be caused only by dirtying new pages.
>
> So theoretically, if a process can dirty pages faster than we can tag pages
> for writeback, even now isn't there a chance for livelock? But if it is really
Yes, theoretically the livelock is still there but practically, I don't
think it's triggerable (the amount of work needed to do either write(2) or
page fault is much higher than just looking up a page in radix tree and
setting there one bit). If the file has lots of dirty pages, I belive user
can create a few more while we are tagging but not much...

> a very fast operation and livelock is not possible, why not hold the tree_lock
> during the entire period of tagging the pages for writeback i.e., call
> tag_pages_for_writeback() under mapping->tree_lock? Would it cause
> deadlock/starvation or some other serious problems?
I'm dropping tree_lock because I don't think I can hold it during
pagevec_lookup_tag. Even if that was worked-around, if the file has lots of
dirty pages, it could take us long enough to tag all of them that it would
matter latency-wise for other users of the lock. So I'd leave the code as
is.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-10-28 21:47:58

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

On Mon, 26 Oct 2009 19:13:14 +0100
Jan Kara <[email protected]> wrote:

> Hi,
>
> on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.
> The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
> What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or
> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between
> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.
> So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...
> Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.
>

I need to think about this. Hard.

So I'll defer that and nitpick the implementation instead ;)

My MUA doesn't understand text/x-patch. Please use text/plain if you
must use attachments?

/**
> + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> + * @mapping: address space structure to write
> + * @start: starting page index
> + * @end: ending page index (inclusive)
> + *
> + * This function scans the page range from @start to @end and tags all pages
> + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> + * to identify pages eligible for writeback. This mechanism is used to avoid
> + * livelocking of writeback by a process steadily creating new dirty pages in
> + * the file (thus it is important for this function to be damn quick so that it
> + * can tag pages faster than a dirtying process can create them).
> + */
> +void tag_pages_for_writeback(struct address_space *mapping,
> + pgoff_t start, pgoff_t end)
> +{
> + struct pagevec pvec;
> + int nr_pages, i;
> + struct page *page;
> +
> + pagevec_init(&pvec, 0);
> + while (start <= end) {
> + nr_pages = pagevec_lookup_tag(&pvec, mapping, &start,
> + PAGECACHE_TAG_DIRTY,
> + min(end - start, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> + if (!nr_pages)
> + return;
> +
> + spin_lock_irq(&mapping->tree_lock);
> + for (i = 0; i < nr_pages; i++) {
> + page = pvec.pages[i];
> + /* Raced with someone freeing the page? */
> + if (page->mapping != mapping)
> + continue;
> + if (page->index > end)
> + break;
> + radix_tree_tag_set(&mapping->page_tree,
> + page_index(page), PAGECACHE_TAG_TOWRITE);
> + }
> + spin_unlock_irq(&mapping->tree_lock);
> + }
> +}
> +EXPORT_SYMBOL(tag_pages_for_writeback);

This is really inefficient. We do a full tree descent for each dirty
page.

It would be far more efficient to do a combined lookup and set
operation. Bascially that's the same as pagevec_lookup_tag(), only we
set the PAGECACHE_TAG_TOWRITE on each page instead of taking a copy
into the pagevec.

Which makes one wonder: would such an operation require ->tree_lock?
pagevec_lookup_tag() just uses rcu_read_lock() - what do we need to do
to use lighter locking in the new
radix_tree_gang_lookup_tag_slot_then_set_a_flag()? Convert tag_set()
and tag_clear() to atomic ops, perhaps?

2009-11-02 03:34:31

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

On Wed, Oct 28, 2009 at 02:47:31PM -0700, Andrew Morton wrote:
> On Mon, 26 Oct 2009 19:13:14 +0100
> Jan Kara <[email protected]> wrote:
>
> > Hi,
> >
> > on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> > The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> > What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
> > So with my patch we can make write_cache_pages always write from
> > range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> > after changing balance_dirty_pages() so that a throttled process does not
> > directly submit the IO (Fengguang has the patches for this), we can
> > completely remove the nr_to_write logic because nothing really uses it
> > anymore. Thus also the requeue_io logic should go away etc...
> > Fengguang, do you have the series somewhere publicly available? You had
> > there a plenty of changes and quite some of them are not needed when the
> > above is done. So could you maybe separate out the balance_dirty_pages
> > change and I'd base my patch and further simplifications on top of that?
> > Thanks.
> >
>
> I need to think about this. Hard.
>
> So I'll defer that and nitpick the implementation instead ;)
>
> My MUA doesn't understand text/x-patch. Please use text/plain if you
> must use attachments?
>
>
> /**
> > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> > + * @mapping: address space structure to write
> > + * @start: starting page index
> > + * @end: ending page index (inclusive)
> > + *
> > + * This function scans the page range from @start to @end and tags all pages
> > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> > + * to identify pages eligible for writeback. This mechanism is used to avoid
> > + * livelocking of writeback by a process steadily creating new dirty pages in
> > + * the file (thus it is important for this function to be damn quick so that it
> > + * can tag pages faster than a dirtying process can create them).
> > + */
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > + pgoff_t start, pgoff_t end)
> > +{
> > + struct pagevec pvec;
> > + int nr_pages, i;
> > + struct page *page;
> > +
> > + pagevec_init(&pvec, 0);
> > + while (start <= end) {
> > + nr_pages = pagevec_lookup_tag(&pvec, mapping, &start,
> > + PAGECACHE_TAG_DIRTY,
> > + min(end - start, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> > + if (!nr_pages)
> > + return;
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + for (i = 0; i < nr_pages; i++) {
> > + page = pvec.pages[i];
> > + /* Raced with someone freeing the page? */
> > + if (page->mapping != mapping)
> > + continue;
> > + if (page->index > end)
> > + break;
> > + radix_tree_tag_set(&mapping->page_tree,
> > + page_index(page), PAGECACHE_TAG_TOWRITE);
> > + }
> > + spin_unlock_irq(&mapping->tree_lock);
> > + }
> > +}
> > +EXPORT_SYMBOL(tag_pages_for_writeback);
>
> This is really inefficient. We do a full tree descent for each dirty
> page.
>
> It would be far more efficient to do a combined lookup and set
> operation. Bascially that's the same as pagevec_lookup_tag(), only we
> set the PAGECACHE_TAG_TOWRITE on each page instead of taking a copy
> into the pagevec.

I had a radix_tree_gang_set_if_tagged operation in my earlier
patchset, which should basically do this.

> Which makes one wonder: would such an operation require ->tree_lock?
> pagevec_lookup_tag() just uses rcu_read_lock() - what do we need to do
> to use lighter locking in the new
> radix_tree_gang_lookup_tag_slot_then_set_a_flag()? Convert tag_set()
> and tag_clear() to atomic ops, perhaps?

Well that, but the hard part is propagating the tag back to the root
in a coherent way (when other guys are setting and clearing tags in
other nodes). Also, if we have more than a couple of atomic bitops,
then the spinlock will win out in straight line performance (although
scalability could still be better with an unlocked version... but I
think the propagation is the hard part).

2009-11-03 13:14:40

by Fengguang Wu

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

Hi Jan,

Sorry for being late! - for some reason I was just able to see your email.

On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
> Hi,
>
> on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.

Yes, tagging is a very fine grained way for livelock avoidance.
However I doubt this patch can achieve the simplification goals
listed below..

> The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
> What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or

Right.

> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between

Right.

> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.

This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
With current policy, it could be

sync 4MB of A
sync B
sync C
sync D
sync 4MB of A
sync 4MB of A
...

And you want to change to

sync A (all 4GB)
sync B
sync C
sync D

This means the writeback of B,C,D won't be able to start at 30s, but
delayed to 80s because of A. This is not entirely fair. IMHO writeback
of big files shall not delay small files too much.

> So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...

For the above reason I think we should think twice on removing
nr_to_write and requeue_io()..

> Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.

Sorry I don't maintain a public git tree. However it's a good idea to
break down the big patchset to smaller pieces, and submit/review them
bits by bits.

I'm on leave tomorrow and will do that after coming back.

Thanks,
Fengguang

2009-11-03 14:56:55

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

Hi Fengguang,

On Tue 03-11-09 21:14:34, Wu Fengguang wrote:
> Sorry for being late! - for some reason I was just able to see your email.
No problem :).

> On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
> > Hi,
> >
> > on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
>
> Yes, tagging is a very fine grained way for livelock avoidance.
> However I doubt this patch can achieve the simplification goals
> listed below..
>
> > The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> > What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
>
> Right.
>
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
>
> Right.
>
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
>
> This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
> With current policy, it could be
>
> sync 4MB of A
> sync B
> sync C
> sync D
> sync 4MB of A
> sync 4MB of A
> ...
>
> And you want to change to
>
> sync A (all 4GB)
> sync B
> sync C
> sync D
>
> This means the writeback of B,C,D won't be able to start at 30s, but
> delayed to 80s because of A. This is not entirely fair. IMHO writeback
> of big files shall not delay small files too much.
Yes, I'm aware of this change. It's just that I'm not sure we really
care. There are few reasons to this: What advantage does it bring that we
are "fair among files"? User can only tell the difference if after a crash,
files he wrote long time ago are still not on disk. But we shouldn't
accumulate too many dirty data (like minutes of writeback) in caches
anyway... So the difference should not be too big. Also how is the case
"one big and a few small files" different from the case "many small files"
where to be fair among files does not bring anything?
It's just that see some substantial code complexity and also performance
impact (because of smaller chunks of sequential IO) in trying to be fair
among files and I don't really see adequate advantages of that approach.
That's why I'm suggesting we should revisit the decision and possibly go in
a different direction.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-11-04 11:32:18

by Fengguang Wu

[permalink] [raw]

Subject: Re: [RFC] [PATCH] Avoid livelock for fsync

Jan,

On Tue, Nov 03, 2009 at 10:56:54PM +0800, Jan Kara wrote:
[snip]
> > > inodes really does not make any sence - why should be a page in a file
> > > penalized and written later only because there are lots of other dirty
> > > pages in the file? It is enough to make sure that we don't write one file
> > > indefinitely when there are new dirty pages continuously created - and my
> > > patch achieves that.
> >
> > This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
> > With current policy, it could be
> >
> > sync 4MB of A
> > sync B
> > sync C
> > sync D
> > sync 4MB of A
> > sync 4MB of A
> > ...
> >
> > And you want to change to
> >
> > sync A (all 4GB)
> > sync B
> > sync C
> > sync D
> >
> > This means the writeback of B,C,D won't be able to start at 30s, but
> > delayed to 80s because of A. This is not entirely fair. IMHO writeback
> > of big files shall not delay small files too much.
> Yes, I'm aware of this change. It's just that I'm not sure we really
> care. There are few reasons to this: What advantage does it bring that we
> are "fair among files"? User can only tell the difference if after a crash,

I'm not all that sure, too. The perception is, big files normally
contain less valuable information per-page than small files ;)

If crashed, it's much better to lose one single big file, than to lose
all the (big and small) files.

Maybe nobody really care that - sync() has always been working file
after file (ignoring nr_to_write) and no one complained.

> files he wrote long time ago are still not on disk. But we shouldn't
> accumulate too many dirty data (like minutes of writeback) in caches
> anyway... So the difference should not be too big. Also how is the case
> "one big and a few small files" different from the case "many small files"
> where to be fair among files does not bring anything?
> It's just that see some substantial code complexity and also performance
> impact (because of smaller chunks of sequential IO) in trying to be fair
> among files and I don't really see adequate advantages of that approach.
> That's why I'm suggesting we should revisit the decision and possibly go in
> a different direction.

Anyway, if this is not a big concern, nr_to_write could be removed.

Note that requeue_io() (or requeue_io_wait) still cannot be removed
because sometimes we have (temporary) problems on writeback an inode.

Thanks,
Fengguang