On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
> On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
>
> Mikulas Patocka <[email protected]> wrote:
> > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > >
> > > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > > > The bug happens when one process is doing sequential buffered writes
> > > > to a block device (or file) and another process is attempting to
> > > > execute sync(), fsync() or direct-IO on that device (or file). This
> > > > syncing process will wait indefinitelly, until the first writing
> > > > process finishes.
> > > >
> > > > For example, run these two commands:
> > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > > >
> > > > The bug is caused by sequential walking of address space in
> > > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > > process is constantly making dirty and writeback pages while these
> > > > functions run, the functions will wait on every new page, resulting
> > > > in indefinite wait.
I think the problem has been misidentified, or else I have misread the
code. See below. I hope I'm right, because I think the patches are pretty
heavy on complexity in these already complex paths...
It would help if you explicitly identify the exact livelock. Ie. give a
sequence of behaviour that leads to our progress rate falling to zero.
> > > Shouldn't happen. All the data-syncing functions should have an upper
> > > bound on the number of pages which they attempt to write. In the
> > > example above, we end up in here:
> > >
> > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > > start,
> > > loff_t end, int sync_mode)
> > > {
> > > int ret;
> > > struct writeback_control wbc = {
> > > .sync_mode = sync_mode,
> > > .nr_to_write = mapping->nrpages * 2, <<--
> > > .range_start = start,
> > > .range_end = end,
> > > };
> > >
> > > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
> > > to write at most 2* the number of pages which are in cache for that
> > > inode.
> >
> > See write_cache_pages:
> >
> > if (wbc->sync_mode != WB_SYNC_NONE)
> > wait_on_page_writeback(page); (1)
> > if (PageWriteback(page) ||
> > !clear_page_dirty_for_io(page)) {
> > unlock_page(page); (2)
> > continue;
> > }
> > ret = (*writepage)(page, wbc, data);
> > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > unlock_page(page);
> > ret = 0;
> > }
> > if (ret || (--(wbc->nr_to_write) <= 0))
> > done = 1;
> >
> > --- so if it goes by points (1) and (2), the counter is not decremented,
> > yet the function waits for the page. If there is constant stream of
> > writeback pages being generated, it waits on each on them --- that is,
> > forever.
*What* is, forever? Data integrity syncs should have pages operated on
in-order, until we get to the end of the range. Circular writeback could
go through again, possibly, but no more than once.
> > I have seen livelock in this function. For you that example with
> > two dd's, one buffered write and the other directIO read doesn't work?
> > For me it livelocks here.
> >
> > wait_on_page_writeback_range is another example where the livelock
> > happened, there is no protection at all against starvation.
>
> um, OK. So someone else is initiating IO for this inode and this
> thread *never* gets to initiate any writeback. That's a bit of a
> surprise.
>
> How do we fix that? Maybe decrement nt_to_write for these pages as
> well?
What's the actual problem, though? nr_to_write should not be used for
data integrity operations, and it should not be critical for other
writeout. Upper layers should be able to deal with it rather than
have us lying to them.
> > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous
> > thing to me.
> >
> > Imagine this case: You have two pages with indices 4 and 5 dirty in a
> > file. You call fsync(). It sets nr_to_write to 4.
> >
> > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
> >
> > The fsync() process goes to write_cache_pages, writes the first 4 dirty
> > pages and exits because it goes over the limit.
> >
> > result --- you violate fsync() semantics, pages that were dirty before
> > call to fsync() are not written when fsync() exits.
Wow, that's really nasty. Sad we still have known data integrity problems
in such core functions.
> yup, that's pretty much unfixable, really, unless new locks are added
> which block threads which are writing to unrelated sections of the
> file, and that could hurt some workloads quite a lot, I expect.
Why is it unfixable? Just ignore nr_to_write, and write out everything
properly, I would have thought.
Some things may go a tad slower, but those are going to be the things
that are using fsync, in which cases they are going to hurt much more
from the loss of data integrity than a slowdown.
Unfortunately because we have played fast and loose for so long, they
expect this behaviour, were tested and optimised with it, and systems
designed and deployed with it, and will notice performance regressions
if we start trying to do things properly. This is one of my main
arguments for doing things correctly up-front, even if it means a
massive slowdown in some real or imagined workload: at least then we
will hear about complaints and be able to try to improve them rather
than setting ourselves up for failure later.
/rant
Anyway, in this case, I don't think there would be really big problems.
Also, I think there is a reasonable optimisation that might improve it
(2nd last point, in attached patch).
OK, so after glancing at the code... wow, it seems like there are a lot
of bugs in there.
On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin <[email protected]> wrote:
> > yup, that's pretty much unfixable, really, unless new locks are added
> > which block threads which are writing to unrelated sections of the
> > file, and that could hurt some workloads quite a lot, I expect.
>
> Why is it unfixable? Just ignore nr_to_write, and write out everything
> properly, I would have thought.
That can cause fsync to wait arbitrarily long if some other process is
writing the file. This happens.
On Friday 03 October 2008 12:32, Nick Piggin wrote:
> On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
> > On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
> >
> > Mikulas Patocka <[email protected]> wrote:
> > > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > > >
> > > > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > > > > The bug happens when one process is doing sequential buffered
> > > > > writes to a block device (or file) and another process is
> > > > > attempting to execute sync(), fsync() or direct-IO on that device
> > > > > (or file). This syncing process will wait indefinitelly, until the
> > > > > first writing process finishes.
> > > > >
> > > > > For example, run these two commands:
> > > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > > > >
> > > > > The bug is caused by sequential walking of address space in
> > > > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > > > process is constantly making dirty and writeback pages while these
> > > > > functions run, the functions will wait on every new page, resulting
> > > > > in indefinite wait.
>
> I think the problem has been misidentified, or else I have misread the
> code. See below. I hope I'm right, because I think the patches are pretty
> heavy on complexity in these already complex paths...
>
> It would help if you explicitly identify the exact livelock. Ie. give a
> sequence of behaviour that leads to our progress rate falling to zero.
>
> > > > Shouldn't happen. All the data-syncing functions should have an upper
> > > > bound on the number of pages which they attempt to write. In the
> > > > example above, we end up in here:
> > > >
> > > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > > > start,
> > > > loff_t end, int sync_mode)
> > > > {
> > > > int ret;
> > > > struct writeback_control wbc = {
> > > > .sync_mode = sync_mode,
> > > > .nr_to_write = mapping->nrpages * 2, <<--
> > > > .range_start = start,
> > > > .range_end = end,
> > > > };
> > > >
> > > > so generic_file_direct_write()'s filemap_write_and_wait() will
> > > > attempt to write at most 2* the number of pages which are in cache
> > > > for that inode.
> > >
> > > See write_cache_pages:
> > >
> > > if (wbc->sync_mode != WB_SYNC_NONE)
> > > wait_on_page_writeback(page); (1)
> > > if (PageWriteback(page) ||
> > > !clear_page_dirty_for_io(page)) {
> > > unlock_page(page); (2)
> > > continue;
> > > }
> > > ret = (*writepage)(page, wbc, data);
> > > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > > unlock_page(page);
> > > ret = 0;
> > > }
> > > if (ret || (--(wbc->nr_to_write) <= 0))
> > > done = 1;
> > >
> > > --- so if it goes by points (1) and (2), the counter is not
> > > decremented, yet the function waits for the page. If there is constant
> > > stream of writeback pages being generated, it waits on each on them ---
> > > that is, forever.
>
> *What* is, forever? Data integrity syncs should have pages operated on
> in-order, until we get to the end of the range. Circular writeback could
> go through again, possibly, but no more than once.
OK, I have been able to reproduce it somewhat. It is not a livelock,
but what is happening is that direct IO read basically does an fsync
on the file before performing the IO. The fsync gets stuck behind the
dd that is dirtying the pages, and ends up following behind it and
doing all its IO for it.
The following patch avoids the issue for direct IO, by using the range
syncs rather than trying to sync the whole file.
The underlying problem I guess is unchanged. Is it really a problem,
though? The way I'd love to solve it is actually by adding another bit
or two to the pagecache radix tree, that can be used to transiently tag
the tree for future operations. That way we could record the dirty and
writeback pages up front, and then only bother with operating on them.
That's *if* it really is a problem. I don't have much pity for someone
doing buffered IO and direct IO to the same pages of the same file :)
On Friday 03 October 2008 12:40, Andrew Morton wrote:
> On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin <[email protected]>
wrote:
> > > yup, that's pretty much unfixable, really, unless new locks are added
> > > which block threads which are writing to unrelated sections of the
> > > file, and that could hurt some workloads quite a lot, I expect.
> >
> > Why is it unfixable? Just ignore nr_to_write, and write out everything
> > properly, I would have thought.
>
> That can cause fsync to wait arbitrarily long if some other process is
> writing the file.
It can be fixed without touching non-fsync paths (see my next email for
the way to fix it without touching fastpaths).
> This happens.
What does such a thing? It would have been nicer to ask them to not do
that then, or get them to use range syncs or something. Now that's much
harder because we've accepted the crappy workaround for so long.
It's far far worse to just ignore data integrity of fsync because of the
problem. Should at least have returned an error from fsync in that case,
or make it interruptible or something.
On Fri, 3 Oct 2008 12:59:17 +1000 Nick Piggin <[email protected]> wrote:
> On Friday 03 October 2008 12:40, Andrew Morton wrote:
> > On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin <[email protected]>
> wrote:
> > > > yup, that's pretty much unfixable, really, unless new locks are added
> > > > which block threads which are writing to unrelated sections of the
> > > > file, and that could hurt some workloads quite a lot, I expect.
> > >
> > > Why is it unfixable? Just ignore nr_to_write, and write out everything
> > > properly, I would have thought.
> >
> > That can cause fsync to wait arbitrarily long if some other process is
> > writing the file.
>
> It can be fixed without touching non-fsync paths (see my next email for
> the way to fix it without touching fastpaths).
>
yup.
>
> > This happens.
>
> What does such a thing?
I forget. People do all sorts of weird stuff.
> It would have been nicer to ask them to not do
> that then, or get them to use range syncs or something. Now that's much
> harder because we've accepted the crappy workaround for so long.
>
> It's far far worse to just ignore data integrity of fsync because of the
> problem. Should at least have returned an error from fsync in that case,
> or make it interruptible or something.
If a file has one dirty page at offset 1000000000000000 then someone
does an fsync() and someone else gets in first and starts madly writing
pages at offset 0, we want to write that page at 1000000000000000.
Somehow.
I expect there's no solution which avoids blocking the writers at some
stage.
On Friday 03 October 2008 13:14, Andrew Morton wrote:
> On Fri, 3 Oct 2008 12:59:17 +1000 Nick Piggin <[email protected]>
wrote:
> > On Friday 03 October 2008 12:40, Andrew Morton wrote:
> > > That can cause fsync to wait arbitrarily long if some other process is
> > > writing the file.
> >
> > It can be fixed without touching non-fsync paths (see my next email for
> > the way to fix it without touching fastpaths).
>
> yup.
>
> > > This happens.
> >
> > What does such a thing?
>
> I forget. People do all sorts of weird stuff.
Damn people...
I guess they also do non-weird stuff like expecting fsync to really fsync.
> > It would have been nicer to ask them to not do
> > that then, or get them to use range syncs or something. Now that's much
> > harder because we've accepted the crappy workaround for so long.
> >
> > It's far far worse to just ignore data integrity of fsync because of the
> > problem. Should at least have returned an error from fsync in that case,
> > or make it interruptible or something.
>
> If a file has one dirty page at offset 1000000000000000 then someone
> does an fsync() and someone else gets in first and starts madly writing
> pages at offset 0, we want to write that page at 1000000000000000.
> Somehow.
>
> I expect there's no solution which avoids blocking the writers at some
> stage.
See my other email. Something roughly like this would do the trick
(hey, it actually boots and runs and does fix the problem too).
It's ugly because we don't have quite the right radix tree operations
yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
lookups). But the theory is to up-front tag the pages that we need to
get to disk.
Completely no impact or slowdown to any writers (although it does add
8 bytes of tags to the radix tree node... but doesn't increase memory
footprint as such due to slab).
On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <[email protected]> wrote:
> > I expect there's no solution which avoids blocking the writers at some
> > stage.
>
> See my other email. Something roughly like this would do the trick
> (hey, it actually boots and runs and does fix the problem too).
It needs exclusion to protect all those temp tags. Is do_fsync()'s
i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
the callers of this stuff are running under that lock.
> It's ugly because we don't have quite the right radix tree operations
> yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
> lookups). But the theory is to up-front tag the pages that we need to
> get to disk.
Perhaps some callback-calling radix tree walker.
> Completely no impact or slowdown to any writers (although it does add
> 8 bytes of tags to the radix tree node... but doesn't increase memory
> footprint as such due to slab).
Can we reduce the amount of copy-n-pasting here?
On Friday 03 October 2008 13:56, Andrew Morton wrote:
> On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <[email protected]>
wrote:
> > > I expect there's no solution which avoids blocking the writers at some
> > > stage.
> >
> > See my other email. Something roughly like this would do the trick
> > (hey, it actually boots and runs and does fix the problem too).
>
> It needs exclusion to protect all those temp tags. Is do_fsync()'s
> i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
> the callers of this stuff are running under that lock.
Yeah... it does need a lock, which I brushed under the carpet :P
I was going to just say use i_mutex, but then we really would start
impacting on other fastpaths (eg writers).
Possibly a new mutex in the address_space? That way we can say
"anybody who holds this mutex is allowed to use the tag for anything"
and it doesn't have to be fsync specific (whether that would be of
any use to anything else, I don't know).
> > It's ugly because we don't have quite the right radix tree operations
> > yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
> > lookups). But the theory is to up-front tag the pages that we need to
> > get to disk.
>
> Perhaps some callback-calling radix tree walker.
Possibly, yes. That would make it fairly general. I'll have a look...
> > Completely no impact or slowdown to any writers (although it does add
> > 8 bytes of tags to the radix tree node... but doesn't increase memory
> > footprint as such due to slab).
>
> Can we reduce the amount of copy-n-pasting here?
Yeah... I went to break the sync/async cases into two, but it looks like
it may not have been worthwhile. Just another branch might be the best
way to go.
As far as the c&p in setting the FSYNC tag, yes that should all go away
if the radix-tree is up to scratch. Basically:
radix_tree_tag_set_if_tagged(start, end, ifWRITEBACK|DIRTY, setFSYNC);
should be able to replace the whole thing, and we'd hold the tree_lock, so
we would not have to take the page lock etc. Basically it would be much
nicer... even somewhere close to a viable solution.
On Fri, 3 Oct 2008 14:07:55 +1000 Nick Piggin <[email protected]> wrote:
> On Friday 03 October 2008 13:56, Andrew Morton wrote:
> > On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <[email protected]>
> wrote:
> > > > I expect there's no solution which avoids blocking the writers at some
> > > > stage.
> > >
> > > See my other email. Something roughly like this would do the trick
> > > (hey, it actually boots and runs and does fix the problem too).
> >
> > It needs exclusion to protect all those temp tags. Is do_fsync()'s
> > i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
> > the callers of this stuff are running under that lock.
>
> Yeah... it does need a lock, which I brushed under the carpet :P
> I was going to just say use i_mutex, but then we really would start
> impacting on other fastpaths (eg writers).
>
> Possibly a new mutex in the address_space?
That's another, umm 24 bytes minimum in the address_space (and inode).
That's fairly ouch, which is why Miklaus did that hokey bit-based
thing.
> That way we can say
> "anybody who holds this mutex is allowed to use the tag for anything"
> and it doesn't have to be fsync specific (whether that would be of
> any use to anything else, I don't know).
>
>
> > > It's ugly because we don't have quite the right radix tree operations
> > > yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
> > > lookups). But the theory is to up-front tag the pages that we need to
> > > get to disk.
> >
> > Perhaps some callback-calling radix tree walker.
>
> Possibly, yes. That would make it fairly general. I'll have a look...
>
>
> > > Completely no impact or slowdown to any writers (although it does add
> > > 8 bytes of tags to the radix tree node... but doesn't increase memory
> > > footprint as such due to slab).
> >
> > Can we reduce the amount of copy-n-pasting here?
>
> Yeah... I went to break the sync/async cases into two, but it looks like
> it may not have been worthwhile. Just another branch might be the best
> way to go.
Yup. Could add another do-this flag in the writeback_control, perhaps.
Or even a function pointer.
> As far as the c&p in setting the FSYNC tag, yes that should all go away
> if the radix-tree is up to scratch. Basically:
>
> radix_tree_tag_set_if_tagged(start, end, ifWRITEBACK|DIRTY, setFSYNC);
>
> should be able to replace the whole thing, and we'd hold the tree_lock, so
> we would not have to take the page lock etc. Basically it would be much
> nicer... even somewhere close to a viable solution.
On Friday 03 October 2008 14:17, Andrew Morton wrote:
> On Fri, 3 Oct 2008 14:07:55 +1000 Nick Piggin <[email protected]>
> > Possibly a new mutex in the address_space?
>
> That's another, umm 24 bytes minimum in the address_space (and inode).
> That's fairly ouch, which is why Miklaus did that hokey bit-based
> thing.
Well yeah, it would be a bit based mutex in mapping->flags with
hashed waitqueues. Like Miklaus's.
> > Yeah... I went to break the sync/async cases into two, but it looks like
> > it may not have been worthwhile. Just another branch might be the best
> > way to go.
>
> Yup. Could add another do-this flag in the writeback_control, perhaps.
> Or even a function pointer.
Yeah... possibly we could just _always_ do the PAGECACHE_TAG_FSYNC thing
if mode != WB_SYNC_NONE. I think if we had the infrastructure there to
do it all, it should always be something we want to do for data integrity
writeout.
> > *What* is, forever? Data integrity syncs should have pages operated on
> > in-order, until we get to the end of the range. Circular writeback could
> > go through again, possibly, but no more than once.
>
> OK, I have been able to reproduce it somewhat. It is not a livelock,
> but what is happening is that direct IO read basically does an fsync
> on the file before performing the IO. The fsync gets stuck behind the
> dd that is dirtying the pages, and ends up following behind it and
> doing all its IO for it.
>
> The following patch avoids the issue for direct IO, by using the range
> syncs rather than trying to sync the whole file.
>
> The underlying problem I guess is unchanged. Is it really a problem,
> though? The way I'd love to solve it is actually by adding another bit
> or two to the pagecache radix tree, that can be used to transiently tag
> the tree for future operations. That way we could record the dirty and
> writeback pages up front, and then only bother with operating on them.
>
> That's *if* it really is a problem. I don't have much pity for someone
> doing buffered IO and direct IO to the same pages of the same file :)
LVM does (that is where the bug was discovered). Basically, it scans all
the block devices with direct IO and if someone else does buffered IO on
any device simultaneously, it locks up.
That fsync-vs-write livelock is quite improbably (why would some
application do it?) --- although it could be used as a DoS --- getting
unkillable process.
But there is another possible real-world problem --- sync-vs-write ---
i.e. admin plugs in two disks and copies data from one to the other.
Meanwhile, some unrelated server process executes sync(). The server goes
into coma until the copy finishes.
Mikulas
On Thu, 2 Oct 2008, Andrew Morton wrote:
> On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <[email protected]> wrote:
>
> > > I expect there's no solution which avoids blocking the writers at some
> > > stage.
> >
> > See my other email. Something roughly like this would do the trick
> > (hey, it actually boots and runs and does fix the problem too).
>
> It needs exclusion to protect all those temp tags. Is do_fsync()'s
> i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
> the callers of this stuff are running under that lock.
That filemap_fdatawrite and filemap_fdatawait in fsync() aren't really
called under i_mutex (see do_fsync).
So the possible solutions are:
1. Add jiffies when the page was diried and wroteback to struct page
+ no impact on locking and concurrency
- increases the structure by 8 bytes
2. Stop the writers when the starvation happens (what I did)
+ doesn't do any locking if the livelock doesn't happen
- locks writers when the livelock happens (I think it's not really serious
--- because very few people complained about the livelock, very few people
will see performance degradation from blocking the writers).
3. Add another bit to radix tree (what Nick did)
+ doesn't ever block writers
- unconditionally takes the lock on fsync path and serializates concurrent
syncs/fsyncs. Probably low overhead too ... or I don't know, is there any
possible situation when more processes execute sync() in parallel and user
would see degradations if those syncs were serialized?
Mikulas
On Friday 03 October 2008 21:43, Mikulas Patocka wrote:
> On Thu, 2 Oct 2008, Andrew Morton wrote:
> > On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <[email protected]>
wrote:
> > > > I expect there's no solution which avoids blocking the writers at
> > > > some stage.
> > >
> > > See my other email. Something roughly like this would do the trick
> > > (hey, it actually boots and runs and does fix the problem too).
> >
> > It needs exclusion to protect all those temp tags. Is do_fsync()'s
> > i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
> > the callers of this stuff are running under that lock.
>
> That filemap_fdatawrite and filemap_fdatawait in fsync() aren't really
> called under i_mutex (see do_fsync).
>
> So the possible solutions are:
>
> 1. Add jiffies when the page was diried and wroteback to struct page
> + no impact on locking and concurrency
> - increases the structure by 8 bytes
This one is not practical.
> 2. Stop the writers when the starvation happens (what I did)
> + doesn't do any locking if the livelock doesn't happen
> - locks writers when the livelock happens (I think it's not really serious
> --- because very few people complained about the livelock, very few people
> will see performance degradation from blocking the writers).
Maybe it is because not much actually does sequential writes to a massive
file or block device while trying to fsync it as well? I don't know. You
could still have cases where fsync takes much longer than expected but it
is still not long enough for a user to report it as a "livelock" bug.
> 3. Add another bit to radix tree (what Nick did)
> + doesn't ever block writers
> - unconditionally takes the lock on fsync path and serializates concurrent
> syncs/fsyncs. Probably low overhead too ... or I don't know, is there any
> possible situation when more processes execute sync() in parallel and user
> would see degradations if those syncs were serialized?
On Friday 03 October 2008 21:26, Mikulas Patocka wrote:
> > > *What* is, forever? Data integrity syncs should have pages operated on
> > > in-order, until we get to the end of the range. Circular writeback
> > > could go through again, possibly, but no more than once.
> >
> > OK, I have been able to reproduce it somewhat. It is not a livelock,
> > but what is happening is that direct IO read basically does an fsync
> > on the file before performing the IO. The fsync gets stuck behind the
> > dd that is dirtying the pages, and ends up following behind it and
> > doing all its IO for it.
> >
> > The following patch avoids the issue for direct IO, by using the range
> > syncs rather than trying to sync the whole file.
> >
> > The underlying problem I guess is unchanged. Is it really a problem,
> > though? The way I'd love to solve it is actually by adding another bit
> > or two to the pagecache radix tree, that can be used to transiently tag
> > the tree for future operations. That way we could record the dirty and
> > writeback pages up front, and then only bother with operating on them.
> >
> > That's *if* it really is a problem. I don't have much pity for someone
> > doing buffered IO and direct IO to the same pages of the same file :)
>
> LVM does (that is where the bug was discovered). Basically, it scans all
> the block devices with direct IO and if someone else does buffered IO on
> any device simultaneously, it locks up.
Scans all block devices with direct IO? Hmm, why, I wonder? Should
really consider using buffered (posix_fadvise to readahead/dropbehind).
> That fsync-vs-write livelock is quite improbably (why would some
> application do it?) --- although it could be used as a DoS --- getting
> unkillable process.
I'm not sure.
> > LVM does (that is where the bug was discovered). Basically, it scans all
> > the block devices with direct IO and if someone else does buffered IO on
> > any device simultaneously, it locks up.
>
> Scans all block devices with direct IO? Hmm, why, I wonder? Should
> really consider using buffered (posix_fadvise to readahead/dropbehind).
LVM must not allocate any memory when doing IO because it suspends the
block device and memory allocation could trigger writeback on the
suspended device and deadlock.
So it preallocates heap and stack, mlockall()s itself and does direct IO.
Well, there are still two allocations on direct IO path --- one in
__blockdev_direct_IO and the other in dio_bio_alloc. If you have an idea
how to get rid of them (especially that kzalloc(sizeof(*dio), GFP_KERNEL),
that bio allocation would be quite easy to avoid), it would be benefical.
Mikulas
> > So the possible solutions are:
> >
> > 1. Add jiffies when the page was diried and wroteback to struct page
> > + no impact on locking and concurrency
> > - increases the structure by 8 bytes
>
> This one is not practical.
>
>
> > 2. Stop the writers when the starvation happens (what I did)
> > + doesn't do any locking if the livelock doesn't happen
> > - locks writers when the livelock happens (I think it's not really serious
> > --- because very few people complained about the livelock, very few people
> > will see performance degradation from blocking the writers).
>
> Maybe it is because not much actually does sequential writes to a massive
> file or block device while trying to fsync it as well? I don't know. You
> could still have cases where fsync takes much longer than expected but it
> is still not long enough for a user to report it as a "livelock" bug.
At most twice the time it would normally take (one loop of writeback queue
until it detects the livelock and the other loop until it drains all the
new pages that were created during the first loop).
While with solution (3) it would take only once for the whole writeback
queue.
Mikulas
> > 3. Add another bit to radix tree (what Nick did)
> > + doesn't ever block writers
> > - unconditionally takes the lock on fsync path and serializates concurrent
> > syncs/fsyncs. Probably low overhead too ... or I don't know, is there any
> > possible situation when more processes execute sync() in parallel and user
> > would see degradations if those syncs were serialized?
On Fri, Oct 03, 2008 at 10:31:03PM +1000, Nick Piggin wrote:
> On Friday 03 October 2008 21:26, Mikulas Patocka wrote:
> > LVM does (that is where the bug was discovered). Basically, it scans all
> > the block devices with direct IO and if someone else does buffered IO on
> > any device simultaneously, it locks up.
> Scans all block devices with direct IO? Hmm, why, I wonder? Should
> really consider using buffered (posix_fadvise to readahead/dropbehind).
Scan in the sense of it reading a few sectors from each potential LVM
device to check whether it contains an LVM label.
Alasdair
--
[email protected]
On Fri, Oct 03, 2008 at 09:50:17AM -0400, Mikulas Patocka wrote:
> > > LVM does (that is where the bug was discovered). Basically, it scans all
> > > the block devices with direct IO and if someone else does buffered IO on
> > > any device simultaneously, it locks up.
> > Scans all block devices with direct IO? Hmm, why, I wonder? Should
> > really consider using buffered (posix_fadvise to readahead/dropbehind).
> LVM must not allocate any memory when doing IO because it suspends the
> block device and memory allocation could trigger writeback on the
> suspended device and deadlock.
> So it preallocates heap and stack, mlockall()s itself and does direct IO.
True, but unrelated to the scanning, which LVM performs *prior* to
entering such a state.
We use direct IO while scanning because it's essential all nodes in a
cluster see the same updated version of the data after any node updated
it.
Alasdair
--
[email protected]
On Fri, 3 Oct 2008, Nick Piggin wrote:
>> *What* is, forever? Data integrity syncs should have pages operated on
>> in-order, until we get to the end of the range. Circular writeback could
>> go through again, possibly, but no more than once.
>
> OK, I have been able to reproduce it somewhat. It is not a livelock,
> but what is happening is that direct IO read basically does an fsync
> on the file before performing the IO. The fsync gets stuck behind the
> dd that is dirtying the pages, and ends up following behind it and
> doing all its IO for it.
>
> The following patch avoids the issue for direct IO, by using the range
> syncs rather than trying to sync the whole file.
>
> The underlying problem I guess is unchanged. Is it really a problem,
> though? The way I'd love to solve it is actually by adding another bit
> or two to the pagecache radix tree, that can be used to transiently tag
> the tree for future operations. That way we could record the dirty and
> writeback pages up front, and then only bother with operating on them.
>
> That's *if* it really is a problem. I don't have much pity for someone
> doing buffered IO and direct IO to the same pages of the same file :)
I've seen lots of discussions here about different options in syncing. in
this case a fix is to do a fsync of a range. I've also seen discussions of
how the kernel filesystem code can do ordered writes without having to
wait for them with the use of barriers, is this capability exported to
userspace? if so, could you point me at documentation for it?
David Lang
On Fri, 3 Oct 2008, [email protected] wrote:
> On Fri, 3 Oct 2008, Nick Piggin wrote:
>
> > > *What* is, forever? Data integrity syncs should have pages operated on
> > > in-order, until we get to the end of the range. Circular writeback could
> > > go through again, possibly, but no more than once.
> >
> > OK, I have been able to reproduce it somewhat. It is not a livelock,
> > but what is happening is that direct IO read basically does an fsync
> > on the file before performing the IO. The fsync gets stuck behind the
> > dd that is dirtying the pages, and ends up following behind it and
> > doing all its IO for it.
> >
> > The following patch avoids the issue for direct IO, by using the range
> > syncs rather than trying to sync the whole file.
> >
> > The underlying problem I guess is unchanged. Is it really a problem,
> > though? The way I'd love to solve it is actually by adding another bit
> > or two to the pagecache radix tree, that can be used to transiently tag
> > the tree for future operations. That way we could record the dirty and
> > writeback pages up front, and then only bother with operating on them.
> >
> > That's *if* it really is a problem. I don't have much pity for someone
> > doing buffered IO and direct IO to the same pages of the same file :)
>
> I've seen lots of discussions here about different options in syncing. in this
> case a fix is to do a fsync of a range.
It fixes the bug in concurrent direct read+buffed write, but won't fix the
bug with concurrent sync+buffered write.
> I've also seen discussions of how the
> kernel filesystem code can do ordered writes without having to wait for them
> with the use of barriers, is this capability exported to userspace? if so,
> could you point me at documentation for it?
It isn't. And it is good that it isn't --- the more complicated API, the
more maintenance work.
Mikulas
> David Lang
>
On Sun, 5 Oct 2008, Mikulas Patocka wrote:
> On Fri, 3 Oct 2008, [email protected] wrote:
>
>> I've also seen discussions of how the
>> kernel filesystem code can do ordered writes without having to wait for them
>> with the use of barriers, is this capability exported to userspace? if so,
>> could you point me at documentation for it?
>
> It isn't. And it is good that it isn't --- the more complicated API, the
> more maintenance work.
I can understand that most software would not want to deal with
complications like this, but for things thta have requirements similar to
journaling filesystems (databases for example) it would seem that there
would be advantages to exposing this capabilities.
David Lang
On Sun, 5 Oct 2008, [email protected] wrote:
> On Sun, 5 Oct 2008, Mikulas Patocka wrote:
>
> > On Fri, 3 Oct 2008, [email protected] wrote:
> >
> > > I've also seen discussions of how the
> > > kernel filesystem code can do ordered writes without having to wait for
> > > them
> > > with the use of barriers, is this capability exported to userspace? if so,
> > > could you point me at documentation for it?
> >
> > It isn't. And it is good that it isn't --- the more complicated API, the
> > more maintenance work.
>
> I can understand that most software would not want to deal with complications
> like this, but for things thta have requirements similar to journaling
> filesystems (databases for example) it would seem that there would be
> advantages to exposing this capabilities.
>
> David Lang
If you invent new interface that allows submitting several ordered IOs
from userspace, it will require excessive maintenance overhead over long
period of time. So it should be only justified, if the performance
improvement is excessive as well.
It should not be like "here you improve 10% performance on some synthetic
benchmark in one application that was rewritten to support the new
interface" and then create a few more security vulnerabilities (because of
the complexity of the interface) and damage overall Linux progress,
because everyone is catching bugs in the new interface and checking it for
correctness.
Mikulas
On Sun, 5 Oct 2008, Mikulas Patocka wrote:
> On Sun, 5 Oct 2008, [email protected] wrote:
>
>> On Sun, 5 Oct 2008, Mikulas Patocka wrote:
>>
>>> On Fri, 3 Oct 2008, [email protected] wrote:
>>>
>>>> I've also seen discussions of how the
>>>> kernel filesystem code can do ordered writes without having to wait for
>>>> them
>>>> with the use of barriers, is this capability exported to userspace? if so,
>>>> could you point me at documentation for it?
>>>
>>> It isn't. And it is good that it isn't --- the more complicated API, the
>>> more maintenance work.
>>
>> I can understand that most software would not want to deal with complications
>> like this, but for things thta have requirements similar to journaling
>> filesystems (databases for example) it would seem that there would be
>> advantages to exposing this capabilities.
>>
>> David Lang
>
> If you invent new interface that allows submitting several ordered IOs
> from userspace, it will require excessive maintenance overhead over long
> period of time. So it should be only justified, if the performance
> improvement is excessive as well.
>
> It should not be like "here you improve 10% performance on some synthetic
> benchmark in one application that was rewritten to support the new
> interface" and then create a few more security vulnerabilities (because of
> the complexity of the interface) and damage overall Linux progress,
> because everyone is catching bugs in the new interface and checking it for
> correctness.
the same benchmarks that show that it's far better for the in-kernel
filesystem code to use write barriers should apply for FUSE filesystems.
this isn't a matter of a few % in performance, if an application is
sync-limited in a way that can be converted to write-ordered the potential
is for the application to speed up my many times.
programs that maintain indexes or caches of data that lives in other files
will be able to write data && barrier && write index && fsync and double
their performance vs write data && fsync && write index && fsync
databases can potentially do even better, today they need to fsync data to
disk before they can update their journal to indicate that the data has
been written, with a barrier they could order the writes so that the write
to the journal doesn't happen until the writes of the data. they would
neve need to call an fsync at all (when emptying the journal)
for systems without solid-state drives or battery-backed caches, the
ability to eliminate fsyncs by being able to rely on the order of the
writes is a huge benifit.
David Lang
> > If you invent new interface that allows submitting several ordered IOs
> > from userspace, it will require excessive maintenance overhead over long
> > period of time. So it should be only justified, if the performance
> > improvement is excessive as well.
> >
> > It should not be like "here you improve 10% performance on some synthetic
> > benchmark in one application that was rewritten to support the new
> > interface" and then create a few more security vulnerabilities (because of
> > the complexity of the interface) and damage overall Linux progress,
> > because everyone is catching bugs in the new interface and checking it for
> > correctness.
>
> the same benchmarks that show that it's far better for the in-kernel
> filesystem code to use write barriers should apply for FUSE filesystems.
FUSE is slow by design, and it is used in cases where performance isn't
crucial.
> this isn't a matter of a few % in performance, if an application is
> sync-limited in a way that can be converted to write-ordered the potential is
> for the application to speed up my many times.
>
> programs that maintain indexes or caches of data that lives in other files
> will be able to write data && barrier && write index && fsync and double their
> performance vs write data && fsync && write index && fsync
They can do: write data with O_SYNC; write another piece of data with
O_SYNC.
And the only difference from barriers is the waiting time after the first
O_SYNC before the second I/O is submitted (such delay wouldn't happen with
barriers).
And now I/O delay is in milliseconds and process wakeup time is tens of
microseconds, it doesn't look like eliminating process wakeup time would
do more than few percents.
> databases can potentially do even better, today they need to fsync data to
> disk before they can update their journal to indicate that the data has been
> written, with a barrier they could order the writes so that the write to the
> journal doesn't happen until the writes of the data. they would neve need to
> call an fsync at all (when emptying the journal)
Good databases can pack several user transactions into one fsync() write.
If the database server is properly engineered, it accumulates all user
transactions committed so far into one chunk, writes that chunk with one
fsync() call and then reports successful commit to the clients.
So if you increase fsync() latency, it should have no effect on the
transactional throughput --- only on latency of transactions. Similarly,
if you decrease fsync() latency, it won't increase number of processed
transactions.
Certainly, there are primitive embedded database libraries that fsync()
after each transaction, but they don't have good performance anyway.
> for systems without solid-state drives or battery-backed caches, the ability
> to eliminate fsyncs by being able to rely on the order of the writes is a huge
> benifit.
I may ask --- where are the applications that require extra slow fsync()
latency? Databases are not that, they batch transactions.
If you want to improve things, you can try:
* implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
* implement range_fsync and range_fdatasync (sync on file range --- the
kernel has already support for that, you can just add a syscall)
* turn on FUA bit for O_DSYNC writes, that eliminates the need to flush
drive cache in O_DSYNC call
--- these are definitely less invasive than new I/O submitting interface.
Mikulas
> David Lang
>
On Tue, 7 Oct 2008, Mikulas Patocka wrote:
>>> If you invent new interface that allows submitting several ordered IOs
>>> from userspace, it will require excessive maintenance overhead over long
>>> period of time. So it should be only justified, if the performance
>>> improvement is excessive as well.
>>>
>>> It should not be like "here you improve 10% performance on some synthetic
>>> benchmark in one application that was rewritten to support the new
>>> interface" and then create a few more security vulnerabilities (because of
>>> the complexity of the interface) and damage overall Linux progress,
>>> because everyone is catching bugs in the new interface and checking it for
>>> correctness.
>>
>> the same benchmarks that show that it's far better for the in-kernel
>> filesystem code to use write barriers should apply for FUSE filesystems.
>
> FUSE is slow by design, and it is used in cases where performance isn't
> crucial.
FUSE is slow, but I don't believe that it's a design goal for it to be
slow, it's a limitation of the implementation. so things that could speed
it up would be a good thing.
>> this isn't a matter of a few % in performance, if an application is
>> sync-limited in a way that can be converted to write-ordered the potential is
>> for the application to speed up my many times.
>>
>> programs that maintain indexes or caches of data that lives in other files
>> will be able to write data && barrier && write index && fsync and double their
>> performance vs write data && fsync && write index && fsync
>
> They can do: write data with O_SYNC; write another piece of data with
> O_SYNC.
>
> And the only difference from barriers is the waiting time after the first
> O_SYNC before the second I/O is submitted (such delay wouldn't happen with
> barriers).
>
> And now I/O delay is in milliseconds and process wakeup time is tens of
> microseconds, it doesn't look like eliminating process wakeup time would
> do more than few percents.
each sync write needs to wait for a disk rotation (and a seek if you are
writing to different files). if you only do two writes you save one disk
rotation, if you do five writes you save four disk rotations
>> databases can potentially do even better, today they need to fsync data to
>> disk before they can update their journal to indicate that the data has been
>> written, with a barrier they could order the writes so that the write to the
>> journal doesn't happen until the writes of the data. they would neve need to
>> call an fsync at all (when emptying the journal)
>
> Good databases can pack several user transactions into one fsync() write.
> If the database server is properly engineered, it accumulates all user
> transactions committed so far into one chunk, writes that chunk with one
> fsync() call and then reports successful commit to the clients.
if there are multiple users doing transactions at the same time they will
benifit from overlapping the fsyncs. but each user session cannot complete
their transaction until the fsync completes
> So if you increase fsync() latency, it should have no effect on the
> transactional throughput --- only on latency of transactions. Similarly,
> if you decrease fsync() latency, it won't increase number of processed
> transactions.
only if you have all your transactions happening in parallel. in the real
world programs sometimes need to wait for one transaction to complete so
that they can do the next one.
> Certainly, there are primitive embedded database libraries that fsync()
> after each transaction, but they don't have good performance anyway.
>
>> for systems without solid-state drives or battery-backed caches, the ability
>> to eliminate fsyncs by being able to rely on the order of the writes is a huge
>> benifit.
>
> I may ask --- where are the applications that require extra slow fsync()
> latency? Databases are not that, they batch transactions.
>
> If you want to improve things, you can try:
> * implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
> * implement range_fsync and range_fdatasync (sync on file range --- the
> kernel has already support for that, you can just add a syscall)
> * turn on FUA bit for O_DSYNC writes, that eliminates the need to flush
> drive cache in O_DSYNC call
>
> --- these are definitely less invasive than new I/O submitting interface.
but all of these require that the application stop and wait for each
seperate write to take place before proceeding to the next step.
if this doesn't matter, then why the big push to have the in-kernel
filesystems start using barriers? I understood that this resulted in large
performance increases in the places that they are used from just being
able to avoid having to drain the entire request queue, and you are saying
that the applications would not only need to wait for the queue to flush,
but for the disk to acknowledge the write.
syncs are slow, in some cases _very_ slow.
David Lang