LinuxLists.cc - [GIT PULL] io_uring changes for 5.9-rc1

2020-08-02 21:42:37

Subject: [GIT PULL] io_uring changes for 5.9-rc1

Hi Linus,

Lots of cleanups in here, hardening the code and/or making it easier to
read and fixing buts, but a core feature/change too adding support for
real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for the
fast path. In detail:

- Cleanup how memory accounting is done on ring setup/free (Bijan)

- sq array offset calculation fixup (Dmitry)

- Consistently handle blocking off O_DIRECT submission path (me)

- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries from
task_work, like we handle poll based retry. (me)

- IO completion optimizations (me)

- Fix race with accounting and ring fd install (me)

- Support EPOLLEXCLUSIVE (Jiufei)

- Get rid of the io_kiocb unionizing, made possible by shrinking other
bits (Pavel)

- Completion side cleanups (Pavel)

- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

- Request environment grabbing cleanups (Pavel)

- File and socket read/write cleanups (Pavel)

- Improve kiocb_set_rw_flags() (Pavel)

- Tons of fixes and cleanups (Pavel)

- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)

This will throw a few merge conflicts. One is due to the IOCB_NOIO
addition that happened late in 5.8-rc, the other is due to a change in
for-5.9/block. Both are trivial to fixup, I'm attaching my merge
resolution when I pulled it in locally.

Please pull!

The following changes since commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5:

io_uring: fix lockup in io_fail_links() (2020-07-24 12:51:33 -0600)

are available in the Git repository at:

git://git.kernel.dk/linux-block.git tags/for-5.9/io_uring-20200802

for you to fetch changes up to fa15bafb71fd7a4d6018dae87cfaf890fd4ab47f:

io_uring: flip if handling after io_setup_async_rw (2020-08-01 11:02:57 -0600)

----------------------------------------------------------------
for-5.9/io_uring-20200802

----------------------------------------------------------------
Bijan Mottahedeh (4):
io_uring: add wrappers for memory accounting
io_uring: rename ctx->account_mem field
io_uring: report pinned memory usage
io_uring: separate reporting of ring pages from registered pages

Dan Carpenter (1):
io_uring: fix a use after free in io_async_task_func()

Dmitry Vyukov (1):
io_uring: fix sq array offset calculation

Jens Axboe (31):
block: provide plug based way of signaling forced no-wait semantics
io_uring: always plug for any number of IOs
io_uring: catch -EIO from buffered issue request failure
io_uring: re-issue block requests that failed because of resources
mm: allow read-ahead with IOCB_NOWAIT set
mm: abstract out wake_page_match() from wake_page_function()
mm: add support for async page locking
mm: support async buffered reads in generic_file_buffered_read()
fs: add FMODE_BUF_RASYNC
block: flag block devices as supporting IOCB_WAITQ
xfs: flag files as supporting buffered async reads
btrfs: flag files as supporting buffered async reads
mm: add kiocb_wait_page_queue_init() helper
io_uring: support true async buffered reads, if file provides it
Merge branch 'async-buffered.8' into for-5.9/io_uring
io_uring: provide generic io_req_complete() helper
io_uring: add 'io_comp_state' to struct io_submit_state
io_uring: pass down completion state on the issue side
io_uring: pass in completion state to appropriate issue side handlers
io_uring: enable READ/WRITE to use deferred completions
io_uring: use task_work for links if possible
Merge branch 'io_uring-5.8' into for-5.9/io_uring
io_uring: clean up io_kill_linked_timeout() locking
Merge branch 'io_uring-5.8' into for-5.9/io_uring
io_uring: abstract out task work running
io_uring: use new io_req_task_work_add() helper throughout
io_uring: only call kfree() for a non-zero pointer
io_uring: get rid of __req_need_defer()
io_uring: remove dead 'ctx' argument and move forward declaration
Merge branch 'io_uring-5.8' into for-5.9/io_uring
io_uring: don't touch 'ctx' after installing file descriptor

Jiufei Xue (2):
io_uring: change the poll type to be 32-bits
io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior

Pavel Begunkov (90):
io_uring: remove setting REQ_F_MUST_PUNT in rw
io_uring: remove REQ_F_MUST_PUNT
io_uring: set @poll->file after @poll init
io_uring: kill NULL checks for submit state
io_uring: fix NULL-mm for linked reqs
io-wq: compact io-wq flags numbers
io-wq: return next work from ->do_work() directly
io_uring: fix req->work corruption
io_uring: fix punting req w/o grabbed env
io_uring: fix feeding io-wq with uninit reqs
io_uring: don't mark link's head for_async
io_uring: fix missing io_grab_files()
io_uring: fix refs underflow in io_iopoll_queue()
io_uring: remove inflight batching in free_many()
io_uring: dismantle req early and remove need_iter
io_uring: batch-free linked requests as well
io_uring: cosmetic changes for batch free
io_uring: kill REQ_F_LINK_NEXT
io_uring: clean up req->result setting by rw
io_uring: do task_work_run() during iopoll
io_uring: fix iopoll -EAGAIN handling
io_uring: fix missing wake_up io_rw_reissue()
io_uring: deduplicate freeing linked timeouts
io_uring: replace find_next() out param with ret
io_uring: kill REQ_F_TIMEOUT
io_uring: kill REQ_F_TIMEOUT_NOSEQ
io_uring: fix potential use after free on fallback request free
io_uring: don't pass def into io_req_work_grab_env
io_uring: do init work in grab_env()
io_uring: factor out grab_env() from defer_prep()
io_uring: do grab_env() just before punting
io_uring: don't fail iopoll requeue without ->mm
io_uring: fix NULL mm in io_poll_task_func()
io_uring: simplify io_async_task_func()
io_uring: optimise io_req_find_next() fast check
io_uring: fix missing ->mm on exit
io_uring: fix mis-refcounting linked timeouts
io_uring: keep queue_sqe()'s fail path separately
io_uring: fix lost cqe->flags
io_uring: don't delay iopoll'ed req completion
io_uring: fix stopping iopoll'ing too early
io_uring: briefly loose locks while reaping events
io_uring: partially inline io_iopoll_getevents()
io_uring: remove nr_events arg from iopoll_check()
io_uring: don't burn CPU for iopoll on exit
io_uring: rename sr->msg into umsg
io_uring: use more specific type in rcv/snd msg cp
io_uring: extract io_sendmsg_copy_hdr()
io_uring: replace rw->task_work with rq->task_work
io_uring: simplify io_req_map_rw()
io_uring: add a helper for async rw iovec prep
io_uring: follow **iovec idiom in io_import_iovec
io_uring: share completion list w/ per-op space
io_uring: rename ctx->poll into ctx->iopoll
io_uring: use inflight_entry list for iopoll'ing
io_uring: use completion list for CQ overflow
io_uring: add req->timeout.list
io_uring: remove init for unused list
io_uring: use non-intrusive list for defer
io_uring: remove sequence from io_kiocb
io_uring: place cflags into completion data
io_uring: inline io_req_work_grab_env()
io_uring: remove empty cleanup of OP_OPEN* reqs
io_uring: alloc ->io in io_req_defer_prep()
io_uring/io-wq: move RLIMIT_FSIZE to io-wq
io_uring: simplify file ref tracking in submission state
io_uring: indent left {send,recv}[msg]()
io_uring: remove extra checks in send/recv
io_uring: don't forget cflags in io_recv()
io_uring: free selected-bufs if error'ed
io_uring: move BUFFER_SELECT check into *recv[msg]
io_uring: extract io_put_kbuf() helper
io_uring: don't open-code recv kbuf managment
io_uring: don't miscount pinned memory
io_uring: return locked and pinned page accounting
tasks: add put_task_struct_many()
io_uring: batch put_task_struct()
io_uring: don't do opcode prep twice
io_uring: deduplicate io_grab_files() calls
io_uring: mark ->work uninitialised after cleanup
io_uring: fix missing io_queue_linked_timeout()
io-wq: update hash bits
io_uring: de-unionise io_kiocb
io_uring: deduplicate __io_complete_rw()
io_uring: fix racy overflow count reporting
io_uring: fix stalled deferred requests
io_uring: consolidate *_check_overflow accounting
io_uring: get rid of atomic FAA for cq_timeouts
fs: optimise kiocb_set_rw_flags()
io_uring: flip if handling after io_setup_async_rw

Randy Dunlap (1):
io_uring: fix function args for !CONFIG_NET

Xiaoguang Wang (1):
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works

block/blk-core.c | 6 +
fs/block_dev.c | 2 +-
fs/btrfs/file.c | 2 +-
fs/io-wq.c | 14 +-
fs/io-wq.h | 11 +-
fs/io_uring.c | 2588 +++++++++++++++++++++++------------------
fs/xfs/xfs_file.c | 2 +-
include/linux/blkdev.h | 1 +
include/linux/fs.h | 26 +-
include/linux/pagemap.h | 75 ++
include/linux/sched/task.h | 6 +
include/uapi/linux/io_uring.h | 4 +-
mm/filemap.c | 110 +-
tools/io_uring/liburing.h | 6 +-
14 files changed, 1658 insertions(+), 1195 deletions(-)

--
Jens Axboe

Attachments:

merge.txt (3.38 kB)

2020-08-03 20:49:47

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On Sun, Aug 2, 2020 at 2:41 PM Jens Axboe <[email protected]> wrote:
>
> Lots of cleanups in here, hardening the code and/or making it easier to
> read and fixing buts, but a core feature/change too adding support for
> real async buffered reads. With the latter in place, we just need
> buffered write async support and we're done relying on kthreads for the
> fast path. In detail:

That async buffered reads handling the the page locking flag is a
mess, and I'm really happy I committed my page locking scalability
change early, so that the conflicts there caught it.

Re-using the page bit waitqueue types and exporting them?

That part is fine, I guess, particularly since it came from the
wait_bit_key thing and have a smell of being generic.

Taking a random part of wake_page_function(), and calling it
"wake_page_match()" even though that's not at all that it does?

Not ok.

Adding random kiocb helper functions to a core header file, when they
are only used in one place, and when they only make sense in that one
place?

Not ok.

When the function is called "wake_page_match()", you'd expect it
matches the wake page information, wouldn't it?

Yeah, it did that. And then it also checked whether the bit we're
waiting had been set again, because everybody ostensibly wanted that.
Except they don't any more, and that's not what the name really
implied anyway.

And kiocb_wait_page_queue_init() has absolutely zero business being in
<linux/filemap.h>. There are absolutely no valid uses of that thing
outside of the one place that calls it.

I tried to fix up the things I could.

That said, like a lot of io_uring code, this is some seriously opaque
code. You say you've done a lot of cleanups, but I'm not convinced
those cleanups are in any way offsetting adding yet another union (how
many bugs did the last one add?) and a magic flag of "use this part of
the union" now.

And I don't know what loads you use for testing that thing, or what
happens when the "lock_page_async()" case actually fails to lock, and
just calls back the io_async_buf_func() wakeup function when the page
has unlocked...

That function doesn't actually lock the page either, but does the task
work. I hope that work then knows to do the right thing, but it's
really opaque and hard to follow.

Anyway, I'm not entirely happy with doing these kinds of changes in
the merge resolution, but the alternative was to not do the pull at
all, and require you to do a lot of cleanups before I would pull it.
Maybe I should have done that.

So this is a slightly grumpy email about how io_uring is (a) still
making me very nervous about a very lackadaisical approach to things,
and having the codepaths so obscure that I'm not convinced it's not
horribly buggy. And (b) I fixed things up without really being able to
test them. I tested that the _normal_ paths still seem to work fine,
but..

I really think that whole thing needs a lot of comments, particularly
around the whole io_rw_should_retry() area.

A bit and legible comment about how it will be caught by the
generic_file_buffered_read() page locking code, how the two cases
differ (it might get caught by the "I'm just waiting for it to be
unlocked", but it could *also* get caught by the "lock page now"
case), and how it continues the request.

As it is, it bounces between the generic code and very io_uring
specific code in strange and not easy to follow ways.

I've pushed out my merge of this thing, but you might also want to
take a look at commit 2a9127fcf229 ("mm: rewrite
wait_on_page_bit_common() logic"). I particular, the comment about how
there's no point in even testing the page bit any more when you get
woken up.

I left that

if (test_bit(key->bit_nr, &key->page->flags))
return -1;

logic in io_async_buf_func() (but it's not in "wake_page_match()" any
more), but I suspect it's bogus and pointless, for the same reason
that it isn't done for normal page waits now. Maybe it's better to
just queue the actual work regardless, it will then be caught in the
_real_ lock_page() or whatever it ends up doing - and if it only
really wants to see the "uptodate" bit being set, and was just waiting
for IO to finish, then it never really cared about the page lock bit
at all, it just wanted to be notified about IO being done.

So this was a really long email to tell you - again - that I'm not
happy with how fragile io_uring is, and how the code seems to be
almost intentionally written to *be* fragile. Complex and hard to
understand, and as a result it has had a fairly high rate of fairly
nasty bugs.

I'm hoping this isn't going to be yet another case of "nasty bugs
because of complexity and a total disregard for explaining what is
going on".

Linus

2020-08-03 20:56:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds
<[email protected]> wrote:
>
> I've pushed out my merge of this thing [..]

It seems I'm not the only one unhappy with the pull request.

For some reason I also don't see pr-tracker-bot being all happy and
excited about it. I wonder why.

Linus

2020-08-03 21:11:46

by Konstantin Ryabitsev

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On Mon, Aug 03, 2020 at 01:53:12PM -0700, Linus Torvalds wrote:
> On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > I've pushed out my merge of this thing [..]
>
> It seems I'm not the only one unhappy with the pull request.
>
> For some reason I also don't see pr-tracker-bot being all happy and
> excited about it. I wonder why.

My guess it's because the body consists of two text/plain MIME-parts and
Python returned the merge.txt part first, where we didn't find what we
were looking for.

I'll see if I can teach it to walk all text/plain parts looking for
magic git pull strings instead of giving up after the first one.

-K

2020-08-03 22:30:51

by Jens Axboe

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On 8/3/20 2:48 PM, Linus Torvalds wrote:
> On Sun, Aug 2, 2020 at 2:41 PM Jens Axboe <[email protected]> wrote:
>>
>> Lots of cleanups in here, hardening the code and/or making it easier to
>> read and fixing buts, but a core feature/change too adding support for
>> real async buffered reads. With the latter in place, we just need
>> buffered write async support and we're done relying on kthreads for the
>> fast path. In detail:
>
> That async buffered reads handling the the page locking flag is a
> mess, and I'm really happy I committed my page locking scalability
> change early, so that the conflicts there caught it.
>
> Re-using the page bit waitqueue types and exporting them?
>
> That part is fine, I guess, particularly since it came from the
> wait_bit_key thing and have a smell of being generic.
>
> Taking a random part of wake_page_function(), and calling it
> "wake_page_match()" even though that's not at all that it does?
>
> Not ok.

OK, I actually thought it was kind of nice and better than having that
code duplicated in two spots.

> Adding random kiocb helper functions to a core header file, when they
> are only used in one place, and when they only make sense in that one
> place?
>
> Not ok.

I'll move that into io_uring instead.

> When the function is called "wake_page_match()", you'd expect it
> matches the wake page information, wouldn't it?
>
> Yeah, it did that. And then it also checked whether the bit we're
> waiting had been set again, because everybody ostensibly wanted that.
> Except they don't any more, and that's not what the name really
> implied anyway.
>
> And kiocb_wait_page_queue_init() has absolutely zero business being in
> <linux/filemap.h>. There are absolutely no valid uses of that thing
> outside of the one place that calls it.
>
> I tried to fix up the things I could.

Thanks! As mentioned, I'll prep a cleanup patch that moves the
kiocb_wait_page_queue_init() out of there.

> That said, like a lot of io_uring code, this is some seriously opaque
> code. You say you've done a lot of cleanups, but I'm not convinced
> those cleanups are in any way offsetting adding yet another union (how
> many bugs did the last one add?) and a magic flag of "use this part of
> the union" now.

I had to think a bit about what you are referring to here, but I guess
it's the iocb part. And yes, it's not ideal, but until we support polled
IO with buffered IO, then there's no overlap in the use case. And I
don't see us ever doing that.

> And I don't know what loads you use for testing that thing, or what
> happens when the "lock_page_async()" case actually fails to lock, and
> just calls back the io_async_buf_func() wakeup function when the page
> has unlocked...
>
> That function doesn't actually lock the page either, but does the task
> work. I hope that work then knows to do the right thing, but it's
> really opaque and hard to follow.

The task work retries the whole thing, so it'll go through the normal
page cache read path again with all that that entails. We only really
ever use the callback to tell us when it's a good idea to retry again,
there's no other retained state there at all.

I didn't realize that part wasn't straight forward, so I'll add some
comments as well explaining how that code flow works.

It's seen a good amount of testing, both from myself and also from
others. The postgres IO rewrite has been putting it through its paces,
and outside of a few initial issues months ago, it's been rock solid.

> Anyway, I'm not entirely happy with doing these kinds of changes in
> the merge resolution, but the alternative was to not do the pull at
> all, and require you to do a lot of cleanups before I would pull it.
> Maybe I should have done that.
>
> So this is a slightly grumpy email about how io_uring is (a) still
> making me very nervous about a very lackadaisical approach to things,
> and having the codepaths so obscure that I'm not convinced it's not
> horribly buggy. And (b) I fixed things up without really being able to
> test them. I tested that the _normal_ paths still seem to work fine,
> but..

I need to do a better job at commenting these parts, obviously. And
while nothing is perfect, and we're definitely perfect yet, the general
trend is definitely strongly towards getting rid of odd states through
flags and unifying more of the code, and tons of fixes/cleanups that
make things easier to read and verify...

> I really think that whole thing needs a lot of comments, particularly
> around the whole io_rw_should_retry() area.
>
> A bit and legible comment about how it will be caught by the
> generic_file_buffered_read() page locking code, how the two cases
> differ (it might get caught by the "I'm just waiting for it to be
> unlocked", but it could *also* get caught by the "lock page now"
> case), and how it continues the request.

Noted, I'll write that up.

> As it is, it bounces between the generic code and very io_uring
> specific code in strange and not easy to follow ways.
>
> I've pushed out my merge of this thing, but you might also want to
> take a look at commit 2a9127fcf229 ("mm: rewrite
> wait_on_page_bit_common() logic"). I particular, the comment about how
> there's no point in even testing the page bit any more when you get
> woken up.
>
> I left that
>
> if (test_bit(key->bit_nr, &key->page->flags))
> return -1;
>
> logic in io_async_buf_func() (but it's not in "wake_page_match()" any
> more), but I suspect it's bogus and pointless, for the same reason
> that it isn't done for normal page waits now. Maybe it's better to
> just queue the actual work regardless, it will then be caught in the
> _real_ lock_page() or whatever it ends up doing - and if it only
> really wants to see the "uptodate" bit being set, and was just waiting
> for IO to finish, then it never really cared about the page lock bit
> at all, it just wanted to be notified about IO being done.

I just did notice your rewrite commit, and I'll adjust accordingly and
test it with that too.

> So this was a really long email to tell you - again - that I'm not
> happy with how fragile io_uring is, and how the code seems to be
> almost intentionally written to *be* fragile. Complex and hard to
> understand, and as a result it has had a fairly high rate of fairly
> nasty bugs.
>
> I'm hoping this isn't going to be yet another case of "nasty bugs
> because of complexity and a total disregard for explaining what is
> going on".

Outside of the review from Johannes, lots of other people did look over
the async buffered bits, and Andrew as well said it look good to him. So
while the task_work retry apparently isn't as obvious as I had hoped,
it's definitely not fragile or intentionally trying to be obtuse.

I'll make a few adjustments based on your feedback, and add a patch with
some comments as well. Hopefully that'll make the end result easier to
follow.

--
Jens Axboe

2020-08-03 22:32:06

by Jens Axboe

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On 8/3/20 3:10 PM, Konstantin Ryabitsev wrote:
> On Mon, Aug 03, 2020 at 01:53:12PM -0700, Linus Torvalds wrote:
>> On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> I've pushed out my merge of this thing [..]
>>
>> It seems I'm not the only one unhappy with the pull request.
>>
>> For some reason I also don't see pr-tracker-bot being all happy and
>> excited about it. I wonder why.
>
> My guess it's because the body consists of two text/plain MIME-parts and
> Python returned the merge.txt part first, where we didn't find what we
> were looking for.
>
> I'll see if I can teach it to walk all text/plain parts looking for
> magic git pull strings instead of giving up after the first one.

Thanks, I was a bit puzzled on that one too, and this time it definitely
wasn't because the tag wasn't there.

In terms of attachments, I'm usually a fan of inlining, but seemed cleaner
to me to attach the merge resolution as there's already a ton of other
stuff in that email.

--
Jens Axboe

2020-08-03 23:19:20

by Jens Axboe

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On 8/3/20 4:30 PM, Jens Axboe wrote:
>> Adding random kiocb helper functions to a core header file, when they
>> are only used in one place, and when they only make sense in that one
>> place?
>>
>> Not ok.
>
> I'll move that into io_uring instead.

I see that you handled most of the complaints already, so thanks for
that. I've run some basic testing with master and it works for me,
running some more testing on production too.

I took a look at the rewrite you queued up, and made a matching change
on the io_uring side:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=419ebeb6f2d0d56f6b2844c0f77034d1048e37e9

and also queued a documentation patch for the retry logic and the
callback handler:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8

For the latter, let me know if you're happy with the explanation, or if
you want other parts documented more thoroughly too. I'll make a pass
through the file in any case once I've flushed out the other branches
for this merge window in.

--
Jens Axboe

2020-08-03 23:32:55

by Jens Axboe

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On 8/3/20 5:18 PM, Jens Axboe wrote:
> On 8/3/20 4:30 PM, Jens Axboe wrote:
>>> Adding random kiocb helper functions to a core header file, when they
>>> are only used in one place, and when they only make sense in that one
>>> place?
>>>
>>> Not ok.
>>
>> I'll move that into io_uring instead.
>
> I see that you handled most of the complaints already, so thanks for
> that. I've run some basic testing with master and it works for me,
> running some more testing on production too.
>
> I took a look at the rewrite you queued up, and made a matching change
> on the io_uring side:
>
> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=419ebeb6f2d0d56f6b2844c0f77034d1048e37e9

Updated to honor exclusive return value as well:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=f6fd3784c9f7d3309507fdb6dcc818f54467bf3e

--
Jens Axboe

2020-08-03 23:35:04

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On Mon, Aug 3, 2020 at 4:18 PM Jens Axboe <[email protected]> wrote:
>
>
> I took a look at the rewrite you queued up, and made a matching change
> on the io_uring side:

Oh, no, you made it worse.

Now you're tying your odd wakeup routine to entirely irrelevant things
that can't even happen to you.

That io_async_buf_func() will never be called for any entry that isn't
your own, so testing

wait->flags & WQ_FLAG_EXCLUSIVE

is completely pointless, because you never set that flag. And
similarly, for you to then do

wait->flags |= WQ_FLAG_WOKEN;

is equally pointless, because the only thing that cares and looks at
that wait entry is you, and you don't care about the WOKEN flag.

So that patch shows a fundamental misunderstanding of how the
waitqueues actually work.

Which is kind of my _point_. The io_uring code that hooked into the
page wait queues really looks like complete cut-and-paste voodoo
programming.

It needs comments. It's hard to follow. Even somebody like me, who
actually knows how the page wait queues really work, have a really
hard time following how io_uring initializing a wait-queue entry and
pointing to it in the io ctx then interacts with the (later) generic
file reading path, and how it then calls back at unlock time to the
io_uring callback _if_ the page was locked.

And that patch you point to makes me 100% sure you don't quite
understand the code either.

So when you do

/*
* Only test the bit if it's an exclusive wait, as we know the
* bit is cleared for non-exclusive waits. Also see mm/filemap.c
*/
if ((wait->flags & WQ_FLAG_EXCLUSIVE) &&
test_and_set_bit(key->bit_nr, &key->page->flags))
return -1;

the first test guarantees that the second test is never done, which is
good, because if it *had* been done, you'd have taken the lock and
nothing you have actually expects that.

So the fix is to just remove those lines entirely. If somebody
unlocked the page you care about, and did a wakeup on that page and
bit, then you know you should start the async worker. Noi amount of
testing bits matters at all.

And similarly, the

wait->flags |= WQ_FLAG_WOKEN;

is a no-op because nothing tests that WQ_FLAG_WOKEN bit. That wait
entry is _your_ wait entry. It's not the wait entry of some normal
page locker - those use wake_page_function().

Now *if* you had workers that actually expected to be woken up with
the page lock already held, and owning it, then that kind of
WQ_FLAG_EXCLUSIVE and WQ_FLAG_WOKEN logic would be a good idea. But
that's not what you have.

> and also queued a documentation patch for the retry logic and the
> callback handler:
>
> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8

Better. Although I find the first comment a bit misleading.

You say

/* Invoked from our "page is now unlocked" handler when someone ..

but that's not really the case. The function gets called by whoever
unlocks the page after you've registered that page wait entry through
lock_page_async().

So there's no "our handler" anywhere, which I find misleading and
confusing in the comment.

Linus

2020-08-03 23:44:42

by Jens Axboe

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On 8/3/20 5:34 PM, Linus Torvalds wrote:
> On Mon, Aug 3, 2020 at 4:18 PM Jens Axboe <[email protected]> wrote:
>>
>>
>> I took a look at the rewrite you queued up, and made a matching change
>> on the io_uring side:
>
> Oh, no, you made it worse.
>
> Now you're tying your odd wakeup routine to entirely irrelevant things
> that can't even happen to you.
>
> That io_async_buf_func() will never be called for any entry that isn't
> your own, so testing
>
> wait->flags & WQ_FLAG_EXCLUSIVE
>
> is completely pointless, because you never set that flag. And
> similarly, for you to then do
>
> wait->flags |= WQ_FLAG_WOKEN;
>
> is equally pointless, because the only thing that cares and looks at
> that wait entry is you, and you don't care about the WOKEN flag.
>
> So that patch shows a fundamental misunderstanding of how the
> waitqueues actually work.
>
> Which is kind of my _point_. The io_uring code that hooked into the
> page wait queues really looks like complete cut-and-paste voodoo
> programming.
>
> It needs comments. It's hard to follow. Even somebody like me, who
> actually knows how the page wait queues really work, have a really
> hard time following how io_uring initializing a wait-queue entry and
> pointing to it in the io ctx then interacts with the (later) generic
> file reading path, and how it then calls back at unlock time to the
> io_uring callback _if_ the page was locked.
>
> And that patch you point to makes me 100% sure you don't quite
> understand the code either.
>
> So when you do
>
> /*
> * Only test the bit if it's an exclusive wait, as we know the
> * bit is cleared for non-exclusive waits. Also see mm/filemap.c
> */
> if ((wait->flags & WQ_FLAG_EXCLUSIVE) &&
> test_and_set_bit(key->bit_nr, &key->page->flags))
> return -1;
>
> the first test guarantees that the second test is never done, which is
> good, because if it *had* been done, you'd have taken the lock and
> nothing you have actually expects that.
>
> So the fix is to just remove those lines entirely. If somebody
> unlocked the page you care about, and did a wakeup on that page and
> bit, then you know you should start the async worker. Noi amount of
> testing bits matters at all.
>
> And similarly, the
>
> wait->flags |= WQ_FLAG_WOKEN;
>
> is a no-op because nothing tests that WQ_FLAG_WOKEN bit. That wait
> entry is _your_ wait entry. It's not the wait entry of some normal
> page locker - those use wake_page_function().
>
> Now *if* you had workers that actually expected to be woken up with
> the page lock already held, and owning it, then that kind of
> WQ_FLAG_EXCLUSIVE and WQ_FLAG_WOKEN logic would be a good idea. But
> that's not what you have.

Yes, looks like I was a bit too trigger happy without grokking the whole
thing, and got it mixed up with the broader more generic waitqueue
cases. Thanks for clueing me in, I've updated the patch so the use case
is inline with only what io_uring is doing here.

>> and also queued a documentation patch for the retry logic and the
>> callback handler:
>>
>> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8
>
> Better. Although I find the first comment a bit misleading.
>
> You say
>
> /* Invoked from our "page is now unlocked" handler when someone ..
>
> but that's not really the case. The function gets called by whoever
> unlocks the page after you've registered that page wait entry through
> lock_page_async().
>
> So there's no "our handler" anywhere, which I find misleading and
> confusing in the comment.

The 'handler' refers to the io_uring waitqueue callback, but I should
probably spell that out. I'll adjust it.

--
Jens Axboe

2020-08-03 23:50:24

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On Mon, Aug 3, 2020 at 4:31 PM Jens Axboe <[email protected]> wrote:
>
> Updated to honor exclusive return value as well:

See my previous email, You're just adding code that makes no sense,
because your wait entry fundamentally isn't an exclusive one.

So all that code is a no-op and only makes it more confusing to read.

Your wakeup handler has _nothing_ to do with the generic
wake_page_function(). There is _zero_ overlap. Your wakeup handler
gets called only for the wait entries _you_ created.

Trying to use the wakeup logic from wake_page_function() makes no
sense, because the rules for wake_page_function() are entirely
different. Yes, they are called for the same thing (somebody unlocked
a page and is waking up waiters), but it's using a completely
different sleeping logic.

See? When wake_page_function() does that

wait->flags |= WQ_FLAG_WOKEN;

and does something different (and returns different values) depending
on whether WQ_FLAG_EXCLUSIVE was set, that is all because
wait_on_page_bit_common() entry set yo that wait entry (on its stack)
with those exact rules in mind.

So the wakeup function is 1:1 tied to the code that registers the wait
entry. wait_on_page_bit_common() has one set of rules, that are then
honored by the wakeup function it uses. But those rules have _zero_
impact on your use. You can have - and you *do* have - different sets
of rules.

For example, none of your wakeups are ever exclusive. All you do is
make a work runnable - that doesn't mean that other people shouldn't
do other things when they get a "page was unlocked" wakeup
notification.

Also, for you "list_del_init()" is fine, because you never do the
unlocked "list_empty_careful()" on that wait entry. All the waitqueue
operations run under the queue head lock.

So what I think you _should_ do is just something like this:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2a3af95be4ca..1e243f99643b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2965,10 +2965,10 @@ static int io_async_buf_func(struct
wait_queue_entry *wait, unsigned mode,
if (!wake_page_match(wpq, key))
return 0;

- /* Stop waking things up if the page is locked again */
- if (test_bit(key->bit_nr, &key->page->flags))
- return -1;
-
+ /*
+ * Somebody unlocked the page. Unqueue the wait entry
+ * and run the task_work
+ */
list_del_init(&wait->entry);

init_task_work(&req->task_work, io_req_task_submit);

because that matches what you're actually doing.

There's no reason to stop waking up others because the page is locked,
because you don't know what others want.

And there's never any reason for the exclusive thing, b3ecause none of
what you do guarantees that you take exclusive ownership of the page
lock. Running the work *may* end up doing a "lock_page()", but you
don't actually guarantee that.

Linus

2020-08-03 23:58:57

by Jens Axboe

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On 8/3/20 5:49 PM, Linus Torvalds wrote:
> On Mon, Aug 3, 2020 at 4:31 PM Jens Axboe <[email protected]> wrote:
>>
>> Updated to honor exclusive return value as well:
>
> See my previous email, You're just adding code that makes no sense,
> because your wait entry fundamentally isn't an exclusive one.

Right, I get that now, it's just dead code for my use case. It was sent
out before your previous email.

> So all that code is a no-op and only makes it more confusing to read.
>
> Your wakeup handler has _nothing_ to do with the generic
> wake_page_function(). There is _zero_ overlap. Your wakeup handler
> gets called only for the wait entries _you_ created.
>
> Trying to use the wakeup logic from wake_page_function() makes no
> sense, because the rules for wake_page_function() are entirely
> different. Yes, they are called for the same thing (somebody unlocked
> a page and is waking up waiters), but it's using a completely
> different sleeping logic.
>
> See? When wake_page_function() does that
>
> wait->flags |= WQ_FLAG_WOKEN;
>
> and does something different (and returns different values) depending
> on whether WQ_FLAG_EXCLUSIVE was set, that is all because
> wait_on_page_bit_common() entry set yo that wait entry (on its stack)
> with those exact rules in mind.
>
> So the wakeup function is 1:1 tied to the code that registers the wait
> entry. wait_on_page_bit_common() has one set of rules, that are then
> honored by the wakeup function it uses. But those rules have _zero_
> impact on your use. You can have - and you *do* have - different sets
> of rules.
>
> For example, none of your wakeups are ever exclusive. All you do is
> make a work runnable - that doesn't mean that other people shouldn't
> do other things when they get a "page was unlocked" wakeup
> notification.
>
> Also, for you "list_del_init()" is fine, because you never do the
> unlocked "list_empty_careful()" on that wait entry. All the waitqueue
> operations run under the queue head lock.
>
> So what I think you _should_ do is just something like this:
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2a3af95be4ca..1e243f99643b 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2965,10 +2965,10 @@ static int io_async_buf_func(struct
> wait_queue_entry *wait, unsigned mode,
> if (!wake_page_match(wpq, key))
> return 0;
>
> - /* Stop waking things up if the page is locked again */
> - if (test_bit(key->bit_nr, &key->page->flags))
> - return -1;
> -
> + /*
> + * Somebody unlocked the page. Unqueue the wait entry
> + * and run the task_work
> + */
> list_del_init(&wait->entry);
>
> init_task_work(&req->task_work, io_req_task_submit);
>
> because that matches what you're actually doing.
>
> There's no reason to stop waking up others because the page is locked,
> because you don't know what others want.
>
> And there's never any reason for the exclusive thing, b3ecause none of
> what you do guarantees that you take exclusive ownership of the page
> lock. Running the work *may* end up doing a "lock_page()", but you
> don't actually guarantee that.

What I ended up with after the last email was just removing the test
bit:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=cbd287c09351f1d3a4b3cb9167a2616a11390d32

and I clarified the comments on the io_async_buf_func() to add more
hints on how everything is triggered instead of just a vague "handler"
reference:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=c1dd91d16246b168b80af9b64c5cc35a66410455

--
Jens Axboe

2020-08-04 00:13:35

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] io_uring changes for 5.9-rc1

On Mon, Aug 3, 2020 at 4:56 PM Jens Axboe <[email protected]> wrote:
>
> What I ended up with after the last email was just removing the test
> bit:
>
> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=cbd287c09351f1d3a4b3cb9167a2616a11390d32
>
> and I clarified the comments on the io_async_buf_func() to add more
> hints on how everything is triggered instead of just a vague "handler"
> reference:
>
> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=c1dd91d16246b168b80af9b64c5cc35a66410455

These both look sensible to me now.

Linus