2022-11-01 04:54:24

by Khazhy Kumykov

[permalink] [raw]
Subject: Why don't we always grab bfqd->lock for bio_bfqq?

I'm investigating a NULL deref crash in bfq_add_bfqq_busy(), wherein
bfqq->woken_list_node is hashed, but bfqq->waker_bfqq is NULL - which
seems inconsistent per my reading of the code.

Wherein I see bfq_allow_bio_merge() both accesses and modifies
accesses bfqd->bio_bfqq without bfqd->lock, which strikes me as odd.
The call there though to bfq_setup_cooperator and bfq_merge_bfqqs()
seem wrong to me. In particular, the call to bfq_merge_bfqqs() I am
suspecting can cause the inconsistency seen above, since it's the only
place I've found that modifies bfqq->waker_bfqq without bfqd->lock.

But I'm curious in general - what's special about bio_bfqq? Should we
grab bfqd->lock when touching it? e.g. bfq_request_merge() also
accesses bio_bfqq without grabbing the lock, where-in we traverse
bfqq->sort_list - that strikes me as odd as well, but I'm not fully
familiar with the locking conventions here. But it feels like,
especially since we can merge bfqqs, so bio_bfqq is shared - this
lockless access seems wrong.


2022-11-01 05:12:53

by Khazhy Kumykov

[permalink] [raw]
Subject: Re: Why don't we always grab bfqd->lock for bio_bfqq?

On Mon, Oct 31, 2022 at 9:37 PM Khazhy Kumykov <[email protected]> wrote:
>
> I'm investigating a NULL deref crash in bfq_add_bfqq_busy(), wherein
> bfqq->woken_list_node is hashed, but bfqq->waker_bfqq is NULL - which
> seems inconsistent per my reading of the code.
>
> Wherein I see bfq_allow_bio_merge() both accesses and modifies
> accesses bfqd->bio_bfqq without bfqd->lock, which strikes me as odd.
> The call there though to bfq_setup_cooperator and bfq_merge_bfqqs()
> seem wrong to me. In particular, the call to bfq_merge_bfqqs() I am
> suspecting can cause the inconsistency seen above, since it's the only
> place I've found that modifies bfqq->waker_bfqq without bfqd->lock.
>
> But I'm curious in general - what's special about bio_bfqq? Should we
> grab bfqd->lock when touching it? e.g. bfq_request_merge() also
> accesses bio_bfqq without grabbing the lock, where-in we traverse
> bfqq->sort_list - that strikes me as odd as well, but I'm not fully
> familiar with the locking conventions here. But it feels like,
> especially since we can merge bfqqs, so bio_bfqq is shared - this
> lockless access seems wrong.

Something on this front, since it does look like in *some* paths we do
call blk_mq_sched_allow_merge()/bfq_allow_bio_merge() with the lock
held, and some paths we do not - e.g. blk_mq_sched_try_merge gets
called directly by the schedulers (and bfq calls it under the lock).

However, blk_attempt_bio_merge also calls blk_mq_sched_allow_merge(),
and it's called by blk-mq directly on the submission path
(blk_bio_list_merge <- blk_mq_sched_bio_merge <-
blk_mq_attempt_bio_merge <- blk_mq_get_new_requests <-
blk_mq_submit_bio), and so we'll call bfq_allow_bio_merge without
bfqd->lock held in this path only.

I can see also for bfq_request_merge(), it gets called under
bfqd->lock, since the only path to ->request_merge() is through
blk_mq_sched_try_merge() - which is called by the schedulers. If I'm
understanding this correctly, and the functions are intended to be
called under the locks, perhaps it'd be appropriate to add some
lockdep_held annotations?

Khazhy

2022-11-01 23:41:13

by Khazhismel Kumykov

[permalink] [raw]
Subject: NULL deref crash in bfq_add_bfqq_busy

Still poking around here, I think my previous emails were leading me
down a dead end, at least w.r.t. unsafe locking. The one code path I
identified below (blk_bio_list_merge() via __blk_mq_sched_bio_merge())
won't actually get hit since we'll instead call bfq_bio_merge().

The crashes I'm seeing have been in the read side (stacks below)

[160595.656560] bfq_add_bfqq_busy+0x110/0x1ec
[160595.661142] bfq_add_request+0x6bc/0x980
[160595.666602] bfq_insert_request+0x8ec/0x1240
[160595.671762] bfq_insert_requests+0x58/0x9c
[160595.676420] blk_mq_sched_insert_request+0x11c/0x198
[160595.682107] blk_mq_submit_bio+0x270/0x62c
[160595.686759] __submit_bio_noacct_mq+0xec/0x178
[160595.691926] submit_bio+0x120/0x184
[160595.695990] ext4_mpage_readpages+0x77c/0x7c8
[160595.701026] ext4_readpage+0x60/0xb0
[160595.705158] filemap_read_page+0x54/0x114
[160595.711961] filemap_fault+0x228/0x5f4
[160595.716272] do_read_fault+0xe0/0x1f0
[160595.720487] do_fault+0x40/0x1c8

or

[28497.344552] bfq_add_bfqq_busy+0x110/0x1ec
[28497.348787] bfq_add_request+0x6bc/0x980
[28497.352845] bfq_insert_request+0x8ec/0x1240
[28497.357265] bfq_insert_requests+0x58/0x9c
[28497.361503] blk_mq_sched_insert_requests+0x8c/0x180
[28497.366647] blk_mq_flush_plug_list+0x15c/0x1e0
[28497.371376] blk_flush_plug_list+0xf0/0x124
[28497.375877] blk_finish_plug+0x34/0x48
[28497.379846] read_pages+0x21c/0x288
[28497.383510] page_cache_ra_unbounded+0x1f0/0x24c
[28497.388346] do_page_cache_ra+0x48/0x54
[28497.392388] do_sync_mmap_readahead+0x190/0x1e0
[28497.397150] filemap_fault+0x150/0x5f4
[28497.401111] do_read_fault+0xe0/0x1f0
[28497.404948] do_fault+0x40/0x1c8

In the crashes I'm looking at, it looks like the inconsistent
bfq_queue is oom_bfqq (where waker_bfqq is NULL, but woken_list_node
is hashed)

I'm looking at bfq_init_rq(), where we set bfqq->waker_bfqq
(potentially to NULL), and only hlist_add_head, no hlist_del_init. It
looks like the logic here is, bfq_get_bfqq_handle_split is supposed to
return a freshly allocated bfq_queue.

However, it can also return oom_bfqq, which may already be on a woken
list - then we proceed to set waker_bfqq without deleting woken_list
node, which can result in the inconsistency seen in the crash. wdyt?

bfqq = bfq_split_bfqq(bic, bfqq);
split = true;

if (!bfqq) {
bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
true, is_sync,
NULL); # This can also return oom_bfqq
bfqq->waker_bfqq = old_bfqq->waker_bfqq; # this can be set to null
bfqq->tentative_waker_bfqq = NULL;


an aside, which looks at the async case, since I'd like to understand
this code better.
---

The call to bfq_split_bfqq() doesn't make sense to me - since for
bfq_get_bfqq_handle_split() et. al. we pass the is_sync parameter, but
bfq_split_bfqq() only ever clears bfqq for sync=true. My understanding
seems like this is intended to be: we decide to split, we call
bfq_split_bfqq() which sets bic->bfqq[is_sync] to NULL, then
bfq_get_bfqq_handle_split() sees that it's null and allocates a brand
new queue. But for async, bfq_split_bfqq will return NULL, but
bic->bfqq[0] was never cleared, so bfq_get_bfqq_handle_split() will
see that, and return that already existing bfqq. We'll then modify the
not-freshly-allocated bfqq.

Should bfq_split_bfqq() here have an is_sync param, and clear the
corresponding bic->bfqq?
---




On Mon, Oct 31, 2022 at 10:05 PM Khazhy Kumykov <[email protected]> wrote:
>
> On Mon, Oct 31, 2022 at 9:37 PM Khazhy Kumykov <[email protected]> wrote:
> >
> > I'm investigating a NULL deref crash in bfq_add_bfqq_busy(), wherein
> > bfqq->woken_list_node is hashed, but bfqq->waker_bfqq is NULL - which
> > seems inconsistent per my reading of the code.


Attachments:
smime.p7s (3.91 kB)
S/MIME Cryptographic Signature