Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751463AbdCCFOo (ORCPT ); Fri, 3 Mar 2017 00:14:44 -0500 Received: from mx2.suse.de ([195.135.220.15]:53747 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750975AbdCCFOi (ORCPT ); Fri, 3 Mar 2017 00:14:38 -0500 X-Amavis-Alert: BAD HEADER SECTION, Header field occurs more than once: "Cc" occurs 7 times From: NeilBrown To: Jens Axboe Date: Fri, 03 Mar 2017 16:14:15 +1100 Cc: LKML Cc: Jinpu Wang Cc: Lars Ellenberg Cc: Kent Overstreet Cc: Pavel Machek Cc: Mike Snitzer Cc: Mikulas Patocka Subject: [PATCH] blk: improve order of bio handling in generic_make_request() Message-ID: <87h93blz6g.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6386 Lines: 152 --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable [ Hi Jens, you might have seen assorted email threads recently about deadlocks, particular in dm-snap or md/raid1/10. Also about the excess of rescuer threads. I think a big part of the problem is my ancient improvement to generic_make_request to queue bios and handle them in a strict FIFO order. As described below, that can cause problems which individual block devices cannot fix themselves without punting to various threads. This patch does not fix everything, but provides a basis that drives can build on to create dead-lock free solutions without excess threads. If you accept this, I will look into improving at least md and bio_alloc_set() to be less dependant on rescuer threads. Thanks, NeilBrown ] To avoid recursion on the kernel stack when stacked block devices are in use, generic_make_request() will, when called recursively, queue new requests for later handling. They will be handled when the make_request_fn for the current bio completes. If any bios are submitted by a make_request_fn, these will ultimately handled seqeuntially. If the handling of one of those generates further requests, they will be added to the end of the queue. This strict first-in-first-out behaviour can lead to deadlocks in various ways, normally because a request might need to wait for a previous request to the same device to complete. This can happen when they share a mempool, and can happen due to interdependencies particular to the device. Both md and dm have examples where this happens. These deadlocks can be erradicated by more selective ordering of bios. Specifically by handling them in depth-first order. That is: when the handling of one bio generates one or more further bios, they are handled immediately after the parent, before any siblings of the parent. That way, when generic_make_request() calls make_request_fn for some particular device, it we can be certain that all previously submited request for that device have been completely handled and are not waiting for anything in the queue of requests maintained in generic_make_request(). An easy way to achieve this would be to use a last-in-first-out stack instead of a queue. However this will change the order of consecutive bios submitted by a make_request_fn, which could have unexpected consequenc= es. Instead we take a slightly more complex approach. A fresh queue is created for each call to a make_request_fn. After it comp= letes, any bios for a different device are placed on the front of the main queue, = followed by any bios for the same device, followed by all bios that were already on the queue before the make_request_fn was called. This provides the depth-first approach without reordering bios on the same = level. This, by itself, it not enough to remove the deadlocks. It just makes it possible for drivers to take the extra step required themselves. To avoid deadlocks, drivers must never risk waiting for a request after submitting one to generic_make_request. This includes never allocing from a mempool twice in the one call to a make_request_fn. A common pattern in drivers is to call bio_split() in a loop, handling the first part and then looping around to possibly split the next part. Instead, a driver that finds it needs to split a bio should queue (with generic_make_request) the second part, handle the first part, and then return. The new code in generic_make_request will ensure the requests to underlying bios are processed first, then the second bio that was split off. If it splits again, the same process happens. In each case one bio will be completely handled before the next one is attempt= ed. With this is place, it should be possible to disable the punt_bios_to_recover() recovery thread for many block devices, and eventually it may be possible to remove it completely. Tested-by: Jinpu Wang Inspired-by: Lars Ellenberg Signed-off-by: NeilBrown =2D-- block/blk-core.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index b9e857f4afe8..ef55f210dd7c 100644 =2D-- a/block/blk-core.c +++ b/block/blk-core.c @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q =3D bdev_get_queue(bio->bi_bdev); =20 if (likely(blk_queue_enter(q, false) =3D=3D 0)) { + struct bio_list hold; + struct bio_list lower, same; + + /* Create a fresh bio_list for all subordinate requests */ + bio_list_init(&hold); + bio_list_merge(&hold, &bio_list_on_stack); + bio_list_init(&bio_list_on_stack); ret =3D q->make_request_fn(q, bio); =20 blk_queue_exit(q); =20 + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio =3D bio_list_pop(&bio_list_on_stack)) !=3D NULL) + if (q =3D=3D bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); + bio =3D bio_list_pop(current->bio_list); } else { struct bio *bio_next =3D bio_list_pop(current->bio_list); =2D-=20 2.11.0 --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEG8Yp69OQ2HB7X0l6Oeye3VZigbkFAli4+6cACgkQOeye3VZi gbnigA//dF7kPS2stFvem+M7Gowvxe/gJajD4q7Ge8EHohtpjK3WnyeIsXMcLE9M F6dFPiQvNRXeR7eB7hN82t7NETkF+GcILz4iiIKqGz971vOse0yMNNVKWNblatEP Rt3LH+xNuBd6X9u/tFT79plBUrw9C+Ykb8sm2mJk84Bi8iNS2L67csk9TxDjgajc Hte3ehN5wQIOt60M5Ya0duDf+lhZvSJ3FNs2xq4VacKFSFx/zQeVvIpmzUvekIJ2 +wERS6NiFRY72iWCPiujqz+oOS2+gKZ2C+eujZq5L7YnAERL30s+xa10wwmZnuLu 8GUEwZPJj2ow2Fg8kqQtmkUQ4YVn1wFyHSj+I0spGfzu7JotViZxO+HzgQlWmmma IC/8xAmIVRJ9rP6U808JJIGcQbOYslf7OcEmaahipV+bTs9sjVGhCv8+FJANUtoj orFkvVY3cnfVWE0Suk4bioNRUrkyTfvv++bFOXluWNVaZttIzQix3e491HG/2Cs8 lcAcNvdOybse8Ue58bbAyRI/HFzu3k9TgQijJhWgd0GHOnfCn/QjiUDPat6n1spE ag4ubufD9YK6CCRvJzTm0oQ1gYJzmAKFUrLcaQEleIie1JJXOLlE4w9jxTW2cGiZ 9NawoR+8+2YR2gI75BIIxX+7qwgY7V4Qfic6ceOb+oOgZcJyw60= =Fdw/ -----END PGP SIGNATURE----- --=-=-=--