Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp4522926ybc; Tue, 26 Nov 2019 10:12:38 -0800 (PST) X-Google-Smtp-Source: APXvYqwepn9/zvTePVvu77PTF5beq+CuMtvOBinoxzDwJZwRc1GrtPh/F5H2D0NNs3V4Jtt6jg5K X-Received: by 2002:a1c:8156:: with SMTP id c83mr323214wmd.59.1574791957971; Tue, 26 Nov 2019 10:12:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574791957; cv=none; d=google.com; s=arc-20160816; b=IkJNruirfBh9zEnFvlh9VK2tDUpZhBWPrAXTUXkci3S9mdZy/W7JWMiYDmbZlBLQY4 f0LTjr5AiFu/lYDnAtx8IdJVgAhugxM6JIoPt6zOSq5d9gNVBwXnyTwdtTgk0IUdVk9S NNIVjrY9CV6iUJfRuEaGOg6m/pBpS6pOXzyqxf+mnK3QRa0XepLcsKmLP6N/VGoAEym9 6x8vxGXo6y27GTiYZuWfp/1dTKBwmev2aElVsulVzYaidfF3Mes+YxSsfw5V5pslr0qY 8nfm7er39qWNipbanrMzcvaFpgM3UVNrPSe2ODTyBzbCe8mLAnrMEeGI4srrLXPslGGG 5Xzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:from:subject:mime-version :message-id:date:dkim-signature; bh=OALSh94rEmxgGcEOjsjMu0vRTzgbsWcJ+FcsMweXvcg=; b=Q5+CFwN3n3DON/q8pP9U1srfS812ZTrIEwgLePwlMl4jOpXjdNcvji94oiCXbW7D+E /fElYeUhKY5ojhGykiYlKb7p8CWvnU7kHad2X3v5OfAnq9mx+iPwNOE6KswPE0wii5qn O3HZH7VXufta/pgQkF73t5NGDM1qGMjl7iCHlPmwqmh8RLkq3rlwpo2sKZWDVPDkO5xZ 3rmYCgiGVE2WQ+069gXGMJIMyE1z6GdlInuUXcfBRFUeFDr5XgH1hXOwTEQQ0quwRy6E G8cVslhQwhPPV+A5lksXpo3tOfpLsIPLfSrD0wUI1X7CqAEBP4O/4AgNysRorEKwSZ8b MrlQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CBITTIye; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id qn4si7297106ejb.329.2019.11.26.10.12.13; Tue, 26 Nov 2019 10:12:37 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CBITTIye; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726926AbfKZSK3 (ORCPT + 99 others); Tue, 26 Nov 2019 13:10:29 -0500 Received: from mail-qv1-f74.google.com ([209.85.219.74]:42209 "EHLO mail-qv1-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725933AbfKZSK3 (ORCPT ); Tue, 26 Nov 2019 13:10:29 -0500 Received: by mail-qv1-f74.google.com with SMTP id bz8so13069180qvb.9 for ; Tue, 26 Nov 2019 10:10:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=OALSh94rEmxgGcEOjsjMu0vRTzgbsWcJ+FcsMweXvcg=; b=CBITTIyeewci4O7GYY1HCN5cCif6l6QXOj4YE19eUjc0jPXZ076WtLMvpdaaeHU8cO TUWo7mK977ZhnupouKNHhBvU5gdR3AJ279D03WSvPaCH4JiE1BcKYqNBvPT3JMFi4mD1 IktO60cZ7klZSoN0vjylmLD/m10ZRVr8EWCY+7/EkmWSMikOVBPlipLldkfFeEiMr1PI T7s8ztxakIcqNssu3TpLE/pOpGRKESztnfEm1Iy76wJKIU9qGZ8FSsIVFpx3WjQsFE/w IgYUzWXxTTJ5bocNeG1RTd/vMI0d3oJoMlZOXZvPc5HH1zMg9b15pqHz3wtz2xNeLa6x 9eHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=OALSh94rEmxgGcEOjsjMu0vRTzgbsWcJ+FcsMweXvcg=; b=Q0ba9tmDLdEmtJKjNN+vbPGd2Q9NLpSXKF4HYlAosPt9R1yMQ2HJRoyjDaHheCv4TV Nqpx7gZ2qLATjGYluHqtRXVuIzSuiVICOzU2jT1lFGegV5r2AlcnNdAks6n2yTXXjKoC iT3nI38DjX+evQY9JPZkuKtVxipQf2aC+tfU37g9cqwY8NCWkKhy33rszaDgLDp2ORe1 pAtxGB++uJc9SXcfBA0lspoGQg8ApWQA53Hotn50Jg/Yrd59D0dp/xsUiYyHasop336H HsGPvXFjB5RHY3G5Rjd/59L23OQJ/Z+R+RTMami5b+v9da3AkN7olgnfSeyhVYiqfHJH zi0g== X-Gm-Message-State: APjAAAXfaDH8tZv722gvvH++hZJAoof2ZRDqmejMC/2GZPwENSQXRrfM Df+KN/rakZzma4B3RNl2/cHzrfI9/w== X-Received: by 2002:ad4:5183:: with SMTP id b3mr35298126qvp.144.1574791827803; Tue, 26 Nov 2019 10:10:27 -0800 (PST) Date: Tue, 26 Nov 2019 19:10:20 +0100 Message-Id: <20191126181020.17593-1-jannh@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.24.0.432.g9d3f5f5b63-goog Subject: [PATCH] io-wq: fix handling of NUMA node IDs From: Jann Horn To: Jens Axboe , jannh@google.com Cc: io-uring@vger.kernel.org, Andrew Morton , Michal Hocko , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There are several things that can go wrong in the current code on NUMA systems, especially if not all nodes are online all the time: - If the identifiers of the online nodes do not form a single contiguous block starting at zero, wq->wqes will be too small, and OOB memory accesses will occur e.g. in the loop in io_wq_create(). - If a node comes online between the call to num_online_nodes() and the for_each_node() loop in io_wq_create(), an OOB write will occur. - If a node comes online between io_wq_create() and io_wq_enqueue(), a lookup is performed for an element that doesn't exist, and an OOB read will probably occur. Fix it by: - using nr_node_ids instead of num_online_nodes() for the allocation size; nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the highest node ID that could possibly come online at some point, even if those nodes' identifiers are not a contiguous block - creating workers for all possible CPUs, not just all online ones This is basically what the normal workqueue code also does, as far as I can tell. Signed-off-by: Jann Horn --- Notes: compile-tested only. While I think I probably got this stuff right, it might be good if someone more familiar with the NUMA logic could give an opinion on this. An alternative might be to only allocate workers for online nodes, but then we'd have to either fiddle together logic to create more workers on demand or punt requests on newly-onlined nodes over to older nodes. Both of those don't seem very nice to me. fs/io-wq.c | 80 +++++++++++++++++++++++------------------------------- 1 file changed, 34 insertions(+), 46 deletions(-) diff --git a/fs/io-wq.c b/fs/io-wq.c index 465f1a1eb52c..f27071e5cae2 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -105,7 +105,6 @@ struct io_wqe { struct io_wq { struct io_wqe **wqes; unsigned long state; - unsigned nr_wqes; get_work_fn *get_work; put_work_fn *put_work; @@ -632,21 +631,22 @@ static inline bool io_wqe_need_worker(struct io_wqe *wqe, int index) static int io_wq_manager(void *data) { struct io_wq *wq = data; - int i; + int workers_to_create = num_possible_nodes(); + int node; /* create fixed workers */ - refcount_set(&wq->refs, wq->nr_wqes); - for (i = 0; i < wq->nr_wqes; i++) { - if (create_io_worker(wq, wq->wqes[i], IO_WQ_ACCT_BOUND)) - continue; - goto err; + refcount_set(&wq->refs, workers_to_create); + for_each_node(node) { + if (!create_io_worker(wq, wq->wqes[node], IO_WQ_ACCT_BOUND)) + goto err; + workers_to_create--; } complete(&wq->done); while (!kthread_should_stop()) { - for (i = 0; i < wq->nr_wqes; i++) { - struct io_wqe *wqe = wq->wqes[i]; + for_each_node(node) { + struct io_wqe *wqe = wq->wqes[node]; bool fork_worker[2] = { false, false }; spin_lock_irq(&wqe->lock); @@ -668,7 +668,7 @@ static int io_wq_manager(void *data) err: set_bit(IO_WQ_BIT_ERROR, &wq->state); set_bit(IO_WQ_BIT_EXIT, &wq->state); - if (refcount_sub_and_test(wq->nr_wqes - i, &wq->refs)) + if (refcount_sub_and_test(workers_to_create, &wq->refs)) complete(&wq->done); return 0; } @@ -776,7 +776,7 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe, void io_wq_cancel_all(struct io_wq *wq) { - int i; + int node; set_bit(IO_WQ_BIT_CANCEL, &wq->state); @@ -785,8 +785,8 @@ void io_wq_cancel_all(struct io_wq *wq) * to a worker and the worker putting itself on the busy_list */ rcu_read_lock(); - for (i = 0; i < wq->nr_wqes; i++) { - struct io_wqe *wqe = wq->wqes[i]; + for_each_node(node) { + struct io_wqe *wqe = wq->wqes[node]; io_wq_for_each_worker(wqe, io_wqe_worker_send_sig, NULL); } @@ -859,10 +859,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, void *data) { enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; - int i; + int node; - for (i = 0; i < wq->nr_wqes; i++) { - struct io_wqe *wqe = wq->wqes[i]; + for_each_node(node) { + struct io_wqe *wqe = wq->wqes[node]; ret = io_wqe_cancel_cb_work(wqe, cancel, data); if (ret != IO_WQ_CANCEL_NOTFOUND) @@ -936,10 +936,10 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) { enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; - int i; + int node; - for (i = 0; i < wq->nr_wqes; i++) { - struct io_wqe *wqe = wq->wqes[i]; + for_each_node(node) { + struct io_wqe *wqe = wq->wqes[node]; ret = io_wqe_cancel_work(wqe, cwork); if (ret != IO_WQ_CANCEL_NOTFOUND) @@ -970,10 +970,10 @@ static void io_wq_flush_func(struct io_wq_work **workptr) void io_wq_flush(struct io_wq *wq) { struct io_wq_flush_data data; - int i; + int node; - for (i = 0; i < wq->nr_wqes; i++) { - struct io_wqe *wqe = wq->wqes[i]; + for_each_node(node) { + struct io_wqe *wqe = wq->wqes[node]; init_completion(&data.done); INIT_IO_WORK(&data.work, io_wq_flush_func); @@ -985,15 +985,14 @@ void io_wq_flush(struct io_wq *wq) struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) { - int ret = -ENOMEM, i, node; + int ret = -ENOMEM, node; struct io_wq *wq; wq = kzalloc(sizeof(*wq), GFP_KERNEL); if (!wq) return ERR_PTR(-ENOMEM); - wq->nr_wqes = num_online_nodes(); - wq->wqes = kcalloc(wq->nr_wqes, sizeof(struct io_wqe *), GFP_KERNEL); + wq->wqes = kcalloc(nr_node_ids, sizeof(struct io_wqe *), GFP_KERNEL); if (!wq->wqes) { kfree(wq); return ERR_PTR(-ENOMEM); @@ -1006,14 +1005,13 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) wq->user = data->user; wq->creds = data->creds; - i = 0; - for_each_online_node(node) { + for_each_node(node) { struct io_wqe *wqe; wqe = kzalloc_node(sizeof(struct io_wqe), GFP_KERNEL, node); if (!wqe) - break; - wq->wqes[i] = wqe; + goto err; + wq->wqes[node] = wqe; wqe->node = node; wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded; atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0); @@ -1029,15 +1027,10 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0); INIT_HLIST_NULLS_HEAD(&wqe->busy_list, 1); INIT_LIST_HEAD(&wqe->all_list); - - i++; } init_completion(&wq->done); - if (i != wq->nr_wqes) - goto err; - /* caller must have already done mmgrab() on this mm */ wq->mm = data->mm; @@ -1056,8 +1049,8 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) ret = PTR_ERR(wq->manager); complete(&wq->done); err: - for (i = 0; i < wq->nr_wqes; i++) - kfree(wq->wqes[i]); + for_each_node(node) + kfree(wq->wqes[node]); kfree(wq->wqes); kfree(wq); return ERR_PTR(ret); @@ -1071,26 +1064,21 @@ static bool io_wq_worker_wake(struct io_worker *worker, void *data) void io_wq_destroy(struct io_wq *wq) { - int i; + int node; set_bit(IO_WQ_BIT_EXIT, &wq->state); if (wq->manager) kthread_stop(wq->manager); rcu_read_lock(); - for (i = 0; i < wq->nr_wqes; i++) { - struct io_wqe *wqe = wq->wqes[i]; - - if (!wqe) - continue; - io_wq_for_each_worker(wqe, io_wq_worker_wake, NULL); - } + for_each_node(node) + io_wq_for_each_worker(wq->wqes[node], io_wq_worker_wake, NULL); rcu_read_unlock(); wait_for_completion(&wq->done); - for (i = 0; i < wq->nr_wqes; i++) - kfree(wq->wqes[i]); + for_each_node(node) + kfree(wq->wqes[node]); kfree(wq->wqes); kfree(wq); } -- 2.24.0.432.g9d3f5f5b63-goog