Received: by 2002:ab2:1149:0:b0:1f3:1f8c:d0c6 with SMTP id z9csp1820182lqz; Mon, 1 Apr 2024 20:13:50 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCX/+X+3zfwd6r/99vwb89EsBZQRpXkAmnUlRt+H5bWB4E1whwO/VxErglUIup4wDY/SBC/bw69bXMyn+eYjovaMM7Kcrydv7Nx8/L0VDw== X-Google-Smtp-Source: AGHT+IHai+iyYDaZc4iZ/x2q7GAN/sgPXirBiSxYrVy0ZaW4531a0nMCLExzVRDphc0toq8o9J4X X-Received: by 2002:a05:6808:164a:b0:3c4:e2cd:fae7 with SMTP id az10-20020a056808164a00b003c4e2cdfae7mr7889632oib.15.1712027629803; Mon, 01 Apr 2024 20:13:49 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712027629; cv=pass; d=google.com; s=arc-20160816; b=f8GQMYkXTyifboLJd0PxMfF5QMrM8dH6gcoDQXcRiMNGE7pgwmMvSEbii/Ul4c1auB /V/dj1ZwgXKZNMqKoRMis3VA99bgRZEPaRVGKSsSn5AQTZ2KSx/Ky4oHAI9kk+c0L3ne wMYmEu6aGdY2yYyLCvsnl1ig2iUNJ7nxEZ6zuY+KT0eB3pU01G+9MDly4kovufwsZDGh c7zf6I8Q/G3KHAuV8swVoOYtbE0EslSEkH1xEf+n8G01Iu/MFWmW/TtGItv4tmHowjcc ckoolMeBlubznHY9SonS09II70F73EUaSxJFr+N51tS5HaNvBXgsUSBxJwZboKj3Y4XR q/HQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=lpXo2jTY1tiZU8QQ9rlA8dJIOrOacg3ePhj8d70Z0vk=; fh=pEdBvemCncCQzqA4jOCFO/qJ2iXvfJSksgL6SoW/MhA=; b=eseSpvvIjUFhmpWhFyXCtxPZo9dgXqvkaE2hUk8dalm9KzToE5Alrh5Jl54YAYGo0l FzOWSzgiTc0Mh5KpvppQl0Ed/KT3nxEyysdNjUHX3o+lKzwlIqbTT5YWR9PCnKO6MFxE JkLupFha5iYP5QIet3zs6fMZSsPHDcqArqdu1uACH3IXY19uO2hKglicvabFFhHl1inC ZAdxAg7UUXnRXLvEqY+iKoW2C2mGjx2Dsal8DhGwE+JyMIXX1WUE+JDgVGSCPsdbGaxS V+mWuHYCBw+huKr7o26waFeq28v1O7U6lAfWojqz+D/8TwUUziKPwtnarttMWV2iKaUO 4qyA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-127331-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-127331-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE) header.from=huawei.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id w4-20020a63f504000000b005cdd9963f49si10385515pgh.863.2024.04.01.20.13.49 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Apr 2024 20:13:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-127331-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-127331-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-127331-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE) header.from=huawei.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 02DC7B20C68 for ; Tue, 2 Apr 2024 03:11:15 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4FFAA14AB7; Tue, 2 Apr 2024 03:11:09 +0000 (UTC) Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EF3A54C96 for ; Tue, 2 Apr 2024 03:11:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.187 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712027468; cv=none; b=ec4s0CNnb/SZayvujSSKyoc5jbUQCrYieaUVMj0KT0Zw6X0W9NhnYphyY/M0rowNWtccBWiH1exwyf2rZU9mqEkevc/AzPpAOX5N+1gY6CpyAFKQUS18G2TxkAubbv6LFp4Du8UY8kO7PYIrP1fh93FE8o38TcpwI8YjE/3B1Mw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712027468; c=relaxed/simple; bh=lPYQzZyLPSJrBGpypklNNkVEZMcWl3CnRL3VJNsHg9E=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=FIl7XN6TK2n/xF+wwc+UErXgRQ6Hx/xinNym3yI5dWsKnd9sNfxz/xLLNDd4YkuMe/6XnMthKmQjUCUpnxd87UWJhmJsU7ADe+51yS0tGQSJTsDHKFxIpU5MsIDpxEgXhaVymg0rQ2/snOb91dxp3aIht5CqD/jaFa3LXccF6b0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.174]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4V7tB63fh6ztR23; Tue, 2 Apr 2024 11:08:30 +0800 (CST) Received: from dggpemd100003.china.huawei.com (unknown [7.185.36.199]) by mail.maildlp.com (Postfix) with ESMTPS id 299CE1404F6; Tue, 2 Apr 2024 11:11:03 +0800 (CST) Received: from huawei.com (10.174.184.140) by dggpemd100003.china.huawei.com (7.185.36.199) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.28; Tue, 2 Apr 2024 11:11:02 +0800 From: Ming Yang To: , , , , , , , <42.hyeyoo@gmail.com>, , CC: , , , , , , Subject: [PATCH] slub: fix slub segmentation Date: Tue, 2 Apr 2024 11:10:25 +0800 Message-ID: <20240402031025.1097-1-yangming73@huawei.com> X-Mailer: git-send-email 2.32.0.windows.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpemd100003.china.huawei.com (7.185.36.199) When one of numa nodes runs out of memory and lots of processes still booting, slabinfo shows much slub segmentation exits. The following shows some of them: tunables : slabdata kmalloc-512 84309 380800 1024 32 8 : tunables 0 0 0 : slabdata 11900 11900 0 kmalloc-256 65869 365408 512 32 4 : tunables 0 0 0 : slabdata 11419 11419 0 365408 "kmalloc-256" objects are alloced but only 65869 of them are used; While 380800 "kmalloc-512" objects are alloced but only 84309 of them are used. This problem exits in the following senario: 1. Multiple numa nodes, e.g. four nodes. 2. Lack of memory in any one node. 3. Functions which alloc many slub memory in certain numa nodes, like alloc_fair_sched_group. The slub segmentation generated because of the following reason: In function "___slab_alloc" a new slab is attempted to be gotten via function "get_partial". If the argument 'node' is assigned but there are neither partial memory nor buddy memory in that assigned node, no slab could be gotten. And then the program attempt to alloc new slub from buddy system, as mentationed before: no buddy memory in that assigned node left, a new slub might be alloced from the buddy system of other node directly, no matter whether there is free partil memory left on other node. As a result slub segmentation generated. The key point of above allocation flow is: the slab should be alloced from the partial of other node first, instead of the buddy system of other node directly. In this commit a new slub allocation flow is proposed: 1. Attempt to get a slab via function get_partial (first step in new_objects lable). 2. If no slab is gotten and 'node' is assigned, try to alloc a new slab just from the assigned node instead of all node. 3. If no slab could be alloced from the assigned node, try to alloc slub from partial of other node. 4. If the alloctation in step 3 fails, alloc a new slub from buddy system of all node. Signed-off-by: Ming Yang Signed-off-by: Liang Zhang Signed-off-by: Zhigang Wang Reviewed-by: Shixin Liu --- This patch can be tested and verified by following steps: 1. First, try to run out memory on node0. echo 1000(depending on your memory) > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages. 2. Second, boot 10000(depending on your memory) processes which use setsid systemcall, as the setsid systemcall may likely call function alloc_fair_sched_group. 3. Last, check slabinfo, cat /proc/slabinfo. Hardware info: Memory : 8GiB CPU (total #): 120 numa node: 4 Test clang code example: int main() { void *p = malloc(1024); setsid(); while(1); } mm/slub.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index 1bb2a93cf7..3eb2e7d386 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, } slub_put_cpu_ptr(s->cpu_slab); + if (node != NUMA_NO_NODE) { + slab = new_slab(s, gfpflags | __GFP_THISNODE, node); + if (slab) + goto slab_alloced; + + slab = get_any_partial(s, &pc); + if (slab) + goto slab_alloced; + } slab = new_slab(s, gfpflags, node); + +slab_alloced: c = slub_get_cpu_ptr(s->cpu_slab); if (unlikely(!slab)) { -- 2.33.0