Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp3418282ybv; Sat, 15 Feb 2020 20:16:23 -0800 (PST) X-Google-Smtp-Source: APXvYqxVcaLkGNln4iWuW9JYD4P6+TPMogNFpNpuvDZOQPL0733KvGpxNmabwuamjHsuExf6EbCC X-Received: by 2002:a05:6808:64e:: with SMTP id z14mr6207939oih.79.1581826583164; Sat, 15 Feb 2020 20:16:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581826583; cv=none; d=google.com; s=arc-20160816; b=smdAbe/lVIK8fD5tThvEBY7T0MdPEr2JZYEKcv4YqqiPvoTfOEzmlg1VYIHajLtKVZ +NlUoKO7WjcMiOJ0s39fnX1e9iHeCblvkY4UUkuLLKaF3i72x+Z3ARHyoGaKDnTmp5Re MN50xPdSoGZpVOAdsFGIKBeCmZZQ7td9uaTLJL4RZn1x9k7j0S1O1XTQFPjQcjnKT8Ue JuGB1RXas4tx8N0WwdLTle6Xet+sgXKlWswpZ7OMpF2CPIhXPn00Z9hB9meB1xcECN/q zdWS7CFadFXX6CHprupoRFBg6UPX2WwgQfrlEvCeu2DbqCwPPR2aJhTtLCb8y/NueOPz 7iSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=irECoUtDGZx6SjgvhoGFl+gxByE0JFPQ/wRCAyZ82u0=; b=yy+BEJpGgzoYmnx1SIyYRhDsKqyj5kvz7MBnSMV3winKq+BKbizuTBAuTZy54LVVzz SrHx/BdunihKJfvi/rIHdHO69bPtJGvAdKvEedXmeW4jKZFzKP/bM4Ui8VTxqpN4Eh1Y xKEo2oxmQ7g1haH0283e0rjcJqTeH0S+k8I+CUgufWpPBRIV8Ayp4TA23YjzEoxxajYr X3f6KZFtv6tuREfVWwK3DsFS4/25+ofgMhOsBXO2lpW3KI6JOPOyC2wibx2+I+A26Fbr GWrTo6mXmOIRETk+TlRhRmKj2AvI1MKgMtGuCGl2NZdaEq+1TiHBa4v7eUtNeWZj51v9 YphA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n1si5053488oic.225.2020.02.15.20.16.08; Sat, 15 Feb 2020 20:16:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726703AbgBPEP7 (ORCPT + 99 others); Sat, 15 Feb 2020 23:15:59 -0500 Received: from out30-45.freemail.mail.aliyun.com ([115.124.30.45]:40752 "EHLO out30-45.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726634AbgBPEP7 (ORCPT ); Sat, 15 Feb 2020 23:15:59 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R811e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04428;MF=wenyang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0Tq3BIwr_1581826554; Received: from IT-C02W23QPG8WN.local(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0Tq3BIwr_1581826554) by smtp.aliyun-inc.com(127.0.0.1); Sun, 16 Feb 2020 12:15:55 +0800 Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Xunlei Pang , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20200201031502.92218-1-wenyang@linux.alibaba.com> <20200212145247.bf89431272038de53dd9d975@linux-foundation.org> From: Wen Yang Message-ID: Date: Sun, 16 Feb 2020 12:15:54 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Thunderbird/68.1.1 MIME-Version: 1.0 In-Reply-To: <20200212145247.bf89431272038de53dd9d975@linux-foundation.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2020/2/13 6:52 上午, Andrew Morton wrote: > On Sat, 1 Feb 2020 11:15:02 +0800 Wen Yang wrote: > >> The lock, protecting the node partial list, is taken when couting the free >> objects resident in that list. It introduces locking contention when the >> page(s) is moved between CPU and node partial lists in allocation path >> on another CPU. So reading "/proc/slabinfo" can possibily block the slab >> allocation on another CPU for a while, 200ms in extreme cases. If the >> slab object is to carry network packet, targeting the far-end disk array, >> it causes block IO jitter issue. >> >> This fixes the block IO jitter issue by caching the total inuse objects in >> the node in advance. The value is retrieved without taking the node partial >> list lock on reading "/proc/slabinfo". >> >> ... >> >> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page) >> >> static void discard_slab(struct kmem_cache *s, struct page *page) >> { >> - dec_slabs_node(s, page_to_nid(page), page->objects); >> + int inuse = page->objects; >> + >> + dec_slabs_node(s, page_to_nid(page), page->objects, inuse); > > Is this right? dec_slabs_node(..., page->objects, page->objects)? > > If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node > and save a function argument. > > If yes then why? > Thanks for your comments. We are happy to improve this patch based on your suggestions. When the user reads /proc/slabinfo, in order to obtain the active_objs information, the kernel traverses all slabs and executes the following code snippet: static unsigned long count_partial(struct kmem_cache_node *n, int (*get_count)(struct page *)) { unsigned long flags; unsigned long x = 0; struct page *page; spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) x += get_count(page); spin_unlock_irqrestore(&n->list_lock, flags); return x; } It may cause performance issues. Christoph suggested "you could cache the value in the userspace application? Why is this value read continually?", But reading the /proc/slabinfo is initiated by the user program. As a cloud provider, we cannot control user behavior. If a user program inadvertently executes cat /proc/slabinfo, it may affect other user programs. As Christoph said: "The count is not needed for any operations. Just for the slabinfo output. The value has no operational value for the allocator itself. So why use extra logic to track it in potentially performance critical paths?" In this way, could we show the approximate value of active_objs in the /proc/slabinfo? Based on the following information: In the discard_slab() function, page->inuse is equal to page->total_objects; In the allocate_slab() function, page->inuse is also equal to page->total_objects (with one exception: for kmem_cache_node, page-> inuse equals 1); page->inuse will only change continuously when the obj is constantly allocated or released. (This should be the performance critical path emphasized by Christoph) When users query the global slabinfo information, we may use total_objects to approximate active_objs. In this way, the modified patch is as follows: diff --git a/mm/slub.c b/mm/slub.c index a0b335d..ef0e6ac 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5900,17 +5900,15 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) { unsigned long nr_slabs = 0; unsigned long nr_objs = 0; - unsigned long nr_free = 0; int node; struct kmem_cache_node *n; for_each_kmem_cache_node(s, node, n) { nr_slabs += node_nr_slabs(n); nr_objs += node_nr_objs(n); - nr_free += count_partial(n, count_free); } - sinfo->active_objs = nr_objs - nr_free; + sinfo->active_objs = nr_objs; sinfo->num_objs = nr_objs; sinfo->active_slabs = nr_slabs; sinfo->num_slabs = nr_slabs; In addition, when the user really needs to view the precise active_obj value of a slab, he can query this single slab info through an interface similar to the following, which avoids traversing all the slabs. # cat /sys/kernel/slab/kmalloc-512/total_objects 1472 N0=1472 # cat /sys/kernel/slab/kmalloc-512/objects 1311 N0=1311 or # cat /sys/kernel/slab/kmalloc-8k/total_objects 60 N0=60 # cat /sys/kernel/slab/kmalloc-8k/objects 60 N0=60 Best wishes, Wen >> free_slab(s, page); >> } >>