Received: by 2002:a05:6a10:eb17:0:0:0:0 with SMTP id hx23csp458251pxb; Wed, 8 Sep 2021 05:16:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzIYStM9olJ+1rEza0WD52vkaDXAZ2C09gHjTT0oF/MilGn0AKOfS+BAfjkhIcYtTei0uRM X-Received: by 2002:a05:6e02:1be2:: with SMTP id y2mr2563444ilv.293.1631103401159; Wed, 08 Sep 2021 05:16:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1631103401; cv=none; d=google.com; s=arc-20160816; b=EdS8s5h2l+gM780m64GHUUsIO/0hGiyUc+Fl0593fjE+EobaXWbFXtJzMAttg9I2he enPBdJniIVp4gj2m00eTBrAlX6iWHRFCEB8XY4zmRbp9iGK4dLz7NaYwHGBhpIjCb6xO gN9POb1Gh7/mX22AGAi18Bhi+r3mJw8F6je1vGAt+zBInm0AAB+jvjdI9Y30SqrGfjEh 5+HL3T2HyvbdsCy3VFMoH7Rj5iEGvxvOeWtXz3//RJe5m8mdr6A+YJZdzpb5/lVgsVP+ VWB6ujvkW4fRdi7N1d1soaB2dL1ykTNWVjQ0mcASN7nOZGtNkYWHkevmBvscqoo0jbU8 NQBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from; bh=/vm8Bot5zy3186+sFFdF1SjkTmW5QZD5p9drqmkPrZw=; b=uIfqHLEnqKgjpF0GpX8JvnWqSt2jvZignzgYLZXoZYRiyRVZ51A+NQ06d3ci5Gzu99 AQQBFVxzTDSCFFSRuQd1xgBU+xwxi07IQEeovZaS5LwneJg2d2lvqjQOCL1H3oduSRLF oSQ6nRBWKJ4l3XZAzaYpy/I6sZNwQl2iIa3hx0Eh0mQbFW9tG0NRKCiq1/8/GJtbDYFa 2Sx1Vmiz/2CMhAhuMTDD1/R+JbRHxc19NzZtuOLviDiEFCMec//cR7X7C8oz0eSFI3GU HWwJzQMlzQM+lT44mUOdOjjAG3nkVfZqvkVYCAsbs+FhALvEFvx+vXL9HpwqrNLVwHOH VB5Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x3si1970991ile.168.2021.09.08.05.16.29; Wed, 08 Sep 2021 05:16:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351676AbhIHMQo (ORCPT + 99 others); Wed, 8 Sep 2021 08:16:44 -0400 Received: from out30-54.freemail.mail.aliyun.com ([115.124.30.54]:54381 "EHLO out30-54.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231372AbhIHMQn (ORCPT ); Wed, 8 Sep 2021 08:16:43 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=escape@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0Ungmnr5_1631103332; Received: from localhost(mailfrom:escape@linux.alibaba.com fp:SMTPD_---0Ungmnr5_1631103332) by smtp.aliyun-inc.com(127.0.0.1); Wed, 08 Sep 2021 20:15:32 +0800 From: Yi Tao To: gregkh@linuxfoundation.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, mcgrof@kernel.org, keescook@chromium.org, yzaikin@google.com Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-fsdevel@vger.kernel.org, shanpeic@linux.alibaba.com Subject: [RFC PATCH 0/2] support cgroup pool in v1 Date: Wed, 8 Sep 2021 20:15:11 +0800 Message-Id: X-Mailer: git-send-email 1.8.3.1 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In a scenario where containers are started with high concurrency, in order to control the use of system resources by the container, it is necessary to create a corresponding cgroup for each container and attach the process. The kernel uses the cgroup_mutex global lock to protect the consistency of the data, which results in a higher long-tail delay for cgroup-related operations during concurrent startup. For example, long-tail delay of creating cgroup under each subsystems is 900ms when starting 400 containers, which becomes bottleneck of performance. The delay is mainly composed of two parts, namely the time of the critical section protected by cgroup_mutex and the scheduling time of sleep. The scheduling time will increase with the increase of the cpu overhead. In order to solve this long-tail delay problem, we designed a cgroup pool. The cgroup pool will create a certain number of cgroups in advance. When a user creates a cgroup through the mkdir system call, a clean cgroup can be quickly obtained from the pool. Cgroup pool draws on the idea of cgroup rename. By creating pool and rename in advance, it reduces the critical area of cgroup creation, and uses a spinlock different from cgroup_mutex, which reduces scheduling overhead on the one hand, and eases competition with attaching processes on the other hand. The core idea of implementing a cgroup pool is to create a hidden kernfs tree. Cgroup is implemented based on the kernfs file system. The user manipulates the cgroup through the kernfs file. Therefore, we can create a cgroup in advance and place it in a hidden kernfs tree, so that the user can not operate the cgroup. When the user needs to create one, move the cgroup to its original location. Because this only needs to remove a node from one kernfs tree and move it to another tree, it does not affect other data of the cgroup and related subsystems, so this operation is very efficient and fast, and there is no need to hold cgroup_mutex. In this way, we get rid of the limitation of cgroup_mutex and reduce the time consumption of the critical section, but the kernfs_rwsem is still protecting the kernfs-related data structure, and the scheduling time of sleep still exists. In order to avoid the use of kernfs_rwsem, we introduced a pinned state for the kernfs node. When the pinned state of this node is true, the lock that protects the data of this node is changed from kernfs_rwsem to a lock that can be set. In the scenario of a cgroup pool, the parent cgroup will have a corresponding spinlock. When the pool is enabled, the kernfs nodes of all cgroups under the parent cgroup are set to the pinned state. Create, delete, and move these kernfs nodes are protected by the spinlock of the parent cgroup, so data consistency will not be a problem. After opening the pool, the user creates a cgroup will take the fast path and obtain it from the cgroup pool. Deleting cgroups still take the slow path. When resources in the pool are insufficient, a delayed task will be triggered, and the pool will be replenished after a period of time. This is done to avoid competition with the current creation of cgroups and thus affect performance. When the resources in the pool are exhausted and not replenished in time, the creation of a cgroup will take a slow path, so users need to set an appropriate pool size and supplementary delay time. What we did in the patches are: 1.add pinned flags for kernfs nodes, so that they can get rid of kernfs_rwsem and choose to be protected by other locks. 2.add pool_size interface which used to open cgroup pool and close cgroup pool. 3.add extra kernfs tree which used to hide cgroup in pool. 4.add spinlock to protect kernfs nodes of cgroup in pool Yi Tao (2): add pinned flags for kernfs node support cgroup pool in v1 fs/kernfs/dir.c | 74 ++++++++++++++++------- include/linux/cgroup-defs.h | 16 +++++ include/linux/cgroup.h | 2 + include/linux/kernfs.h | 14 +++++ kernel/cgroup/cgroup-v1.c | 139 ++++++++++++++++++++++++++++++++++++++++++++ kernel/cgroup/cgroup.c | 113 ++++++++++++++++++++++++++++++++++- kernel/sysctl.c | 8 +++ 7 files changed, 345 insertions(+), 21 deletions(-) -- 1.8.3.1