Received: by 10.192.165.148 with SMTP id m20csp347505imm; Thu, 3 May 2018 22:01:22 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrqMd6otB/KgNpPr/OXpXy+OXK8bGrZ6KNXVQMQFBldZ02H92+YhyHzATxoAlci+iFrCILQ X-Received: by 2002:a63:b203:: with SMTP id x3-v6mr20937622pge.266.1525410082209; Thu, 03 May 2018 22:01:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525410082; cv=none; d=google.com; s=arc-20160816; b=aoyrJby+aVkNtd++Y0nuynyl8UcOd09X8/oGWm8qvHl4T+UpHGES/ZG4SIuW/frX9I K/Dm06EHdpfcLjh26+eMEj7UB13HfUR4Qu/ptxcCqYAhY4hOWGcqpjobc6nqNij3rNTD EYt2x56+MpAblzSAk6+OpKtl9zlKc/e4MSsRubgD8gHhuxWD6uwck9b+4W7dMasDyLHW GLgSo2DXXvKP1QaJ/itWthH0LNHRpXhQvdAz4/8ZFQUHpTZxHb4XuPxNOKwh6haW8wpC cI5cAzJF5B+E6H0C/WuPraFGwTFnVMsonjhDnP0aYjOPtdxvEduKUSMoYQlkzONihrLd 1dng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=feqDsJITKGrji7zoBstnHpmrEFj3LC4kf/GbyZch7Tg=; b=H06Jjf6xknhazQ6B7YLHEEbJjbW56STfbyrU36hodjEaPU7vEEOgotMnh3EJscpRqI AASMyGDZ2IwdJfvLKYsSMbLC4I/uXS1rNoM4QZqVcD8CMFp/73kSUkKCj6FzlZ9hhDic Q8vTd8cyeN+kLOR5JuShcI9lofwyLkirVNrC5t6ahfkLzPYejSIyOr4K8NwrL5kYq7tO s5kk1thFJaWOtzhdv64RqpXLF64/eKPu9oMa0llA8yTWrklSOIhJDX5kAUmCvIyY1Vjr XNDdzWVT1u2/8SUXh8kFwFVnY22B1MG93x+jnl/nKxUDuM5ReZCLRBn3LvVWkGMrNEiR HEMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=o1AumY8o; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y15-v6si12273864pgv.69.2018.05.03.22.01.07; Thu, 03 May 2018 22:01:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=o1AumY8o; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751291AbeEDE7Z (ORCPT + 99 others); Fri, 4 May 2018 00:59:25 -0400 Received: from mail-ua0-f171.google.com ([209.85.217.171]:41264 "EHLO mail-ua0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751114AbeEDE7Y (ORCPT ); Fri, 4 May 2018 00:59:24 -0400 Received: by mail-ua0-f171.google.com with SMTP id a3so13295970uad.8 for ; Thu, 03 May 2018 21:59:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=feqDsJITKGrji7zoBstnHpmrEFj3LC4kf/GbyZch7Tg=; b=o1AumY8oXOAIGiNZJYUBa59o8We2mOBHMqy2lhRfwTdjz1XFC9eJIXgb+ZJLz3nwQ9 RJPh0+Yp1UnhkV+cEyLuvH4iYeQEXiuyBr52yM+YngsMVA8XhW4dUuAhoefA+mHs5fNy 3lsAqPYsD4Ot0XAzc0TH3BDwrtSQAvw73yf0a7sLxD1GmBlGCxA3XvUkcCk3N2pPWaQa ybnF5kVIVqzEa0qN0iBnIKAmkdkbTEfl+bLdu5g4qsnYFjzUvOjN/D5FdQjBsVBXWlfr yjUfF0DsOJY0c7whpM52q1PdzAu13qBSYOmHI8j7qNVEog7Jwv0RXq2ElthCEb4xRGHj q5bg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=feqDsJITKGrji7zoBstnHpmrEFj3LC4kf/GbyZch7Tg=; b=tkH7yk3KXoN548Sxbvh82ujnCNKl5O1iQJWZ0dV0Z2IFscGLwrXvprvO8kLmzay+vn VP069v2GDloIgPCjMIDX24v2TojRN1UAbSg0q3jGXvR6sJoyUGVdnIAxHRmr0+0tJz9c 8J+whfvhxErxsJYql337dWaygkGpt9z8SzW2j+KUOPh7WboFTKpnFdrbznAryZb23tZT BAYo0sZVbB32rOz+gxT+o1qxIOwbB4+uWF9aYF7l4JBDL1PSoJvGXncTkTKM8pwuJi7X LkkeRBboeGcKrHMfnuEDGkwl4OT+h+oL84P4D3143kEoUxUQXfgX4xnpGm75GP1mhyRL E/WA== X-Gm-Message-State: ALQs6tBOfnLMWHaYETq8u0E/+njbqfraYtkpYeLkmePvQaXV3+j+rnat PuSnTPOO3Nmc5zdiTYRpBuT64EdGb2Jb8Z+LdyY= X-Received: by 10.176.20.38 with SMTP id b35mr17969320uae.188.1525409963387; Thu, 03 May 2018 21:59:23 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.95.158 with HTTP; Thu, 3 May 2018 21:59:22 -0700 (PDT) In-Reply-To: <87d0ycu62t.fsf@xmission.com> References: <152473763015.29458.1131542311542381803.stgit@localhost.localdomain> <20180426130700.GP17484@dhcp22.suse.cz> <87efj2q6sq.fsf@xmission.com> <20180426192818.GX17484@dhcp22.suse.cz> <20180427070848.GA17484@dhcp22.suse.cz> <87r2n01q58.fsf@xmission.com> <87o9hz2sw3.fsf@xmission.com> <87h8nr2sa3.fsf_-_@xmission.com> <20180503095952.70bffde1@balbir.ozlabs.ibm.com> <87d0ycu62t.fsf@xmission.com> From: Balbir Singh Date: Fri, 4 May 2018 14:59:22 +1000 Message-ID: Subject: Re: [RFC][PATCH] memcg: Replace mm->owner with mm->memcg To: "Eric W. Biederman" Cc: Michal Hocko , Kirill Tkhai , "akpm@linux-foundation.org" , Peter Zijlstra , Oleg Nesterov , Al Viro , Ingo Molnar , "Paul E. McKenney" , Kees Cook , Rik van Riel , Thomas Gleixner , "Kirill A. Shutemov" , marcos.souza.org@gmail.com, hoeun.ryu@gmail.com, pasha.tatashin@oracle.com, gs051095@gmail.com, dhowells@redhat.com, Mike Rapoport , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 4, 2018 at 1:11 AM, Eric W. Biederman wrote: > Balbir Singh writes: > >> On Tue, 01 May 2018 12:35:16 -0500 >> ebiederm@xmission.com (Eric W. Biederman) wrote: >> >>> Recently it was reported that mm_update_next_owner could get into >>> cases where it was executing it's fallback for_each_process part of >>> the loop and thus taking up a lot of time. >>> >>> To deal with this replace mm->owner with mm->memcg. This just reduces >>> the complexity of everything. As much as possible I have maintained >>> the current semantics. There are two siginificant exceptions. During >>> fork the memcg of the process calling fork is charged rather than >>> init_css_set. During memory cgroup migration the charges are migrated >>> not if the process is the owner of the mm, but if the process being >>> migrated has the same memory cgroup as the mm. >>> >>> I believe it was a bug if init_css_set is charged for memory activity >>> during fork, and the old behavior was simply a consequence of the new >>> task not having tsk->cgroup not initialized to it's proper cgroup. >> >> That does sound like a bug, I guess we've not seen it because we did >> not track any slab allocations initially. > > >>> Durhing cgroup migration only thread group leaders are allowed to >>> migrate. Which means in practice there should only be one. Linux >>> tasks created with CLONE_VM are the only exception, but the common >>> cases are already ruled out. Processes created with vfork have a >>> suspended parent and can do nothing but call exec so they should never >>> show up. Threads of the same cgroup are not the thread group leader >>> so also should not show up. That leaves the old LinuxThreads library >>> which is probably out of use by now, and someone doing something very >>> creative with cgroups, and rolling their own threads with CLONE_VM. >>> So in practice I don't think the difference charge migration will >>> affect anyone. >>> >>> To ensure that mm->memcg is updated appropriately I have implemented >>> cgroup "attach" and "fork" methods. This ensures that at those >>> points the mm pointed to the task has the appropriate memory cgroup. >>> >>> For simplicity instead of introducing a new mm lock I simply use >>> exchange on the pointer where the mm->memcg is updated to get >>> atomic updates. >>> >>> Looking at the history effectively this change is a revert. The >>> reason given for adding mm->owner is so that multiple cgroups can be >>> attached to the same mm. In the last 8 years a second user of >>> mm->owner has not appeared. A feature that has never used, makes the >>> code more complicated and has horrible worst case performance should >>> go. >> >> The idea was to track the mm to the right cgroup, we did find that >> the mm could be confused as belonging to two cgroups. tsk->cgroup is >> not sufficient and if when the tgid left, we needed an owner to track >> where the current allocations were. But this is from 8 year old history, >> I don't have my notes anymore :) > > I was referring to the change 8ish years ago where mm->memcg was > replaced with mm->owner. Semantically the two concepts should both > be perfectly capable of resolving which cgroup the mm belongs to. > Yep, agreed. >>> +static void mem_cgroup_attach(struct cgroup_taskset *tset) >>> +{ >>> + struct cgroup_subsys_state *css; >>> + struct task_struct *tsk; >>> + >>> + cgroup_taskset_for_each(tsk, css, tset) { >>> + struct mem_cgroup *new = mem_cgroup_from_css(css); >>> + css_get(css); >>> + task_update_memcg(tsk, new); >> >> I'd have to go back and check and I think your comment refers to this, >> but we don't expect non tgid tasks to show up here? My concern is I can't >> find the guaratee that task_update_memcg(tsk, new) is not >> >> 1. Duplicated for each thread in the process or attached to the mm >> 2. Do not update mm->memcg to point to different places, so the one >> that sticks is the one that updated things last. > > For cgroupv2 which only operates on processes we have such a guarantee. > > There is no such guarantee for cgroupv1. But it would take someone > being crazy to try this. > > We can add a guarantee to can_attach that we move all of the threads in > a process, and we probably should. However having mm->memcg is more > important structurally than what crazy we let in. So let's make this > change first as safely as we can, and then we don't loose important > data structure simplications if it turns out we have to revert a change > to make the memcgroup per process in cgroupv1. > > > There are some serious issues with the intereactions between the memory > control group and the concept of thread group leader, that show up when > you consider a zombie thread group leader that has called cgroup_exit. > So I am not anxious to stir in concepts like thread_group_leader into > new code unless there is a very good reason. > > We don't exepect crazy but the code allows it, and I have not changed > that. > OK, lets stick with this Acked-by: Balbir Singh Balbir Singh