Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp3952948yba; Tue, 7 May 2019 09:36:38 -0700 (PDT) X-Google-Smtp-Source: APXvYqwojCCNibB0fD+ftJbznOKVYAnRE6UhnrsV3K39Mo7n5oifSWil3baqS90eAQTU9j5xlGMN X-Received: by 2002:a17:902:2a07:: with SMTP id i7mr36328788plb.125.1557246998181; Tue, 07 May 2019 09:36:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557246998; cv=none; d=google.com; s=arc-20160816; b=SFkK3+eTTjpn42DpjwE7+FyySIFjd7nXu/R07K6jqFWzW/hw4XKuGW5PZjyEPtQRk8 RzW1DxGMI1PnnRWUciRLXdkWGjD4/LijLVqNwExCMLl09hA9A2YUaRR9sv+WbbSkMaW/ xm+pv/VbbuanOKwr8eo9ArWv6Unx5UtjlYgfnqW5uEQPk5oO+WiHxUBhVheGiBg+aiwJ 1lxy2v7DcsFvPybs7s3IXQ2egrUzTyzhQflBM3H1WbdNlbaOLn45RZILpWiWOe2fKeRp 3fzIdUWdn81/2f0dSv1AilBxph+KOO0Uh5bv2v42S/lou4ggAu70U4iI0ZfA1N7Hh420 TzYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=uaftBA0UoN/VSNTOzbOYMh9Vy7eqJmh/fNyQY1SouuI=; b=XuUGKqxIYf4UXWDVYKnAarFpTvCQwQfIq/dAbkavxGvC/SjSLZ8GbKlM+PtbhjMAhv FnGbsCt9PJdP9hHnCF68JbP4g+W8CjMgPnbOfSCMUDzdZ3bx1pdB6pFxkVYTOi+1w1/F rkNo4Y+sFFB808f6xFur51UvMW9C19Q+Mce81XImjBSeWbZKkHsXe9QpWAzXq4tAinup 6GQN7xXw+zjwE+Q/JyiZf2UDdoqo8k30mS//JlPu9aVSkTlYfda0jsqo1FLZlGIG0RQW nwsB4Il0aEVgYuCnRgpd0ol4OxA656j3FLslMzmCrqhd14BLl795KUZ61pgxnp6M0yQL aR2Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 96si21561137ple.289.2019.05.07.09.36.21; Tue, 07 May 2019 09:36:38 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726797AbfEGQf2 (ORCPT + 99 others); Tue, 7 May 2019 12:35:28 -0400 Received: from mail-ot1-f68.google.com ([209.85.210.68]:46372 "EHLO mail-ot1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726197AbfEGQf0 (ORCPT ); Tue, 7 May 2019 12:35:26 -0400 Received: by mail-ot1-f68.google.com with SMTP id v17so7463353otp.13 for ; Tue, 07 May 2019 09:35:26 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=uaftBA0UoN/VSNTOzbOYMh9Vy7eqJmh/fNyQY1SouuI=; b=HHlnjvhdoye2vz5q1lQACLZrDkm7QQuQOjBz3KMcIwZTqWnMCKikdeMrkHO+ydpz+E ClJJpf7/yT7TZ8pFmX24wVuB7+wdborzWhWRbMeowBv26ozeAhZJKhYPEjdzvz68LdMq +/buurUDzrEhM5pGEta4Yd7fLyzmum99u9cDf1VkmpbAzFfO6ovN+CsJfl6m4BlE7Lyo VGQkC0Ih7+e98+9oxVesrWLdx9NQX6nLu/MJv1YFKG9vURsGlC476q/5ngqBGExABjwJ tZh9B+GTLlb0ftl6hmfTQzghnH0WpPkcd4TbU3BsuonPLpmi/tEkkvqntOzxcxYbcBo7 UOCg== X-Gm-Message-State: APjAAAVpWs3IepaN4HGo2kEoa1yojvZAnbFWfPDThcrisj87U/E1s6JV 3gatM2hSo7v4rfDvEczbMB4= X-Received: by 2002:a9d:6c51:: with SMTP id g17mr13323687otq.171.1557246925490; Tue, 07 May 2019 09:35:25 -0700 (PDT) Received: from sultan-box.localdomain ([107.193.118.89]) by smtp.gmail.com with ESMTPSA id p4sm3559359oti.70.2019.05.07.09.35.22 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 07 May 2019 09:35:24 -0700 (PDT) Date: Tue, 7 May 2019 09:35:20 -0700 From: Sultan Alsawaf To: Oleg Nesterov Cc: Christian Brauner , Daniel Colascione , Suren Baghdasaryan , Steven Rostedt , Tim Murray , Michal Hocko , Greg Kroah-Hartman , Arve =?iso-8859-1?B?SGr4bm5lduVn?= , Todd Kjos , Martijn Coenen , Ingo Molnar , Peter Zijlstra , LKML , "open list:ANDROID DRIVERS" , linux-mm , kernel-team , Andy Lutomirski , "Serge E. Hallyn" , Kees Cook , Joel Fernandes Subject: Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android Message-ID: <20190507163520.GA1131@sultan-box.localdomain> References: <20190317114238.ab6tvvovpkpozld5@brauner.io> <20190318002949.mqknisgt7cmjmt7n@brauner.io> <20190318235052.GA65315@google.com> <20190319221415.baov7x6zoz7hvsno@brauner.io> <20190319231020.tdcttojlbmx57gke@brauner.io> <20190320015249.GC129907@google.com> <20190507021622.GA27300@sultan-box.localdomain> <20190507153154.GA5750@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190507153154.GA5750@redhat.com> User-Agent: Mutt/1.11.4 (2019-03-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 07, 2019 at 05:31:54PM +0200, Oleg Nesterov wrote: > I am not going to comment the intent, but to be honest I am skeptical too. The general sentiment has been that this is a really bad idea, but I'm just a frustrated Android user who wants his phone to not require mountains of zRAM only to still manage memory poorly. Until I can go out and buy a non-Pixel phone that uses PSI to make these decisions (and does a good job of it), I'm going to stick to my hacky little driver for my personal devices. Many others who like to hack their Android devices to make them last longer will probably find value in this as well, since there are millions of people who use devices that'll never seen any of PSI ported to their ancient 3.x kernels. And yes, I know this would never be accepted to upstream in a million years. I mostly wanted some code review and guidance, since mm code is pretty tricky :) > On 05/06, Sultan Alsawaf wrote: > > > > +static unsigned long find_victims(struct victim_info *varr, int *vindex, > > + int vmaxlen, int min_adj, int max_adj) > > +{ > > + unsigned long pages_found = 0; > > + int old_vindex = *vindex; > > + struct task_struct *tsk; > > + > > + for_each_process(tsk) { > > + struct task_struct *vtsk; > > + unsigned long tasksize; > > + short oom_score_adj; > > + > > + /* Make sure there's space left in the victim array */ > > + if (*vindex == vmaxlen) > > + break; > > + > > + /* Don't kill current, kthreads, init, or duplicates */ > > + if (same_thread_group(tsk, current) || > > + tsk->flags & PF_KTHREAD || > > + is_global_init(tsk) || > > + vtsk_is_duplicate(varr, *vindex, tsk)) > > + continue; > > + > > + vtsk = find_lock_task_mm(tsk); > > Did you test this patch with lockdep enabled? > > If I read the patch correctly, lockdep should complain. vtsk_is_duplicate() > ensures that we do not take the same ->alloc_lock twice or more, but lockdep > can't know this. Yeah, lockdep is fine with this, at least on 4.4. > > +static void scan_and_kill(unsigned long pages_needed) > > +{ > > + static DECLARE_WAIT_QUEUE_HEAD(victim_waitq); > > + struct victim_info victims[MAX_VICTIMS]; > > + int i, nr_to_kill = 0, nr_victims = 0; > > + unsigned long pages_found = 0; > > + atomic_t victim_count; > > + > > + /* > > + * Hold the tasklist lock so tasks don't disappear while scanning. This > > + * is preferred to holding an RCU read lock so that the list of tasks > > + * is guaranteed to be up to date. Keep preemption disabled until the > > + * SIGKILLs are sent so the victim kill process isn't interrupted. > > + */ > > + read_lock(&tasklist_lock); > > + preempt_disable(); > > read_lock() disables preemption, every task_lock() too, so this looks > unnecessary. Good point. > > + for (i = 1; i < ARRAY_SIZE(adj_prio); i++) { > > + pages_found += find_victims(victims, &nr_victims, MAX_VICTIMS, > > + adj_prio[i], adj_prio[i - 1]); > > + if (pages_found >= pages_needed || nr_victims == MAX_VICTIMS) > > + break; > > + } > > + > > + /* > > + * Calculate the number of tasks that need to be killed and quickly > > + * release the references to those that'll live. > > + */ > > + for (i = 0, pages_found = 0; i < nr_victims; i++) { > > + struct victim_info *victim = &victims[i]; > > + struct task_struct *vtsk = victim->tsk; > > + > > + /* The victims' mm lock is taken in find_victims; release it */ > > + if (pages_found >= pages_needed) { > > + task_unlock(vtsk); > > + continue; > > + } > > + > > + /* > > + * Grab a reference to the victim so it doesn't disappear after > > + * the tasklist lock is released. > > + */ > > + get_task_struct(vtsk); > > The comment doesn't look correct. the victim can't dissapear until task_unlock() > below, it can't pass exit_mm(). I was always unsure about this and decided to hold a reference to the task_struct to be safe. Thanks for clearing that up. > > + pages_found += victim->size; > > + nr_to_kill++; > > + } > > + read_unlock(&tasklist_lock); > > + > > + /* Kill the victims */ > > + victim_count = (atomic_t)ATOMIC_INIT(nr_to_kill); > > + for (i = 0; i < nr_to_kill; i++) { > > + struct victim_info *victim = &victims[i]; > > + struct task_struct *vtsk = victim->tsk; > > + > > + pr_info("Killing %s with adj %d to free %lu kiB\n", vtsk->comm, > > + vtsk->signal->oom_score_adj, > > + victim->size << (PAGE_SHIFT - 10)); > > + > > + /* Configure the victim's mm to notify us when it's freed */ > > + vtsk->mm->slmk_waitq = &victim_waitq; > > + vtsk->mm->slmk_counter = &victim_count; > > + > > + /* Accelerate the victim's death by forcing the kill signal */ > > + do_send_sig_info(SIGKILL, SIG_INFO_TYPE, vtsk, true); > ^^^^ > this should be PIDTYPE_TGID Thanks, I didn't realize the last argument to do_send_sig_info changed in newer kernels. The compiler didn't complain, so it went over my head. > > + > > + /* Finally release the victim's mm lock */ > > + task_unlock(vtsk); > > + } > > + preempt_enable_no_resched(); > > See above. And I don't understand how can _no_resched() really help... Yeah, good point. > > + > > + /* Try to speed up the death process now that we can schedule again */ > > + for (i = 0; i < nr_to_kill; i++) { > > + struct task_struct *vtsk = victims[i].tsk; > > + > > + /* Increase the victim's priority to make it die faster */ > > + set_user_nice(vtsk, MIN_NICE); > > + > > + /* Allow the victim to run on any CPU */ > > + set_cpus_allowed_ptr(vtsk, cpu_all_mask); > > + > > + /* Finally release the victim reference acquired earlier */ > > + put_task_struct(vtsk); > > + } > > + > > + /* Wait until all the victims die */ > > + wait_event(victim_waitq, !atomic_read(&victim_count)); > > Can't we avoid the new slmk_waitq/slmk_counter members in mm_struct? > > I mean, can't we export victim_waitq and victim_count and, say, set/test > MMF_OOM_VICTIM. In fact I think you should try to re-use mark_oom_victim() > at least. This makes the patch less portable across different kernel versions, which is kind of one of its major goals. Thanks for the code review, Oleg. Sultan