Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp3289275imm; Fri, 20 Jul 2018 13:44:21 -0700 (PDT) X-Google-Smtp-Source: AAOMgpeQaNTG2ycqlnirBKF/Nql89OXWeq/VI0HNBeQ0saj6KClwMZApoX34GfxbwD2hdwGDbIFc X-Received: by 2002:a17:902:654b:: with SMTP id d11-v6mr3463758pln.8.1532119461730; Fri, 20 Jul 2018 13:44:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532119461; cv=none; d=google.com; s=arc-20160816; b=Q6IicCgxmn/g7mbBQ+rt1l90RzhxlkmOrMf8zkM2+ZTQ6mvl3UDmWJRxs9uzucWwbR lNKv1zqdJr3zLvfCvI7Crg3hH/jnGabBl79NGAN7sH1onIrLZ9asFqrKAMkx3SZnj7MO KKbgKDjtjzWjknTUSsPJAcE4LK8WdwhqGo6Su8YlfYDDd4zQ4pg+C8kWUOIJpvsmZW0S 2iOnH1Jg3ttM2gBB1vdDkEht+cUIkeDFW/BLdkVWVpkqX8DqQcI5RR/6H5+JOdqLxXQy Zk2dbYGE99eOXbSxSQhU+G3LnRlJUmPSAxP3jRU/4mfkSBvHfFS3pdbH/v4WyhExZPOt dhdQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=ZVMUMHLO3VyBFGV9g/tDRDM3Osa92HfsfVg06gTLrd4=; b=ATAEaKzvVe0p5orXWCvgbtyQFMwBcRXg8ozbuqXcNbOoSOopR2y/NldIekBb7znYWr WdkmN1hsF0JZQdcc2ldMHUhSmVYb1zlg6pEUU02pBNVXyQHBi4JFzOfa3N+iuiNe4w5b AuCt/fJIY6kWee3wBRMKUmIN5+HQMLih/GiVwXw2o3FfwDRY0u55YlRdTEAK+i3zvZuB 1bVoLt6vQm9h0hoBW9gs1bEI5+tTbmv0CYqMY/PbYnBHu4KpmX537q6M4dNvalxgxGFz bzgRVgrfSUxaYL/OUeueV11FwZwZwb7z6FIIAOx0lVe5EDdxB+CZmjyi6Zx3dCQ7Nkqi MEzw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x17-v6si2417232pfn.286.2018.07.20.13.44.06; Fri, 20 Jul 2018 13:44:21 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728167AbeGTVdZ (ORCPT + 99 others); Fri, 20 Jul 2018 17:33:25 -0400 Received: from www262.sakura.ne.jp ([202.181.97.72]:49553 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727974AbeGTVdZ (ORCPT ); Fri, 20 Jul 2018 17:33:25 -0400 Received: from fsav305.sakura.ne.jp (fsav305.sakura.ne.jp [153.120.85.136]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id w6KKhI0I015477; Sat, 21 Jul 2018 05:43:18 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav305.sakura.ne.jp (F-Secure/fsigk_smtp/530/fsav305.sakura.ne.jp); Sat, 21 Jul 2018 05:43:18 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/530/fsav305.sakura.ne.jp) Received: from [192.168.1.8] (softbank126074194044.bbtec.net [126.74.194.44]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id w6KKhD0B015461 (version=TLSv1.2 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 21 Jul 2018 05:43:18 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Subject: Re: [patch v4] mm, oom: fix unnecessary killing of additional processes To: David Rientjes , Andrew Morton Cc: Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: From: Tetsuo Handa Message-ID: Date: Sat, 21 Jul 2018 05:43:15 +0900 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018/07/21 5:14, David Rientjes wrote: > diff --git a/mm/mmap.c b/mm/mmap.c > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3066,25 +3066,27 @@ void exit_mmap(struct mm_struct *mm) > if (unlikely(mm_is_oom_victim(mm))) { > /* > * Manually reap the mm to free as much memory as possible. > - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard > - * this mm from further consideration. Taking mm->mmap_sem for > - * write after setting MMF_OOM_SKIP will guarantee that the oom > - * reaper will not run on this mm again after mmap_sem is > - * dropped. > - * > * Nothing can be holding mm->mmap_sem here and the above call > * to mmu_notifier_release(mm) ensures mmu notifier callbacks in > * __oom_reap_task_mm() will not block. > * > + * This sets MMF_UNSTABLE to avoid racing with the oom reaper. > * This needs to be done before calling munlock_vma_pages_all(), > * which clears VM_LOCKED, otherwise the oom reaper cannot > - * reliably test it. > + * reliably test for it. If the oom reaper races with > + * munlock_vma_pages_all(), this can result in a kernel oops if > + * a pmd is zapped, for example, after follow_page_mask() has > + * checked pmd_none(). > */ > mutex_lock(&oom_lock); > __oom_reap_task_mm(mm); > mutex_unlock(&oom_lock); I don't like holding oom_lock for full teardown of an mm, for an OOM victim's mm might have multiple TB memory which could take long time. Of course, forcing someone who triggered __mmput() to pay the full cost is not nice though, for it even can be a /proc/$pid/ reader, can't it? > > - set_bit(MMF_OOM_SKIP, &mm->flags); > + /* > + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will > + * guarantee that the oom reaper will not run on this mm again > + * after mmap_sem is dropped. > + */ > down_write(&mm->mmap_sem); > up_write(&mm->mmap_sem); > } > -#define MAX_OOM_REAP_RETRIES 10 > static void oom_reap_task(struct task_struct *tsk) > { > - int attempts = 0; > struct mm_struct *mm = tsk->signal->oom_mm; > > - /* Retry the down_read_trylock(mmap_sem) a few times */ > - while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) > - schedule_timeout_idle(HZ/10); > + /* > + * If this mm has either been fully unmapped, or the oom reaper has > + * given up on it, nothing left to do except drop the refcount. > + */ > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) > + goto drop; > > - if (attempts <= MAX_OOM_REAP_RETRIES || > - test_bit(MMF_OOM_SKIP, &mm->flags)) > - goto done; > + /* > + * If this mm has already been reaped, doing so again will not likely > + * free additional memory. > + */ > + if (!test_bit(MMF_UNSTABLE, &mm->flags)) > + oom_reap_task_mm(tsk, mm); This is still wrong. If preempted immediately after set_bit(MMF_UNSTABLE, &mm->flags) from __oom_reap_task_mm() from exit_mmap(), oom_reap_task() can give up before reclaiming any memory. test_bit(MMF_UNSTABLE, &mm->flags) has to be done under oom_lock serialization, and I don't like holding oom_lock while calling __oom_reap_task_mm(mm). > > - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > - task_pid_nr(tsk), tsk->comm); > - debug_show_all_locks(); > + if (time_after_eq(jiffies, mm->oom_free_expire)) { > + if (!test_bit(MMF_OOM_SKIP, &mm->flags)) { > + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > + task_pid_nr(tsk), tsk->comm); > + debug_show_all_locks(); > > -done: > - tsk->oom_reaper_list = NULL; > + /* > + * Reaping has failed for the timeout period, so give up > + * and allow additional processes to be oom killed. > + */ > + set_bit(MMF_OOM_SKIP, &mm->flags); > + } > + goto drop; > + } > @@ -645,25 +657,60 @@ static int oom_reaper(void *unused) > return 0; > } > > +/* > + * Millisecs to wait for an oom mm to free memory before selecting another > + * victim. > + */ > +static u64 oom_free_timeout_ms = 1000; > static void wake_oom_reaper(struct task_struct *tsk) > { > - /* tsk is already queued? */ > - if (tsk == oom_reaper_list || tsk->oom_reaper_list) > + unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms); > + > + if (!expire) > + expire++; > + /* > + * Set the reap timeout; if it's already set, the mm is enqueued and > + * this tsk can be ignored. > + */ > + if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire)) > return; We don't need this if we do from mark_oom_victim() like my series does. > > get_task_struct(tsk); > > spin_lock(&oom_reaper_lock); > - tsk->oom_reaper_list = oom_reaper_list; > - oom_reaper_list = tsk; > + list_add(&tsk->oom_reap_list, &oom_reaper_list); > spin_unlock(&oom_reaper_lock); > trace_wake_reaper(tsk->pid); > wake_up(&oom_reaper_wait); > }