Received: by 2002:a5d:9c59:0:0:0:0:0 with SMTP id 25csp96827iof; Sun, 5 Jun 2022 22:15:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwzJ7NWP2EQxYLSQjEXub+TepoQdw262uD+yYoQoGLcQ2Z9ZDpYM/+8oy4AMzfXVcnUNz41 X-Received: by 2002:a17:902:b692:b0:14c:935b:2b03 with SMTP id c18-20020a170902b69200b0014c935b2b03mr22409530pls.81.1654492535034; Sun, 05 Jun 2022 22:15:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654492535; cv=none; d=google.com; s=arc-20160816; b=OfTsM95NfQQ6vpXO/+57X5u0fBZXZCQ37nrnog+wVqImzxubClS4aYCXnTaabR9DbB VhEfYFYpqi4TRFnZa1RJw/g8azk5RtcCgWGCxXFG3OXQisNc1vFa07BgCsMUTP22UaKI Pq4u5FoE7BtP7IhILc2Bbmzt+pL8PuGXsMo7gEDKcprjsyNo5X3dWNhWXvnRBvjLAdkm QvyjuRpXO3srX8TQ2KO90oIvsnZqapxBIq4DKp1eqCyyz/jcCSfsWySMd7TYM+pNkUM8 PO9u8Bb6cr5eeswqCV18PnSVLz1yX6Hm2xPg3AVaXhJ4fWA4i8KGuzcPOC5kWMygun8x vwew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:subject:mime-version:message-id:date:user-agent :references:in-reply-to:cc:to:from; bh=FpHnpgJeCJhikiyk+TFs+IFaVEaSkjk7Q56vEojGtDw=; b=ik9/lc/lrO6yhdW809j2J/lGkopIBFFCXG9O0tVG4DKTYUFdBoQ6lk18ogwf8Uild5 yqsYnODfO0T8g6rHTOpF+UV9JOSsm9sF+xpreW2n38Piw0K8r7efR06ovvYuBqv1cobC wV/bqoGAVqWfEDD3CiguYfrzdEcvot7ea3Zb1Of4L4wqGhQODo/IQNLqMyzEqVfkQyzj otdjZCzT7j4EJeTnQQ2/4OMvysxfRqEkU1rQ8HYqqtpBLIy6jjWrikhzrPZBSb2IHMch 9nU1W1MSknx5EK0rF6RcuaX3nasDNADdvBKTFo8vViKv9JbH6lSH0R4DPEW4ouD7laZQ +Ejw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=xmission.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id l15-20020a17090a384f00b001e87471611dsi4618705pjf.145.2022.06.05.22.15.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Jun 2022 22:15:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=xmission.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id F1DE31668B9; Sun, 5 Jun 2022 21:21:05 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343914AbiFCQ3Z (ORCPT + 99 others); Fri, 3 Jun 2022 12:29:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40928 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244687AbiFCQ3W (ORCPT ); Fri, 3 Jun 2022 12:29:22 -0400 Received: from out03.mta.xmission.com (out03.mta.xmission.com [166.70.13.233]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 36EAC51584 for ; Fri, 3 Jun 2022 09:29:20 -0700 (PDT) Received: from in02.mta.xmission.com ([166.70.13.52]:56664) by out03.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1nxAAj-008aMV-4j; Fri, 03 Jun 2022 10:29:13 -0600 Received: from ip68-227-174-4.om.om.cox.net ([68.227.174.4]:38856 helo=email.froward.int.ebiederm.org.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1nxAAh-00H94e-PG; Fri, 03 Jun 2022 10:29:12 -0600 From: "Eric W. Biederman" To: Miaohe Lin Cc: Ying Huang , , , , , , , , , , , , In-Reply-To: (Miaohe Lin's message of "Thu, 2 Jun 2022 17:22:07 +0800") References: <20220530113016.16663-1-linmiaohe@huawei.com> <20220530113016.16663-2-linmiaohe@huawei.com> <87bkvdfzvm.fsf@email.froward.int.ebiederm.org> <87y1yga1r2.fsf@email.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) Date: Fri, 03 Jun 2022 11:28:58 -0500 Message-ID: <87ilph90dx.fsf@email.froward.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1nxAAh-00H94e-PG;;;mid=<87ilph90dx.fsf@email.froward.int.ebiederm.org>;;;hst=in02.mta.xmission.com;;;ip=68.227.174.4;;;frm=ebiederm@xmission.com;;;spf=softfail X-XM-AID: U2FsdGVkX1/gRKhgbPDJYnswsXncEZobWy1dsmVlhZI= X-SA-Exim-Connect-IP: 68.227.174.4 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Virus: No X-Spam-DCC: XMission; sa02 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ****;Miaohe Lin X-Spam-Relay-Country: X-Spam-Timing: total 735 ms - load_scoreonly_sql: 0.04 (0.0%), signal_user_changed: 4.7 (0.6%), b_tie_ro: 3.2 (0.4%), parse: 1.40 (0.2%), extract_message_metadata: 14 (1.8%), get_uri_detail_list: 4.9 (0.7%), tests_pri_-1000: 4.1 (0.6%), tests_pri_-950: 1.07 (0.1%), tests_pri_-900: 0.82 (0.1%), tests_pri_-90: 132 (18.0%), check_bayes: 130 (17.7%), b_tokenize: 10 (1.4%), b_tok_get_all: 12 (1.7%), b_comp_prob: 3.5 (0.5%), b_tok_touch_all: 101 (13.7%), b_finish: 0.75 (0.1%), tests_pri_0: 566 (76.9%), check_dkim_signature: 0.46 (0.1%), check_dkim_adsp: 3.6 (0.5%), poll_dns_idle: 0.25 (0.0%), tests_pri_10: 1.70 (0.2%), tests_pri_500: 6 (0.9%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Miaohe Lin writes: > On 2022/6/1 22:37, Eric W. Biederman wrote: >> Miaohe Lin writes: >> >>> On 2022/6/1 0:09, Eric W. Biederman wrote: >>>> Miaohe Lin writes: >>> snip >>>>> >>>>> " >>>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct") >>>>> extends the period of the rcu_read_lock until after the permissions checks >>>>> are done because it suspects the permissions checks are not safe unless >>>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm >>>>> association does not change on us while we are working [1]. But extended >>>>> rcu read lock does not add much value. Because after permission checking >>>>> the permission may still be changed. There's no much difference. So it's >>>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu >>>>> lock after task refcount is successfully grabbed to reduce the rcu holding >>>>> time. >>>>> >>>>> [1] https://lore.kernel.org/lkml/87sjhzun47.fsf@xmission.com/ >>>>> " >>>> >>>> It doesn't make sense to me. >>>> >>>> I don't see any sleeping functions called from find_mm_struct or >>>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the >>>> code protected by get_task_struct. So at a very basic level I see a >>>> justification for dirtying a cache line twice with get_task_struct and >>>> put_task_struct to reduce rcu_read_lock hold times. >>>> >>>> I would contend that a reasonable cleanup based up on the current state >>>> of the code would be to extend the rcu_read_lock over get_task_mm so >>> >>> If so, security_task_movememory will be called inside rcu lock. It might >>> call sleeping functions, e.g. smack_log(). I think it's not a good >>> idea. >> >> In general the security functions are not allowed to sleep. >> The audit mechanism typically preallocates memory so it does >> not need to sleep. From a quick skim through the code I don't >> see smack_log sleeping. >> >> Certainly the security hooks are not supposed to be inserted into the >> code that they prevent reasonable implementation decisions. >> >> Which is to say if there is a good (non-security hook reason) for >> supporting sleeping deal with it. Otherwise the security hooks has a >> bug and needs to get fixed/removed. > > I see. Many thanks for explanation. > >> >>>> that a reference to task_struct does not need to be taken. That has >>>> the potential to reduce contention and reduce lock hold times. >>>> >>>> >>>> The code is missing a big fat comment with the assertion that it is ok >>>> if the permission checks are racy because the race is small, and the >>>> worst case thing that happens is the page is migrated to another >>>> numa node. >>>> >>>> >>>> Given that the get_mm_task takes task_lock the cost of dirtying the >>>> cache line is already being paid. Perhaps not extending task_lock hold >>>> times a little bit is justified, but I haven't seen that case made. >>>> >>>> This seems like code that is called little enough it would be better for >>>> it to be correct, and not need big fat comments explaining why it >>>> doesn't matter that they code is deliberately buggy. >>>> >>> >>> Agree. A big fat comments will make code hard to follow. >> >> No. >> >> The code is impossible to follow currently. >> >> The code either requires a comment pointing out that it is deliberately >> racy, or the code needs to be fixed. >> >> Clever and subtle code always requires a comment if for no other >> reason then to alert the reader that something a typical is going on. > > Yes, clever and subtle code requires a comment but others might not. > >> >>>> In short it does not make sense to me to justify a patch for performance >>>> reasons when it appears that extending the rcu_read_lock hold time and >>>> not touch the task reference count would stop dirtying a cache line and >>>> likely have more impact. >>> >>> IMHO, incremented task refcount should make code works correctly. And extending >>> the rcu_read_lock over get_task_mm will break the things because sleeping functions >>> might be called while holding rcu lock. >> >> Which sleeping functions? >> >> I can't find any. In particular smack_task_movememory calls >> smk_curacc_on_task which is the same function called by >> security_task_getpgid. security_task_getpgid is called >> under rcu_read_lock. So smack won't sleep. > > Sorry, I didn't take a close look at smack_log code. So I thought it could sleep. > >> >>> Does the patch itself makes sense for you? Should I rephase the commit log further? >>> I'm afraid I didn't get your point correctly. >> >> The patch makes no sense to me because I don't see it doing anything >> worth doing. >> >> get/put_task_struct both dirty a cache line and are expensive especially >> when contended. Dirtying a cache line that is contended is the pretty >> much the most expensive native cpu operation. In pathological scenarios >> I have seen it take up to 1s. Realistically in a cache cold scenario >> (which is not as bad as a contended scenario) you are looking at 100ns >> or more just to execute get_task_struct/put_task_struct. That is the >> kind of cost it would be nice to avoid all together (even if the >> pathological case never comes up). >> >> So I see two pieces of code where we could use the cheap operation >> rcu_read_lock/rcu_read_unlock to remove the expensive operation >> get_task_struct/put_task_struct. Instead I see people removing >> rcu_read_lock/rcu_read_unlock. >> >> That makes no sense. Especially as your implied reason for making this >> change is to make the code have better performance. Improving >> performance is the reason for making critical sections smaller isn't it? >> > > I think you're right. We should extend the rcu_read_lock over get_task_mm so we can > remove the expensive operation get_task_struct/put_task_struct and thus avoid possible > cacheline penalty. But is the extended rcu lock enough to ensure the task reference > is stable when calling security check functions and performing cpuset node validation? > Or this race is just OK and can be left alone with a comment? The extending the rcu_read_lock is just about removing the expense of get_task_struct/put_task_struct. It can be handled separately from other issues at play in this code. The rcu_read_lock guarantees that task does not go away. The rcu_read_lock does not guarantee that task->mm does not change. A separate but related issue is should the task_lock in get_task_mm be extended to cover the security checks so that everything checks against the same mm. Possibly the code could be refactored or reordered so that everything is guaranteed to check against the same mm. If the checks are not made to guarantee they are all checking against the same mm, the code needs a comment to say that there is a tiny race. The comment should say we don't care about the tiny race because the worst that can happen is that a page is migrated to a different numa node. And so we don't care. Eric