Received: by 2002:ab2:5182:0:b0:1f4:61d5:3ad4 with SMTP id x2csp23634lqi; Fri, 5 Apr 2024 08:47:01 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCV1kjb3gQVX2inDUfHGJaPkmd6TcTU7Hp/49hBWtpRflnZZGK1bzLtMJk59jtIrefXg/Tlw2SgeTCShMmuRG1E060HGT0WO5dbF+kxEcQ== X-Google-Smtp-Source: AGHT+IH20mZuDCDotaqHtXYmClAUUIrltm+8XcgYO24u4UgITE4KTqfmc0AclUy7y7AMmXKoe6/o X-Received: by 2002:a05:6e02:1384:b0:369:95dc:e4ed with SMTP id d4-20020a056e02138400b0036995dce4edmr1674751ilo.15.1712332021729; Fri, 05 Apr 2024 08:47:01 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712332021; cv=pass; d=google.com; s=arc-20160816; b=NYpDZpzKPMO1Awnb1Ga177XXspfksNUdCvTeZ7Bt6APHo7B0i42gPAj9HFgwwJCmYL 1RPsTC4cuZNFajQSjb/XPSe9BGKqrcSSuFl/LSfc/HE2YMvNPibybUZTCu3le90xCzPM Ln3jzQdbWIUH6hRtrhl7og3+rI7t1eDUo42PqslxSkld2ScRMQBQlB1c2wCRkMbx0rKs YLXdtQTRzinpZVUxreTKVqLlhy7XVoSpMswxbWifjASLspgZv4BaMU2DWITX7uh898Qv xkEAmOKFLxOkCqFnlBJyZmTv3MYh1AF0WF2s5iJL+4JPj6+6Zk3YhGJNU/M/AGHeTr1b px3g== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=xrx4EfBLw5fhhjyFFx6WbCnhxWaGw01fsd0qggFmp/8=; fh=1MuEeGBNo0dRWmz5sz/wuvdIQg4xg95f2sclRxrbHFA=; b=KGUitLlMXXaYYHelW/LdB21RxY4pLWnkEHgrcfJwuO8bDZZxVsyHU0O6m0UPaVDDt3 lZATA1fWvyYDKP1PBilVdzgfS4od/uY6/jGDjpWQ/q+fCSmxQGVvLnTLeWD8FOx1puZK Aq99G8RFSGVnVVRmNZFwVFhNtj5Ul/EZkAyIMrXME/cBld532oKiQ9ffNH7HwsQ7A8Yd z5yfDOgVx2GunS+OY/rvx9Q4hBxTrGlztuJPFsJJSx422JA8GZq23o7y44NJlO0j4+hl UrLOdvZi2ti7dUzIO9SRLX098UEFQliTxz005tzGZ3I5vJZdwFute7f0q7Wo3uF01XTS UOzw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=TdT1INvk; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-133283-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-133283-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id o26-20020a65615a000000b005dc4f9ceccdsi1550452pgv.609.2024.04.05.08.47.01 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 05 Apr 2024 08:47:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-133283-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=TdT1INvk; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-133283-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-133283-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id DDAE328182D for ; Fri, 5 Apr 2024 15:47:00 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 7D5FE16EC0B; Fri, 5 Apr 2024 15:46:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="TdT1INvk" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F275516FF3D for ; Fri, 5 Apr 2024 15:46:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712332008; cv=none; b=sIvRUWCl0vEwihKL5bOTkTzLV3Y1xkS0M9c4qTqkrIpLElo4QOK/UViiS/mvilMGGys2bSuvzAFx0Y8LMxWaITqCJsL3prLabvgVEA7XCEXuhvH29BopqWMfbSt5aSHAdIBixs1eu0D+qvXPjBYvULmtdygBSmd5fL1c1WNSaLk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712332008; c=relaxed/simple; bh=+sYP6mSovvQVh5eaZ4COqHW52CnVHRQXjI7rvGpQ3zI=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=ju1EaIItrjQZ6ArA1HV48YQURFyaDqLRFbbADtmQCI9GytErXSG5QD0sgR7K6a5IsTecQ5DpKub4mHv+Xe7DGgrot5NVi1mfndmW9W88I0aGvSdfKv7JJR+SVT0qrTgX8Pba4fe6sj1q+RD9Wa/CAvBPi2N2oJRY5hKwIAHweSA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=TdT1INvk; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1712332005; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xrx4EfBLw5fhhjyFFx6WbCnhxWaGw01fsd0qggFmp/8=; b=TdT1INvkO4/bMUbU67X7GlWaEYmvtAYGGbpbN2k28F5n3xSkG5eFqOeKScn7fff7GL5R7n HGbEWwpwPvzIl7MpAdLCFyykJ8wYBigSHpQexY+FOzgVZsNfKd3j/yNarJCv6XV8ElIJS1 JxtBIHj1Be/pZhk6B21884Znpe4Uj5o= Received: from mail-lf1-f71.google.com (mail-lf1-f71.google.com [209.85.167.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-281-ILwd87wEN2ONWmU8rZcAVQ-1; Fri, 05 Apr 2024 11:46:43 -0400 X-MC-Unique: ILwd87wEN2ONWmU8rZcAVQ-1 Received: by mail-lf1-f71.google.com with SMTP id 2adb3069b0e04-516c8697daaso1843216e87.2 for ; Fri, 05 Apr 2024 08:46:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712332002; x=1712936802; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xrx4EfBLw5fhhjyFFx6WbCnhxWaGw01fsd0qggFmp/8=; b=xA4XNAEAjhEbAjn/XsMPBkw25/lfgBuaxqQfCF3tlrgTufkMZv4Ey6Jzo3N5g2aw82 tknewp5IZqYSPgcmUBxA3Jg2jBOZsG0uaMxx74Y8FLvOBVt//VvknDCfqy+DUsnbiIyi eSD7PvLwONCWRXiC9EjVkbdaWIiSk97T/rbBEBLNhL62UEV1XQHQimwAaluZNBDSWDKI xdo3cufLPrjir60lDtlIEL3w+KETe0K8TDFBgI9V+9btqDDW19o4EV/3FlqndvMgkX5h 3h2zDVgZKnfa1sP7FKJfBndJ77ekdCM65E8B43R7i8/ufVU2IaMmChavErMQgv3Yts94 gYeQ== X-Forwarded-Encrypted: i=1; AJvYcCWodYAKB4YJghvygk31ewS5F7tGDNy1yIGDxowf2z0TjVGpV5t1dPHY8NLaGsryQzDKivni22Nn4MdetquPUbn2UddH7YDTNg6L1ueF X-Gm-Message-State: AOJu0Yw27Adje8+2fouG0rfLgX8E2hQaMNV0W7lJvm5rbsAJHg8ONjvZ jzPNS4Hhu5eKf9sxzl4q9y/P9M1aQGK7jgtWU08ggv9dbGOkENiCn6EkCM3Fjon4QkK4sVX46iB No43gpT9aIQGcb2X4I4NovVgIMIC1Iee+0bcxgjMD6ajyNXGms1vsIAacvjdgjiIKJb5cUQV5Wl 6WVmAGrJs2E83DrNfwMM7wvC8mtEH243Jn2C+o X-Received: by 2002:a19:8c45:0:b0:516:cdde:aaa1 with SMTP id i5-20020a198c45000000b00516cddeaaa1mr1281275lfj.18.1712332002037; Fri, 05 Apr 2024 08:46:42 -0700 (PDT) X-Received: by 2002:a19:8c45:0:b0:516:cdde:aaa1 with SMTP id i5-20020a198c45000000b00516cddeaaa1mr1281263lfj.18.1712332001553; Fri, 05 Apr 2024 08:46:41 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240322-hid-bpf-sleepable-v5-0-179c7b59eaaa@kernel.org> <20240322-hid-bpf-sleepable-v5-1-179c7b59eaaa@kernel.org> In-Reply-To: From: Benjamin Tissoires Date: Fri, 5 Apr 2024 17:46:29 +0200 Message-ID: Subject: Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers To: Alexei Starovoitov Cc: Benjamin Tissoires , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Mykola Lysenko , Shuah Khan , bpf , LKML , "open list:KERNEL SELFTEST FRAMEWORK" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Apr 4, 2024 at 8:29=E2=80=AFPM Alexei Starovoitov wrote: > > On Thu, Apr 4, 2024 at 10:56=E2=80=AFAM Benjamin Tissoires > wrote: > > > > On Thu, Apr 4, 2024 at 6:41=E2=80=AFPM Alexei Starovoitov > > wrote: > > > > > > On Thu, Apr 4, 2024 at 8:27=E2=80=AFAM Benjamin Tissoires > > > wrote: > > > > > > > > > > > > > > > > > > So we need something like: > > > > > > > > > > struct bpf_hrtimer { > > > > > union { > > > > > struct hrtimer timer; > > > > > + struct work_struct work; > > > > > }; > > > > > struct bpf_map *map; > > > > > struct bpf_prog *prog; > > > > > void __rcu *callback_fn; > > > > > void *value; > > > > > union { > > > > > > > > Are you sure we need an union here? If we get to call kfree_rcu() w= e > > > > need to have both struct rcu_head and sync_work, not one or the oth= er. > > > > > > why? with an extra flag it's one or the other. > > > In bpf_timer_cancel_and_free() > > > if (flag & SLEEPABLE) { > > > schedule_work() to cancel_work_sync + kfree_rcu > > > } else { > > > hrtimer_cancel > > > kfree_rcu > > > } > > > > I thought kfree_rcu required struct rcu_head, and given that we need > > to initialize sync_work it will be poisoned... > > yes. It needs rcu_head. > But where do you see a conflict? > INIT_WORK + schedule_work() will use that space, > then cancel_work_sync() will wait on a different work_struct, > then kfree_rcu() will reuse that space. Yeah, sorry, I haven't realized that the memory used by kfree_rcu wasn't initialized. > > In case of hrtimers none of the work_structs will be used. > > > > > > > > > > > struct rcu_head rcu; > > > > > + struct work_struct sync_work; > > > > > }; > > > > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPAB= LE > > > > > > > > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init() > > > > (like in my v2 or v3 IIRC). But that means that once a timer is > > > > initialized it needs to be of one or the other type (especially tru= e > > > > with the first union in this struct). > > > > > > yes. That's an idea. > > > The code to support wq vs timer seems to be diverging more > > > than what we expected initially. > > > It seems cleaner to set it as init time and enforce in > > > other helpers. > > > > OK, works for me. > > > > > > > > Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer) > > > too far. > > > It's already at 112 bytes and some people use bpf_timer per flow. > > > So potentially millions of such timers. > > > Adding extra sizeof(struct work_struct)=3D32 * 2 that won't be > > > used is too much. > > > Note that sizeof(struct hrtimer)=3D64, so unions make everything > > > fit nicely. > > > > Maybe we should do > > union { > > struct hrtimer timer; > > struct { > > struct work_struct work; > > struct work_struct sync_work; > > } > > } > > It's also ok, but sharing rcu_head and work_struct seems > cleaner, since it highlights that they're exclusive. > > > (not nice to read but at least we don't change the size at the beginnin= g) > > > > > > > > > So should we reject during run time bpf_timer_set_callback() for > > > > sleepable timers and only allow bpf_timer_set_sleepable_cb() for > > > > those? (and the invert in the other case). > > > > > > yes. > > > > > > > This version of the patch allows for one timer to be used as softIR= Q > > > > or WQ, depending on the timer_set_callback that is used. But it mig= ht > > > > be simpler for the kfree_rcu race to define the bpf_timer to be of = one > > > > kind, so we are sure to call the correct kfree method. > > > > > > I think one or another simplifies the code and makes it easier > > > to think through combinations. > > > > > > I'm still contemplating adding new "struct bpf_wq" and new kfuncs > > > to completely separate wq vs timer. > > > The code reuse seems to be relatively small. > > > > There is some code reuse in the verifier, but it can be factored out I = think. > > > > Though the biggest reuse might be in the map portion of bpf_timer, > > which I haven't looked much TBH. > > Right. It's all the 'case BPF_TIMER:' in various places. > New 'struct bpf_wq' would need another entry in btf_field_type. > But that should be a straightforward addition. > > > > > > We can potentially factor out internals of bpf_timer_* into smaller > > > helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs. > > > > Yeah, also, given that we are going to enforce delay =3D=3D 0 for > > sleepable timers (wq), the user api would be much cleaner if we can > > have a dedicated bpf_wq (and it would make the flags of bpf_timer_init > > easier to deal with). > > It seems so. > Kinda hard to judge one way or the other without looking at > the final code, but it seems separation is worth attempting, at least. > > Also if we ever do hrtimer+wq we probably will be using > 'struct delayed_work' instead of rolling our own > 'struct hrtimer' + 'struct work_struct' combo. > > It seems wq logic already made such a combination special enough > and thought through the races, so we better just follow that path. > In that case it might be yet another 'struct bpf_delayed_wq' > and another set of kfuncs. > Considering that cancel_work() and cancel_delayed_work() > are separate api in the kernel. > Multiplexing all of them under bpf_timer_cancel() > seems wrong. > In the past we were somewhat limited in terms of helpers. > We tried not to add them unless absolutely necessary because > of uapi considerations. > Now with kfuncs we can add/tweak/remove them at will. > > > > > > > > > One more thing. > > > bpf_timer_cancel() api turned out to be troublesome. > > > Not only it cancels the timer, but drops callback too. > > > It was a surprising behavior for people familiar with > > > kernel timer api-s. > > > We should not repeat this mistake with wq. > > > > > > We can try to fix bpf_timer_cancel() too. > > > If we drop drop_prog_refcnt() from it it shouldn't affect > > > existing bpf_timer users who are forced to do: > > > bpf_timer_cancel() > > > bpf_timer_set_callback() > > > bpf_timer_start() > > > all the time. > > > If/when bpf_timer_cancel() stops dropping the callback > > > such bpf prog won't be affected. So low chance of breaking any prog. > > > wdyt? > > > > > > > How would a program know set_callback() is not required after a > > cancel() because the kernel kept it around? It seems that it's going > > to be hard for them to know that (unless by trying first a start()), > > and it will add more code. > > > > timer_cancel() would be hard to change but we can always do the change > > and add a new kfunc timer_cancel_no_drop() which would clearly allow > > that works too. > > > for new programs to know that set_callback() is not required to be > > called. In a few kernel releases we could remove it and say that > > timer_cancel() is the same (and replaced by a #define) > > #define won't work, since mechanics of detecting and calling > helpers vs kfuncs is quite different. > > > Anyway, the more I think of it, the more I think the best API would be > > a dedicated wq API. It's probably going to need a little bit more > > work, but it'll be more or less this work plus the new bpf_wq type in > > the map. > > It seems to me as well. > > Thanks for brainstorming. > Alright, as of today (and I'm about to be AFK for the weekend), I got your changes in and working (I think). I'll review the series on Monday and send it back so we have a baseline to compare it to with bpf_wq. Cheers, Benjamin