Received: by 2002:a05:7412:f690:b0:e2:908c:2ebd with SMTP id ej16csp1251205rdb; Fri, 20 Oct 2023 12:46:45 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG/aKsQtw+IqRGuHcBghc9PMzJqgnGJvkNduKjwx7JqW7EPD4VdvNNMDC8do9OFVVX9Gigq X-Received: by 2002:a05:6a00:14c9:b0:68f:bbc3:df44 with SMTP id w9-20020a056a0014c900b0068fbbc3df44mr3374449pfu.0.1697831204796; Fri, 20 Oct 2023 12:46:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697831204; cv=none; d=google.com; s=arc-20160816; b=bYMZclBDBb0G587HAHaO3h8GSjkraHGgudHBTZMPeZ2/h7IEeLFRVTz8pKvMp9fD1Z dRpIVGRbxi1nojqfMQK9eMnsvETR7tySIK2ui5pV+uCgxcNl1Lam5wJmle3orhLelC75 tQ35Iqn5TF+CeDbvDR9MklbPGQATqnhrHbMHvUeZ37GTYRVMyEoFup5tCoHNYegbRFSZ lgqzHfFEZS5bLRe9Sg7so30BZPIsLOn1CDTNCnX16o36dFSNIYb662O/zUJAXYfDBEPQ ZXsG7N4VDZgg+LGydMc+4AUXFR4bSv7b/PZrz1vQCAqgtCs5+AQmTRBhGsqyCpUo+aox tkOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:references:in-reply-to:subject:cc:to:from :dkim-signature; bh=Wh+Dv419ISZ8MwJJpSPHQzg0ccuVUnxDPZZwa2eAnEY=; fh=qk53yMnWmsWfrmzkTe5d80ixRWgQ5jHxqKMWw9sOx5s=; b=nCCzxN9dDZ1O2y1KsrO6m+JheVGFJFVuoJ4MFtzH5J0wJalY8yIPjR/H34/Rydqz0y tcG7v3Ef10tlHAcs6kzXui7BRZRM1cNGfeinudr8EygAE1Uvd7RPuJ9/AcvbaoNUf81Z cbfHm2ZchAqP9ElYdEfaxJwUlQZBCUrsbxdeMtJTwB398Vjh6TLZvVG3Xc4PVcWnM84+ lwtlzr4BWz327UpQLV8W5UMCFjn32FN7EZgMslJ+I+ldRV8fy2Ql5q+BD5DFNnUhbtcf wIVSQhhIfZCKZ0y5/AxbO0vIRO9LBe2HyxceehlgTKzjo98EjT0BJVrPCa84KXHvkHNg psUw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Sdep1+c0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id i23-20020a633c57000000b00565eccbc52fsi2589832pgn.303.2023.10.20.12.46.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 20 Oct 2023 12:46:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Sdep1+c0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 520E58076D07; Fri, 20 Oct 2023 12:45:46 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229723AbjJTTpd (ORCPT + 99 others); Fri, 20 Oct 2023 15:45:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38282 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229555AbjJTTpb (ORCPT ); Fri, 20 Oct 2023 15:45:31 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 21DF0D52 for ; Fri, 20 Oct 2023 12:45:29 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1DAA0C433C7; Fri, 20 Oct 2023 19:45:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1697831128; bh=w20LFURbBMrbFbfKlNvLx4G7zOkHbNQac0n02X5DHnM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=Sdep1+c0KL9zzmvUTGS83P2istPzUonpp11T0DdFI6SHHLScZ6DhvKrFS0UYUGmBK RBFaPWCWbtAbkjjEOKgJvx3/fQK2NsiundaEis3NV03NyrGHx02epDH2wXMc0D47PV wJu8fvKyo5+o3OfTTvPnfzV83Yn/vwmFot7G0csBVMdYyigTdhSAm3+w42adoGJhDU OiHR+y1wPGo4oRs5l8E0M0+AS6lYRQKdFMGV7ofyLKZ7taZ6Q4D8aOEc4/KlaDybWG B90L+ABPi4UmmvHH11Xm4q6sqBArpWxet8fyS4VjNljLIAQP4M6qbPuhHr1mGh8ZSj HF6mErwfFxZzg== From: =?utf-8?B?QmrDtnJuIFTDtnBlbA==?= To: Anup Patel Cc: Palmer Dabbelt , Paul Walmsley , Thomas Gleixner , Marc Zyngier , Rob Herring , Krzysztof Kozlowski , Frank Rowand , Conor Dooley , Atish Patra , Andrew Jones , Sunil V L , Saravana Kannan , Anup Patel , linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, devicetree@vger.kernel.org Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support In-Reply-To: References: <20231003044403.1974628-1-apatel@ventanamicro.com> <87o7gu7mo9.fsf@all.your.base.are.belong.to.us> <87h6mlbryy.fsf@all.your.base.are.belong.to.us> <87o7gtpdb4.fsf@all.your.base.are.belong.to.us> <87v8b1i72s.fsf@all.your.base.are.belong.to.us> Date: Fri, 20 Oct 2023 21:45:25 +0200 Message-ID: <875y31cc2y.fsf@all.your.base.are.belong.to.us> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 20 Oct 2023 12:45:46 -0700 (PDT) Anup Patel writes: > On Fri, Oct 20, 2023 at 10:07=E2=80=AFPM Bj=C3=B6rn T=C3=B6pel wrote: >> >> Anup Patel writes: >> >> > On Fri, Oct 20, 2023 at 8:10=E2=80=AFPM Bj=C3=B6rn T=C3=B6pel wrote: >> >> >> >> Anup Patel writes: >> >> >> >> > On Fri, Oct 20, 2023 at 2:17=E2=80=AFPM Bj=C3=B6rn T=C3=B6pel wrote: >> >> >> >> >> >> Thanks for the quick reply! >> >> >> >> >> >> Anup Patel writes: >> >> >> >> >> >> > On Thu, Oct 19, 2023 at 7:13=E2=80=AFPM Bj=C3=B6rn T=C3=B6pel wrote: >> >> >> >> >> >> >> >> Hi Anup, >> >> >> >> >> >> >> >> Anup Patel writes: >> >> >> >> >> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V in= ternational >> >> >> >> > process. The latest ratified AIA specifcation can be found at: >> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/risc= v-interrupts-1.0.pdf >> >> >> >> > >> >> >> >> > At a high-level, the AIA specification adds three things: >> >> >> >> > 1) AIA CSRs >> >> >> >> > - Improved local interrupt support >> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC) >> >> >> >> > - Per-HART MSI controller >> >> >> >> > - Support MSI virtualization >> >> >> >> > - Support IPI along with virtualization >> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC) >> >> >> >> > - Wired interrupt controller >> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MS= I generator) >> >> >> >> > - In Direct-mode, injects external interrupts directly int= o HARTs >> >> >> >> >> >> >> >> Thanks for working on the AIA support! I had a look at the seri= es, and >> >> >> >> have some concerns about interrupt ID abstraction. >> >> >> >> >> >> >> >> A bit of background, for readers not familiar with the AIA deta= ils. >> >> >> >> >> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, = and >> >> >> >> each MSI is dedicated to a certain hart. The series takes the a= pproach >> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") global= ly. >> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in = the >> >> >> >> slice only *one* msi-irq is acutally used. >> >> >> >> >> >> >> >> This scheme makes affinity changes more robust, because the int= errupt >> >> >> >> sources on "other" harts are pre-allocated. On the other hand it >> >> >> >> requires to propagate irq masking to other harts via IPIs (this= is >> >> >> >> mostly done up setup/tear down). It's also wasteful, because ms= i-irqs >> >> >> >> are hogged, and cannot be used. >> >> >> >> >> >> >> >> Contemporary storage/networking drivers usually uses queues per= core >> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi= -irqs. >> >> >> >> If we instead used a scheme where "msi-irq =3D=3D lnx-irq", ins= tead of >> >> >> >> "lnq-irq =3D {hart 0;msi-irq x , ... hart N;msi-irq x}", there = would be >> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage devi= ce >> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the = current >> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5. >> >> >> >> >> >> >> >> On the plus side: >> >> >> >> * Changing interrupts affinity will never fail, because the int= errupts >> >> >> >> on each hart is pre-allocated. >> >> >> >> >> >> >> >> On the negative side: >> >> >> >> * Wasteful interrupt usage, and a system can potientially "run = out" of >> >> >> >> interrupts. Especially for many core systems. >> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's= no >> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IM= SIC >> >> >> >> >> >> >> >> Summary: >> >> >> >> The current series caps the number of global interrupts to maxi= mum >> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, w= ould be >> >> >> >> to expose 2047 * #harts unique MSIs. >> >> >> >> >> >> >> >> I think this could simplify/remove(?) the locking as well. >> >> >> > >> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues: >> >> >> > 1) The irq_set_affinity() does not work for MSIs because each >> >> >> > IRQ is not tied to a particular HART. This means we can't >> >> >> > balance the IRQ processing load among HARTs. >> >> >> >> >> >> Yes, you can balance. In your code, each *active* MSI is still >> >> >> bound/active to a specific hard together with the affinity mask. I= n an >> >> >> 1-1 model you would still need to track the affinity mask, but the >> >> >> irq_set_affinity() would be different. It would try to allocate a = new >> >> >> MSI from the target CPU, and then switch to having that MSI active. >> >> >> >> >> >> That's what x86 does AFAIU, which is also constrained by the # of >> >> >> available MSIs. >> >> >> >> >> >> The downside, as I pointed out, is that the set affinity action can >> >> >> fail for a certain target CPU. >> >> > >> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for >> >> > RISC-V AIA, one HART does not have access to other HARTs >> >> > MSI enable/disable bits so the approach will also involve IPI. >> >> >> >> Correct, but the current series does a broadcast to all cores, where = the >> >> 1-1 approach is at most an IPI to a single core. >> >> >> >> 128+c machines are getting more common, and you have devices that you >> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, wh= en >> >> dealing with a per-core activity is a pretty noisy neighbor. >> > >> > Broadcast IPI in the current approach is only done upon MSI mask/unmask >> > operation. It is not done upon set_affinity() of interrupt handling. >> >> I'm aware. We're on the same page here. >> >> >> >> >> This could be fixed in the existing 1-n approach, by not require to s= ync >> >> the cores that are not handling the MSI in question. "Lazy disable" >> > >> > Incorrect. The approach you are suggesting involves an IPI upon every >> > irq_set_affinity(). This is because a HART can only enable it's own >> > MSI ID so when an IRQ is moved to from HART A to HART B with >> > a different ID X on HART B then we will need an IPI in irq_set_affinit= () >> > to enable ID X on HART B. >> >> Yes, the 1-1 approach will require an IPI to one target cpu on affinity >> changes, and similar on mask/unmask. >> >> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI >> broadcast to all cores on mask/unmask (not so nice). >> >> >> >> My concern is interrupts become a scarce resource with this >> >> >> implementation, but maybe my view is incorrect. I've seen bare-met= al >> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but ma= ybe >> >> >> that is considered "a lot of interrupts". >> >> >> >> >> >> As long as we don't get into scenarios where we're running out of >> >> >> interrupts, due to the software design. >> >> >> >> >> > >> >> > The current approach is simpler and ensures irq_set_affinity >> >> > always works. The limit of max 2047 IDs is sufficient for many >> >> > systems (if not all). >> >> >> >> Let me give you another view. On a 128c system each core has ~16 uniq= ue >> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048 >> >> network queue pairs for each PF. >> > >> > Clearly, this example is a hypothetical and represents a poorly >> > designed platform. >> > >> > Having just 16 IDs per-Core is a very poor design choice. In fact, the >> > Server SoC spec mandates a minimum 255 IDs. >> >> You are misreading. A 128c system with 2047 MSIs per-core, will only >> have 16 *per-core unique* (2047/128) interrupts with the current series. >> >> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c >> system with the maximum amount of MSIs possible in the spec, you'll end >> up with 16 *unique* interrupts per core. > > -ENOPARSE > > I don't see how this applies to the current approach because we treat > MSI ID space as global across cores so if a system has 2047 MSIs > per-core then we have 2047 MSIs across all cores. Ok, I'll try again! :-) Let's assume that each core in the 128c system has some per-core resources, say a two NIC queue pairs, and a storage queue pair. This will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace. If each core does this it'll be 6*128 MSI sources of the global namespace. The maximum number of "privates" MSI sources a core can utilize is 16. I'm trying (it's does seem to go that well ;-)) to point out that it's only 16 unique sources per core. For, say, a 256 core system it would be 8. 2047 MSI sources in a system is not much. Say that I want to spin up 24 NIC queues with one MSI each on each core on my 128c system. That's not possible with this series, while with an 1-1 system it wouldn't be an issue. Clearer, or still weird? > >> >> > Regarding NICs which support a large number of queues, the driver >> > will typically enable only one queue per-core and set the affinity to >> > separate cores. We have user-space data plane applications based >> > on DPDK which are capable of using a large number of NIC queues >> > but these applications are polling based and don't use MSIs. >> >> That's one sample point, and clearly not the only one. There are *many* >> different usage models. Just because you *assign* MSI, doesn't mean they >> are firing all the time. >> >> I can show you a couple of networking setups where this is clearly not >> enough. Each core has a large number of QoS queues, and each queue would >> very much like to have a dedicated MSI. >> >> >> > When we encounter a system requiring a large number of MSIs, >> >> > we can either: >> >> > 1) Extend the AIA spec to support greater than 2047 IDs >> >> > 2) Re-think the approach in the IMSIC driver >> >> > >> >> > The choice between #1 and #2 above depends on the >> >> > guarantees we want for irq_set_affinity(). >> >> >> >> The irq_set_affinity() behavior is better with this series, but I thi= nk >> >> the other downsides: number of available interrupt sources, and IPI >> >> broadcast are worse. >> > >> > The IPI overhead in the approach you are suggesting will be >> > even bad compared to the IPI overhead of the current approach >> > because we will end-up doing IPI upon every irq_set_affinity() >> > in the suggested approach compared to doing IPI upon every >> > mask/unmask in the current approach. >> >> Again, very workload dependent. >> >> This series does IPI broadcast on masking/unmasking, which means that >> cores that don't care get interrupted because, say, a network queue-pair >> is setup on another core. >> >> Some workloads never change the irq affinity. > > There are various events which irq affinity such as irq balance, > CPU hotplug, system suspend, etc. > > Also, the 1-1 approach does IPI upon set_affinity, mask and > unmask whereas the 1-n approach does IPI only upon mask > and unmask. An important distinction; When you say IPI on mask/unmask it is a broadcast IPI to *all* cores, which is pretty instrusive. The 1-1 variant does an IPI to a *one* target core. >> I'm just pointing out that there are pro/cons with both variants. >> >> > The biggest advantage of the current approach is a reliable >> > irq_set_affinity() which is a very valuable thing to have. >> >> ...and I'm arguing that we're paying a big price for that. >> >> > ARM systems easily support a large number of LPIs per-core. >> > For example, GIC-700 supports 56000 LPIs per-core. >> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-= GIC-700/Features) >> >> Yeah, but this is not the GIC. This is something that looks more like >> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0 >> spec, and many cores. > > Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying > is that there are systems with large number per-core interrupt IDs > for handling MSIs. Yes, and while that is nice, it's not what IMSIC is. Now, back to the weekend for real! ;-) (https://xkcd.com/386/) Bj=C3=B6rn