Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp1024021rwr; Thu, 20 Apr 2023 09:02:58 -0700 (PDT) X-Google-Smtp-Source: AKy350ZHLD4NZFkmiUWNPB4J88OeuD3AA45g5sYL9OAoF58P5oSng3RFZJc9fIgmNjeLzVuJHE4t X-Received: by 2002:a05:7500:f12:b0:102:cc42:d6bf with SMTP id kl18-20020a0575000f1200b00102cc42d6bfmr172514gab.22.1682006578025; Thu, 20 Apr 2023 09:02:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682006577; cv=none; d=google.com; s=arc-20160816; b=jArbb20v/2KhbOrnIZRYd71dve1nxj+vPf9h6pe0mCgC6JnjGDzw4VK6l+oUk4vwgn /Sg4NA/8ceWO29DBkuwAXumnukYBoOuUspGLSplfpz6dLa0C9ihakv5g+asqtKNqDjPN MJ/YNyxpFk779q8/JWd+1FXezc8iv80d/+cbTfTjwwnLNKExK+ocWdXFl2pLjvgfeHsW aAX8d8Y3FYCmZc+2PIpffaPLSZiXgjXCV+GaNZRiYmRv9HbWuTbS3y6c6Ykp1Hvd3I+F JmVoxMnE7E7XMquhnHGrTJCtuTc6vpZJP4KF3rpg5PbP29+iccFLNCy5mEmyybCU8bTu 8HQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id; bh=dBoJLFxkrYXxIsFa+SFSe6CrqnC/cHZsuXifzi2AKbg=; b=xc+gA3A8iW1iUW4JNg/M+1P+gN3ScQTNVhEkA0rdXTEc/5pOJOw1mx8guMs6kk86+C 4pady6KGNSCJYlr8JWZUKwKV/0kPxiF8gEUOeSzWUE5gKxfP9hc5AbW4wzydX3qXzYSx C+myJY2fUsVstdmhrd6r4ZjwC+aaAPb1ayQqKHpVkS6a9CKzBX6dfbBWGalE/zKurQPX c88I/hoGVwwOvh/O4X0zsNsmS946xDKqkYc0Fqs62asWIhfGeyNNJPcLuUu4UbZz1UDf hLHzv5FJ+FpApOvuTZRuaV0NRJg+x8HyiMcMLDFkHpbkhADDROnx+YqBEyJrQJFQA7p3 iS7Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g20-20020a67dc94000000b0042e52365d9bsi488289vsk.394.2023.04.20.09.02.33; Thu, 20 Apr 2023 09:02:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231547AbjDTQB1 (ORCPT + 99 others); Thu, 20 Apr 2023 12:01:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54788 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229521AbjDTQBZ (ORCPT ); Thu, 20 Apr 2023 12:01:25 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id B492E94; Thu, 20 Apr 2023 09:01:22 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4CB171480; Thu, 20 Apr 2023 09:02:06 -0700 (PDT) Received: from [10.57.22.141] (unknown [10.57.22.141]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id AA4B43F6C4; Thu, 20 Apr 2023 09:01:20 -0700 (PDT) Message-ID: Date: Thu, 20 Apr 2023 18:01:15 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error To: Yang Shi , Ionela Voinescu Cc: viresh.kumar@linaro.org, scott@os.amperecomputing.com, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, "Rafael J. Wysocki" References: <20230328193846.8757-1-yang@os.amperecomputing.com> <4bda6b02-cc50-fa47-c9b6-acda4cf201a8@arm.com> <195c95b2-f47c-f3d0-5663-97dd4c929ea4@arm.com> <3e239024-91d8-ea06-25a4-631496576319@os.amperecomputing.com> <7b57e680-0ba3-0b8b-851e-7cc369050386@os.amperecomputing.com> <1ce09fd7-0c1d-fc46-ce12-01b25fbd4afd@os.amperecomputing.com> Content-Language: en-US From: Pierre Gondois In-Reply-To: <1ce09fd7-0c1d-fc46-ce12-01b25fbd4afd@os.amperecomputing.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> >> You say that the cause of this is a congestion in the interconnect. I >> don't >> see a way to check that right now. >> However your trace is on the CPU0, so maybe all the other cores were >> shutdown >> in your test. If this is the case, do you think a congestion could >> happen with >> only one CPU ? > > No, other CPUs were not shut down in my test. I just ran "yes" on all > cores except CPU 0, then ran the reading freq script. Since all other > cores are busy, so the script should be always running on CPU 0. > > Since the counters, memory and other devices are on the interconnect, so > the congestion may be caused by plenty of factors IIUC. +Ionela Ionela pointed me to the following patch-set, which seems realated: https://lore.kernel.org/all/20230418113459.12860-5-sumitg@nvidia.com/ One thing that we didn't check I believe is and that Ionela pointed out is that we don't know whether we are accessing the present CPU or a remote CPU'AMUs. In the latter case there would be IPIs and possible delays in waking up/accessing the remote CPU). > >> >> Just 2 other comments: >> a- >> It might be interesting to change the order in which cpc registers are >> read >> just to see if it has an impact, but even if it has, I m not sure how >> this >> could be exploitable. >> Just in case, I mean doing that, but I think that b. might be better >> to try. >> >> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c >> index c51d3ccb4cca..479b55006020 100644 >> --- a/drivers/acpi/cppc_acpi.c >> +++ b/drivers/acpi/cppc_acpi.c >> @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct >> cppc_perf_fb_ctrs *perf_fb_ctrs) >>                 } >>         } >> >> -       cpc_read(cpunum, delivered_reg, &delivered); >>         cpc_read(cpunum, reference_reg, &reference); >> +       cpc_read(cpunum, delivered_reg, &delivered); >>         cpc_read(cpunum, ref_perf_reg, &ref_perf); >> >>         /* >> >> b- >> In the trace that you shared, the cpc_read() calls in the fist >> cppc_get_perf_ctrs() calls seem to always take a bit more time than in >> the >> second cppc_get_perf_ctrs() call. >> Would it be possible to collect traces similar as above with 3 or 4 >> calls to >> cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in >> the first >> call, accessing the cpc registers takes more time than in the >> following calls, >> due to cache misses or other reasons. > > Cache miss? The counters should be not cached and reading the counters > should not hit cache IIUC. Yes you are correct, what I said is copmletely wrong. > >> Ideally statistics on the result would be the best, or if you have a >> trace.dat >> to share containing a trace with multiple cppc_cpufreq_get_rate() calls. > > Tried option b, I managed to get histogram: > > @hist_first_ns[cat]: > [4K, 8K)          112321 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8K, 16K)            212 > |                                                    | > [16K, 32K)            25 > |                                                    | > [32K, 64K)            59 > |                                                    | > [64K, 128K)            6 > |                                                    | > [128K, 256K)           9 > |                                                    | > > @hist_second_ns[cat]: > [2K, 4K)          112590 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [4K, 8K)               4 > |                                                    | > [8K, 16K)              0 > |                                                    | > [16K, 32K)            15 > |                                                    | > [32K, 64K)            18 > |                                                    | > [64K, 128K)            1 > |                                                    | > [128K, 256K)           4 > |                                                    | > > The "first" means the first cppc_get_perf_ctrs() call. But the bpftrace > script can't tell the second, the third and the fourth, so all them are > shown as "second". Anyway it seems fine. We can tell the first read took > longer than the later ones for the most time. > > And a typical func_graph trace shows: > > # tracer: function_graph > # > #     TIME        CPU  DURATION                  FUNCTION CALLS > #      |          |     |   |                     |   |   | | >  4447.171333 |     0)               |  cppc_cpufreq_get_rate > [cppc_cpufreq]() { >  4447.171334 |     0)               |    cpufreq_cpu_get() { >  4447.171334 |     0)   1.060 us    | _raw_read_lock_irqsave(); >  4447.171336 |     0)   0.560 us    | _raw_read_unlock_irqrestore(); >  4447.171337 |     0)   3.480 us    | } >  4447.171338 |     0)   0.400 us    | cpufreq_cpu_put(); >  4447.171338 |     0)               |    cppc_get_perf_ctrs() { >  4447.171339 |     0)   0.720 us    | cpc_read.isra.0(); >  4447.171341 |     0)   0.700 us    | cpc_read.isra.0(); >  4447.171342 |     0)   0.380 us    | cpc_read.isra.0(); >  4447.171342 |     0)   0.600 us    | cpc_read.isra.0(); >  4447.171343 |     0)   4.900 us    | } >  4447.171344 |     0)               |    __delay() { >  4447.171344 |     0)   0.540 us    | arch_timer_evtstrm_available(); >  4447.171346 |     0)   2.420 us    | } >  4447.171347 |     0)               |    cppc_get_perf_ctrs() { >  4447.171347 |     0)   0.540 us    | cpc_read.isra.0(); >  4447.171348 |     0)   0.520 us    | cpc_read.isra.0(); >  4447.171349 |     0)   0.400 us    | cpc_read.isra.0(); >  4447.171350 |     0)   0.440 us    | cpc_read.isra.0(); >  4447.171350 |     0)   3.660 us    | } >  4447.171351 |     0)               |    __delay() { >  4447.171351 |     0)   0.400 us    | arch_timer_evtstrm_available(); >  4447.171353 |     0)   2.400 us    | } >  4447.171353 |     0)               |    cppc_get_perf_ctrs() { >  4447.171354 |     0)   0.540 us    | cpc_read.isra.0(); >  4447.171355 |     0)   0.540 us    | cpc_read.isra.0(); >  4447.171356 |     0)   0.380 us    | cpc_read.isra.0(); >  4447.171356 |     0)   0.420 us    | cpc_read.isra.0(); >  4447.171357 |     0)   3.640 us    | } >  4447.171357 |     0)               |    __delay() { >  4447.171358 |     0)   0.380 us    | arch_timer_evtstrm_available(); >  4447.171360 |     0)   2.380 us    |    } >  4447.171360 |     0)               |    cppc_get_perf_ctrs() { >  4447.171361 |     0)   0.520 us    |      cpc_read.isra.0(); >  4447.171361 |     0)   0.520 us    |      cpc_read.isra.0(); >  4447.171362 |     0)   0.400 us    |      cpc_read.isra.0(); >  4447.171363 |     0)   0.440 us    |      cpc_read.isra.0(); >  4447.171364 |     0)   3.640 us    |    } >  4447.171364 |     0)   0.520 us    |    cppc_cpufreq_perf_to_khz > [cppc_cpufreq](); >  4447.171365 |     0) + 34.240 us   |  } > > It also shows the first reading typically took longer than the later > ones. The second, the third and the fourth actually took almost same time. > > I also tried to read perf_fb_ctrs_t0 twice (so total 3 reads, 2 for t0, > 1 for t1, 2us delay between each read), but I didn't see noticeable > improvement. 4 reads (2 for t0, 2 for t1) does show some noticeable > improvement. > Thanks for the new data. >> >> Example of code where we do 4 calls to cppc_get_perf_ctrs(): >> >> diff --git a/drivers/cpufreq/cppc_cpufreq.c >> b/drivers/cpufreq/cppc_cpufreq.c >> index 022e3555407c..6370f2f0bdad 100644 >> --- a/drivers/cpufreq/cppc_cpufreq.c >> +++ b/drivers/cpufreq/cppc_cpufreq.c >> @@ -853,6 +853,20 @@ static unsigned int >> cppc_cpufreq_get_rate(unsigned int cpu) >> >>         udelay(2); /* 2usec delay between sampling */ >> >> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1); >> +       if (ret) >> +               return ret; >> + >> +       udelay(2); /* 2usec delay between sampling */ >> + >> +       /* Do a third call. */ >> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1); >> +       if (ret) >> +               return ret; >> + >> +       udelay(2); /* 2usec delay between sampling */ >> + >> +       /* Do a fourth call. */ >>         ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1); >>         if (ret) >>                 return ret; >> >>> >>> The above trace data shows some cpc reading took a little bit longer >>> than usual. I suspected it was caused by interconnect congestion. >>> >>> >>> So it looks like IRQ is the major contributing factor of high error >>> (4xxxxxx KHz), interconnect congestion is the major contributing factor >>> of low error (3xxxxxx KHz). >>> >>> So I did the below test: >>> 1. Disable IRQ: The high errors were gone (> 3700000KHz), but low errors >>> were still seen. >>> 2.10us delay: The high errors were still seen. >>> 3. Disable IRQ + 10us delay: all the errors were gone. >>> >>> I think the test result also supports the tracing data. >>> >>> >>> I also got some confusion about calling cppc_cpufreq_get_rate() with irq >>> disabled. Rafael thought 10ms delay is too long because the function may >>> be called with irq disabled. But a deeper look at the function shows it >>> should *NOT* be called with irq disabled at all. >>> >>> First, if pcc channel is used, cpc reading may take over 100ms, it is >>> way larger the proposed 10ms delay. >>> Second, reading from cpc channel needs to take a semaphore, so it may >>> sleep. But sleep with IRQ disabled is not allowed. >> >> Yes right, however the semaphore is not taken in between the sequence of >> cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below >> should >> be acceptable: > > Yeah, we should be able to find a smaller irq disable section. > >> >> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c >> index c51d3ccb4cca..105a7e2ffffa 100644 >> --- a/drivers/acpi/cppc_acpi.c >> +++ b/drivers/acpi/cppc_acpi.c >> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct >> cppc_perf_fb_ctrs *perf_fb_ctrs) >>         struct cppc_pcc_data *pcc_ss_data = NULL; >>         u64 delivered, reference, ref_perf, ctr_wrap_time; >>         int ret = 0, regs_in_pcc = 0; >> +       unsigned long flags; >> >>         if (!cpc_desc) { >>                 pr_debug("No CPC descriptor for CPU:%d\n", cpunum); >> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct >> cppc_perf_fb_ctrs *perf_fb_ctrs) >>                 } >>         } >> >> +       local_irq_save(flags); >> + >>         cpc_read(cpunum, delivered_reg, &delivered); >>         cpc_read(cpunum, reference_reg, &reference); >>         cpc_read(cpunum, ref_perf_reg, &ref_perf); >> >> +       local_irq_restore(flags); >> + > > cpc_read_ffh() would return -EPERM if irq is disabled. > > So, the irq disabling must happen for mmio only in cpc_read(), for example: I thought the issue was that irqs could happen in between cpc_read() functions, the patch below would not cover it. If the frequency is more accurate with this patch, I think I don't understand something. (asking for more information) Just to check, the core/perf counters are AMUs and the other CPPC registers are mmio right ? Is it possible to know the CPPC registers that are implemented on your platform ? Also is it possible which platform you are using ? > > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c > index c51d3ccb4cca..f3c92d844074 100644 > --- a/drivers/acpi/cppc_acpi.c > +++ b/drivers/acpi/cppc_acpi.c > @@ -982,6 +982,7 @@ static int cpc_read(int cpu, struct > cpc_register_resource *reg_res, u64 *val) >         void __iomem *vaddr = NULL; >         int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu); >         struct cpc_reg *reg = ®_res->cpc_entry.reg; > +       unsigned long flags; > >         if (reg_res->type == ACPI_TYPE_INTEGER) { >                 *val = reg_res->cpc_entry.int_value; > @@ -1015,6 +1016,7 @@ static int cpc_read(int cpu, struct > cpc_register_resource *reg_res, u64 *val) >                 return > acpi_os_read_memory((acpi_physical_address)reg->address, >                                 val, reg->bit_width); > > +       local_irq_save(flags); >         switch (reg->bit_width) { >         case 8: >                 *val = readb_relaxed(vaddr); > @@ -1029,10 +1031,12 @@ static int cpc_read(int cpu, struct > cpc_register_resource *reg_res, u64 *val) >                 *val = readq_relaxed(vaddr); >                 break; >         default: > +               local_irq_restore(flags); >                 pr_debug("Error: Cannot read %u bit width from PCC for > ss: %d\n", >                          reg->bit_width, pcc_ss_id); >                 return -EFAULT; >         } > +       local_irq_restore(flags); > >         return 0; >  } > >>         /* >>          * Per spec, if ctr_wrap_time optional register is >> unsupported, then the >>          * performance counters are assumed to never wrap during the >> lifetime of >> >>> Third, if the counters are implemented by AMU, cpc_read_ffh() needs to >>> send IPI so it requires IRQ enabled. >> >> If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs >> disabled, >> it should not prevent it to send IPIs no ? > > It can't work with irq disabled. The comment in counters_read_on_cpu() > says "Abort call on counterless CPU or when interrupts are disabled - > can lead to deadlock in smp sync call." > > > And it just returns -EPERM and raise a warning if irq is disabled. Ok right, Regards, Piere