2017-06-14 05:59:17

by Zheng, Lv

[permalink] [raw]
Subject: [PATCH 1/3] ACPI: EC: Fix an EC event IRQ storming issue

The EC event IRQ (SCI_EVT) can only be handled by submitting QR_EC. As the
EC driver handles SCI_EVT in a workqueue, after SCI_EVT is flagged and
before QR_EC is submitted, there is a period risking IRQ storming. EC IRQ
must be masked for this period but linux EC driver never does so.

No end user notices the IRQ storming and no developer fixes this known
issue because:
1. the EC IRQ is always edge triggered GPE, and
2. the kernel can execute no-op EC IRQ handler very fast.
For edge triggered EC GPE platforms, it is only reported of post-resume EC
event lost issues, there won't be an IRQ storming. For level triggered EC
GPE platforms, fortunately the kernel is always fast enough to execute such
a no-op EC IRQ handler so that the IRQ handler won't be accumulated to
starve the task contexts, causing a real IRQ storming.

But the IRQ storming actually can still happen when:
1. the EC IRQ performs like level triggered GPE, and
2. the kernel EC debugging log is turned on but the console is slow enough.
There are more and more platforms using EC GPE as wake GPE where the EC GPE
is likely designed as level triggered. Then when EC debugging log is
enabled, the EC IRQ handler is no longer a no-op but dumps IRQ status to
the consoles. If the consoles are slow enough, the EC IRQs can arrive much
faster than executing the handler. Finally the accumulated EC event IRQ
handlers starve the task contexts, causing the IRQ storming to occur, and
the kernel hangs can be observed during boot/resume.

See link #1 for reference, however the bug link can only be accessed by
priviledged Intel users.

This patch fixes this issue by masking EC IRQ for this period:
1. begins when there is an SCI_EVT IRQ pending, and
2. ends when there is a QR_EC completed (SCI_EVT acknowledged).

Link: https://jira01.devtools.intel.com/browse/LCK-4004 [#1]
Tested-by: Wang Wendy <[email protected]>
Tested-by: Feng Chenzhou <[email protected]>
Signed-off-by: Lv Zheng <[email protected]>
---
drivers/acpi/ec.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index c24235d..30d7f82 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -459,8 +459,10 @@ static bool acpi_ec_submit_flushable_request(struct acpi_ec *ec)

static void acpi_ec_submit_query(struct acpi_ec *ec)
{
- if (acpi_ec_event_enabled(ec) &&
- !test_and_set_bit(EC_FLAGS_QUERY_PENDING, &ec->flags)) {
+ acpi_ec_set_storm(ec, EC_FLAGS_COMMAND_STORM);
+ if (!acpi_ec_event_enabled(ec))
+ return;
+ if (!test_and_set_bit(EC_FLAGS_QUERY_PENDING, &ec->flags)) {
ec_dbg_evt("Command(%s) submitted/blocked",
acpi_ec_cmd_string(ACPI_EC_COMMAND_QUERY));
ec->nr_pending_queries++;
@@ -470,11 +472,10 @@ static void acpi_ec_submit_query(struct acpi_ec *ec)

static void acpi_ec_complete_query(struct acpi_ec *ec)
{
- if (test_bit(EC_FLAGS_QUERY_PENDING, &ec->flags)) {
- clear_bit(EC_FLAGS_QUERY_PENDING, &ec->flags);
+ if (test_and_clear_bit(EC_FLAGS_QUERY_PENDING, &ec->flags))
ec_dbg_evt("Command(%s) unblocked",
acpi_ec_cmd_string(ACPI_EC_COMMAND_QUERY));
- }
+ acpi_ec_clear_storm(ec, EC_FLAGS_COMMAND_STORM);
}

static inline void __acpi_ec_enable_event(struct acpi_ec *ec)
--
2.7.4


2017-06-14 05:59:24

by Zheng, Lv

[permalink] [raw]
Subject: [PATCH 2/3] ACPI: EC: Fix EC command visibility for dynamic debug

acpi_ec_cmd_string() currently is only enabled for "DEBUG" macro, but users
trend to use CONFIG_DYNAMIC_DEBUG and enable ec.c pr_debug() print-outs by
"dyndbg='file ec.c +p'". In this use case, all command names are turned
into UNDEF and the log is confusing. This affects bugzilla triage work.

This patch fixes this issue by enabling acpi_ec_cmd_string() for
CONFIG_DYNAMIC_DEBUG.

Tested-by: Wang Wendy <[email protected]>
Tested-by: Feng Chenzhou <[email protected]>
Signed-off-by: Lv Zheng <[email protected]>
---
drivers/acpi/ec.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index 30d7f82..f3ff591 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -316,7 +316,7 @@ static inline void acpi_ec_write_data(struct acpi_ec *ec, u8 data)
ec->timestamp = jiffies;
}

-#ifdef DEBUG
+#if defined(DEBUG) || defined(CONFIG_DYNAMIC_DEBUG)
static const char *acpi_ec_cmd_string(u8 cmd)
{
switch (cmd) {
--
2.7.4

2017-06-14 05:59:31

by Zheng, Lv

[permalink] [raw]
Subject: [PATCH 3/3] ACPI: EC: Change EC noirq tuning to be an optional behavior

According to the bug report, though the busy polling mode can make noirq
stages executed faster, it causes abnormal fan blowing in noirq stages.

This patch prepares an option so that the automatic busy polling mode
switching for noirq stages can be enabled by who wants to tune it, not all
users.
Noticed that the new global option cannot be changed during noirq stages.
There is no need to lock its value changes to sync with polling mode
settings switches.

For reporters and testers in the thread, as there are too many reporters
on the bug link, this patch only picks names from most active commenters.
Sorry for the neglet.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=191181
Reported-by: Tatsuyuki Ishi <[email protected]>
Reported-by: Claudio Sacerdoti Coen <[email protected]>
Tested-by: Nicolo' <[email protected]>
Reported-by: Jens Axboe <[email protected]>
Tested-by: Gjorgji Jankovski <[email protected]>
Tested-by: Damjan Georgievski <[email protected]>
Tested-by: Fernando Chaves <[email protected]>
Signed-off-by: Lv Zheng <[email protected]>
---
drivers/acpi/ec.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index f3ff591..de5dde6 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -151,6 +151,10 @@ static bool ec_freeze_events __read_mostly = true;
module_param(ec_freeze_events, bool, 0644);
MODULE_PARM_DESC(ec_freeze_events, "Disabling event handling during suspend/resume");

+static bool ec_tune_noirq __read_mostly = false;
+module_param(ec_tune_noirq, bool, 0644);
+MODULE_PARM_DESC(ec_tune_noirq, "Automatic enabling busy polling to tune noirq stages faster");
+
struct acpi_ec_query_handler {
struct list_head node;
acpi_ec_query_func func;
@@ -979,9 +983,11 @@ static void acpi_ec_enter_noirq(struct acpi_ec *ec)
unsigned long flags;

spin_lock_irqsave(&ec->lock, flags);
- ec->busy_polling = true;
- ec->polling_guard = 0;
- ec_log_drv("interrupt blocked");
+ if (ec_tune_noirq) {
+ ec->busy_polling = true;
+ ec->polling_guard = 0;
+ ec_log_drv("interrupt blocked");
+ }
spin_unlock_irqrestore(&ec->lock, flags);
}

@@ -990,9 +996,11 @@ static void acpi_ec_leave_noirq(struct acpi_ec *ec)
unsigned long flags;

spin_lock_irqsave(&ec->lock, flags);
- ec->busy_polling = ec_busy_polling;
- ec->polling_guard = ec_polling_guard;
- ec_log_drv("interrupt unblocked");
+ if (ec_tune_noirq) {
+ ec->busy_polling = ec_busy_polling;
+ ec->polling_guard = ec_polling_guard;
+ ec_log_drv("interrupt unblocked");
+ }
spin_unlock_irqrestore(&ec->lock, flags);
}

--
2.7.4

2017-06-28 21:30:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/3] ACPI: EC: Fix an EC event IRQ storming issue

On Wednesday, June 14, 2017 01:59:09 PM Lv Zheng wrote:
> The EC event IRQ (SCI_EVT) can only be handled by submitting QR_EC. As the
> EC driver handles SCI_EVT in a workqueue, after SCI_EVT is flagged and
> before QR_EC is submitted, there is a period risking IRQ storming. EC IRQ
> must be masked for this period but linux EC driver never does so.
>
> No end user notices the IRQ storming and no developer fixes this known
> issue because:
> 1. the EC IRQ is always edge triggered GPE, and
> 2. the kernel can execute no-op EC IRQ handler very fast.
> For edge triggered EC GPE platforms, it is only reported of post-resume EC
> event lost issues, there won't be an IRQ storming. For level triggered EC
> GPE platforms, fortunately the kernel is always fast enough to execute such
> a no-op EC IRQ handler so that the IRQ handler won't be accumulated to
> starve the task contexts, causing a real IRQ storming.
>
> But the IRQ storming actually can still happen when:
> 1. the EC IRQ performs like level triggered GPE, and
> 2. the kernel EC debugging log is turned on but the console is slow enough.
> There are more and more platforms using EC GPE as wake GPE where the EC GPE
> is likely designed as level triggered. Then when EC debugging log is
> enabled, the EC IRQ handler is no longer a no-op but dumps IRQ status to
> the consoles. If the consoles are slow enough, the EC IRQs can arrive much
> faster than executing the handler. Finally the accumulated EC event IRQ
> handlers starve the task contexts, causing the IRQ storming to occur, and
> the kernel hangs can be observed during boot/resume.
>
> See link #1 for reference, however the bug link can only be accessed by
> priviledged Intel users.
>
> This patch fixes this issue by masking EC IRQ for this period:
> 1. begins when there is an SCI_EVT IRQ pending, and
> 2. ends when there is a QR_EC completed (SCI_EVT acknowledged).
>
> Link: https://jira01.devtools.intel.com/browse/LCK-4004 [#1]
> Tested-by: Wang Wendy <[email protected]>
> Tested-by: Feng Chenzhou <[email protected]>
> Signed-off-by: Lv Zheng <[email protected]>

I've applied this and the [2/3], but I'm not sure about the [3/3].

I'll reply to that patch.

Thanks,
Rafael

2017-06-28 21:38:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 3/3] ACPI: EC: Change EC noirq tuning to be an optional behavior

On Wednesday, June 14, 2017 01:59:24 PM Lv Zheng wrote:
> According to the bug report, though the busy polling mode can make noirq
> stages executed faster, it causes abnormal fan blowing in noirq stages.
>
> This patch prepares an option so that the automatic busy polling mode
> switching for noirq stages can be enabled by who wants to tune it, not all
> users.
> Noticed that the new global option cannot be changed during noirq stages.
> There is no need to lock its value changes to sync with polling mode
> settings switches.
>
> For reporters and testers in the thread, as there are too many reporters
> on the bug link, this patch only picks names from most active commenters.
> Sorry for the neglet.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=191181
> Reported-by: Tatsuyuki Ishi <[email protected]>
> Reported-by: Claudio Sacerdoti Coen <[email protected]>
> Tested-by: Nicolo' <[email protected]>
> Reported-by: Jens Axboe <[email protected]>
> Tested-by: Gjorgji Jankovski <[email protected]>
> Tested-by: Damjan Georgievski <[email protected]>
> Tested-by: Fernando Chaves <[email protected]>
> Signed-off-by: Lv Zheng <[email protected]>

First of all, this seems to be a fix for commit c3a696b6e8f8 (ACPI / EC: Use busy polling
mode when GPE is not enabled), so there should be a Fixes: tag pointing to that
one.

Moreover, if that is just a performance optimization and not a matter of correctness,
why don't we simply drop acpi_ec_enter/leave_noirq() entirely?

What is going to break if we do that?

Thanks,
Rafael

2017-07-03 05:15:39

by Zheng, Lv

[permalink] [raw]
Subject: RE: [PATCH 3/3] ACPI: EC: Change EC noirq tuning to be an optional behavior

Hi, Rafael

> From: Rafael J. Wysocki [mailto:[email protected]]
> Subject: Re: [PATCH 3/3] ACPI: EC: Change EC noirq tuning to be an optional behavior
>
> On Wednesday, June 14, 2017 01:59:24 PM Lv Zheng wrote:
> > According to the bug report, though the busy polling mode can make noirq
> > stages executed faster, it causes abnormal fan blowing in noirq stages.
> >
> > This patch prepares an option so that the automatic busy polling mode
> > switching for noirq stages can be enabled by who wants to tune it, not all
> > users.
> > Noticed that the new global option cannot be changed during noirq stages.
> > There is no need to lock its value changes to sync with polling mode
> > settings switches.
> >
> > For reporters and testers in the thread, as there are too many reporters
> > on the bug link, this patch only picks names from most active commenters.
> > Sorry for the neglet.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=191181
> > Reported-by: Tatsuyuki Ishi <[email protected]>
> > Reported-by: Claudio Sacerdoti Coen <[email protected]>
> > Tested-by: Nicolo' <[email protected]>
> > Reported-by: Jens Axboe <[email protected]>
> > Tested-by: Gjorgji Jankovski <[email protected]>
> > Tested-by: Damjan Georgievski <[email protected]>
> > Tested-by: Fernando Chaves <[email protected]>
> > Signed-off-by: Lv Zheng <[email protected]>
>
> First of all, this seems to be a fix for commit c3a696b6e8f8 (ACPI / EC: Use busy polling
> mode when GPE is not enabled), so there should be a Fixes: tag pointing to that
> one.
>
> Moreover, if that is just a performance optimization and not a matter of correctness,
> why don't we simply drop acpi_ec_enter/leave_noirq() entirely?
>
> What is going to break if we do that?

Let me Cc Yu for justification.
I just added busy poll support for suspend/boot according to the root cause reported by him.
He should know the end user requirements better than me.

Thanks and best regards
Lv

2017-07-06 08:30:24

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 3/3] ACPI: EC: Change EC noirq tuning to be an optional behavior

Hi Lv,
On Mon, Jul 03, 2017 at 01:15:26PM +0800, Zheng, Lv wrote:
> Hi, Rafael
>
> > From: Rafael J. Wysocki [mailto:[email protected]]
> > Subject: Re: [PATCH 3/3] ACPI: EC: Change EC noirq tuning to be an optional behavior
> >
> > On Wednesday, June 14, 2017 01:59:24 PM Lv Zheng wrote:
> > > According to the bug report, though the busy polling mode can make noirq
> > > stages executed faster, it causes abnormal fan blowing in noirq stages.
> > >
> > > This patch prepares an option so that the automatic busy polling mode
> > > switching for noirq stages can be enabled by who wants to tune it, not all
> > > users.
> > > Noticed that the new global option cannot be changed during noirq stages.
> > > There is no need to lock its value changes to sync with polling mode
> > > settings switches.
> > >
> > > For reporters and testers in the thread, as there are too many reporters
> > > on the bug link, this patch only picks names from most active commenters.
> > > Sorry for the neglet.
> > >
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=191181
> > > Reported-by: Tatsuyuki Ishi <[email protected]>
> > > Reported-by: Claudio Sacerdoti Coen <[email protected]>
> > > Tested-by: Nicolo' <[email protected]>
> > > Reported-by: Jens Axboe <[email protected]>
> > > Tested-by: Gjorgji Jankovski <[email protected]>
> > > Tested-by: Damjan Georgievski <[email protected]>
> > > Tested-by: Fernando Chaves <[email protected]>
> > > Signed-off-by: Lv Zheng <[email protected]>
> >
> > First of all, this seems to be a fix for commit c3a696b6e8f8 (ACPI / EC: Use busy polling
> > mode when GPE is not enabled), so there should be a Fixes: tag pointing to that
> > one.
> >
> > Moreover, if that is just a performance optimization and not a matter of correctness,
> > why don't we simply drop acpi_ec_enter/leave_noirq() entirely?
> >
> > What is going to break if we do that?
>
> Let me Cc Yu for justification.
> I just added busy poll support for suspend/boot according to the root cause reported by him.
> He should know the end user requirements better than me.
>
I remembered the original issue was reported against the slowness during suspend/resume
due to the ec is running with GPE disabled, and falls into schedule_timeout() loop.

If the busy polling mode is controlled by the boot options rather than
acpi_ec_enter/leave_noirq(), then user should be noticed of the result
that the cpu usage might be always higer if they enable the busy polling
mode.

Thanks,
Yu
> Thanks and best regards
> Lv