Hi all,
If ACPI received ejection request for a ACPI container, kernel
emits KOBJ_CHANGE uevent when it found online children devices
below the acpi container.
Base on the description of caa73ea15 kernel patch, user space
is expected to offline all devices below the container and the
container itself. Then, user space can finalize the removal of
the container with the help of its ACPI device object's eject
attribute in sysfs.
That means that kernel relies on users space to peform the offline
and ejection jobs to acpi container and children devices. The
discussion is here:
https://lkml.org/lkml/2013/11/28/520
The mail loop didn't explain why the userspace is responsible for
the whole container offlining. Is it possible to do that transparently
from the kernel? What's the difference between offlining memory and
processors which happends without any cleanup and container which
does essentially the same except it happens at once?
- After a couple of years, can we let the container hot-remove
process transparently?
- Except udev rule, does there have any other mechanism to trigger
auto offline/ejection?
The attached patch is a udev rule that it's used to perform the
offlien/ejection jobs on user space. I want to send it to systemd.
Thanks a lot!
Joey Lee
>From 6c95d4858e0e7c280e490491c1a00c1b5226a029 Mon Sep 17 00:00:00 2001
From: "Lee, Chun-Yi" <[email protected]>
Date: Mon, 26 Jun 2017 11:40:03 +0800
Subject: [PATCH] rules: handle the change event of ACPI container
Currently the ACPI in kernel emits KOBJ_CHANGE uevent when there
have online children devices below the acpi container.
Base on the description of caa73ea15 kernel patch, user space
is expected to offline all devices below the container and the
container itself. Then, user space can finalize the removal of
the container with the help of its ACPI device object's eject
attribute in sysfs.
This udev rule can be a default user space application to meet
kernel's expectations. This rule walks through the sysfs tree
to trigger the offline of each child device then ejects the
container.
The ACPI_CONTAINER_EJECT environoment variable can be used to
turn off the the ejection logic of container if the ejection
will be triggered by other ways, e.g. BIOS or other user space
application.
Reference: https://lkml.org/lkml/2013/11/28/520
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: "Lee, Chun-Yi" <[email protected]>
---
rules/80-acpi-container-hotremove.rules | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
create mode 100644 rules/80-acpi-container-hotremove.rules
diff --git a/rules/80-acpi-container-hotremove.rules b/rules/80-acpi-container-hotremove.rules
new file mode 100644
index 000000000..ef4ceb5fb
--- /dev/null
+++ b/rules/80-acpi-container-hotremove.rules
@@ -0,0 +1,16 @@
+# do not edit this file, it will be overwritten on update
+
+SUBSYSTEM=="container", ACTION=="change", DEVPATH=="*/ACPI0004:??", ENV{ACPI_CONTAINER_EJECT}="1"\
+RUN+="/bin/sh -c ' \
+if [ $(cat /sys/$env{DEVPATH}/online) -eq 1 ]; then \
+ find -L /sys/$env{DEVPATH}/firmware_node/*/physical_node* -maxdepth 1 -name online | \
+ while read line; do \
+ if [ $(cat $line) -eq 1 ]; then \
+ /bin/echo 0 > $line; \
+ fi \
+ done; \
+ /bin/echo 0 > /sys/$env{DEVPATH}/online; \
+ if [$env{ACPI_CONTAINER_EJECT} -eq 1] && [ $(cat /sys/$env{DEVPATH}/online) -eq 0 ]; then \
+ /bin/echo 1 > /sys/$env{DEVPATH}/firmware_node/eject; \
+ fi \
+fi'"
--
2.12.0
On Mon 26-06-17 14:26:57, Joey Lee wrote:
> Hi all,
>
> If ACPI received ejection request for a ACPI container, kernel
> emits KOBJ_CHANGE uevent when it found online children devices
> below the acpi container.
>
> Base on the description of caa73ea15 kernel patch, user space
> is expected to offline all devices below the container and the
> container itself. Then, user space can finalize the removal of
> the container with the help of its ACPI device object's eject
> attribute in sysfs.
>
> That means that kernel relies on users space to peform the offline
> and ejection jobs to acpi container and children devices. The
> discussion is here:
> https://lkml.org/lkml/2013/11/28/520
>
> The mail loop didn't explain why the userspace is responsible for
> the whole container offlining. Is it possible to do that transparently
> from the kernel? What's the difference between offlining memory and
> processors which happends without any cleanup and container which
> does essentially the same except it happens at once?
>
> - After a couple of years, can we let the container hot-remove
> process transparently?
> - Except udev rule, does there have any other mechanism to trigger
> auto offline/ejection?
I would be also interested whether the kernel can simply send an udev event
to all devices in the container.
--
Michal Hocko
SUSE Labs
On 06/26/2017 02:26 AM, joeyli wrote:
> Hi all,
>
> If ACPI received ejection request for a ACPI container, kernel
> emits KOBJ_CHANGE uevent when it found online children devices
> below the acpi container.
>
> Base on the description of caa73ea15 kernel patch, user space
> is expected to offline all devices below the container and the
> container itself. Then, user space can finalize the removal of
> the container with the help of its ACPI device object's eject
> attribute in sysfs.
>
> That means that kernel relies on users space to peform the offline
> and ejection jobs to acpi container and children devices. The
> discussion is here:
> https://lkml.org/lkml/2013/11/28/520
>
> The mail loop didn't explain why the userspace is responsible for
> the whole container offlining. Is it possible to do that transparently
> from the kernel? What's the difference between offlining memory and
> processors which happends without any cleanup and container which
> does essentially the same except it happens at once?
We don't know what devices mount on the container device. I think
devices mount on the container device are different each vendor's server.
If memory device mounts on the container, memory offline easily fails.
Other devices may have other concerns. So the following udev rule you
write does not work correctly.
I think we need to change offline processing for each device. So currently
the userspace is responsible for the whole container offlining.
Thanks,
Yasuaki Ishimatsu
`
>
> - After a couple of years, can we let the container hot-remove
> process transparently?
> - Except udev rule, does there have any other mechanism to trigger
> auto offline/ejection?
>
> The attached patch is a udev rule that it's used to perform the
> offlien/ejection jobs on user space. I want to send it to systemd.
>
> Thanks a lot!
> Joey Lee
>
> From 6c95d4858e0e7c280e490491c1a00c1b5226a029 Mon Sep 17 00:00:00 2001
> From: "Lee, Chun-Yi" <[email protected]>
> Date: Mon, 26 Jun 2017 11:40:03 +0800
> Subject: [PATCH] rules: handle the change event of ACPI container
>
> Currently the ACPI in kernel emits KOBJ_CHANGE uevent when there
> have online children devices below the acpi container.
>
> Base on the description of caa73ea15 kernel patch, user space
> is expected to offline all devices below the container and the
> container itself. Then, user space can finalize the removal of
> the container with the help of its ACPI device object's eject
> attribute in sysfs.
>
> This udev rule can be a default user space application to meet
> kernel's expectations. This rule walks through the sysfs tree
> to trigger the offline of each child device then ejects the
> container.
>
> The ACPI_CONTAINER_EJECT environoment variable can be used to
> turn off the the ejection logic of container if the ejection
> will be triggered by other ways, e.g. BIOS or other user space
> application.
>
> Reference: https://lkml.org/lkml/2013/11/28/520
> Cc: Yasuaki Ishimatsu <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: "Lee, Chun-Yi" <[email protected]>
> ---
> rules/80-acpi-container-hotremove.rules | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
> create mode 100644 rules/80-acpi-container-hotremove.rules
>
> diff --git a/rules/80-acpi-container-hotremove.rules b/rules/80-acpi-container-hotremove.rules
> new file mode 100644;a
> index 000000000..ef4ceb5fb
> --- /dev/null
> +++ b/rules/80-acpi-container-hotremove.rules
> @@ -0,0 +1,16 @@
> +# do not edit this file, it will be overwritten on update
> +
> +SUBSYSTEM=="container", ACTION=="change", DEVPATH=="*/ACPI0004:??", ENV{ACPI_CONTAINER_EJECT}="1"\
> +RUN+="/bin/sh -c ' \
> +if [ $(cat /sys/$env{DEVPATH}/online) -eq 1 ]; then \
> + find -L /sys/$env{DEVPATH}/firmware_node/*/physical_node* -maxdepth 1 -name online | \
> + while read line; do \
> + if [ $(cat $line) -eq 1 ]; then \
> + /bin/echo 0 > $line; \
> + fi \
> + done; \
> + /bin/echo 0 > /sys/$env{DEVPATH}/online; \
> + if [$env{ACPI_CONTAINER_EJECT} -eq 1] && [ $(cat /sys/$env{DEVPATH}/online) -eq 0 ]; then \
> + /bin/echo 1 > /sys/$env{DEVPATH}/firmware_node/eject; \
> + fi \
> +fi
>
Hi YASUAKI,
Thanks for your response.
On Wed, Jun 28, 2017 at 03:53:16PM -0400, YASUAKI ISHIMATSU wrote:
>
> On 06/26/2017 02:26 AM, joeyli wrote:
> > Hi all,
> >
> > If ACPI received ejection request for a ACPI container, kernel
> > emits KOBJ_CHANGE uevent when it found online children devices
> > below the acpi container.
> >
> > Base on the description of caa73ea15 kernel patch, user space
> > is expected to offline all devices below the container and the
> > container itself. Then, user space can finalize the removal of
> > the container with the help of its ACPI device object's eject
> > attribute in sysfs.
> >
> > That means that kernel relies on users space to peform the offline
> > and ejection jobs to acpi container and children devices. The
> > discussion is here:
> > https://lkml.org/lkml/2013/11/28/520
> >
> > The mail loop didn't explain why the userspace is responsible for
> > the whole container offlining. Is it possible to do that transparently
> > from the kernel? What's the difference between offlining memory and
> > processors which happends without any cleanup and container which
> > does essentially the same except it happens at once?
>
> We don't know what devices mount on the container device. I think
> devices mount on the container device are different each vendor's server.
>
> If memory device mounts on the container, memory offline easily fails.
> Other devices may have other concerns. So the following udev rule you
> write does not work correctly.
>
IMHO, if the memory hot-remove(offline/ejection) has problem, then we
should report the issue and fix it in mm subsystem. Michal Hocko works
hard on this. I think that the CPU or IO subsystem are the same.
Current kernel can not complete the container hot-remove job without
userspace's involvement. So I sent the udev rule as a example to
response change uevents.
> I think we need to change offline processing for each device. So currently
> the userspace is responsible for the whole container offlining.
>
It depends on what is the expectation of the deivce offline function
in kernel. If a subsystem supports offline in kernel, then it should
not affects the running user space application. Otherwise the issue
should be fixed.
Thanks a lot!
Joey Lee
On Mon 26-06-17 10:59:07, Michal Hocko wrote:
> On Mon 26-06-17 14:26:57, Joey Lee wrote:
> > Hi all,
> >
> > If ACPI received ejection request for a ACPI container, kernel
> > emits KOBJ_CHANGE uevent when it found online children devices
> > below the acpi container.
> >
> > Base on the description of caa73ea15 kernel patch, user space
> > is expected to offline all devices below the container and the
> > container itself. Then, user space can finalize the removal of
> > the container with the help of its ACPI device object's eject
> > attribute in sysfs.
> >
> > That means that kernel relies on users space to peform the offline
> > and ejection jobs to acpi container and children devices. The
> > discussion is here:
> > https://lkml.org/lkml/2013/11/28/520
> >
> > The mail loop didn't explain why the userspace is responsible for
> > the whole container offlining. Is it possible to do that transparently
> > from the kernel? What's the difference between offlining memory and
> > processors which happends without any cleanup and container which
> > does essentially the same except it happens at once?
> >
> > - After a couple of years, can we let the container hot-remove
> > process transparently?
> > - Except udev rule, does there have any other mechanism to trigger
> > auto offline/ejection?
>
> I would be also interested whether the kernel can simply send an udev event
> to all devices in the container.
Any opinion on this?
--
Michal Hocko
SUSE Labs
Hi Michal,
Sorry for my delay.
On Tue, Jul 11, 2017 at 10:25:32AM +0200, Michal Hocko wrote:
> On Mon 26-06-17 10:59:07, Michal Hocko wrote:
> > On Mon 26-06-17 14:26:57, Joey Lee wrote:
> > > Hi all,
> > >
> > > If ACPI received ejection request for a ACPI container, kernel
> > > emits KOBJ_CHANGE uevent when it found online children devices
> > > below the acpi container.
> > >
> > > Base on the description of caa73ea15 kernel patch, user space
> > > is expected to offline all devices below the container and the
> > > container itself. Then, user space can finalize the removal of
> > > the container with the help of its ACPI device object's eject
> > > attribute in sysfs.
> > >
> > > That means that kernel relies on users space to peform the offline
> > > and ejection jobs to acpi container and children devices. The
> > > discussion is here:
> > > https://lkml.org/lkml/2013/11/28/520
> > >
> > > The mail loop didn't explain why the userspace is responsible for
> > > the whole container offlining. Is it possible to do that transparently
> > > from the kernel? What's the difference between offlining memory and
> > > processors which happends without any cleanup and container which
> > > does essentially the same except it happens at once?
> > >
> > > - After a couple of years, can we let the container hot-remove
> > > process transparently?
> > > - Except udev rule, does there have any other mechanism to trigger
> > > auto offline/ejection?
> >
> > I would be also interested whether the kernel can simply send an udev event
> > to all devices in the container.
>
> Any opinion on this?
If BIOS emits ejection event for a ACPI0004 container, someone needs
to handle the offline/eject jobs of container. Either kernel or user
space.
Only sending uevent to individual child device can simplify udev rule,
but it also means that the kernel needs to offline/eject container
after all children devices are offlined. Maybe adding a ejection flag
on the ACPI0004 object to indicate the container state. But, if userland
doesn't do his job, then the timing to reset the flag will be the problem.
Thanks a lot!
Joey Lee
On Thu 13-07-17 14:58:06, Joey Lee wrote:
> Hi Michal,
>
> Sorry for my delay.
>
> On Tue, Jul 11, 2017 at 10:25:32AM +0200, Michal Hocko wrote:
> > On Mon 26-06-17 10:59:07, Michal Hocko wrote:
> > > On Mon 26-06-17 14:26:57, Joey Lee wrote:
> > > > Hi all,
> > > >
> > > > If ACPI received ejection request for a ACPI container, kernel
> > > > emits KOBJ_CHANGE uevent when it found online children devices
> > > > below the acpi container.
> > > >
> > > > Base on the description of caa73ea15 kernel patch, user space
> > > > is expected to offline all devices below the container and the
> > > > container itself. Then, user space can finalize the removal of
> > > > the container with the help of its ACPI device object's eject
> > > > attribute in sysfs.
> > > >
> > > > That means that kernel relies on users space to peform the offline
> > > > and ejection jobs to acpi container and children devices. The
> > > > discussion is here:
> > > > https://lkml.org/lkml/2013/11/28/520
> > > >
> > > > The mail loop didn't explain why the userspace is responsible for
> > > > the whole container offlining. Is it possible to do that transparently
> > > > from the kernel? What's the difference between offlining memory and
> > > > processors which happends without any cleanup and container which
> > > > does essentially the same except it happens at once?
> > > >
> > > > - After a couple of years, can we let the container hot-remove
> > > > process transparently?
> > > > - Except udev rule, does there have any other mechanism to trigger
> > > > auto offline/ejection?
> > >
> > > I would be also interested whether the kernel can simply send an udev event
> > > to all devices in the container.
> >
> > Any opinion on this?
>
> If BIOS emits ejection event for a ACPI0004 container, someone needs
> to handle the offline/eject jobs of container. Either kernel or user
> space.
>
> Only sending uevent to individual child device can simplify udev rule,
> but it also means that the kernel needs to offline/eject container
> after all children devices are offlined.
Why cannot kernel send this eject command to the BIOS if the whole
container is offline? If it is not then the kernel would send EBUSY to
the BIOS and BIOS would have to retry after some timeout. Or is it a
problem that currently implemented BIOS firmwares do not implement this
retry?
--
Michal Hocko
SUSE Labs
On Thu, Jul 13, 2017 at 09:06:19AM +0200, Michal Hocko wrote:
> On Thu 13-07-17 14:58:06, Joey Lee wrote:
> > Hi Michal,
> >
> > Sorry for my delay.
> >
> > On Tue, Jul 11, 2017 at 10:25:32AM +0200, Michal Hocko wrote:
> > > On Mon 26-06-17 10:59:07, Michal Hocko wrote:
> > > > On Mon 26-06-17 14:26:57, Joey Lee wrote:
> > > > > Hi all,
> > > > >
> > > > > If ACPI received ejection request for a ACPI container, kernel
> > > > > emits KOBJ_CHANGE uevent when it found online children devices
> > > > > below the acpi container.
> > > > >
> > > > > Base on the description of caa73ea15 kernel patch, user space
> > > > > is expected to offline all devices below the container and the
> > > > > container itself. Then, user space can finalize the removal of
> > > > > the container with the help of its ACPI device object's eject
> > > > > attribute in sysfs.
> > > > >
> > > > > That means that kernel relies on users space to peform the offline
> > > > > and ejection jobs to acpi container and children devices. The
> > > > > discussion is here:
> > > > > https://lkml.org/lkml/2013/11/28/520
> > > > >
> > > > > The mail loop didn't explain why the userspace is responsible for
> > > > > the whole container offlining. Is it possible to do that transparently
> > > > > from the kernel? What's the difference between offlining memory and
> > > > > processors which happends without any cleanup and container which
> > > > > does essentially the same except it happens at once?
> > > > >
> > > > > - After a couple of years, can we let the container hot-remove
> > > > > process transparently?
> > > > > - Except udev rule, does there have any other mechanism to trigger
> > > > > auto offline/ejection?
> > > >
> > > > I would be also interested whether the kernel can simply send an udev event
> > > > to all devices in the container.
> > >
> > > Any opinion on this?
> >
> > If BIOS emits ejection event for a ACPI0004 container, someone needs
> > to handle the offline/eject jobs of container. Either kernel or user
> > space.
> >
> > Only sending uevent to individual child device can simplify udev rule,
> > but it also means that the kernel needs to offline/eject container
> > after all children devices are offlined.
>
> Why cannot kernel send this eject command to the BIOS if the whole
> container is offline? If it is not then the kernel would send EBUSY to
Current kernel container hot-remove process:
BIOS -> SCI event -> Kernel ACPI -> uevent -> userland
Then, kernel just calls _OST to expose state to BIOS, then process is
stopped. Kernel doesn't wait there for userland to offline each child
devices. Either BIOS or userland needs to trigger the container
ejection.
> container is offline? If it is not then the kernel would send EBUSY to
> the BIOS and BIOS would have to retry after some timeout. Or is it a
The d429e5c122 patch is merged to mainline. So kernel will send
DEVICE_BUSY to BIOS after it emits uevent to userland. BIOS can choice
to apply the retry approach until OS returns process failure exactly or
BIOS timeout.
> problem that currently implemented BIOS firmwares do not implement this
> retry?
Yes, we should consider the behavior of old BIOS. Old BIOS doesn't
retry/resend the ejection event. So kernel or userland need to take the
retry job. Obviously userland runs the retry since the caa73ea15 patch
is merged.
IMHO there have two different expectation from user space application.
Applications like DVD player or Burner expect that kernel should
info userspace for the ejection, then application can do their cleaning
job and re-trigger ejection from userland.
But, some other applications like database don't want that their service
be stopped when the devices offline/eject. The hot-remove sholud be done by
kernel transparently.
We need a way for fill two situations.
Thanks a lot!
Joey Lee
On Thu 13-07-17 20:45:21, Joey Lee wrote:
> On Thu, Jul 13, 2017 at 09:06:19AM +0200, Michal Hocko wrote:
> > On Thu 13-07-17 14:58:06, Joey Lee wrote:
[...]
> > > If BIOS emits ejection event for a ACPI0004 container, someone needs
> > > to handle the offline/eject jobs of container. Either kernel or user
> > > space.
> > >
> > > Only sending uevent to individual child device can simplify udev rule,
> > > but it also means that the kernel needs to offline/eject container
> > > after all children devices are offlined.
> >
> > Why cannot kernel send this eject command to the BIOS if the whole
> > container is offline? If it is not then the kernel would send EBUSY to
>
> Current kernel container hot-remove process:
>
> BIOS -> SCI event -> Kernel ACPI -> uevent -> userland
>
> Then, kernel just calls _OST to expose state to BIOS, then process is
> stopped. Kernel doesn't wait there for userland to offline each child
> devices. Either BIOS or userland needs to trigger the container
> ejection.
>
> > container is offline? If it is not then the kernel would send EBUSY to
> > the BIOS and BIOS would have to retry after some timeout. Or is it a
>
> The d429e5c122 patch is merged to mainline. So kernel will send
> DEVICE_BUSY to BIOS after it emits uevent to userland. BIOS can choice
> to apply the retry approach until OS returns process failure exactly or
> BIOS timeout.
>
> > problem that currently implemented BIOS firmwares do not implement this
> > retry?
>
> Yes, we should consider the behavior of old BIOS. Old BIOS doesn't
> retry/resend the ejection event. So kernel or userland need to take the
> retry job. Obviously userland runs the retry since the caa73ea15 patch
> is merged.
>
> IMHO there have two different expectation from user space application.
>
> Applications like DVD player or Burner expect that kernel should
> info userspace for the ejection, then application can do their cleaning
> job and re-trigger ejection from userland.
I am not sure I understand the DVD example because I do not see how it
fits into the container and online/offline scenario.
> But, some other applications like database don't want that their service
> be stopped when the devices offline/eject. The hot-remove sholud be done by
> kernel transparently.
>
> We need a way for fill two situations.
Hmm, so can we trigger the eject from the _kernel_ when the last child
is offlined?
--
Michal Hocko
SUSE Labs
On Fri, Jul 14, 2017 at 10:37:13AM +0200, Michal Hocko wrote:
> On Thu 13-07-17 20:45:21, Joey Lee wrote:
> > On Thu, Jul 13, 2017 at 09:06:19AM +0200, Michal Hocko wrote:
> > > On Thu 13-07-17 14:58:06, Joey Lee wrote:
> [...]
> > > > If BIOS emits ejection event for a ACPI0004 container, someone needs
> > > > to handle the offline/eject jobs of container. Either kernel or user
> > > > space.
> > > >
> > > > Only sending uevent to individual child device can simplify udev rule,
> > > > but it also means that the kernel needs to offline/eject container
> > > > after all children devices are offlined.
> > >
> > > Why cannot kernel send this eject command to the BIOS if the whole
> > > container is offline? If it is not then the kernel would send EBUSY to
> >
> > Current kernel container hot-remove process:
> >
> > BIOS -> SCI event -> Kernel ACPI -> uevent -> userland
> >
> > Then, kernel just calls _OST to expose state to BIOS, then process is
> > stopped. Kernel doesn't wait there for userland to offline each child
> > devices. Either BIOS or userland needs to trigger the container
> > ejection.
> >
> > > container is offline? If it is not then the kernel would send EBUSY to
> > > the BIOS and BIOS would have to retry after some timeout. Or is it a
> >
> > The d429e5c122 patch is merged to mainline. So kernel will send
> > DEVICE_BUSY to BIOS after it emits uevent to userland. BIOS can choice
> > to apply the retry approach until OS returns process failure exactly or
> > BIOS timeout.
> >
> > > problem that currently implemented BIOS firmwares do not implement this
> > > retry?
> >
> > Yes, we should consider the behavior of old BIOS. Old BIOS doesn't
> > retry/resend the ejection event. So kernel or userland need to take the
> > retry job. Obviously userland runs the retry since the caa73ea15 patch
> > is merged.
> >
> > IMHO there have two different expectation from user space application.
> >
> > Applications like DVD player or Burner expect that kernel should
> > info userspace for the ejection, then application can do their cleaning
> > job and re-trigger ejection from userland.
>
> I am not sure I understand the DVD example because I do not see how it
> fits into the container and online/offline scenario.
>
At least Yasuaki raised similar behavior for container in 2013.
It's similar to the DVD player case, user space application needs
to do something then trigger children offline and ejection of
container.
Base on Yasuaki's explanation, the reason of that he requested the
userland ejection approach is that he got memory hot-remove problem
in 2013. Maybe his problem is already fixed by your patches in current
mainline.
Hi Yasuaki, could you please check that your memory hot-remove problem
is fixed on mainline kernel?
If Yasuaki's issue is already fixed, then we should consider to let
kernel does the container hot-remove transparently.
> > But, some other applications like database don't want that their service
> > be stopped when the devices offline/eject. The hot-remove sholud be done by
> > kernel transparently.
> >
> > We need a way for fill two situations.
>
> Hmm, so can we trigger the eject from the _kernel_ when the last child
> is offlined?
Kernel needs to remember that the container is under a _EJECTION_ state
that it should waits all children be offlined. Then kernel checks the
container offline state when each individual device is offlined. If
kernel found a container offlined (means that all children are offlined),
and the container is under ejection state, then kernel runs ejection
jobs (removing objects and calls _EJ0).
To achieve this, I think that the container object needs a _EJECTION_
flag. It helps kernel to remember the state that it set by BIOS's
ejection event.
This approach has some problems: If userland doesn't finish his offline
jobs or userland doesn't do anything, when should kernel clears the
ejection flag and responses failure by _OST to BIOS?
And, for new BIOS that it has time out mechanism. Currently there have
no way for BIOS to tell kernel that it gives up. It's hard to sync the
kernel container's ejection flag with BIOS.
Of course the better is that Yasuaki's problem got fixed. Kernel does
the hot-removes container transparently (again). Then we don't need
to worry how to maintain a ejection state in kernel.
Thanks a lot!
Joey Lee
On Fri 14-07-17 22:44:14, Joey Lee wrote:
> On Fri, Jul 14, 2017 at 10:37:13AM +0200, Michal Hocko wrote:
> > On Thu 13-07-17 20:45:21, Joey Lee wrote:
> > > On Thu, Jul 13, 2017 at 09:06:19AM +0200, Michal Hocko wrote:
> > > > On Thu 13-07-17 14:58:06, Joey Lee wrote:
> > [...]
> > > > > If BIOS emits ejection event for a ACPI0004 container, someone needs
> > > > > to handle the offline/eject jobs of container. Either kernel or user
> > > > > space.
> > > > >
> > > > > Only sending uevent to individual child device can simplify udev rule,
> > > > > but it also means that the kernel needs to offline/eject container
> > > > > after all children devices are offlined.
> > > >
> > > > Why cannot kernel send this eject command to the BIOS if the whole
> > > > container is offline? If it is not then the kernel would send EBUSY to
> > >
> > > Current kernel container hot-remove process:
> > >
> > > BIOS -> SCI event -> Kernel ACPI -> uevent -> userland
> > >
> > > Then, kernel just calls _OST to expose state to BIOS, then process is
> > > stopped. Kernel doesn't wait there for userland to offline each child
> > > devices. Either BIOS or userland needs to trigger the container
> > > ejection.
> > >
> > > > container is offline? If it is not then the kernel would send EBUSY to
> > > > the BIOS and BIOS would have to retry after some timeout. Or is it a
> > >
> > > The d429e5c122 patch is merged to mainline. So kernel will send
> > > DEVICE_BUSY to BIOS after it emits uevent to userland. BIOS can choice
> > > to apply the retry approach until OS returns process failure exactly or
> > > BIOS timeout.
> > >
> > > > problem that currently implemented BIOS firmwares do not implement this
> > > > retry?
> > >
> > > Yes, we should consider the behavior of old BIOS. Old BIOS doesn't
> > > retry/resend the ejection event. So kernel or userland need to take the
> > > retry job. Obviously userland runs the retry since the caa73ea15 patch
> > > is merged.
> > >
> > > IMHO there have two different expectation from user space application.
> > >
> > > Applications like DVD player or Burner expect that kernel should
> > > info userspace for the ejection, then application can do their cleaning
> > > job and re-trigger ejection from userland.
> >
> > I am not sure I understand the DVD example because I do not see how it
> > fits into the container and online/offline scenario.
> >
>
> At least Yasuaki raised similar behavior for container in 2013.
> It's similar to the DVD player case, user space application needs
> to do something then trigger children offline and ejection of
> container.
The problem I have with this expectation is that userspace will never
have a good atomic view of the whole container. So it can only try to
eject and then hope that nobody has onlined part of the container.
If you emit offline event to the userspace the cleanup can be done and
after the last component goes offline then the eject can be done
atomically.
[...]
> > Hmm, so can we trigger the eject from the _kernel_ when the last child
> > is offlined?
>
> Kernel needs to remember that the container is under a _EJECTION_ state
> that it should waits all children be offlined. Then kernel checks the
> container offline state when each individual device is offlined. If
> kernel found a container offlined (means that all children are offlined),
> and the container is under ejection state, then kernel runs ejection
> jobs (removing objects and calls _EJ0).
yes, that is what I meant.
> To achieve this, I think that the container object needs a _EJECTION_
> flag. It helps kernel to remember the state that it set by BIOS's
> ejection event.
yes something like that.
> This approach has some problems: If userland doesn't finish his offline
> jobs or userland doesn't do anything, when should kernel clears the
> ejection flag and responses failure by _OST to BIOS?
I do not see how is that any different from the current approach. You
still cannot do the eject if some component is online and we rely on the
userspace to do the offline.
> And, for new BIOS that it has time out mechanism. Currently there have
> no way for BIOS to tell kernel that it gives up. It's hard to sync the
> kernel container's ejection flag with BIOS.
I am not sure I understand. The kernel/BIOS synchronization happens on
the up/down calls between the platform and the kernel...
--
Michal Hocko
SUSE Labs
On Mon, Jul 17, 2017 at 11:05:25AM +0200, Michal Hocko wrote:
> On Fri 14-07-17 22:44:14, Joey Lee wrote:
> > On Fri, Jul 14, 2017 at 10:37:13AM +0200, Michal Hocko wrote:
> > > On Thu 13-07-17 20:45:21, Joey Lee wrote:
> > > > On Thu, Jul 13, 2017 at 09:06:19AM +0200, Michal Hocko wrote:
> > > > > On Thu 13-07-17 14:58:06, Joey Lee wrote:
> > > [...]
> > > > > > If BIOS emits ejection event for a ACPI0004 container, someone needs
> > > > > > to handle the offline/eject jobs of container. Either kernel or user
> > > > > > space.
> > > > > >
> > > > > > Only sending uevent to individual child device can simplify udev rule,
> > > > > > but it also means that the kernel needs to offline/eject container
> > > > > > after all children devices are offlined.
> > > > >
> > > > > Why cannot kernel send this eject command to the BIOS if the whole
> > > > > container is offline? If it is not then the kernel would send EBUSY to
> > > >
> > > > Current kernel container hot-remove process:
> > > >
> > > > BIOS -> SCI event -> Kernel ACPI -> uevent -> userland
> > > >
> > > > Then, kernel just calls _OST to expose state to BIOS, then process is
> > > > stopped. Kernel doesn't wait there for userland to offline each child
> > > > devices. Either BIOS or userland needs to trigger the container
> > > > ejection.
> > > >
> > > > > container is offline? If it is not then the kernel would send EBUSY to
> > > > > the BIOS and BIOS would have to retry after some timeout. Or is it a
> > > >
> > > > The d429e5c122 patch is merged to mainline. So kernel will send
> > > > DEVICE_BUSY to BIOS after it emits uevent to userland. BIOS can choice
> > > > to apply the retry approach until OS returns process failure exactly or
> > > > BIOS timeout.
> > > >
> > > > > problem that currently implemented BIOS firmwares do not implement this
> > > > > retry?
> > > >
> > > > Yes, we should consider the behavior of old BIOS. Old BIOS doesn't
> > > > retry/resend the ejection event. So kernel or userland need to take the
> > > > retry job. Obviously userland runs the retry since the caa73ea15 patch
> > > > is merged.
> > > >
> > > > IMHO there have two different expectation from user space application.
> > > >
> > > > Applications like DVD player or Burner expect that kernel should
> > > > info userspace for the ejection, then application can do their cleaning
> > > > job and re-trigger ejection from userland.
> > >
> > > I am not sure I understand the DVD example because I do not see how it
> > > fits into the container and online/offline scenario.
> > >
> >
> > At least Yasuaki raised similar behavior for container in 2013.
> > It's similar to the DVD player case, user space application needs
> > to do something then trigger children offline and ejection of
> > container.
>
> The problem I have with this expectation is that userspace will never
> have a good atomic view of the whole container. So it can only try to
I agreed!
Even a userspace application can handle part of offline jobs. It's
still possible that other kernel/userland compenents are using the
resource in container.
> eject and then hope that nobody has onlined part of the container.
> If you emit offline event to the userspace the cleanup can be done and
> after the last component goes offline then the eject can be done
> atomically.
The thing that we didn't align is how does kernel maintains the flag
of ejection state on container.
>
> [...]
> > > Hmm, so can we trigger the eject from the _kernel_ when the last child
> > > is offlined?
> >
> > Kernel needs to remember that the container is under a _EJECTION_ state
> > that it should waits all children be offlined. Then kernel checks the
> > container offline state when each individual device is offlined. If
> > kernel found a container offlined (means that all children are offlined),
> > and the container is under ejection state, then kernel runs ejection
> > jobs (removing objects and calls _EJ0).
>
> yes, that is what I meant.
>
> > To achieve this, I think that the container object needs a _EJECTION_
> > flag. It helps kernel to remember the state that it set by BIOS's
> > ejection event.
>
> yes something like that.
>
Good!
> > This approach has some problems: If userland doesn't finish his offline
> > jobs or userland doesn't do anything, when should kernel clears the
> > ejection flag and responses failure by _OST to BIOS?
>
> I do not see how is that any different from the current approach. You
> still cannot do the eject if some component is online and we rely on the
> userspace to do the offline.
I see. I agree with you that it's no different from the current approach.
But I still concern how to maintain the ejection state flag in kernel.
>
> > And, for new BIOS that it has time out mechanism. Currently there have
> > no way for BIOS to tell kernel that it gives up. It's hard to sync the
> > kernel container's ejection flag with BIOS.
>
> I am not sure I understand. The kernel/BIOS synchronization happens on
> the up/down calls between the platform and the kernel...
NO! The container hot-remove process does not synchronize. For new BIOS
that it can retry until time out, so kernel doesn't need to keep the
ejection state of container:
retry BIOS Kernel Userspace
|-----SCI------>|----uevent---->|
| | +----+
|<--_OST(Busy)--| | |
|----+ | | clean up
| | | | |
| Wait |<---offline----+<---+
| Wait |<---offline----+<---+
| Wait |<---offline----+<---+
| Wait |<---offline----+<---+
| | | |
| | | |
|<---+ | |
|---retry SCI-->|----+ |
| | | |
| | check all |
| | offlined |
| | | |
| +<---+ |
| +----+ |
| | | |
| | ejection |
| | | |
|<----_EJ0------+<---+ |
| | |
| | |
Time out
But for old BIOS or non-retry BIOS, it doesn't resend
SCI to kernel, so kernel will not run ejection and _EJ0:
old BIOS Kernel Userspace
|-----SCI------>|----uevent---->|
| | +----+
|<--_OST(Busy)--| | |
|----+ | | clean up
| | | | |
| failure |<---offline----+<---+
| | |<---offline----+<---+
|<---+ |<---offline----+<---+
Stop |<---offline----+<---+
| |
| |
No retry SCI!--->|----+ |
| | |
| check all |
| offlined?? |
| | |
+<---+ |
+----+ |
| | |
| ejection?? |
| | |
<----_EJ0??----+<---+ |
| |
So, kernel needs to keep a ejection state flag that it
indicates that BIOS SCI has triggered the container ejection:
BIOS Kernel Userspace
|-----SCI------>|----+event---->|
| | | |
| | set [eject] |
| | flag |
| | | |
| |<---+ |
| |----uevent---->|
| | +----+
|<--_OST(Busy)--| | |
|----+ | | clean up
| | | | |
| failure |<---offline----+<---+
| | |<---offline----+<---+
|<---+ |<---offline----+<---+
Stop |<---offline----+<---+
+----+ |
| | |
| if container |
| is in [eject] |
| state |
| | |
|<---+ |
|----+ |
| | |
| check all |
| offlined |
| | |
+<---+ |
+----+ |
| | |
| ejection?? |
| | | |
|<----_EJ0 ----+<---+ |
| | |
Base on the above figure, if userspace didn't do anything or it
just performs part of offline jobs. Then the container's [eject]
state will be always _SET_ there, and kernel will always check
the the latest child offline state when any child be offlined
by userspace.
On the other hand, for retry BIOS, we will apply the same
_eject_ flag approach on retry BIOS. If the OS performs
offline/ejection jobs too long then the retry BIOS is finally
time out. There doesn't have way for OS to aware the timeout.
So the _eject_ flag is set forever.
Thanks a lot!
Joey Lee
Hi Yasuaki,
On Fri, Jul 14, 2017 at 10:44:14PM +0800, joeyli wrote:
> On Fri, Jul 14, 2017 at 10:37:13AM +0200, Michal Hocko wrote:
> > On Thu 13-07-17 20:45:21, Joey Lee wrote:
> > > On Thu, Jul 13, 2017 at 09:06:19AM +0200, Michal Hocko wrote:
> > > > On Thu 13-07-17 14:58:06, Joey Lee wrote:
> > [...]
> > > > > If BIOS emits ejection event for a ACPI0004 container, someone needs
> > > > > to handle the offline/eject jobs of container. Either kernel or user
> > > > > space.
> > > > >
> > > > > Only sending uevent to individual child device can simplify udev rule,
> > > > > but it also means that the kernel needs to offline/eject container
> > > > > after all children devices are offlined.
> > > >
> > > > Why cannot kernel send this eject command to the BIOS if the whole
> > > > container is offline? If it is not then the kernel would send EBUSY to
> > >
> > > Current kernel container hot-remove process:
> > >
> > > BIOS -> SCI event -> Kernel ACPI -> uevent -> userland
> > >
> > > Then, kernel just calls _OST to expose state to BIOS, then process is
> > > stopped. Kernel doesn't wait there for userland to offline each child
> > > devices. Either BIOS or userland needs to trigger the container
> > > ejection.
> > >
> > > > container is offline? If it is not then the kernel would send EBUSY to
> > > > the BIOS and BIOS would have to retry after some timeout. Or is it a
> > >
> > > The d429e5c122 patch is merged to mainline. So kernel will send
> > > DEVICE_BUSY to BIOS after it emits uevent to userland. BIOS can choice
> > > to apply the retry approach until OS returns process failure exactly or
> > > BIOS timeout.
> > >
> > > > problem that currently implemented BIOS firmwares do not implement this
> > > > retry?
> > >
> > > Yes, we should consider the behavior of old BIOS. Old BIOS doesn't
> > > retry/resend the ejection event. So kernel or userland need to take the
> > > retry job. Obviously userland runs the retry since the caa73ea15 patch
> > > is merged.
> > >
> > > IMHO there have two different expectation from user space application.
> > >
> > > Applications like DVD player or Burner expect that kernel should
> > > info userspace for the ejection, then application can do their cleaning
> > > job and re-trigger ejection from userland.
> >
> > I am not sure I understand the DVD example because I do not see how it
> > fits into the container and online/offline scenario.
> >
>
> At least Yasuaki raised similar behavior for container in 2013.
> It's similar to the DVD player case, user space application needs
> to do something then trigger children offline and ejection of
> container.
>
> Base on Yasuaki's explanation, the reason of that he requested the
> userland ejection approach is that he got memory hot-remove problem
> in 2013. Maybe his problem is already fixed by your patches in current
> mainline.
>
> Hi Yasuaki, could you please check that your memory hot-remove problem
> is fixed on mainline kernel?
>
> If Yasuaki's issue is already fixed, then we should consider to let
> kernel does the container hot-remove transparently.
Could you please help to check that your memory hot-remove problem in 2013
is fixed on mainline kernel?
Thanks a lot!
Joey Lee
On Wed 19-07-17 17:09:10, Joey Lee wrote:
> On Mon, Jul 17, 2017 at 11:05:25AM +0200, Michal Hocko wrote:
[...]
> > The problem I have with this expectation is that userspace will never
> > have a good atomic view of the whole container. So it can only try to
>
> I agreed!
>
> Even a userspace application can handle part of offline jobs. It's
> still possible that other kernel/userland compenents are using the
> resource in container.
>
> > eject and then hope that nobody has onlined part of the container.
> > If you emit offline event to the userspace the cleanup can be done and
> > after the last component goes offline then the eject can be done
> > atomically.
>
> The thing that we didn't align is how does kernel maintains the flag
> of ejection state on container.
Why it cannot be an attribute of the container? The flag would be set
when the eject operation is requested and cleared when either the
operation is successful (all parts offline and eject operation acked
by the BIOS) or it is terminated.
[...]
> Base on the above figure, if userspace didn't do anything or it
> just performs part of offline jobs. Then the container's [eject]
> state will be always _SET_ there, and kernel will always check
> the the latest child offline state when any child be offlined
> by userspace.
What is a problem about that? The eject is simply in progress until all
is set. Or maybe I just misunderstood.
>
> On the other hand, for retry BIOS, we will apply the same
> _eject_ flag approach on retry BIOS. If the OS performs
> offline/ejection jobs too long then the retry BIOS is finally
> time out. There doesn't have way for OS to aware the timeout.
Doesn't BIOS notify the OS that the eject has timed out?
> So the _eject_ flag is set forever.
>
> Thanks a lot!
> Joey Lee
--
Michal Hocko
SUSE Labs
On Mon, Jul 24, 2017 at 10:57:02AM +0200, Michal Hocko wrote:
> On Wed 19-07-17 17:09:10, Joey Lee wrote:
> > On Mon, Jul 17, 2017 at 11:05:25AM +0200, Michal Hocko wrote:
> [...]
> > > The problem I have with this expectation is that userspace will never
> > > have a good atomic view of the whole container. So it can only try to
> >
> > I agreed!
> >
> > Even a userspace application can handle part of offline jobs. It's
> > still possible that other kernel/userland compenents are using the
> > resource in container.
> >
> > > eject and then hope that nobody has onlined part of the container.
> > > If you emit offline event to the userspace the cleanup can be done and
> > > after the last component goes offline then the eject can be done
> > > atomically.
> >
> > The thing that we didn't align is how does kernel maintains the flag
> > of ejection state on container.
>
> Why it cannot be an attribute of the container? The flag would be set
> when the eject operation is requested and cleared when either the
> operation is successful (all parts offline and eject operation acked
> by the BIOS) or it is terminated.
>
For the success case, yes, we can clear the flag when the _EJ0 of container
is success. But for the fail case, we don't know when the operation is
terminated.
> [...]
> > Base on the above figure, if userspace didn't do anything or it
> > just performs part of offline jobs. Then the container's [eject]
> > state will be always _SET_ there, and kernel will always check
> > the the latest child offline state when any child be offlined
> > by userspace.
>
> What is a problem about that? The eject is simply in progress until all
> is set. Or maybe I just misunderstood.
>
I agree, but it's only for success case. For fail case, kernel can not
wait forever. Can we?
> >
> > On the other hand, for retry BIOS, we will apply the same
> > _eject_ flag approach on retry BIOS. If the OS performs
> > offline/ejection jobs too long then the retry BIOS is finally
> > time out. There doesn't have way for OS to aware the timeout.
>
> Doesn't BIOS notify the OS that the eject has timed out?
>
No, there doesn't have interface to notify OS for BIOS time out.
Thanks a lot!
Joey Lee
On Mon 24-07-17 17:29:21, Joey Lee wrote:
> On Mon, Jul 24, 2017 at 10:57:02AM +0200, Michal Hocko wrote:
> > On Wed 19-07-17 17:09:10, Joey Lee wrote:
> > > On Mon, Jul 17, 2017 at 11:05:25AM +0200, Michal Hocko wrote:
> > [...]
> > > > The problem I have with this expectation is that userspace will never
> > > > have a good atomic view of the whole container. So it can only try to
> > >
> > > I agreed!
> > >
> > > Even a userspace application can handle part of offline jobs. It's
> > > still possible that other kernel/userland compenents are using the
> > > resource in container.
> > >
> > > > eject and then hope that nobody has onlined part of the container.
> > > > If you emit offline event to the userspace the cleanup can be done and
> > > > after the last component goes offline then the eject can be done
> > > > atomically.
> > >
> > > The thing that we didn't align is how does kernel maintains the flag
> > > of ejection state on container.
> >
> > Why it cannot be an attribute of the container? The flag would be set
> > when the eject operation is requested and cleared when either the
> > operation is successful (all parts offline and eject operation acked
> > by the BIOS) or it is terminated.
> >
>
> For the success case, yes, we can clear the flag when the _EJ0 of container
> is success. But for the fail case, we don't know when the operation is
> terminated.
Hmm, this is rather strange. What is the BIOS state in the meantime?
Let's say it doesn't retry. Does it wait for the OS for ever?
> > [...]
> > > Base on the above figure, if userspace didn't do anything or it
> > > just performs part of offline jobs. Then the container's [eject]
> > > state will be always _SET_ there, and kernel will always check
> > > the the latest child offline state when any child be offlined
> > > by userspace.
> >
> > What is a problem about that? The eject is simply in progress until all
> > is set. Or maybe I just misunderstood.
> >
>
> I agree, but it's only for success case. For fail case, kernel can not
> wait forever. Can we?
Well, this won't consume any additional resources so I wouldn't be all
that worried. Maybe we can reset the flag as soon as somebody tries to
online some part of the container?
--
Michal Hocko
SUSE Labs
Hi Michal,
Sorry for my delay...
On Tue, Jul 25, 2017 at 02:48:37PM +0200, Michal Hocko wrote:
> On Mon 24-07-17 17:29:21, Joey Lee wrote:
> > On Mon, Jul 24, 2017 at 10:57:02AM +0200, Michal Hocko wrote:
> > > On Wed 19-07-17 17:09:10, Joey Lee wrote:
> > > > On Mon, Jul 17, 2017 at 11:05:25AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > The problem I have with this expectation is that userspace will never
> > > > > have a good atomic view of the whole container. So it can only try to
> > > >
> > > > I agreed!
> > > >
> > > > Even a userspace application can handle part of offline jobs. It's
> > > > still possible that other kernel/userland compenents are using the
> > > > resource in container.
> > > >
> > > > > eject and then hope that nobody has onlined part of the container.
> > > > > If you emit offline event to the userspace the cleanup can be done and
> > > > > after the last component goes offline then the eject can be done
> > > > > atomically.
> > > >
> > > > The thing that we didn't align is how does kernel maintains the flag
> > > > of ejection state on container.
> > >
> > > Why it cannot be an attribute of the container? The flag would be set
> > > when the eject operation is requested and cleared when either the
> > > operation is successful (all parts offline and eject operation acked
> > > by the BIOS) or it is terminated.
> > >
> >
> > For the success case, yes, we can clear the flag when the _EJ0 of container
> > is success. But for the fail case, we don't know when the operation is
> > terminated.
>
> Hmm, this is rather strange. What is the BIOS state in the meantime?
> Let's say it doesn't retry. Does it wait for the OS for ever?
>
Unfortunately ACPI spec doesn't mention the detail of BIOS behavior for
container hot-removing.
IMHO, if the BIOS doesn't retry, at least it should maintains a timer
to handle the OS layer time out then BIOS resets hardware(turns off
progress light or something else...).
The old BIOS just treats the ejection event as a button event. BIOS
emits 0x103 ejection event to OS after user presses a button or UI.
Then BIOS hopes that OS(either kernel or userland) finishs all jobs,
calls _EJ0 to turn off power, and calls _OST to return state to BIOS.
If the ejection event from BIOS doesn't trigger anything in upper OS
layer, old BIOS can not against this situation unless it has a timer.
> > > [...]
> > > > Base on the above figure, if userspace didn't do anything or it
> > > > just performs part of offline jobs. Then the container's [eject]
> > > > state will be always _SET_ there, and kernel will always check
> > > > the the latest child offline state when any child be offlined
> > > > by userspace.
> > >
> > > What is a problem about that? The eject is simply in progress until all
> > > is set. Or maybe I just misunderstood.
> > >
> >
> > I agree, but it's only for success case. For fail case, kernel can not
> > wait forever. Can we?
>
> Well, this won't consume any additional resources so I wouldn't be all
> that worried. Maybe we can reset the flag as soon as somebody tries to
> online some part of the container?
>
So, the behavior is:
Kernel received ejection event, set _Eject_ flag on container object
-> Kernel sends offline events to all children devices
-> User space performs cleaning jobs and offlines each child device
-> Kernel detects all children offlined
-> Kernel removes objects and calls power off(_EJ0)
If anyone onlined one of the children devices in the term of waiting
userland offlines all children, then the _Eject_ flag will be clean
and ejection process will be interrupted. In this situation, administrator
needs to trigger ejection event again. Do you think that the race hurts
anything?
Thanks a lot!
Joey Lee
Hi Joey,
On 07/23/2017 05:18 AM, joeyli wrote:
> Hi Yasuaki,
>
> On Fri, Jul 14, 2017 at 10:44:14PM +0800, joeyli wrote:
>> On Fri, Jul 14, 2017 at 10:37:13AM +0200, Michal Hocko wrote:
>>> On Thu 13-07-17 20:45:21, Joey Lee wrote:
>>>> On Thu, Jul 13, 2017 at 09:06:19AM +0200, Michal Hocko wrote:
>>>>> On Thu 13-07-17 14:58:06, Joey Lee wrote:
>>> [...]
>>>>>> If BIOS emits ejection event for a ACPI0004 container, someone needs
>>>>>> to handle the offline/eject jobs of container. Either kernel or user
>>>>>> space.
>>>>>>
>>>>>> Only sending uevent to individual child device can simplify udev rule,
>>>>>> but it also means that the kernel needs to offline/eject container
>>>>>> after all children devices are offlined.
>>>>>
>>>>> Why cannot kernel send this eject command to the BIOS if the whole
>>>>> container is offline? If it is not then the kernel would send EBUSY to
>>>>
>>>> Current kernel container hot-remove process:
>>>>
>>>> BIOS -> SCI event -> Kernel ACPI -> uevent -> userland
>>>>
>>>> Then, kernel just calls _OST to expose state to BIOS, then process is
>>>> stopped. Kernel doesn't wait there for userland to offline each child
>>>> devices. Either BIOS or userland needs to trigger the container
>>>> ejection.
>>>>
>>>>> container is offline? If it is not then the kernel would send EBUSY to
>>>>> the BIOS and BIOS would have to retry after some timeout. Or is it a
>>>>
>>>> The d429e5c122 patch is merged to mainline. So kernel will send
>>>> DEVICE_BUSY to BIOS after it emits uevent to userland. BIOS can choice
>>>> to apply the retry approach until OS returns process failure exactly or
>>>> BIOS timeout.
>>>>
>>>>> problem that currently implemented BIOS firmwares do not implement this
>>>>> retry?
>>>>
>>>> Yes, we should consider the behavior of old BIOS. Old BIOS doesn't
>>>> retry/resend the ejection event. So kernel or userland need to take the
>>>> retry job. Obviously userland runs the retry since the caa73ea15 patch
>>>> is merged.
>>>>
>>>> IMHO there have two different expectation from user space application.
>>>>
>>>> Applications like DVD player or Burner expect that kernel should
>>>> info userspace for the ejection, then application can do their cleaning
>>>> job and re-trigger ejection from userland.
>>>
>>> I am not sure I understand the DVD example because I do not see how it
>>> fits into the container and online/offline scenario.
>>>
>>
>> At least Yasuaki raised similar behavior for container in 2013.
>> It's similar to the DVD player case, user space application needs
>> to do something then trigger children offline and ejection of
>> container.
>>
>> Base on Yasuaki's explanation, the reason of that he requested the
>> userland ejection approach is that he got memory hot-remove problem
>> in 2013. Maybe his problem is already fixed by your patches in current
>> mainline.
>>
>> Hi Yasuaki, could you please check that your memory hot-remove problem
>> is fixed on mainline kernel?
I cannot remember what I mentioned in 2013. Could you tell me url of lkml archive.
Thanks,
Yasuaki Ishimatsu
>>
>> If Yasuaki's issue is already fixed, then we should consider to let
>> kernel does the container hot-remove transparently.
>
> Could you please help to check that your memory hot-remove problem in 2013
> is fixed on mainline kernel?
>
> Thanks a lot!
> Joey Lee
>
Hi YASUAKI,
On Tue, Aug 01, 2017 at 03:21:38PM -0400, YASUAKI ISHIMATSU wrote:
> Hi Joey,
>
> On 07/23/2017 05:18 AM, joeyli wrote:
[...snip]
> >>>
> >>
> >> At least Yasuaki raised similar behavior for container in 2013.
> >> It's similar to the DVD player case, user space application needs
> >> to do something then trigger children offline and ejection of
> >> container.
> >>
> >> Base on Yasuaki's explanation, the reason of that he requested the
> >> userland ejection approach is that he got memory hot-remove problem
> >> in 2013. Maybe his problem is already fixed by your patches in current
> >> mainline.
> >>
> >> Hi Yasuaki, could you please check that your memory hot-remove problem
> >> is fixed on mainline kernel?
>
> I cannot remember what I mentioned in 2013. Could you tell me url of lkml archive.
>
Here: https://lkml.org/lkml/2013/11/28/520
In the mail, you mentioned two problems. But there have no detail about
those issues, and no root causes. We don't know why kernel should relies
on user space to complete the hot removing process for container.
Thank a lot!
Joey Lee
On Mon 31-07-17 15:38:45, Joey Lee wrote:
> Hi Michal,
>
> Sorry for my delay...
>
> On Tue, Jul 25, 2017 at 02:48:37PM +0200, Michal Hocko wrote:
> > On Mon 24-07-17 17:29:21, Joey Lee wrote:
[...]
> > > For the success case, yes, we can clear the flag when the _EJ0 of container
> > > is success. But for the fail case, we don't know when the operation is
> > > terminated.
> >
> > Hmm, this is rather strange. What is the BIOS state in the meantime?
> > Let's say it doesn't retry. Does it wait for the OS for ever?
> >
>
> Unfortunately ACPI spec doesn't mention the detail of BIOS behavior for
> container hot-removing.
>
> IMHO, if the BIOS doesn't retry, at least it should maintains a timer
> to handle the OS layer time out then BIOS resets hardware(turns off
> progress light or something else...).
>
> The old BIOS just treats the ejection event as a button event. BIOS
> emits 0x103 ejection event to OS after user presses a button or UI.
> Then BIOS hopes that OS(either kernel or userland) finishs all jobs,
> calls _EJ0 to turn off power, and calls _OST to return state to BIOS.
>
> If the ejection event from BIOS doesn't trigger anything in upper OS
> layer, old BIOS can not against this situation unless it has a timer.
Right but I would consider that a BIOS problem. It is simply not
feasible to expect that OS will react in instance. Especially when we
are talking about resources like memory which takes time proportional to
the size to tear down properly.
> > > > [...]
> > > > > Base on the above figure, if userspace didn't do anything or it
> > > > > just performs part of offline jobs. Then the container's [eject]
> > > > > state will be always _SET_ there, and kernel will always check
> > > > > the the latest child offline state when any child be offlined
> > > > > by userspace.
> > > >
> > > > What is a problem about that? The eject is simply in progress until all
> > > > is set. Or maybe I just misunderstood.
> > > >
> > >
> > > I agree, but it's only for success case. For fail case, kernel can not
> > > wait forever. Can we?
> >
> > Well, this won't consume any additional resources so I wouldn't be all
> > that worried. Maybe we can reset the flag as soon as somebody tries to
> > online some part of the container?
> >
>
> So, the behavior is:
>
> Kernel received ejection event, set _Eject_ flag on container object
> -> Kernel sends offline events to all children devices
> -> User space performs cleaning jobs and offlines each child device
> -> Kernel detects all children offlined
> -> Kernel removes objects and calls power off(_EJ0)
Yes this is what I've had in mind. It is the "kernel detects..." part
which is not implemented now and that requires us to do the explicit
eject from userspace, correct?
> If anyone onlined one of the children devices in the term of waiting
> userland offlines all children, then the _Eject_ flag will be clean
> and ejection process will be interrupted. In this situation, administrator
> needs to trigger ejection event again.
yes
> Do you think that the race hurts anything?
What kind of race?
--
Michal Hocko
SUSE Labs
On Wed, Aug 02, 2017 at 11:01:43AM +0200, Michal Hocko wrote:
> On Mon 31-07-17 15:38:45, Joey Lee wrote:
> > Hi Michal,
> >
> > Sorry for my delay...
> >
> > On Tue, Jul 25, 2017 at 02:48:37PM +0200, Michal Hocko wrote:
> > > On Mon 24-07-17 17:29:21, Joey Lee wrote:
> [...]
> > > > For the success case, yes, we can clear the flag when the _EJ0 of container
> > > > is success. But for the fail case, we don't know when the operation is
> > > > terminated.
> > >
> > > Hmm, this is rather strange. What is the BIOS state in the meantime?
> > > Let's say it doesn't retry. Does it wait for the OS for ever?
> > >
> >
> > Unfortunately ACPI spec doesn't mention the detail of BIOS behavior for
> > container hot-removing.
> >
> > IMHO, if the BIOS doesn't retry, at least it should maintains a timer
> > to handle the OS layer time out then BIOS resets hardware(turns off
> > progress light or something else...).
> >
> > The old BIOS just treats the ejection event as a button event. BIOS
> > emits 0x103 ejection event to OS after user presses a button or UI.
> > Then BIOS hopes that OS(either kernel or userland) finishs all jobs,
> > calls _EJ0 to turn off power, and calls _OST to return state to BIOS.
> >
> > If the ejection event from BIOS doesn't trigger anything in upper OS
> > layer, old BIOS can not against this situation unless it has a timer.
>
> Right but I would consider that a BIOS problem. It is simply not
> feasible to expect that OS will react in instance. Especially when we
> are talking about resources like memory which takes time proportional to
> the size to tear down properly.
>
I agree with you that old BIOS implementation is not enough
to handle the situation from OS layer. But those old BIOS has
been shipped. We still need to consider to work with them.
> > > > > [...]
> > > > > > Base on the above figure, if userspace didn't do anything or it
> > > > > > just performs part of offline jobs. Then the container's [eject]
> > > > > > state will be always _SET_ there, and kernel will always check
> > > > > > the the latest child offline state when any child be offlined
> > > > > > by userspace.
> > > > >
> > > > > What is a problem about that? The eject is simply in progress until all
> > > > > is set. Or maybe I just misunderstood.
> > > > >
> > > >
> > > > I agree, but it's only for success case. For fail case, kernel can not
> > > > wait forever. Can we?
> > >
> > > Well, this won't consume any additional resources so I wouldn't be all
> > > that worried. Maybe we can reset the flag as soon as somebody tries to
> > > online some part of the container?
> > >
> >
> > So, the behavior is:
> >
> > Kernel received ejection event, set _Eject_ flag on container object
> > -> Kernel sends offline events to all children devices
> > -> User space performs cleaning jobs and offlines each child device
> > -> Kernel detects all children offlined
> > -> Kernel removes objects and calls power off(_EJ0)
>
> Yes this is what I've had in mind. It is the "kernel detects..." part
> which is not implemented now and that requires us to do the explicit
> eject from userspace, correct?
>
Yes, the _Eject_ flag and _detects_ part are not implemented now.
In this approach, kernel still relies on user space to trigger the
offline. The ejection process is still not transparent to user space.
Is it what you want?
> > If anyone onlined one of the children devices in the term of waiting
> > userland offlines all children, then the _Eject_ flag will be clean
> > and ejection process will be interrupted. In this situation, administrator
> > needs to trigger ejection event again.
>
> yes
>
> > Do you think that the race hurts anything?
>
> What kind of race?
User space set a child online before all childreen offlined, then
the _Eject_ flag is cleaned and the ejection process is interrupted.
Thanks
Joey Lee
On Thu 03-08-17 17:22:37, Joey Lee wrote:
> On Wed, Aug 02, 2017 at 11:01:43AM +0200, Michal Hocko wrote:
> > On Mon 31-07-17 15:38:45, Joey Lee wrote:
[...]
> > > So, the behavior is:
> > >
> > > Kernel received ejection event, set _Eject_ flag on container object
> > > -> Kernel sends offline events to all children devices
> > > -> User space performs cleaning jobs and offlines each child device
> > > -> Kernel detects all children offlined
> > > -> Kernel removes objects and calls power off(_EJ0)
> >
> > Yes this is what I've had in mind. It is the "kernel detects..." part
> > which is not implemented now and that requires us to do the explicit
> > eject from userspace, correct?
> >
>
> Yes, the _Eject_ flag and _detects_ part are not implemented now.
>
> In this approach, kernel still relies on user space to trigger the
> offline. The ejection process is still not transparent to user space.
> Is it what you want?
But as long as there is no auto-offlining then there is no other choice
no? Besides that userspace even shouldn't care about the fact that the
eject is in progress. That is a BIOS->OS deal AFAIU. All the userspace
cares about is the proper cleanup of the resources and that happens at
the offline time.
> > > If anyone onlined one of the children devices in the term of waiting
> > > userland offlines all children, then the _Eject_ flag will be clean
> > > and ejection process will be interrupted. In this situation, administrator
> > > needs to trigger ejection event again.
> >
> > yes
> >
> > > Do you think that the race hurts anything?
> >
> > What kind of race?
>
> User space set a child online before all childreen offlined, then
> the _Eject_ flag is cleaned and the ejection process is interrupted.
Is this really a race though? Kernel will always have a full picture and
if userspace wants to online some part then the eject cannot succeed.
This is something that a userspace driver eject cannot possibly handle.
--
Michal Hocko
SUSE Labs
On Thu, Aug 03, 2017 at 11:31:53AM +0200, Michal Hocko wrote:
> On Thu 03-08-17 17:22:37, Joey Lee wrote:
> > On Wed, Aug 02, 2017 at 11:01:43AM +0200, Michal Hocko wrote:
> > > On Mon 31-07-17 15:38:45, Joey Lee wrote:
> [...]
> > > > So, the behavior is:
> > > >
> > > > Kernel received ejection event, set _Eject_ flag on container object
> > > > -> Kernel sends offline events to all children devices
> > > > -> User space performs cleaning jobs and offlines each child device
> > > > -> Kernel detects all children offlined
> > > > -> Kernel removes objects and calls power off(_EJ0)
> > >
> > > Yes this is what I've had in mind. It is the "kernel detects..." part
> > > which is not implemented now and that requires us to do the explicit
> > > eject from userspace, correct?
> > >
> >
> > Yes, the _Eject_ flag and _detects_ part are not implemented now.
> >
> > In this approach, kernel still relies on user space to trigger the
> > offline. The ejection process is still not transparent to user space.
> > Is it what you want?
>
> But as long as there is no auto-offlining then there is no other choice
> no? Besides that userspace even shouldn't care about the fact that the
If Yasuaki's problem is already fixed in mainline, then the auto-offlining
will be possible.
> eject is in progress. That is a BIOS->OS deal AFAIU. All the userspace
> cares about is the proper cleanup of the resources and that happens at
> the offline time.
>
I agree! User space doesn't need to know the detail of kobject cleaning
and ejection stages.
> > > > If anyone onlined one of the children devices in the term of waiting
> > > > userland offlines all children, then the _Eject_ flag will be clean
> > > > and ejection process will be interrupted. In this situation, administrator
> > > > needs to trigger ejection event again.
> > >
> > > yes
> > >
> > > > Do you think that the race hurts anything?
> > >
> > > What kind of race?
> >
> > User space set a child online before all childreen offlined, then
> > the _Eject_ flag is cleaned and the ejection process is interrupted.
>
> Is this really a race though? Kernel will always have a full picture and
> if userspace wants to online some part then the eject cannot succeed.
> This is something that a userspace driver eject cannot possibly handle.
Then I agree.
I am waiting Yasuaki's response and want to know Rafael's and
Yasuaki's opinions about the _Eject_ flag approach.
Thanks a lot!
Joey Lee
On Thu 03-08-17 17:52:57, Joey Lee wrote:
> On Thu, Aug 03, 2017 at 11:31:53AM +0200, Michal Hocko wrote:
> > On Thu 03-08-17 17:22:37, Joey Lee wrote:
> > > On Wed, Aug 02, 2017 at 11:01:43AM +0200, Michal Hocko wrote:
> > > > On Mon 31-07-17 15:38:45, Joey Lee wrote:
> > [...]
> > > > > So, the behavior is:
> > > > >
> > > > > Kernel received ejection event, set _Eject_ flag on container object
> > > > > -> Kernel sends offline events to all children devices
> > > > > -> User space performs cleaning jobs and offlines each child device
> > > > > -> Kernel detects all children offlined
> > > > > -> Kernel removes objects and calls power off(_EJ0)
> > > >
> > > > Yes this is what I've had in mind. It is the "kernel detects..." part
> > > > which is not implemented now and that requires us to do the explicit
> > > > eject from userspace, correct?
> > > >
> > >
> > > Yes, the _Eject_ flag and _detects_ part are not implemented now.
> > >
> > > In this approach, kernel still relies on user space to trigger the
> > > offline. The ejection process is still not transparent to user space.
> > > Is it what you want?
> >
> > But as long as there is no auto-offlining then there is no other choice
> > no? Besides that userspace even shouldn't care about the fact that the
>
> If Yasuaki's problem is already fixed in mainline, then the auto-offlining
> will be possible.
Kernel alone cannot do the memory offline in general. There might be
resources which need an explicit userspace action. But that is not
important. The eject process should be pretty much independent on who is
doing the offline. The only thing that matters is that the kernel ejects
_after_ all resources are offline. This is the case already so the only
case we need to settle down is how is the offline done on a container
which has multiple resources. I still maintain my opinion that all
associated resources should be notified for offline from the kernel
rather than relying on userspace do somehow find those resources and
offline them manually.
--
Michal Hocko
SUSE Labs
On 08/02/2017 01:49 AM, joeyli wrote:
> Hi YASUAKI,
>
> On Tue, Aug 01, 2017 at 03:21:38PM -0400, YASUAKI ISHIMATSU wrote:
>> Hi Joey,
>>
>> On 07/23/2017 05:18 AM, joeyli wrote:
> [...snip]
>>>>>
>>>>
>>>> At least Yasuaki raised similar behavior for container in 2013.
>>>> It's similar to the DVD player case, user space application needs
>>>> to do something then trigger children offline and ejection of
>>>> container.
>>>>
>>>> Base on Yasuaki's explanation, the reason of that he requested the
>>>> userland ejection approach is that he got memory hot-remove problem
>>>> in 2013. Maybe his problem is already fixed by your patches in current
>>>> mainline.
>>>>
>>>> Hi Yasuaki, could you please check that your memory hot-remove problem
>>>> is fixed on mainline kernel?
>>
>> I cannot remember what I mentioned in 2013. Could you tell me url of lkml archive.
>>
>
> Here: https://lkml.org/lkml/2013/11/28/520
Thank you for specifying the URL. In the URL, I wrote the following problems:
> 1. easily fail
> My container device has CPU device and Memory device, and maximum size of
> memory is 3Tbyte. In my environment, hot removing container device fails
> on offlining memory if memory is used by application.
> I think if offlininig memory, we must retly to offline memory several
> times.
I think the issue still remains. If process keeps accessing memory, offlining
the memory easily fails with EBUSY.
Thanks,
Yasuaki Ishimatsu
>
> In the mail, you mentioned two problems. But there have no detail about
> those issues, and no root causes. We don't know why kernel should relies
> on user space to complete the hot removing process for container.
>
> Thank a lot!
> Joey Lee
>
On Thu 03-08-17 11:37:37, YASUAKI ISHIMATSU wrote:
>
>
> On 08/02/2017 01:49 AM, joeyli wrote:
> > Hi YASUAKI,
> >
> > On Tue, Aug 01, 2017 at 03:21:38PM -0400, YASUAKI ISHIMATSU wrote:
> >> Hi Joey,
> >>
> >> On 07/23/2017 05:18 AM, joeyli wrote:
> > [...snip]
> >>>>>
> >>>>
> >>>> At least Yasuaki raised similar behavior for container in 2013.
> >>>> It's similar to the DVD player case, user space application needs
> >>>> to do something then trigger children offline and ejection of
> >>>> container.
> >>>>
> >>>> Base on Yasuaki's explanation, the reason of that he requested the
> >>>> userland ejection approach is that he got memory hot-remove problem
> >>>> in 2013. Maybe his problem is already fixed by your patches in current
> >>>> mainline.
> >>>>
> >>>> Hi Yasuaki, could you please check that your memory hot-remove problem
> >>>> is fixed on mainline kernel?
> >>
> >> I cannot remember what I mentioned in 2013. Could you tell me url of lkml archive.
> >>
> >
>
> > Here: https://lkml.org/lkml/2013/11/28/520
>
> Thank you for specifying the URL. In the URL, I wrote the following problems:
>
> > 1. easily fail
> > My container device has CPU device and Memory device, and maximum size of
> > memory is 3Tbyte. In my environment, hot removing container device fails
> > on offlining memory if memory is used by application.
> > I think if offlininig memory, we must retly to offline memory several
> > times.
>
> I think the issue still remains. If process keeps accessing memory, offlining
> the memory easily fails with EBUSY.
Yes and the only way how to deal with it is to retry the offline. In
order to do that userspace has to be notified to try again.
--
Michal Hocko
SUSE Labs
On Fri, Aug 04, 2017 at 05:06:19PM +0200, Michal Hocko wrote:
> On Thu 03-08-17 11:37:37, YASUAKI ISHIMATSU wrote:
> >
> >
> > On 08/02/2017 01:49 AM, joeyli wrote:
> > > Hi YASUAKI,
> > >
> > > On Tue, Aug 01, 2017 at 03:21:38PM -0400, YASUAKI ISHIMATSU wrote:
> > >> Hi Joey,
> > >>
> > >> On 07/23/2017 05:18 AM, joeyli wrote:
> > > [...snip]
> > >>>>>
> > >>>>
> > >>>> At least Yasuaki raised similar behavior for container in 2013.
> > >>>> It's similar to the DVD player case, user space application needs
> > >>>> to do something then trigger children offline and ejection of
> > >>>> container.
> > >>>>
> > >>>> Base on Yasuaki's explanation, the reason of that he requested the
> > >>>> userland ejection approach is that he got memory hot-remove problem
> > >>>> in 2013. Maybe his problem is already fixed by your patches in current
> > >>>> mainline.
> > >>>>
> > >>>> Hi Yasuaki, could you please check that your memory hot-remove problem
> > >>>> is fixed on mainline kernel?
> > >>
> > >> I cannot remember what I mentioned in 2013. Could you tell me url of lkml archive.
> > >>
> > >
> >
> > > Here: https://lkml.org/lkml/2013/11/28/520
> >
> > Thank you for specifying the URL. In the URL, I wrote the following problems:
> >
> > > 1. easily fail
> > > My container device has CPU device and Memory device, and maximum size of
> > > memory is 3Tbyte. In my environment, hot removing container device fails
> > > on offlining memory if memory is used by application.
> > > I think if offlininig memory, we must retly to offline memory several
> > > times.
> >
> > I think the issue still remains. If process keeps accessing memory, offlining
> > the memory easily fails with EBUSY.
>
> Yes and the only way how to deal with it is to retry the offline. In
> order to do that userspace has to be notified to try again.
Looks that kernel still needs to rely on userspace to trigger offline
for each child device.
My plan is to impelement a _EJECTING_ flag on acpi_device struct.
It indicates that the acpi device under ejection process. The flag
will be changed in the following situations:
- EJECTING flag 0 -> 1
- After acpi_scan_hot_remove() sends KOBJ_CHANGE for all children
- EJECTING flag 1 -> 0
- After acpi_scan_hot_remove() calls _EJ0
- Any child device was set online again
And, according to the order, any EJECTING flag changing work must be
pushed to kacpi_hotplug_wq work queue.
Thanks a lot!
Joey Lee