From: Li Li <[email protected]>
As there isn't an atomic operation to freeze the main thread and binder
interface together, it's possible the main thread initiates a new binder
transaction while the binder interfaces are already frozen. This race issue
will result in failed binder transaction and unexpectedly crash the app.
This patch allows a post-froze rollback mechanism by checking if there's
any new pending binder transaction waiting for response. At the same time,
it treats the response transaction like an oneway transaction so that the
response can successfully reach the frozen process.
Changes in v2:
1. Improve commit msg, adding "Fixes"
2. Adding missing "_ilocked" suffix to binder_txns_pending()
3. Document bit assignment of struct binder_frozen_status_info in binder.h
Li Li (1):
binder: fix freeze race
drivers/android/binder.c | 34 +++++++++++++++++++++++++----
drivers/android/binder_internal.h | 2 ++
include/uapi/linux/android/binder.h | 7 ++++++
3 files changed, 39 insertions(+), 4 deletions(-)
--
2.33.0.309.g3052b89438-goog
From: Li Li <[email protected]>
Currently cgroup freezer is used to freeze the application threads, and
BINDER_FREEZE is used to freeze the corresponding binder interface.
There's already a mechanism in ioctl(BINDER_FREEZE) to wait for any
existing transactions to drain out before actually freezing the binder
interface.
But freezing an app requires 2 steps, freezing the binder interface with
ioctl(BINDER_FREEZE) and then freezing the application main threads with
cgroupfs. This is not an atomic operation. The following race issue
might happen.
1) Binder interface is frozen by ioctl(BINDER_FREEZE);
2) Main thread A initiates a new sync binder transaction to process B;
3) Main thread A is frozen by "echo 1 > cgroup.freeze";
4) The response from process B reaches the frozen thread, which will
unexpectedly fail.
This patch provides a mechanism to check if there's any new pending
transaction happening between ioctl(BINDER_FREEZE) and freezing the
main thread. If there's any, the main thread freezing operation can
be rolled back to finish the pending transaction.
Furthermore, the response might reach the binder driver before the
rollback actually happens. That will still cause failed transaction.
As the other process doesn't wait for another response of the response,
the response transaction failure can be fixed by treating the response
transaction like an oneway/async one, allowing it to reach the frozen
thread. And it will be consumed when the thread gets unfrozen later.
Fixes: 432ff1e91694 ("binder: BINDER_FREEZE ioctl")
Signed-off-by: Li Li <[email protected]>
---
drivers/android/binder.c | 34 +++++++++++++++++++++++++----
drivers/android/binder_internal.h | 2 ++
include/uapi/linux/android/binder.h | 7 ++++++
3 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index d9030cb6b1e4..eaffdf5f692c 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -3038,9 +3038,8 @@ static void binder_transaction(struct binder_proc *proc,
if (reply) {
binder_enqueue_thread_work(thread, tcomplete);
binder_inner_proc_lock(target_proc);
- if (target_thread->is_dead || target_proc->is_frozen) {
- return_error = target_thread->is_dead ?
- BR_DEAD_REPLY : BR_FROZEN_REPLY;
+ if (target_thread->is_dead) {
+ return_error = BR_DEAD_REPLY;
binder_inner_proc_unlock(target_proc);
goto err_dead_proc_or_thread;
}
@@ -4648,6 +4647,22 @@ static int binder_ioctl_get_node_debug_info(struct binder_proc *proc,
return 0;
}
+static int binder_txns_pending_ilocked(struct binder_proc *proc)
+{
+ struct rb_node *n;
+ struct binder_thread *thread;
+
+ if (proc->outstanding_txns > 0)
+ return 1;
+
+ for (n = rb_first(&proc->threads); n; n = rb_next(n)) {
+ thread = rb_entry(n, struct binder_thread, rb_node);
+ if (thread->transaction_stack)
+ return 1;
+ }
+ return 0;
+}
+
static int binder_ioctl_freeze(struct binder_freeze_info *info,
struct binder_proc *target_proc)
{
@@ -4682,6 +4697,14 @@ static int binder_ioctl_freeze(struct binder_freeze_info *info,
if (!ret && target_proc->outstanding_txns)
ret = -EAGAIN;
+ /* Also check pending transactions that wait for reply */
+ if (ret >= 0) {
+ binder_inner_proc_lock(target_proc);
+ if (binder_txns_pending_ilocked(target_proc))
+ ret = -EAGAIN;
+ binder_inner_proc_unlock(target_proc);
+ }
+
if (ret < 0) {
binder_inner_proc_lock(target_proc);
target_proc->is_frozen = false;
@@ -4696,6 +4719,7 @@ static int binder_ioctl_get_freezer_info(
{
struct binder_proc *target_proc;
bool found = false;
+ int txns_pending;
info->sync_recv = 0;
info->async_recv = 0;
@@ -4705,7 +4729,9 @@ static int binder_ioctl_get_freezer_info(
if (target_proc->pid == info->pid) {
found = true;
binder_inner_proc_lock(target_proc);
- info->sync_recv |= target_proc->sync_recv;
+ txns_pending = binder_txns_pending_ilocked(target_proc);
+ info->sync_recv |= target_proc->sync_recv |
+ (txns_pending << 1);
info->async_recv |= target_proc->async_recv;
binder_inner_proc_unlock(target_proc);
}
diff --git a/drivers/android/binder_internal.h b/drivers/android/binder_internal.h
index 810c0b84d3f8..402c4d4362a8 100644
--- a/drivers/android/binder_internal.h
+++ b/drivers/android/binder_internal.h
@@ -378,6 +378,8 @@ struct binder_ref {
* binder transactions
* (protected by @inner_lock)
* @sync_recv: process received sync transactions since last frozen
+ * bit 0: received sync transaction after being frozen
+ * bit 1: new pending sync transaction during freezing
* (protected by @inner_lock)
* @async_recv: process received async transactions since last frozen
* (protected by @inner_lock)
diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
index 20e435fe657a..3246f2c74696 100644
--- a/include/uapi/linux/android/binder.h
+++ b/include/uapi/linux/android/binder.h
@@ -225,7 +225,14 @@ struct binder_freeze_info {
struct binder_frozen_status_info {
__u32 pid;
+
+ /* process received sync transactions since last frozen
+ * bit 0: received sync transaction after being frozen
+ * bit 1: new pending sync transaction during freezing
+ */
__u32 sync_recv;
+
+ /* process received async transactions since last frozen */
__u32 async_recv;
};
--
2.33.0.309.g3052b89438-goog
On Thu, Sep 09, 2021 at 08:53:16PM -0700, Li Li wrote:
> From: Li Li <[email protected]>
>
> Currently cgroup freezer is used to freeze the application threads, and
> BINDER_FREEZE is used to freeze the corresponding binder interface.
> There's already a mechanism in ioctl(BINDER_FREEZE) to wait for any
> existing transactions to drain out before actually freezing the binder
> interface.
>
> But freezing an app requires 2 steps, freezing the binder interface with
> ioctl(BINDER_FREEZE) and then freezing the application main threads with
> cgroupfs. This is not an atomic operation. The following race issue
> might happen.
>
> 1) Binder interface is frozen by ioctl(BINDER_FREEZE);
> 2) Main thread A initiates a new sync binder transaction to process B;
> 3) Main thread A is frozen by "echo 1 > cgroup.freeze";
> 4) The response from process B reaches the frozen thread, which will
> unexpectedly fail.
>
> This patch provides a mechanism to check if there's any new pending
> transaction happening between ioctl(BINDER_FREEZE) and freezing the
> main thread. If there's any, the main thread freezing operation can
> be rolled back to finish the pending transaction.
>
> Furthermore, the response might reach the binder driver before the
> rollback actually happens. That will still cause failed transaction.
>
> As the other process doesn't wait for another response of the response,
> the response transaction failure can be fixed by treating the response
> transaction like an oneway/async one, allowing it to reach the frozen
> thread. And it will be consumed when the thread gets unfrozen later.
>
> Fixes: 432ff1e91694 ("binder: BINDER_FREEZE ioctl")
> Signed-off-by: Li Li <[email protected]>
> ---
> drivers/android/binder.c | 34 +++++++++++++++++++++++++----
> drivers/android/binder_internal.h | 2 ++
> include/uapi/linux/android/binder.h | 7 ++++++
> 3 files changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index d9030cb6b1e4..eaffdf5f692c 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -3038,9 +3038,8 @@ static void binder_transaction(struct binder_proc *proc,
> if (reply) {
> binder_enqueue_thread_work(thread, tcomplete);
> binder_inner_proc_lock(target_proc);
> - if (target_thread->is_dead || target_proc->is_frozen) {
> - return_error = target_thread->is_dead ?
> - BR_DEAD_REPLY : BR_FROZEN_REPLY;
> + if (target_thread->is_dead) {
> + return_error = BR_DEAD_REPLY;
> binder_inner_proc_unlock(target_proc);
> goto err_dead_proc_or_thread;
> }
> @@ -4648,6 +4647,22 @@ static int binder_ioctl_get_node_debug_info(struct binder_proc *proc,
> return 0;
> }
>
> +static int binder_txns_pending_ilocked(struct binder_proc *proc)
> +{
> + struct rb_node *n;
> + struct binder_thread *thread;
> +
> + if (proc->outstanding_txns > 0)
> + return 1;
> +
> + for (n = rb_first(&proc->threads); n; n = rb_next(n)) {
> + thread = rb_entry(n, struct binder_thread, rb_node);
> + if (thread->transaction_stack)
> + return 1;
> + }
> + return 0;
> +}
> +
> static int binder_ioctl_freeze(struct binder_freeze_info *info,
> struct binder_proc *target_proc)
> {
> @@ -4682,6 +4697,14 @@ static int binder_ioctl_freeze(struct binder_freeze_info *info,
> if (!ret && target_proc->outstanding_txns)
> ret = -EAGAIN;
>
> + /* Also check pending transactions that wait for reply */
> + if (ret >= 0) {
> + binder_inner_proc_lock(target_proc);
> + if (binder_txns_pending_ilocked(target_proc))
> + ret = -EAGAIN;
> + binder_inner_proc_unlock(target_proc);
> + }
> +
> if (ret < 0) {
> binder_inner_proc_lock(target_proc);
> target_proc->is_frozen = false;
> @@ -4696,6 +4719,7 @@ static int binder_ioctl_get_freezer_info(
> {
> struct binder_proc *target_proc;
> bool found = false;
> + int txns_pending;
>
> info->sync_recv = 0;
> info->async_recv = 0;
> @@ -4705,7 +4729,9 @@ static int binder_ioctl_get_freezer_info(
> if (target_proc->pid == info->pid) {
> found = true;
> binder_inner_proc_lock(target_proc);
> - info->sync_recv |= target_proc->sync_recv;
> + txns_pending = binder_txns_pending_ilocked(target_proc);
> + info->sync_recv |= target_proc->sync_recv |
> + (txns_pending << 1);
> info->async_recv |= target_proc->async_recv;
> binder_inner_proc_unlock(target_proc);
> }
> diff --git a/drivers/android/binder_internal.h b/drivers/android/binder_internal.h
> index 810c0b84d3f8..402c4d4362a8 100644
> --- a/drivers/android/binder_internal.h
> +++ b/drivers/android/binder_internal.h
> @@ -378,6 +378,8 @@ struct binder_ref {
> * binder transactions
> * (protected by @inner_lock)
> * @sync_recv: process received sync transactions since last frozen
> + * bit 0: received sync transaction after being frozen
> + * bit 1: new pending sync transaction during freezing
> * (protected by @inner_lock)
> * @async_recv: process received async transactions since last frozen
> * (protected by @inner_lock)
> diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> index 20e435fe657a..3246f2c74696 100644
> --- a/include/uapi/linux/android/binder.h
> +++ b/include/uapi/linux/android/binder.h
> @@ -225,7 +225,14 @@ struct binder_freeze_info {
>
> struct binder_frozen_status_info {
> __u32 pid;
> +
> + /* process received sync transactions since last frozen
> + * bit 0: received sync transaction after being frozen
> + * bit 1: new pending sync transaction during freezing
> + */
> __u32 sync_recv;
You just changed a user/kernel api here, did you just break existing
userspace applications? If not, how did that not happen?
thanks,
greg k-h
On Thu, Sep 9, 2021 at 10:38 PM Greg KH <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 08:53:16PM -0700, Li Li wrote:
> > struct binder_frozen_status_info {
> > __u32 pid;
> > +
> > + /* process received sync transactions since last frozen
> > + * bit 0: received sync transaction after being frozen
> > + * bit 1: new pending sync transaction during freezing
> > + */
> > __u32 sync_recv;
>
> You just changed a user/kernel api here, did you just break existing
> userspace applications? If not, how did that not happen?
>
That's a good question. This design does keep backward compatibility.
The existing userspace applications call ioctl(BINDER_GET_FROZEN_INFO)
to check if there's sync or async binder transactions sent to a frozen
process.
If the existing userspace app runs on a new kernel, a sync binder call
still sets bit 1 of sync_recv (as it's a bool variable) so the ioctl
will return the expected value (TRUE). The app just doesn't check bit
1 intentionally so it doesn't have the ability to tell if there's a
race - this behavior is aligned with what happens on an old kernel as
the old kernel doesn't have bit 1 set at all.
The bit 1 of sync_recv enables new userspace app the ability to tell
1) there's a sync binder transaction happened when being frozen - same
as before; and 2) if that sync binder transaction happens exactly when
there's a race - a new information for rollback decision.
> thanks,
>
> greg k-h
Thanks,
Li
On Thu, Sep 09, 2021 at 11:17:42PM -0700, Li Li wrote:
> On Thu, Sep 9, 2021 at 10:38 PM Greg KH <[email protected]> wrote:
> >
> > On Thu, Sep 09, 2021 at 08:53:16PM -0700, Li Li wrote:
> > > struct binder_frozen_status_info {
> > > __u32 pid;
> > > +
> > > + /* process received sync transactions since last frozen
> > > + * bit 0: received sync transaction after being frozen
> > > + * bit 1: new pending sync transaction during freezing
> > > + */
> > > __u32 sync_recv;
> >
> > You just changed a user/kernel api here, did you just break existing
> > userspace applications? If not, how did that not happen?
> >
>
> That's a good question. This design does keep backward compatibility.
>
> The existing userspace applications call ioctl(BINDER_GET_FROZEN_INFO)
> to check if there's sync or async binder transactions sent to a frozen
> process.
>
> If the existing userspace app runs on a new kernel, a sync binder call
> still sets bit 1 of sync_recv (as it's a bool variable) so the ioctl
> will return the expected value (TRUE). The app just doesn't check bit
> 1 intentionally so it doesn't have the ability to tell if there's a
> race - this behavior is aligned with what happens on an old kernel as
> the old kernel doesn't have bit 1 set at all.
>
> The bit 1 of sync_recv enables new userspace app the ability to tell
> 1) there's a sync binder transaction happened when being frozen - same
> as before; and 2) if that sync binder transaction happens exactly when
> there's a race - a new information for rollback decision.
Ah, can you add that to the changelog text to make it more obvious?
thanks,
greg k-h
On Fri, Sep 10, 2021 at 12:15 AM Greg KH <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 11:17:42PM -0700, Li Li wrote:
> > On Thu, Sep 9, 2021 at 10:38 PM Greg KH <[email protected]> wrote:
> > >
> > > On Thu, Sep 09, 2021 at 08:53:16PM -0700, Li Li wrote:
> > > > struct binder_frozen_status_info {
> > > > __u32 pid;
> > > > +
> > > > + /* process received sync transactions since last frozen
> > > > + * bit 0: received sync transaction after being frozen
> > > > + * bit 1: new pending sync transaction during freezing
> > > > + */
> > > > __u32 sync_recv;
> > >
> > > You just changed a user/kernel api here, did you just break existing
> > > userspace applications? If not, how did that not happen?
> > >
> >
> > That's a good question. This design does keep backward compatibility.
> >
> > The existing userspace applications call ioctl(BINDER_GET_FROZEN_INFO)
> > to check if there's sync or async binder transactions sent to a frozen
> > process.
> >
> > If the existing userspace app runs on a new kernel, a sync binder call
> > still sets bit 1 of sync_recv (as it's a bool variable) so the ioctl
> > will return the expected value (TRUE). The app just doesn't check bit
> > 1 intentionally so it doesn't have the ability to tell if there's a
> > race - this behavior is aligned with what happens on an old kernel as
> > the old kernel doesn't have bit 1 set at all.
> >
> > The bit 1 of sync_recv enables new userspace app the ability to tell
> > 1) there's a sync binder transaction happened when being frozen - same
> > as before; and 2) if that sync binder transaction happens exactly when
> > there's a race - a new information for rollback decision.
>
> Ah, can you add that to the changelog text to make it more obvious?
>
Sure, added that to V3, plus other minor improvements listed in the cover
letter. Please let me know if there's anything else I should continue
to improve.
https://lore.kernel.org/lkml/[email protected]/
BTW, I had a stress test running, repeatedly freezing and unfreezing a
couple apps every second, which at the same time initiates new binder
transactions in a loop. The overnight stress test during the past
weekend showed positive results. Without this kernel patch, the reply
transaction will fail in tens of iterations. With this kernel patch
and the corresponding user space fix (rescheduling the freeze op to
next second in case race happens), the stress test runs for 24hrs
without a single failure.
Thanks,
Li