2024-01-18 18:20:51

by Zach O'Keefe

[permalink] [raw]
Subject: [PATCH] mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again

(struct dirty_throttle_control *)->thresh is an unsigned long, but is
passed as the u32 divisor argument to div_u64(). On architectures where
unsigned long is 64 bytes, the argument will be implicitly truncated.

Use div64_u64() instead of div_u64() so that the value used in the "is
this a safe division" check is the same as the divisor.

Also, remove redundant cast of the numerator to u64, as that should
happen implicitly.

This would be difficult to exploit in memcg domain, given the
ratio-based arithmetic domain_drity_limits() uses, but is much easier in
global writeback domain with a BDI_CAP_STRICTLIMIT-backing device, using
e.g. vm.dirty_bytes=(1<<32)*PAGE_SIZE so that dtc->thresh == (1<<32)

Fixes: f6789593d5ce ("mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()")
Cc: Maxim Patlasov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Zach O'Keefe <[email protected]>
---
mm/page-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cd4e4ae77c40a..02147b61712bc 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1638,7 +1638,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
*/
dtc->wb_thresh = __wb_calc_thresh(dtc);
dtc->wb_bg_thresh = dtc->thresh ?
- div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
+ div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;

/*
* In order to avoid the stacked BDI deadlock we need
--
2.43.0.429.g432eaa2c6b-goog



2024-04-17 11:10:13

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again

On Thu 18-01-24 10:19:53, Zach O'Keefe wrote:
> (struct dirty_throttle_control *)->thresh is an unsigned long, but is
> passed as the u32 divisor argument to div_u64(). On architectures where
> unsigned long is 64 bytes, the argument will be implicitly truncated.
>
> Use div64_u64() instead of div_u64() so that the value used in the "is
> this a safe division" check is the same as the divisor.
>
> Also, remove redundant cast of the numerator to u64, as that should
> happen implicitly.
>
> This would be difficult to exploit in memcg domain, given the
> ratio-based arithmetic domain_drity_limits() uses, but is much easier in
> global writeback domain with a BDI_CAP_STRICTLIMIT-backing device, using
> e.g. vm.dirty_bytes=(1<<32)*PAGE_SIZE so that dtc->thresh == (1<<32)
>
> Fixes: f6789593d5ce ("mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()")
> Cc: Maxim Patlasov <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Zach O'Keefe <[email protected]>

I've come across this change today and it is broken in several ways:

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index cd4e4ae77c40a..02147b61712bc 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1638,7 +1638,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
> */
> dtc->wb_thresh = __wb_calc_thresh(dtc);
> dtc->wb_bg_thresh = dtc->thresh ?
> - div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
> + div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;

Firstly, the removed (u64) cast from the multiplication will introduce a
multiplication overflow on 32-bit archs if wb_thresh * bg_thresh >= 1<<32
(which is actually common - the default settings with 4GB of RAM will
trigger this). Secondly, the div64_u64() is unnecessarily expensive on
32-bit archs. We have div64_ul() in case we want to be safe & cheap.
Thirdly, if thresholds are larger than 1<<32 pages, then dirty balancing is
going to blow up in many other spectacular ways - consider only the
multiplication on this line - it will not necessarily fit into u64 anymore.
The whole dirty limiting code is interspersed with assumptions that limits
are actually within u32 and we do our calculations in unsigned longs to
avoid worrying about overflows (with occasional typing to u64 to make it
more interesting because people expected those entities to overflow 32 bits
even on 32-bit archs). Which is lame I agree but so far people don't seem
to be setting limits to 16TB or more. And I'm not really worried about
security here since this is global-root-only tunable and that has much
better ways to DoS the system.

So overall I'm all for cleaning up this code but in a sensible way please.
E.g. for these overflow issues at least do it one function at a time so
that we can sensibly review it.

Andrew, can you please revert this patch until we have a better fix? So far
it does more harm than good... Thanks!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-04-17 19:34:28

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH] mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again

On Wed, Apr 17, 2024 at 4:10 AM Jan Kara <[email protected]> wrote:
>
> On Thu 18-01-24 10:19:53, Zach O'Keefe wrote:
> > (struct dirty_throttle_control *)->thresh is an unsigned long, but is
> > passed as the u32 divisor argument to div_u64(). On architectures where
> > unsigned long is 64 bytes, the argument will be implicitly truncated.
> >
> > Use div64_u64() instead of div_u64() so that the value used in the "is
> > this a safe division" check is the same as the divisor.
> >
> > Also, remove redundant cast of the numerator to u64, as that should
> > happen implicitly.
> >
> > This would be difficult to exploit in memcg domain, given the
> > ratio-based arithmetic domain_drity_limits() uses, but is much easier in
> > global writeback domain with a BDI_CAP_STRICTLIMIT-backing device, using
> > e.g. vm.dirty_bytes=(1<<32)*PAGE_SIZE so that dtc->thresh == (1<<32)
> >
> > Fixes: f6789593d5ce ("mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()")
> > Cc: Maxim Patlasov <[email protected]>
> > Cc: <[email protected]>
> > Signed-off-by: Zach O'Keefe <[email protected]>
>
> I've come across this change today and it is broken in several ways:

Thanks for picking up on this, Jan.

> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index cd4e4ae77c40a..02147b61712bc 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1638,7 +1638,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
> > */
> > dtc->wb_thresh = __wb_calc_thresh(dtc);
> > dtc->wb_bg_thresh = dtc->thresh ?
> > - div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
> > + div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
>
> Firstly, the removed (u64) cast from the multiplication will introduce a
> multiplication overflow on 32-bit archs if wb_thresh * bg_thresh >= 1<<32
> (which is actually common - the default settings with 4GB of RAM will
> trigger this). [..]

True, and embarrassing given I was looking at this code with a 32-bit
focus. Well spotted.

> [..] Secondly, the div64_u64() is unnecessarily expensive on
> 32-bit archs. We have div64_ul() in case we want to be safe & cheap.

A last-minute change vs just casting the initial "dtc->thresh ?"
check. It did look expensive, but figured its existence implied it
should be used. I must have missed div64_ul().

> Thirdly, if thresholds are larger than 1<<32 pages, then dirty balancing is
> going to blow up in many other spectacular ways - consider only the
> multiplication on this line - it will not necessarily fit into u64 anymore.
> The whole dirty limiting code is interspersed with assumptions that limits
> are actually within u32 and we do our calculations in unsigned longs to
> avoid worrying about overflows (with occasional typing to u64 to make it
> more interesting because people expected those entities to overflow 32 bits
> even on 32-bit archs). Which is lame I agree but so far people don't seem
> to be setting limits to 16TB or more. And I'm not really worried about
> security here since this is global-root-only tunable and that has much
> better ways to DoS the system.
>
> So overall I'm all for cleaning up this code but in a sensible way please.
> E.g. for these overflow issues at least do it one function at a time so
> that we can sensibly review it.
>
> Andrew, can you please revert this patch until we have a better fix? So far
> it does more harm than good... Thanks!

Shall we just roll-forward with a suitable fix? I think all the
original code actually "needed" was to cast the ternary predicate,
like:

---8<---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fba324e1a010..ca1bfc0c9bdd 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1637,8 +1637,8 @@ static inline void wb_dirty_limits(struct
dirty_throttle_control *dtc)
* at some rate <= (write_bw / 2) for bringing down wb_dirty.
*/
dtc->wb_thresh = __wb_calc_thresh(dtc);
- dtc->wb_bg_thresh = dtc->thresh ?
- div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
+ dtc->wb_bg_thresh = (u32)dtc->thresh ?
+ div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;

/*
* In order to avoid the stacked BDI deadlock we need
---8<---

Thanks, and apologize for the inconvenience

Zach

> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2024-04-18 11:04:56

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again

On Wed 17-04-24 12:33:39, Zach O'Keefe wrote:
> On Wed, Apr 17, 2024 at 4:10 AM Jan Kara <[email protected]> wrote:
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index cd4e4ae77c40a..02147b61712bc 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -1638,7 +1638,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
> > > */
> > > dtc->wb_thresh = __wb_calc_thresh(dtc);
> > > dtc->wb_bg_thresh = dtc->thresh ?
> > > - div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
> > > + div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
..
> > Thirdly, if thresholds are larger than 1<<32 pages, then dirty balancing is
> > going to blow up in many other spectacular ways - consider only the
> > multiplication on this line - it will not necessarily fit into u64 anymore.
> > The whole dirty limiting code is interspersed with assumptions that limits
> > are actually within u32 and we do our calculations in unsigned longs to
> > avoid worrying about overflows (with occasional typing to u64 to make it
> > more interesting because people expected those entities to overflow 32 bits
> > even on 32-bit archs). Which is lame I agree but so far people don't seem
> > to be setting limits to 16TB or more. And I'm not really worried about
> > security here since this is global-root-only tunable and that has much
> > better ways to DoS the system.
> >
> > So overall I'm all for cleaning up this code but in a sensible way please.
> > E.g. for these overflow issues at least do it one function at a time so
> > that we can sensibly review it.
> >
> > Andrew, can you please revert this patch until we have a better fix? So far
> > it does more harm than good... Thanks!
>
> Shall we just roll-forward with a suitable fix? I think all the
> original code actually "needed" was to cast the ternary predicate,
> like:
>
> ---8<---
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index fba324e1a010..ca1bfc0c9bdd 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1637,8 +1637,8 @@ static inline void wb_dirty_limits(struct
> dirty_throttle_control *dtc)
> * at some rate <= (write_bw / 2) for bringing down wb_dirty.
> */
> dtc->wb_thresh = __wb_calc_thresh(dtc);
> - dtc->wb_bg_thresh = dtc->thresh ?
> - div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
> + dtc->wb_bg_thresh = (u32)dtc->thresh ?
> + div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;

Well, this would fix the division by 0 but when you read the code you
really start wondering what's going on :) And as I wrote above when
thresholds pass UINT_MAX, the dirty limitting code breaks down anyway so I
don't think the machine will be more usable after your fix. Would you be up
for a challenge to modify mm/page-writeback.c so that such huge limits
cannot be set instead? That would be actually a useful fix...

Honza

>
> /*
> * In order to avoid the stacked BDI deadlock we need
> ---8<---
>
> Thanks, and apologize for the inconvenience
>
> Zach
>
> > Honza
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-04-19 18:05:22

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH] mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again

On Thu, Apr 18, 2024 at 4:04 AM Jan Kara <[email protected]> wrote:
>
> On Wed 17-04-24 12:33:39, Zach O'Keefe wrote:
> > On Wed, Apr 17, 2024 at 4:10 AM Jan Kara <[email protected]> wrote:
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index cd4e4ae77c40a..02147b61712bc 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -1638,7 +1638,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
> > > > */
> > > > dtc->wb_thresh = __wb_calc_thresh(dtc);
> > > > dtc->wb_bg_thresh = dtc->thresh ?
> > > > - div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
> > > > + div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
> ...
> > > Thirdly, if thresholds are larger than 1<<32 pages, then dirty balancing is
> > > going to blow up in many other spectacular ways - consider only the
> > > multiplication on this line - it will not necessarily fit into u64 anymore.
> > > The whole dirty limiting code is interspersed with assumptions that limits
> > > are actually within u32 and we do our calculations in unsigned longs to
> > > avoid worrying about overflows (with occasional typing to u64 to make it
> > > more interesting because people expected those entities to overflow 32 bits
> > > even on 32-bit archs). Which is lame I agree but so far people don't seem
> > > to be setting limits to 16TB or more. And I'm not really worried about
> > > security here since this is global-root-only tunable and that has much
> > > better ways to DoS the system.
> > >
> > > So overall I'm all for cleaning up this code but in a sensible way please.
> > > E.g. for these overflow issues at least do it one function at a time so
> > > that we can sensibly review it.
> > >
> > > Andrew, can you please revert this patch until we have a better fix? So far
> > > it does more harm than good... Thanks!
> >
> > Shall we just roll-forward with a suitable fix? I think all the
> > original code actually "needed" was to cast the ternary predicate,
> > like:
> >
> > ---8<---
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index fba324e1a010..ca1bfc0c9bdd 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1637,8 +1637,8 @@ static inline void wb_dirty_limits(struct
> > dirty_throttle_control *dtc)
> > * at some rate <= (write_bw / 2) for bringing down wb_dirty.
> > */
> > dtc->wb_thresh = __wb_calc_thresh(dtc);
> > - dtc->wb_bg_thresh = dtc->thresh ?
> > - div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
> > + dtc->wb_bg_thresh = (u32)dtc->thresh ?
> > + div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
>
> Well, this would fix the division by 0 but when you read the code you
> really start wondering what's going on :) [..]

Ya, this was definitely a local fix in an area of code I know very
little abit. I stumbled across it in a rather contrived way -- made
easier by internal patches -- and felt its existence still warranted a
local fix.

> [..] And as I wrote above when
> thresholds pass UINT_MAX, the dirty limitting code breaks down anyway so I
> don't think the machine will be more usable after your fix. Would you be up
> for a challenge to modify mm/page-writeback.c so that such huge limits
> cannot be set instead? That would be actually a useful fix...

:) I can't say my schedule affords me much time to take on any
significant unplanned work. Perhaps as a Friday afternoon exercise
I'll come back to scope this out, driven by some sense of
responsibility garnered from starting down this path ; but ... my TODO
list is long.

Have a great rest of your day / weekend,
Zach

> Honza
>
> >
> > /*
> > * In order to avoid the stacked BDI deadlock we need
> > ---8<---
> >
> > Thanks, and apologize for the inconvenience
> >
> > Zach
> >
> > > Honza
> > > --
> > > Jan Kara <[email protected]>
> > > SUSE Labs, CR
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR