2019-01-27 06:55:31

by Heiner Litz

[permalink] [raw]
Subject: [PATCH] lightnvm: pblk: fix race condition on GC

This patch fixes a race condition where a write is mapped to the last
sectors of a line. The write is synced to the device but the L2P is not
updated yet. When the line is garbage collected before the L2P update is
performed, the sectors are ignored by the GC logic and the line is freed
before all sectors are moved. When the L2P is finally updated, it contains
a mapping to a freed line, subsequent reads of the corresponding LBAs fail.

Note that looking up the L2P and checking the ppa in the write buffer needs
to be performed atomically, hence the refactor of pblk_lookup_l2p_rand.

Signed-off-by: Heiner Litz <[email protected]>
---
drivers/lightnvm/pblk-read.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 3789185144da..7c556b2218e4 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -529,13 +529,35 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
int valid_secs = 0;
int i;

- pblk_lookup_l2p_rand(pblk, ppa_list_l2p, lba_list, nr_secs);
-
+ spin_lock(&pblk->trans_lock);
for (i = 0; i < nr_secs; i++) {
if (lba_list[i] == ADDR_EMPTY)
continue;

+ ppa_list_l2p[i] = pblk_trans_map_get(pblk, lba_list[i]);
ppa_gc = addr_to_gen_ppa(pblk, paddr_list_gc[i], line->id);
+
+ /* Obtain ppa from cache if the sector has been synced to the
+ device but the L2P has not been updated yet */
+ if(pblk_addr_in_cache(ppa_list_l2p[i])) {
+ struct pblk_rb *rb = &pblk->rwb;
+ struct pblk_rb_entry *entry;
+ struct pblk_w_ctx *w_ctx;
+ u64 pos = pblk_addr_to_cacheline(ppa_list_l2p[i]);
+
+#ifdef CONFIG_NVM_PBLK_DEBUG
+ /* Ensure that the access will not cause an overflow */
+ BUG_ON(pos >= rb->nr_entries);
+#endif
+
+ entry = &rb->entries[pos];
+ w_ctx = &entry->w_ctx;
+ if (pblk_ppa_comp(w_ctx->ppa, ppa_gc)) {
+ rqd->ppa_list[valid_secs++] = ppa_gc;
+ continue;
+ }
+ }
+
if (!pblk_ppa_comp(ppa_list_l2p[i], ppa_gc)) {
paddr_list_gc[i] = lba_list[i] = ADDR_EMPTY;
continue;
@@ -543,6 +565,7 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,

rqd->ppa_list[valid_secs++] = ppa_list_l2p[i];
}
+ spin_unlock(&pblk->trans_lock);

#ifdef CONFIG_NVM_PBLK_DEBUG
atomic_long_add(valid_secs, &pblk->inflight_reads);
--
2.17.1



2019-01-29 08:13:34

by Javier González

[permalink] [raw]
Subject: Re: [PATCH] lightnvm: pblk: fix race condition on GC

> On 27 Jan 2019, at 07.54, Heiner Litz <[email protected]> wrote:
>
> This patch fixes a race condition where a write is mapped to the last
> sectors of a line. The write is synced to the device but the L2P is not
> updated yet. When the line is garbage collected before the L2P update is
> performed, the sectors are ignored by the GC logic and the line is freed
> before all sectors are moved. When the L2P is finally updated, it contains
> a mapping to a freed line, subsequent reads of the corresponding LBAs fail.

Hi Heiner,

This has been an interesting issue to debug - good catch!

>
> Note that looking up the L2P and checking the ppa in the write buffer needs
> to be performed atomically, hence the refactor of pblk_lookup_l2p_rand.
>
> Signed-off-by: Heiner Litz <[email protected]>
> ---
> drivers/lightnvm/pblk-read.c | 27 +++++++++++++++++++++++++--
> 1 file changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
> index 3789185144da..7c556b2218e4 100644
> --- a/drivers/lightnvm/pblk-read.c
> +++ b/drivers/lightnvm/pblk-read.c
> @@ -529,13 +529,35 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
> int valid_secs = 0;
> int i;
>
> - pblk_lookup_l2p_rand(pblk, ppa_list_l2p, lba_list, nr_secs);
> -
> + spin_lock(&pblk->trans_lock);
> for (i = 0; i < nr_secs; i++) {
> if (lba_list[i] == ADDR_EMPTY)
> continue;
>
> + ppa_list_l2p[i] = pblk_trans_map_get(pblk, lba_list[i]);
> ppa_gc = addr_to_gen_ppa(pblk, paddr_list_gc[i], line->id);
> +
> + /* Obtain ppa from cache if the sector has been synced to the
> + device but the L2P has not been updated yet */
> + if(pblk_addr_in_cache(ppa_list_l2p[i])) {
> + struct pblk_rb *rb = &pblk->rwb;
> + struct pblk_rb_entry *entry;
> + struct pblk_w_ctx *w_ctx;
> + u64 pos = pblk_addr_to_cacheline(ppa_list_l2p[i]);
> +
> +#ifdef CONFIG_NVM_PBLK_DEBUG
> + /* Ensure that the access will not cause an overflow */
> + BUG_ON(pos >= rb->nr_entries);
> +#endif
> +
> + entry = &rb->entries[pos];
> + w_ctx = &entry->w_ctx;
> + if (pblk_ppa_comp(w_ctx->ppa, ppa_gc)) {
> + rqd->ppa_list[valid_secs++] = ppa_gc;
> + continue;
> + }
> + }
> +
> if (!pblk_ppa_comp(ppa_list_l2p[i], ppa_gc)) {
> paddr_list_gc[i] = lba_list[i] = ADDR_EMPTY;
> continue;
> @@ -543,6 +565,7 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
>
> rqd->ppa_list[valid_secs++] = ppa_list_l2p[i];
> }
> + spin_unlock(&pblk->trans_lock);
>
> #ifdef CONFIG_NVM_PBLK_DEBUG
> atomic_long_add(valid_secs, &pblk->inflight_reads);
> --
> 2.17.1


Here is a suggestion: Why not add an atomic counter to the
line stating the sectors that are synced in the L2P table and then
loosely wait (i.e., check and sleep / schedule) until the counter
reaches 0 on pblk_line_close_ws()? This way you guarantee that the line
does not close - and therefore never reaches the GC lists - before all
the L2P entries for that line point to the media. Any other form of
synchronization that puts the burden at pblk_line_close_ws() would also
work for me.

In essence, I would rather pay the price on a per-line basis than
blocking the trans_lock longer for each I/O.

Thoughts?

Javier


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2019-01-29 18:41:46

by Javier González

[permalink] [raw]
Subject: Re: [PATCH] lightnvm: pblk: fix race condition on GC


> On 29 Jan 2019, at 17.36, Heiner Litz <[email protected]> wrote:
>
> Javier,
>
> On Tue, Jan 29, 2019 at 12:13 AM Javier González <[email protected]> wrote:
> > On 27 Jan 2019, at 07.54, Heiner Litz <[email protected]> wrote:
> >
> > This patch fixes a race condition where a write is mapped to the last
> > sectors of a line. The write is synced to the device but the L2P is not
> > updated yet. When the line is garbage collected before the L2P update is
> > performed, the sectors are ignored by the GC logic and the line is freed
> > before all sectors are moved. When the L2P is finally updated, it contains
> > a mapping to a freed line, subsequent reads of the corresponding LBAs fail.
>
> Hi Heiner,
>
> This has been an interesting issue to debug - good catch!
>
> Felt more like a marathon than a catch ;)

Hehehe, I know. It is a good maraton then :)

>
>
> >
> > Note that looking up the L2P and checking the ppa in the write buffer needs
> > to be performed atomically, hence the refactor of pblk_lookup_l2p_rand.
> >
> > Signed-off-by: Heiner Litz <[email protected]>
> > ---
> > drivers/lightnvm/pblk-read.c | 27 +++++++++++++++++++++++++--
> > 1 file changed, 25 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
> > index 3789185144da..7c556b2218e4 100644
> > --- a/drivers/lightnvm/pblk-read.c
> > +++ b/drivers/lightnvm/pblk-read.c
> > @@ -529,13 +529,35 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
> > int valid_secs = 0;
> > int i;
> >
> > - pblk_lookup_l2p_rand(pblk, ppa_list_l2p, lba_list, nr_secs);
> > -
> > + spin_lock(&pblk->trans_lock);
> > for (i = 0; i < nr_secs; i++) {
> > if (lba_list[i] == ADDR_EMPTY)
> > continue;
> >
> > + ppa_list_l2p[i] = pblk_trans_map_get(pblk, lba_list[i]);
> > ppa_gc = addr_to_gen_ppa(pblk, paddr_list_gc[i], line->id);
> > +
> > + /* Obtain ppa from cache if the sector has been synced to the
> > + device but the L2P has not been updated yet */
> > + if(pblk_addr_in_cache(ppa_list_l2p[i])) {
> > + struct pblk_rb *rb = &pblk->rwb;
> > + struct pblk_rb_entry *entry;
> > + struct pblk_w_ctx *w_ctx;
> > + u64 pos = pblk_addr_to_cacheline(ppa_list_l2p[i]);
> > +
> > +#ifdef CONFIG_NVM_PBLK_DEBUG
> > + /* Ensure that the access will not cause an overflow */
> > + BUG_ON(pos >= rb->nr_entries);
> > +#endif
> > +
> > + entry = &rb->entries[pos];
> > + w_ctx = &entry->w_ctx;
> > + if (pblk_ppa_comp(w_ctx->ppa, ppa_gc)) {
> > + rqd->ppa_list[valid_secs++] = ppa_gc;
> > + continue;
> > + }
> > + }
> > +
> > if (!pblk_ppa_comp(ppa_list_l2p[i], ppa_gc)) {
> > paddr_list_gc[i] = lba_list[i] = ADDR_EMPTY;
> > continue;
> > @@ -543,6 +565,7 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
> >
> > rqd->ppa_list[valid_secs++] = ppa_list_l2p[i];
> > }
> > + spin_unlock(&pblk->trans_lock);
> >
> > #ifdef CONFIG_NVM_PBLK_DEBUG
> > atomic_long_add(valid_secs, &pblk->inflight_reads);
> > --
> > 2.17.1
>
>
> Here is a suggestion: Why not add an atomic counter to the
> line stating the sectors that are synced in the L2P table and then
> loosely wait (i.e., check and sleep / schedule) until the counter
> reaches 0 on pblk_line_close_ws()? This way you guarantee that the line
> does not close - and therefore never reaches the GC lists - before all
> the L2P entries for that line point to the media. Any other form of
> synchronization that puts the burden at pblk_line_close_ws() would also
> work for me.
>
> In essence, I would rather pay the price on a per-line basis than
> blocking the trans_lock longer for each I/O.
>
> The patch only adds 2 well predictable branches to the loop so I think the
> impact would be minimal, but I still think your approach is cleaner.
> I suggest checking the proposed sync counter whenever selecting a GC
> candidate. I'll send out a V2
>

Sounds good! Thanks!

Javier


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP