[permalink] [raw]

Subject: Re: [PATCH v3 05/11] xprtrdma: Do not wait if ib_post_send() fails

> On Mar 10, 2016, at 5:25 AM, Sagi Grimberg <[email protected]> wrote:
>
>
>>> Moving the QP into error state right after with rdma_disconnect
>>> you are not sure that none of the subset of the invalidations
>>> that _were_ posted completed and you get the corresponding MRs
>>> in a bogus state...
>>
>> Moving the QP to error state and then draining the CQs means
>> that all LOCAL_INV WRs that managed to get posted will get
>> completed or flushed. That's already handled today.
>>
>> It's the WRs that didn't get posted that I'm worried about
>> in this patch.
>>
>> Are there RDMA consumers in the kernel that use that third
>> argument to recover when LOCAL_INV WRs cannot be posted?
>
> None :)
>
>>>> I suppose I could reset these MRs instead (that is,
>>>> pass them to ib_dereg_mr).
>>>
>>> Or, just wait for a completion for those that were posted
>>> and then all the MRs are in a consistent state.
>>
>> When a LOCAL_INV completes with IB_WC_SUCCESS, the associated
>> MR is in a known state (ie, invalid).
>>
>> The WRs that flush mean the associated MRs are not in a known
>> state. Sometimes the MR state is different than the hardware
>> state, for example. Trying to do anything with one of these
>> inconsistent MRs results in IB_WC_BIND_MW_ERR until the thing
>> is deregistered.
>
> Correct.
>
>> The xprtrdma completion handlers mark the MR associated with
>> a flushed LOCAL_INV WR "stale". They all have to be reset with
>> ib_dereg_mr to guarantee they are usable again. Have a look at
>> __frwr_recovery_worker().
>
> Yes, I'm aware of that.
>
>> And, xprtrdma waits for only the last LOCAL_INV in the chain to
>> complete. If that one isn't posted, then fr_done is never woken
>> up. In that case, frwr_op_unmap_sync() would wait forever.
>
> Ah.. so the (missing) completions is the problem, now I get
> it.
>
>> If I understand you I think the correct solution is for
>> frwr_op_unmap_sync() to regroup and reset the MRs associated
>> with the LOCAL_INV WRs that were never posted, using the same
>> mechanism as __frwr_recovery_worker() .
>
> Yea, I'd recycle all the MRs instead of having non-trivial logic
> to try and figure out MR states...

We have to keep that logic, since a spurious disconnect
will result in flushed LOCAL_INV requests too. In fact
that's the by far more likely source of inconsistent MRs.

>> It's already 4.5-rc7, a little late for a significant rework
>> of this patch, so maybe I should drop it?
>
> Perhaps... Although you can make it incremental because the current
> patch doesn't seem to break anything, just not solving the complete
> problem...

I'm preparing to extend the frwr_queue_recovery mechanism
in v4.7 to deal with other cases, and that new code could
be used here to fence MRs, rather than forcing a disconnect.

I'd like to leave 05/11 in place for v4.6.

Anna, can you add Sagi's Reviewed-by tags to the other
patches in this series, as he posted earlier this week?

--
Chuck Lever

2016-03-10 17:01:57

by Anna Schumaker

[permalink] [raw]

Subject: Re: [PATCH v3 05/11] xprtrdma: Do not wait if ib_post_send() fails

On 03/10/2016 11:40 AM, Chuck Lever wrote:
>
>> On Mar 10, 2016, at 5:25 AM, Sagi Grimberg <[email protected]> wrote:
>>
>>
>>>> Moving the QP into error state right after with rdma_disconnect
>>>> you are not sure that none of the subset of the invalidations
>>>> that _were_ posted completed and you get the corresponding MRs
>>>> in a bogus state...
>>>
>>> Moving the QP to error state and then draining the CQs means
>>> that all LOCAL_INV WRs that managed to get posted will get
>>> completed or flushed. That's already handled today.
>>>
>>> It's the WRs that didn't get posted that I'm worried about
>>> in this patch.
>>>
>>> Are there RDMA consumers in the kernel that use that third
>>> argument to recover when LOCAL_INV WRs cannot be posted?
>>
>> None :)
>>
>>>>> I suppose I could reset these MRs instead (that is,
>>>>> pass them to ib_dereg_mr).
>>>>
>>>> Or, just wait for a completion for those that were posted
>>>> and then all the MRs are in a consistent state.
>>>
>>> When a LOCAL_INV completes with IB_WC_SUCCESS, the associated
>>> MR is in a known state (ie, invalid).
>>>
>>> The WRs that flush mean the associated MRs are not in a known
>>> state. Sometimes the MR state is different than the hardware
>>> state, for example. Trying to do anything with one of these
>>> inconsistent MRs results in IB_WC_BIND_MW_ERR until the thing
>>> is deregistered.
>>
>> Correct.
>>
>>> The xprtrdma completion handlers mark the MR associated with
>>> a flushed LOCAL_INV WR "stale". They all have to be reset with
>>> ib_dereg_mr to guarantee they are usable again. Have a look at
>>> __frwr_recovery_worker().
>>
>> Yes, I'm aware of that.
>>
>>> And, xprtrdma waits for only the last LOCAL_INV in the chain to
>>> complete. If that one isn't posted, then fr_done is never woken
>>> up. In that case, frwr_op_unmap_sync() would wait forever.
>>
>> Ah.. so the (missing) completions is the problem, now I get
>> it.
>>
>>> If I understand you I think the correct solution is for
>>> frwr_op_unmap_sync() to regroup and reset the MRs associated
>>> with the LOCAL_INV WRs that were never posted, using the same
>>> mechanism as __frwr_recovery_worker() .
>>
>> Yea, I'd recycle all the MRs instead of having non-trivial logic
>> to try and figure out MR states...
>
> We have to keep that logic, since a spurious disconnect
> will result in flushed LOCAL_INV requests too. In fact
> that's the by far more likely source of inconsistent MRs.
>
>
>>> It's already 4.5-rc7, a little late for a significant rework
>>> of this patch, so maybe I should drop it?
>>
>> Perhaps... Although you can make it incremental because the current
>> patch doesn't seem to break anything, just not solving the complete
>> problem...
>
> I'm preparing to extend the frwr_queue_recovery mechanism
> in v4.7 to deal with other cases, and that new code could
> be used here to fence MRs, rather than forcing a disconnect.
>
> I'd like to leave 05/11 in place for v4.6.
>
> Anna, can you add Sagi's Reviewed-by tags to the other
> patches in this series, as he posted earlier this week?

Yeah, I can do that. I'll leave in the patch, and send everything to Trond later this afternoon or tomorrow!

Anna
>
>
> --
> Chuck Lever
>
>
>