by Dai Ngo

[permalink] [raw]

Subject: Re: [PATCH v2 0/2] enhance NFSv4.2 SSC to delay unmount source's export.

On 4/7/21 5:58 PM, Olga Kornievskaia wrote:
> On Wed, Apr 7, 2021 at 6:50 PM <[email protected]> wrote:
>>
>> On 4/7/21 2:40 PM, Olga Kornievskaia wrote:
>>> On Wed, Apr 7, 2021 at 4:16 PM <[email protected]> wrote:
>>>> On 4/7/21 12:01 PM, Olga Kornievskaia wrote:
>>>>> On Wed, Apr 7, 2021 at 1:13 PM <[email protected]> wrote:
>>>>>> On 4/7/21 9:30 AM, Olga Kornievskaia wrote:
>>>>>>> On Tue, Apr 6, 2021 at 9:23 PM <[email protected]> wrote:
>>>>>>>> On 4/6/21 6:12 PM, [email protected] wrote:
>>>>>>>>> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
>>>>>>>>>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <[email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Currently the source's export is mounted and unmounted on every
>>>>>>>>>>>>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>>>>>>>>>>>>> for each copy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This patch series is an enhancement to allow the export to remain
>>>>>>>>>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>>>>>>>>>>>>> export is not being used for the configured time it will be
>>>>>>>>>>>>>>>> unmounted
>>>>>>>>>>>>>>>> by a delayed task. If it's used again then its expiration time is
>>>>>>>>>>>>>>>> extended for another period.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since mount and unmount are no longer done on each copy request,
>>>>>>>>>>>>>>>> this overhead is no longer used to decide whether the copy should
>>>>>>>>>>>>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>>>>>>>>>>>>> to determine sync or async copy is now used for this decision.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Dai
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> v2: fix compiler warning of missing prototype.
>>>>>>>>>>>>>>> Hi Olga-
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull
>>>>>>>>>>>>>>> request.
>>>>>>>>>>>>>>> Have you had a chance to review Dai's patches?
>>>>>>>>>>>>>> Hi Chuck,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I apologize I haven't had the chance to review/test it yet. Can I
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> until tomorrow evening to do so?
>>>>>>>>>>>>> Next couple of days will be fine. Thanks!
>>>>>>>>>>>>>
>>>>>>>>>>>> I also assumed there would be a v2 given that kbot complained about
>>>>>>>>>>>> the NFSD patch.
>>>>>>>>>>> This is the v2 (see Subject: )
>>>>>>>>>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
>>>>>>>>>> the originals. Again I'll test/review the v2 by the end of the day
>>>>>>>>>> tomorrow!
>>>>>>>>>>
>>>>>>>>>> Actually a question for Dai: have you done performance tests with your
>>>>>>>>>> patches and can show that small copies still perform? Can you please
>>>>>>>>>> post your numbers with the patch series? When we posted the original
>>>>>>>>>> patch set we did provide performance numbers to support the choices we
>>>>>>>>>> made (ie, not hurting performance of small copies).
>>>>>>>>> Currently the source and destination export was mounted with default
>>>>>>>>> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
>>>>>>>>> to decide whether to do inter-server copy or generic copy.
>>>>>>>>>
>>>>>>>>> I ran performance tests on my test VMs, with and without the patch,
>>>>>>>>> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
>>>>>>>>> each test 5 times and took the average. I include the results of 'cp'
>>>>>>>>> for reference:
>>>>>>>>>
>>>>>>>>> size cp with patch without patch
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>> 1048576 0.031 0.032 (generic) 0.029 (generic)
>>>>>>>>> 1049490 0.032 0.042 (inter-server) 0.037 (generic)
>>>>>>>>> 2048000 0.051 0.047 (inter-server) 0.053 (generic)
>>>>>>>>> 7341056 0.157 0.074 (inter-server) 0.185 (inter-server)
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>> Sorry, times are in seconds.
>>>>>>> Thank you for the numbers. #2 case is what I'm worried about.
>>>>>> Regarding performance numbers, the patch does better than the original
>>>>>> code in #3 and #4 and worse then original code in #1 and #2. #4 run
>>>>>> shows the benefit of the patch when doing inter-copy. The #2 case can
>>>>>> be mitigated by using a configurable threshold. In general, I think it's
>>>>>> more important to have good performance on large files than small files
>>>>>> when using inter-server copy. Note that the original code does not
>>>>>> do well with small files either as shown above.
>>>>> I think the current approach tries to be very conservative to achieve
>>>>> the goal of never being worse than the cp. I'm not sure what you mean
>>>>> that current code doesn't do well with small files. For small files it
>>>>> falls back to the generic copy.
>>>> In this table, the only advantage the current code has over 'cp' is
>>>> run 1 which I don't know why. The rest is slower than 'cp'. I don't
>>>> have the size of the copy where the inter-copy in the current code
>>>> starts showing better performance yet, but even at ~7MB it is still
>>>> slower than 'cp'. So for any size that is smaller than 7MB+, the
>>>> inter-server copy will be slower than 'cp'. Compare that with the
>>>> patch, the benefit of inter-server copy starts at ~2MB.
>>> I went back to Jorge Mora's perf numbers we posted. You are right, we
>>> did report perf degradation for any copies smaller than 16MB for when
>>> we didn't cap the copy size to be at least 14*rsize (I think the
>>> assumption was that rsize=1M and making it 14M). I'm just uneasy to
>>> open it to even smaller sizes. I think we should explicitly change it
>>> to 16MB instead of removing the restriction. Again I think the policy
>>> we want it to do no worse than cp.
>> I can make this a module's configurable parameter and default it to 16MB.
>> However, why 16MB while the measurement I did shows inter-copy starts
>> performing better than 'cp' at ~2MB? Your previous measurement might no
>> longer valid for the latest code with the patch. Can you verify the patch
>> performs than cp even with 2048000 bytes copy?
> 16MB came from hardware measurements which are a lot more realistic
> than VM measurements. Can you please get hardware measurements? I
> haven't gotten around to the hardware testing. Because I think the
> solution needs to be changed (ie the whole mount needs to be saved and
> not just part of the mount). Once we have that solution, we can
> measure performance and see what numbers make sense for the client
> side. Having it as a module configuration parameter is fine with me.
> Updating it to a new set of hardware based numbers is ok with me too.

I'm also ok with a module configuration parameter. I don't have the
hardware for testing, I will use the default value of 16MB for now.
Since it's configurable anyone can adjust to the desire value. I will
send a new patch for this.

>
>>>>>>> I don't believe the code works. In my 1st test doing "nfstest_ssc
>>>>>>> --runtest inter01" and then doing it again. What I see from inspecting
>>>>>>> the traces is that indeed unmount doesn't happen but for the 2nd copy
>>>>>>> the mount happens again.
>>>>>>>
>>>>>>> I'm attaching the trace. my servers are .114 (dest), .110 (src). my
>>>>>>> client .68. The first run of "inter01" places a copy in frame 364.
>>>>>>> frame 367 has the beginning of the "mount" between .114 and .110. then
>>>>>>> read happens. then a copy offload callback happens. No unmount happens
>>>>>>> as expected. inter01 continues with its verification and clean up. By
>>>>>>> frame 768 the test is done. I'm waiting a bit. So there is a heatbeat
>>>>>>> between the .114 and .110 in frame 769. Then the next run of the
>>>>>>> "inter01", COPY is placed in frame 1110. The next thing that happens
>>>>>>> are PUTROOTFH+bunch of GETATTRs that are part of the mount. So what is
>>>>>>> the saving here? a single EXCHANGE_ID? Either the code doesn't work or
>>>>>>> however it works provides no savings.
>>>>>> The saving are EXCHANGE_ID, CREATE_SESSION, RECLAIM COMPLETE,
>>>>>> DESTROY_SESSION and DESTROY_CLIENTID for *every* inter-copy request.
>>>>>> The saving is reflected in the number of #4 test run above.
>>>>> Can't we do better than that? Since you are keeping a list of umounts,
>>>>> can't they be searched before doing the vfs_mount() and instead get
>>>>> the mount structure from the list (and not call the vfs_mount at all)
>>>>> and remove it from the umount list (wouldn't that save all the calls)?
>>>> I thought about this. My problem here is that we don't have much to key
>>>> on to search the list. The only thing in the COPY argument can be used
>>>> for this purpose is the list of IP addresses of the source server.
>>>> I think that is not enough, there can be multiple exports from the
>>>> same server, how do we find the right one? it can get complicated.
>>>> I'm happy to consider any suggestion you have for this.
>>> I believe an IP address is exactly what's needed for keying. Each of
>>> those "mounts" to the same server is shared (that's normal behaviour
>>> for the client) -- meaning there is just 1 "mount" for a given IP
>>> (there is a single nfs4_client structure).
>> ok, I can give this a try. However, I think this only reduces the
>> number of RPCs which are bunch of GETATTRs so I don't think this will
>> help the performance number significantly.
> They are RPCs and meta data operations that go into the file system.I
> would think they affect performance...
>
>>>> I think the patch is an improvement, in performance for copying large
>>>> files (if you consider 2MB file is large) and for removing the bug
>>>> of computing overhead in __nfs4_copy_file_range. Note that we can
>>>> always improve it and not necessary doing it all at once.
>>> I don't think saving 3 RPCs out of 14 is a good enough improvement
>>> when it can be made to save them all (unless you can convince me that
>>> we can't save all 14).
>> I don't understand your numbers here. Without the patch, there are
>> 14 RPCs for mount and 2 RPCs for unmount and overhead of creating
>> and tearing down TCP connection each time. With the patch, there are
>> 8 RPCs for the mount (PUTROOTFH and GETATTRs) and 0 for unmount and
>> no create and tear down TCP connection. The RPCs that the patch save
>> are mostly heavy-weight RPC requests as the test showed the patch
>> outperforms the current code even at 2MB.
> I counted mount operations savings. I agreed with you that we saved
> the umount. I'm saying that given that you are keeping track of the
> mounts structures, those 8 RPCs shouldn't happen.
>
> I think my approach didn't keep track of the mount and just delayed
> the execution of the cleanup function so it was not possible to "save
> the mount rpcs". You code takes it further but not far enough.

ok, let hold off on this patch and I will look into the IP address
approach and also see if we can completely remove all RPCs related to
mount.

>
> As for the savings as I said I'd like to see the performance numbers
> based on the hardware setup not VMs. And yes the existing hardware
> numbers are based on the old code and need to be re-run on the
> existing code. I will try to do that tomorrow.

yes, it's useful data point.

>
> I think at this point you need to have one of the maintainers weigh in
> on whether or not this intermediate step of saying a few rpcs is worth
> checking in and then get the saving of the whole mount. It's really up
> to them. In my opinion, I think if we are trying to save mount
> operations, we should do it all the way.
>
> Furthermore, I still really would like Bruce or Chuck to weigh in on
> the use of the semaphore. We are holding onto a semaphore while doing
> vfs_kern_mount(), I thought it's frowned upon to hold any locks while
> doing network operations. What if vfs_kern_mount() is trying to reach
> an unresponsive server, then all other unmount are blocked, right? I
> still don't understand why we need a semaphore. I think a spin lock
> that protects access to the list of umounts is sufficient. I probably
> should be commenting on the actual patch here.

Let hold off on this until I submit a new patch.

Thanks,
-Dai

>
>
>> -Dai
>>
>>>> -Dai
>>>>
>>>>>> Note that the overhead of the copy in the current code includes mount
>>>>>> *and* unmount. However the threshold computed in __nfs4_copy_file_range
>>>>>> includes only the guesstimated mount overhead and not the unmount
>>>>>> overhead so it not correct.
>>>>>>
>>>>>> -Dai
>>>>>>
>>>>>>
>>>>>>> Honestly I don't understand the whole need of a semaphore and all.
>>>>>> The semaphore is to prevent the export to be unmounted while it's
>>>>>> being used.
>>>>>>
>>>>>> -Dai
>>>>>>
>>>>>>> My
>>>>>>> approach that I tried before was to create a delayed work item but I
>>>>>>> don't recall why I dropped it.
>>>>>>> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-nfs/patch/[email protected]/__;!!GqivPVa7Brio!Jl5Wq7nrFUsaUQjgLJoSuV-cDlvbPaav3x8nXQcRhAdxjVEoWvK24sNgoE82Zg$
>>>>>>>
>>>>>>>
>>>>>>>> -Dai
>>>>>>>>
>>>>>>>>> Note that without the patch, the threshold to do inter-server
>>>>>>>>> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
>>>>>>>>> to do inter-server is (524288 * 2 = 1048576) bytes, same as
>>>>>>>>> threshold to decide to sync/async for intra-copy.
>>>>>>>>>
>>>>>>>>>> While I agree that delaying the unmount on the server is beneficial
>>>>>>>>>> I'm not so sure that dropping the client restriction is wise because
>>>>>>>>>> the small (singular) copy would suffer the setup cost of the initial
>>>>>>>>>> mount.
>>>>>>>>> Right, but only the 1st copy. The export remains to be mounted for
>>>>>>>>> 15 mins so subsequent small copies do not incur the mount and unmount
>>>>>>>>> overhead.
>>>>>>>>>
>>>>>>>>> I think ideally we want the server to do inter-copy only if it's faster
>>>>>>>>> than the generic copy. We can probably come up with a number after some
>>>>>>>>> testing and that number can not be based on the rsize as it is now since
>>>>>>>>> the rsize can be changed by mount option. This can be a fixed number,
>>>>>>>>> 1M/2M/etc, and it should be configurable. What do you think? I'm open
>>>>>>>>> to any other options.
>>>>>>>>>
>>>>>>>>>> Just my initial thoughts...
>>>>>>>>> Thanks,
>>>>>>>>> -Dai
>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

2021-04-08 15:27:01

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH v2 0/2] enhance NFSv4.2 SSC to delay unmount source's export.

> On Apr 7, 2021, at 8:58 PM, Olga Kornievskaia <[email protected]> wrote:
>
> Furthermore, I still really would like Bruce or Chuck to weigh in on
> the use of the semaphore.

The semaphore caught my eye too.

The usual trick to prevent a mount from disappearing is to
bump a reference count until it is safe for the mount to
possibly go away. Dai, if you can make this work with just
an extra reference count, I would prefer that.

> We are holding onto a semaphore while doing
> vfs_kern_mount(), I thought it's frowned upon to hold any locks while
> doing network operations. What if vfs_kern_mount() is trying to reach
> an unresponsive server, then all other unmount are blocked, right? I
> still don't understand why we need a semaphore. I think a spin lock
> that protects access to the list of umounts is sufficient. I probably
> should be commenting on the actual patch here.

--
Chuck Lever