[RFC] After server stop nfslock service, client still can get lock success

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] After server stop nfslock service, client still can get lock success
@ 2009-11-17  9:47 Mi Jinlong
  2009-11-17 15:34 ` Chuck Lever
  0 siblings, 1 reply; 6+ messages in thread
From: Mi Jinlong @ 2009-11-17  9:47 UTC (permalink / raw
  To: Trond.Myklebust, NFSv3 list, J. Bruce Fields

When testing NLM, i find a bug.
After server stop nfslock service, client still can get lock success

Test process:

  Step1: client open nfs file.
  Step2: client using fcntl to get lock.
  Step3: client using fcntl to release lock.
  Step4: service stop it's nfslock service.
  Step5: client using fcntl to get lock again.

At step5, client should get lock fail, but it's success.

Reason:
  When server stop nfslock service, client's host struct not be
  unmonitor at server. When client get lock again, the client's
  host struct will be reuse but don't monitor again. 
  So that, at step5 client can get lock success.

Question:
  1. Should unmonitor the client's host struct at server 
     when server stop nfslock service ?

  2. Whether let rpc.statd tell kernel it's status(when start and stop)
     by send a SM_NOTIFY ?

-- 
Regards
Mi Jinlong

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] After server stop nfslock service, client still can get lock success
  2009-11-17  9:47 [RFC] After server stop nfslock service, client still can get lock success Mi Jinlong
@ 2009-11-17 15:34 ` Chuck Lever
  2009-11-18  9:50   ` Mi Jinlong
  0 siblings, 1 reply; 6+ messages in thread
From: Chuck Lever @ 2009-11-17 15:34 UTC (permalink / raw
  To: Mi Jinlong; +Cc: Trond.Myklebust, NFSv3 list, J. Bruce Fields

On Nov 17, 2009, at 4:47 AM, Mi Jinlong wrote:

> When testing NLM, i find a bug.
> After server stop nfslock service, client still can get lock success
>
> Test process:
>
>  Step1: client open nfs file.
>  Step2: client using fcntl to get lock.
>  Step3: client using fcntl to release lock.
>  Step4: service stop it's nfslock service.
>  Step5: client using fcntl to get lock again.
>
> At step5, client should get lock fail, but it's success.
>
> Reason:
>  When server stop nfslock service, client's host struct not be
>  unmonitor at server. When client get lock again, the client's
>  host struct will be reuse but don't monitor again.
>  So that, at step5 client can get lock success.

Effectively, the client is still monitored, since it is still in  
statd's monitored list.  Shutting down statd does not remove it from  
the monitor list.  If the local host reboots, sm-notify will still  
send the remote an SM_NOTIFY request, which is correct.

Additionally, new clients attempting to lock files when statd is down  
will fail, which is correct if statd is not available.

Conversely, if a monitored remote reboots, there is no way to notify  
the local lockd of the reboot, since statd normally relays the  
SM_NOTIFY to lockd, but isn't running.  That might be a problem.

However, shutting down statd during normal operation is not a normal  
or supported thing to do.

> Question:
>  1. Should unmonitor the client's host struct at server
>     when server stop nfslock service ?
>
>  2. Whether let rpc.statd tell kernel it's status(when start and stop)
>     by send a SM_NOTIFY ?

There are a number of other coordination issues around statd start-up  
and shut down.  The server's grace period, for instance, is not  
synchronized with sending reboot notifications.  So, we do recognize  
this is a general problem.

In this case, however, I would expect indeterminate behavior if statd  
is shut down during normal operation, and that's exactly what we get.   
I'm not sure it's even reasonable to support this use case.  Why would  
someone shut down statd and expect reliable NFSv2/v3 locking  
behavior?  In other words, with due respect, what problem would we  
solve by fixing this, other than making your test case work?

Out of curiosity, what happens if you try this on a Solaris server?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] After server stop nfslock service, client still can get lock success
  2009-11-17 15:34 ` Chuck Lever
@ 2009-11-18  9:50   ` Mi Jinlong
  2009-11-18 17:03     ` Chuck Lever
  0 siblings, 1 reply; 6+ messages in thread
From: Mi Jinlong @ 2009-11-18  9:50 UTC (permalink / raw
  To: Chuck Lever; +Cc: Trond.Myklebust, NFSv3 list, J. Bruce Fields

Hi

Chuck Lever:
> 
> On Nov 17, 2009, at 4:47 AM, Mi Jinlong wrote:
> 
>> When testing NLM, i find a bug.
>> After server stop nfslock service, client still can get lock success
>>
>> Test process:
>>
>>  Step1: client open nfs file.
>>  Step2: client using fcntl to get lock.
>>  Step3: client using fcntl to release lock.
>>  Step4: service stop it's nfslock service.
>>  Step5: client using fcntl to get lock again.
>>
>> At step5, client should get lock fail, but it's success.
>>
>> Reason:
>>  When server stop nfslock service, client's host struct not be
>>  unmonitor at server. When client get lock again, the client's
>>  host struct will be reuse but don't monitor again.
>>  So that, at step5 client can get lock success.
> 
> Effectively, the client is still monitored, since it is still in statd's
> monitored list.  Shutting down statd does not remove it from the monitor
> list.  If the local host reboots, sm-notify will still send the remote
> an SM_NOTIFY request, which is correct.
> 
> Additionally, new clients attempting to lock files when statd is down
> will fail, which is correct if statd is not available.
> 
> Conversely, if a monitored remote reboots, there is no way to notify the
> local lockd of the reboot, since statd normally relays the SM_NOTIFY to
> lockd, but isn't running.  That might be a problem.

  Yes, it seems a problem.

  I don't confirm it, so i want get your opinion.

> 
> However, shutting down statd during normal operation is not a normal or
> supported thing to do.
> 
>> Question:
>>  1. Should unmonitor the client's host struct at server
>>     when server stop nfslock service ?
>>
>>  2. Whether let rpc.statd tell kernel it's status(when start and stop)
>>     by send a SM_NOTIFY ?
> 
> There are a number of other coordination issues around statd start-up
> and shut down.  The server's grace period, for instance, is not
> synchronized with sending reboot notifications.  So, we do recognize
> this is a general problem.
> 
> In this case, however, I would expect indeterminate behavior if statd is
> shut down during normal operation, and that's exactly what we get.  I'm
> not sure it's even reasonable to support this use case.  Why would
> someone shut down statd and expect reliable NFSv2/v3 locking behavior? 
> In other words, with due respect, what problem would we solve by fixing
> this, other than making your test case work?

  When server's nfslock service is stop, client can get lock success sometimes
  and can't get success sometimes, it's puzzled.

> 
> Out of curiosity, what happens if you try this on a Solaris server?

  I'm a new man for Solaris.
  When Solaris's nlockmgr is stop, client can't get lock immediately. 

thanks,
Mi Jinlong


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] After server stop nfslock service, client still can get lock success
  2009-11-18  9:50   ` Mi Jinlong
@ 2009-11-18 17:03     ` Chuck Lever
  2009-11-19  9:48       ` Mi Jinlong
  0 siblings, 1 reply; 6+ messages in thread
From: Chuck Lever @ 2009-11-18 17:03 UTC (permalink / raw
  To: Mi Jinlong; +Cc: Trond.Myklebust, NFSv3 list, J. Bruce Fields


On Nov 18, 2009, at 4:50 AM, Mi Jinlong wrote:

> Hi
>
> Chuck Lever:
>>
>> On Nov 17, 2009, at 4:47 AM, Mi Jinlong wrote:
>>
>>> When testing NLM, i find a bug.
>>> After server stop nfslock service, client still can get lock success
>>>
>>> Test process:
>>>
>>> Step1: client open nfs file.
>>> Step2: client using fcntl to get lock.
>>> Step3: client using fcntl to release lock.
>>> Step4: service stop it's nfslock service.
>>> Step5: client using fcntl to get lock again.
>>>
>>> At step5, client should get lock fail, but it's success.
>>>
>>> Reason:
>>> When server stop nfslock service, client's host struct not be
>>> unmonitor at server. When client get lock again, the client's
>>> host struct will be reuse but don't monitor again.
>>> So that, at step5 client can get lock success.
>>
>> Effectively, the client is still monitored, since it is still in  
>> statd's
>> monitored list.  Shutting down statd does not remove it from the  
>> monitor
>> list.  If the local host reboots, sm-notify will still send the  
>> remote
>> an SM_NOTIFY request, which is correct.
>>
>> Additionally, new clients attempting to lock files when statd is down
>> will fail, which is correct if statd is not available.
>>
>> Conversely, if a monitored remote reboots, there is no way to  
>> notify the
>> local lockd of the reboot, since statd normally relays the  
>> SM_NOTIFY to
>> lockd, but isn't running.  That might be a problem.
>
>  Yes, it seems a problem.
>
>  I don't confirm it, so i want get your opinion.

Currently, there isn't a high degree of coordination between lockd and  
statd.  This is to maintain good scalability when serving NFS lock  
requests.  You offered a couple of alternatives for improving this  
specific situation, but my opinion is that there are larger, more  
general coordination issues here, and that what you observed is  
expected behavior for the current design.

This still seems to me like a case of "Patient: Doctor, it hurts when  
I do that." "Doctor: Well, then, don't do that."  In other words, we  
assume that "service nfslock stop" won't be used under normal  
operating conditions, and we know that NLM will misbehave if you stop  
statd during normal operation.

>> However, shutting down statd during normal operation is not a  
>> normal or
>> supported thing to do.
>>
>>> Question:
>>> 1. Should unmonitor the client's host struct at server
>>>    when server stop nfslock service ?
>>>
>>> 2. Whether let rpc.statd tell kernel it's status(when start and  
>>> stop)
>>>    by send a SM_NOTIFY ?
>>
>> There are a number of other coordination issues around statd start-up
>> and shut down.  The server's grace period, for instance, is not
>> synchronized with sending reboot notifications.  So, we do recognize
>> this is a general problem.
>>
>> In this case, however, I would expect indeterminate behavior if  
>> statd is
>> shut down during normal operation, and that's exactly what we get.   
>> I'm
>> not sure it's even reasonable to support this use case.  Why would
>> someone shut down statd and expect reliable NFSv2/v3 locking  
>> behavior?
>> In other words, with due respect, what problem would we solve by  
>> fixing
>> this, other than making your test case work?
>
>  When server's nfslock service is stop, client can get lock success  
> sometimes
>  and can't get success sometimes, it's puzzled.

On Linux, the user space "nfslock" service is actually nothing more  
than statd.  Linux's NLM service is handled in the kernel, and is  
started and stopped when either a) there are NFS mounts, or b) NFSD is  
started.  The kernel's NLM service has nothing to do with "service  
nfslock start" any more.  I think there used to be a user space NLM  
implementation.

>> Out of curiosity, what happens if you try this on a Solaris server?
>
>  I'm a new man for Solaris.
>  When Solaris's nlockmgr is stop, client can't get lock immediately.

I should have been more clear: if you stop Solaris' user space NSM  
daemon, can you lock files consistently?  My bet is that Solaris will  
demonstrate a similar degree of inconsistent behavior if you try NFSv2/ 
v3 locking while starting and stopping its NSM service daemon.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] After server stop nfslock service, client still can get lock success
  2009-11-18 17:03     ` Chuck Lever
@ 2009-11-19  9:48       ` Mi Jinlong
  2009-11-19 15:41         ` Chuck Lever
  0 siblings, 1 reply; 6+ messages in thread
From: Mi Jinlong @ 2009-11-19  9:48 UTC (permalink / raw
  To: Chuck Lever; +Cc: Trond.Myklebust, NFSv3 list, J. Bruce Fields

Hi

Chuck Lever:
> 
> On Nov 18, 2009, at 4:50 AM, Mi Jinlong wrote:
> 
>> Hi
>>
>> Chuck Lever:
>>>
>>> On Nov 17, 2009, at 4:47 AM, Mi Jinlong wrote:
>>>
>>>> When testing NLM, i find a bug.
>>>> After server stop nfslock service, client still can get lock success
>>>>
>>>> Test process:
>>>>
>>>> Step1: client open nfs file.
>>>> Step2: client using fcntl to get lock.
>>>> Step3: client using fcntl to release lock.
>>>> Step4: service stop it's nfslock service.
>>>> Step5: client using fcntl to get lock again.
>>>>
>>>> At step5, client should get lock fail, but it's success.
>>>>
>>>> Reason:
>>>> When server stop nfslock service, client's host struct not be
>>>> unmonitor at server. When client get lock again, the client's
>>>> host struct will be reuse but don't monitor again.
>>>> So that, at step5 client can get lock success.
>>>
>>> Effectively, the client is still monitored, since it is still in statd's
>>> monitored list.  Shutting down statd does not remove it from the monitor
>>> list.  If the local host reboots, sm-notify will still send the remote
>>> an SM_NOTIFY request, which is correct.
>>>
>>> Additionally, new clients attempting to lock files when statd is down
>>> will fail, which is correct if statd is not available.
>>>
>>> Conversely, if a monitored remote reboots, there is no way to notify the
>>> local lockd of the reboot, since statd normally relays the SM_NOTIFY to
>>> lockd, but isn't running.  That might be a problem.
>>
>>  Yes, it seems a problem.
>>
>>  I don't confirm it, so i want get your opinion.
> 
> Currently, there isn't a high degree of coordination between lockd and
> statd.  This is to maintain good scalability when serving NFS lock
> requests.  You offered a couple of alternatives for improving this
> specific situation, but my opinion is that there are larger, more
> general coordination issues here, and that what you observed is expected
> behavior for the current design.
> 
> This still seems to me like a case of "Patient: Doctor, it hurts when I
> do that." "Doctor: Well, then, don't do that."  In other words, we
> assume that "service nfslock stop" won't be used under normal operating
> conditions, and we know that NLM will misbehave if you stop statd during
> normal operation.
> 
>>> However, shutting down statd during normal operation is not a normal or
>>> supported thing to do.
>>>
>>>> Question:
>>>> 1. Should unmonitor the client's host struct at server
>>>>    when server stop nfslock service ?
>>>>
>>>> 2. Whether let rpc.statd tell kernel it's status(when start and stop)
>>>>    by send a SM_NOTIFY ?
>>>
>>> There are a number of other coordination issues around statd start-up
>>> and shut down.  The server's grace period, for instance, is not
>>> synchronized with sending reboot notifications.  So, we do recognize
>>> this is a general problem.
>>>
>>> In this case, however, I would expect indeterminate behavior if statd is
>>> shut down during normal operation, and that's exactly what we get.  I'm
>>> not sure it's even reasonable to support this use case.  Why would
>>> someone shut down statd and expect reliable NFSv2/v3 locking behavior?
>>> In other words, with due respect, what problem would we solve by fixing
>>> this, other than making your test case work?
>>
>>  When server's nfslock service is stop, client can get lock success
>> sometimes
>>  and can't get success sometimes, it's puzzled.
> 
> On Linux, the user space "nfslock" service is actually nothing more than
> statd.  Linux's NLM service is handled in the kernel, and is started and
> stopped when either a) there are NFS mounts, or b) NFSD is started.  The
> kernel's NLM service has nothing to do with "service nfslock start" any
> more.  I think there used to be a user space NLM implementation.
> 
>>> Out of curiosity, what happens if you try this on a Solaris server?
>>
>>  I'm a new man for Solaris.
>>  When Solaris's nlockmgr is stop, client can't get lock immediately.
> 
> I should have been more clear: if you stop Solaris' user space NSM
> daemon, can you lock files consistently?  My bet is that Solaris will
> demonstrate a similar degree of inconsistent behavior if you try
> NFSv2/v3 locking while starting and stopping its NSM service daemon.

  ^_^ 

  You are right, when i stop Solaris's NSM, client still can get lock success.
  Maybe it's the same as Linux.

-- 
Regards
Mi Jinlong


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] After server stop nfslock service, client still can get lock success
  2009-11-19  9:48       ` Mi Jinlong
@ 2009-11-19 15:41         ` Chuck Lever
  0 siblings, 0 replies; 6+ messages in thread
From: Chuck Lever @ 2009-11-19 15:41 UTC (permalink / raw
  To: Mi Jinlong; +Cc: Trond.Myklebust, NFSv3 list, J. Bruce Fields


On Nov 19, 2009, at 4:48 AM, Mi Jinlong wrote:

> Hi
>
> Chuck Lever:
>>
>> On Nov 18, 2009, at 4:50 AM, Mi Jinlong wrote:
>>
>>> Hi
>>>
>>> Chuck Lever:
>>>>
>>>> On Nov 17, 2009, at 4:47 AM, Mi Jinlong wrote:
>>>>
>>>>> When testing NLM, i find a bug.
>>>>> After server stop nfslock service, client still can get lock  
>>>>> success
>>>>>
>>>>> Test process:
>>>>>
>>>>> Step1: client open nfs file.
>>>>> Step2: client using fcntl to get lock.
>>>>> Step3: client using fcntl to release lock.
>>>>> Step4: service stop it's nfslock service.
>>>>> Step5: client using fcntl to get lock again.
>>>>>
>>>>> At step5, client should get lock fail, but it's success.
>>>>>
>>>>> Reason:
>>>>> When server stop nfslock service, client's host struct not be
>>>>> unmonitor at server. When client get lock again, the client's
>>>>> host struct will be reuse but don't monitor again.
>>>>> So that, at step5 client can get lock success.
>>>>
>>>> Effectively, the client is still monitored, since it is still in  
>>>> statd's
>>>> monitored list.  Shutting down statd does not remove it from the  
>>>> monitor
>>>> list.  If the local host reboots, sm-notify will still send the  
>>>> remote
>>>> an SM_NOTIFY request, which is correct.
>>>>
>>>> Additionally, new clients attempting to lock files when statd is  
>>>> down
>>>> will fail, which is correct if statd is not available.
>>>>
>>>> Conversely, if a monitored remote reboots, there is no way to  
>>>> notify the
>>>> local lockd of the reboot, since statd normally relays the  
>>>> SM_NOTIFY to
>>>> lockd, but isn't running.  That might be a problem.
>>>
>>> Yes, it seems a problem.
>>>
>>> I don't confirm it, so i want get your opinion.
>>
>> Currently, there isn't a high degree of coordination between lockd  
>> and
>> statd.  This is to maintain good scalability when serving NFS lock
>> requests.  You offered a couple of alternatives for improving this
>> specific situation, but my opinion is that there are larger, more
>> general coordination issues here, and that what you observed is  
>> expected
>> behavior for the current design.
>>
>> This still seems to me like a case of "Patient: Doctor, it hurts  
>> when I
>> do that." "Doctor: Well, then, don't do that."  In other words, we
>> assume that "service nfslock stop" won't be used under normal  
>> operating
>> conditions, and we know that NLM will misbehave if you stop statd  
>> during
>> normal operation.
>>
>>>> However, shutting down statd during normal operation is not a  
>>>> normal or
>>>> supported thing to do.
>>>>
>>>>> Question:
>>>>> 1. Should unmonitor the client's host struct at server
>>>>>   when server stop nfslock service ?
>>>>>
>>>>> 2. Whether let rpc.statd tell kernel it's status(when start and  
>>>>> stop)
>>>>>   by send a SM_NOTIFY ?
>>>>
>>>> There are a number of other coordination issues around statd  
>>>> start-up
>>>> and shut down.  The server's grace period, for instance, is not
>>>> synchronized with sending reboot notifications.  So, we do  
>>>> recognize
>>>> this is a general problem.
>>>>
>>>> In this case, however, I would expect indeterminate behavior if  
>>>> statd is
>>>> shut down during normal operation, and that's exactly what we  
>>>> get.  I'm
>>>> not sure it's even reasonable to support this use case.  Why would
>>>> someone shut down statd and expect reliable NFSv2/v3 locking  
>>>> behavior?
>>>> In other words, with due respect, what problem would we solve by  
>>>> fixing
>>>> this, other than making your test case work?
>>>
>>> When server's nfslock service is stop, client can get lock success
>>> sometimes
>>> and can't get success sometimes, it's puzzled.
>>
>> On Linux, the user space "nfslock" service is actually nothing more  
>> than
>> statd.  Linux's NLM service is handled in the kernel, and is  
>> started and
>> stopped when either a) there are NFS mounts, or b) NFSD is  
>> started.  The
>> kernel's NLM service has nothing to do with "service nfslock start"  
>> any
>> more.  I think there used to be a user space NLM implementation.
>>
>>>> Out of curiosity, what happens if you try this on a Solaris server?
>>>
>>> I'm a new man for Solaris.
>>> When Solaris's nlockmgr is stop, client can't get lock immediately.
>>
>> I should have been more clear: if you stop Solaris' user space NSM
>> daemon, can you lock files consistently?  My bet is that Solaris will
>> demonstrate a similar degree of inconsistent behavior if you try
>> NFSv2/v3 locking while starting and stopping its NSM service daemon.
>
>  ^_^
>
>  You are right, when i stop Solaris's NSM, client still can get lock  
> success.
>  Maybe it's the same as Linux.

Thanks for trying this.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-11-19 15:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-17  9:47 [RFC] After server stop nfslock service, client still can get lock success Mi Jinlong
2009-11-17 15:34 ` Chuck Lever
2009-11-18  9:50   ` Mi Jinlong
2009-11-18 17:03     ` Chuck Lever
2009-11-19  9:48       ` Mi Jinlong
2009-11-19 15:41         ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.