[Lustre-discuss] Lustre clients failing, and cant reconnect
    Brock Palen 
    brockp at umich.edu
       
    Thu Sep  4 19:58:34 PDT 2008
    
    
  
I am having clients lose their connection to the MDS.  Messages on  
the clients look like this:
Sep  4 19:51:30 nyx-login2 kernel: Lustre: nobackup-MDT0000- 
mdc-00000101fc44e800: Connection to service nobackup-MDT0000 via nid  
10.164.3.246 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
Sep  4 19:51:30 nyx-login2 kernel: LustreError: 11-0: an error  
occurred while communicating with 10.164.3.246 at tcp. The mds_connect  
operation failed with -16
It will keep doing this trying to connect and spiting out mds_connect  
failed -16.  The clients never recover.
On the mds  all I see is:
Lustre: 7653:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- 
MDT0000: refuse reconnection from 618cf36e-a7a6- 
a7d9-077c-7cbaee1e80b3 at 141.212.31.43@tcp to 0x000001037c109000; still  
busy with 3 active RPCs
This is common between many hosts that I get this RPC message.
Clients and servers are all using TCP.
Is this enough information?
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
    
    
More information about the lustre-discuss
mailing list