[Lustre-discuss] Re-activating a partial lustre disk
    Ms. Megan Larko 
    dobsonunit at gmail.com
       
    Tue Sep  2 14:05:35 PDT 2008
    
    
  
Hello,
Getting back to some hardware which experienced a failure, I followed
instructions per Andreas Dilger and mounted the data storage targets
(crew4-OST0001, crew4-OST0003 and crew4-OST0004----physical hardware
for targets crew4-OST0000 and crew4-OST0002 have failed).   Then I
went to the MGS and mounted the crew4-MDT0000.   Next I used lctl to
dl (for device list, I'm assuming) and explicitly deactivated the ID
numbers associated with crew4-OST0000 and crew4-OST0002.   There were
no errors on either the OSS computer hosting the OSTs nor on the MGS
hosting the MDT.
I attempted to mount the /crew4 lustre disk read only on a client.
The activity timed out.   On the MGS, the recovery status is indicated
as follows:
>>cat /proc/fs/lustre/mds/crew4-MDT0000/recovery_status
status: RECOVERING
recovery_start: 1220380113
time remaining: 0
connected_clients: 0/2
completed_clients: 0/2
replayed_requests: 0/??
queued_requests: 0
next_transno: 112339940
There are zero seconds left but the status is still "RECOVERING".  A
tail of MGS /var/log/messages indicates:
Sep  2 14:29:07 mds1 kernel: Lustre: setting import crew4-OST0000_UUID
INACTIVE by administrator request
Sep  2 14:29:37 mds1 kernel: Lustre: setting import crew4-OST0002_UUID
INACTIVE by administrator request
Sep  2 15:09:54 mds1 ntpd[2857]: no servers reachable
Sep  2 15:28:00 mds1 kernel: Lustre:
3373:0:(ldlm_lib.c:1114:target_start_recovery_timer()) crew4-MDT0000:
starting recovery timer (2500s)
Sep  2 15:28:00 mds1 kernel: LustreError:
3373:0:(ldlm_lib.c:786:target_handle_connect()) crew4-MDT0000: denying
connection for new client 172.18.0.11 at o2ib
(1076f71f-3b0c-025c-586f-3f2649955011): 2 clients in recovery for
2500s
Sep  2 15:28:00 mds1 kernel: LustreError:
3373:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-16)  req at ffff81006533dc00 x2169503/t0 o38-><?>@<?>:-1 lens 240/144
ref 0 fl Interpret:/0/0 rc -16/0
Sep  2 15:32:10 mds1 kernel: LustreError:
3373:0:(ldlm_lib.c:786:target_handle_connect()) crew4-MDT0000: denying
connection for new client 172.18.0.11 at o2ib
(1076f71f-3b0c-025c-586f-3f2649955011): 2 clients in recovery for
2250s
Sep  2 15:32:10 mds1 kernel: LustreError:
3373:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-16)  req at ffff81002cc5b400 x2169552/t0 o38-><?>@<?>:-1 lens 240/144
ref 0 fl Interpret:/0/0 rc -16/0
Sep  2 15:44:04 mds1 ntpd[2857]: synchronized to 10.0.1.97, stratum 3
Sep  2 16:09:40 mds1 kernel: LustreError:
0:0:(ldlm_lib.c:1072:target_recovery_expired()) crew4-MDT0000:
recovery timed out, aborting
So---the recovery timed-out after more than one hour.
Will MGS crew4-MDT0000 never recover because two of its OST's are
missing even though they have been deactivated?
If yes,  is there a way in which to mount the crew4 lustre disk with
its remaining parts for recovery?
Any and all suggestions are genuinely appreciated.
megan
    
    
More information about the lustre-discuss
mailing list