[Lustre-discuss] yet another lustre error

Brock Palen brockp at umich.edu
Mon Mar 10 06:46:24 PDT 2008


On Mar 9, 2008, at 10:01 PM, Aaron Knister wrote:

> Hi! I have a few questions for you-
>
> 1. How many nodes was his job running on?

around 64 serial jobs accessing the same directory (not the same files).

> 2. What version of lustre and linux kernel are you running on your  
> servers/clients?

Lustre servers:
2.6.9-55.0.9.EL_lustre.1.6.4.1smp

Clients:
2.6.9-67.0.1.ELsmp


> 3. What ethernet module are you using on the servers/clients?

Most use the tg3, some use e1000.

>
> I honestly am not sure what the RPC errors mean but I've had  
> similar issues caused by ethernet-level errors.

Over the weekend the MDS/MGS went into a unhealthy state forced a  
reboot+fsck and when it came back up the directory was accessible  
again and jobs started working again.

>
> -Aaron
>
> On Mar 7, 2008, at 6:45 PM, Brock Palen wrote:
>
>> On a file system thats been up for only 57 days,  I have:
>>
>> 505 lustre-log.   dumps.
>>
>> THe problem at hand is a user has many jobs where his jobs are now
>> hung trying to create a directory from his pbs script.  On the
>> clients i see:
>>
>> LustreError: 11-0: an error occurred while communicating with
>> 141.212.30.184 at tcp. The mds_connect operation failed with -16
>> LustreError: Skipped 2 previous similar messages
>>
>> On every client his jobs are on.
>>
>> In the most recent /tmp/lustre-log.  on the MDS/MGS I see this  
>> message:
>>
>> @@@ processing error (-16)  req at 000001001af9a600 x12808293/t0 o38-
>>> 32633f05-02c6-50a5-b496-047150f1fe81 at NET_0x200000aa4003e_UUID:-1
>> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0
>> ldlm_lib.c
>> target_handle_reconnect
>> nobackup-MDT0000: 34b4fbea-200b-1f7c-dac0-516b8ce786fc reconnecting
>> ldlm_lib.c
>> target_handle_connect
>> nobackup-MDT0000: refuse reconnection from 34b4fbea-200b-1f7c-
>> dac0-516b8ce786fc at 10.164.0.111@tcp to 0x00000100069a7000; still busy
>> with 2 active RPCs
>> ldlm_lib.c
>> target_send_reply_msg
>> @@@ processing error (-16)  req at 0000010019159a00 x11199816/t0 o38-
>>> 34b4fbea-200b-1f7c-dac0-516b8ce786fc at NET_0x200000aa4006f_UUID:-1
>> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0
>>
>>
>> What I see messages about active rpc's in other logs.  What would
>> this mean?  Is something suck someplace ?
>>
>>
>>
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org
>
>
>
>
>
>




More information about the lustre-discuss mailing list