[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Wed Mar 5 10:30:44 PST 2008

SDR on the IB side.   Our storage is RAID Inc.  Falcon 3s, host  
attached via 4Gb qlogic FC HCAs.

http://www.raidinc.com/falcon_III.php

Regards,

Charlie

On Mar 5, 2008, at 1:09 PM, Aaron Knister wrote:

> Are you running DDR or SDR IB? Also what hardware are you using for  
> your storage?
>
> On Mar 5, 2008, at 11:34 AM, Charles Taylor wrote:
>
>> Well, go figure.    We are running...
>>
>> Lustre: 1.6.4.2 on clients and servers
>> Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
>> Platform: X86_64 (opteron 275s, mostly)
>> Interconnect: IB,  Ethernet
>> IB Stack: OFED 1.2
>>
>> We already posted our procedure for patching the kernel, building
>> OFED, and building lustre so I don't think I'll go into that
>> again.    Like I said, we just brought a new file system online.
>> Everything looked fine at first with just a few clients mounted.
>> Once we mounted all 408 (or so), we started gettting all kinds of
>> "transport endpoint failures" and the MGSs and OSTs were evicting
>> clients left and right.    We looked for network problems and could
>> not find any of any substance.    Once we increased the obd/lustre/
>> system timeout setting as previously discussed, the errors
>> vanished.    This was consistent with our experience with 1.6.3 as
>> well.    That file system has been online since early December.
>> Both file systems appear to be working well.
>>
>> I'm not sure what to make of it.    Perhaps we are just masking
>> another problem.     Perhaps there are some other, related values
>> that need to be tuned.    We've done the best we could but I'm sure
>> there is still much about Lustre we don't know.   We'll try to get
>> someone out to the next class but until then, we're on our own, so to
>> speak.
>>
>> Charlie Taylor
>> UF HPC Center
>>
>>>>
>>>> Just so you guys know, 1000 seconds for the obd_timeout is very,  
>>>> very
>>>> large!  As you could probably guess, we have some very, very big
>>>> Lustre
>>>> installations and to the best of my knowledge none of them are  
>>>> using
>>>> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
>>>> experience to some of these very large clusters might correct  
>>>> me) the
>>>> largest value that the largest clusters are using is in the
>>>> neighbourhood of 300s.  There has to be some other problem at play
>>>> here
>>>> that you need 1000s.
>>>
>>> I can confirm that at a recent large installation with several
>>> thousand
>>> clients, the default of 100 is in effect.
>>>
>>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org
>
>
>
>