[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
Charles Taylor
taylor at hpc.ufl.edu
Wed Mar 5 10:30:44 PST 2008
SDR on the IB side. Our storage is RAID Inc. Falcon 3s, host
attached via 4Gb qlogic FC HCAs.
http://www.raidinc.com/falcon_III.php
Regards,
Charlie
On Mar 5, 2008, at 1:09 PM, Aaron Knister wrote:
> Are you running DDR or SDR IB? Also what hardware are you using for
> your storage?
>
> On Mar 5, 2008, at 11:34 AM, Charles Taylor wrote:
>
>> Well, go figure. We are running...
>>
>> Lustre: 1.6.4.2 on clients and servers
>> Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
>> Platform: X86_64 (opteron 275s, mostly)
>> Interconnect: IB, Ethernet
>> IB Stack: OFED 1.2
>>
>> We already posted our procedure for patching the kernel, building
>> OFED, and building lustre so I don't think I'll go into that
>> again. Like I said, we just brought a new file system online.
>> Everything looked fine at first with just a few clients mounted.
>> Once we mounted all 408 (or so), we started gettting all kinds of
>> "transport endpoint failures" and the MGSs and OSTs were evicting
>> clients left and right. We looked for network problems and could
>> not find any of any substance. Once we increased the obd/lustre/
>> system timeout setting as previously discussed, the errors
>> vanished. This was consistent with our experience with 1.6.3 as
>> well. That file system has been online since early December.
>> Both file systems appear to be working well.
>>
>> I'm not sure what to make of it. Perhaps we are just masking
>> another problem. Perhaps there are some other, related values
>> that need to be tuned. We've done the best we could but I'm sure
>> there is still much about Lustre we don't know. We'll try to get
>> someone out to the next class but until then, we're on our own, so to
>> speak.
>>
>> Charlie Taylor
>> UF HPC Center
>>
>>>>
>>>> Just so you guys know, 1000 seconds for the obd_timeout is very,
>>>> very
>>>> large! As you could probably guess, we have some very, very big
>>>> Lustre
>>>> installations and to the best of my knowledge none of them are
>>>> using
>>>> anywhere near that. AFAIK (and perhaps a Sun engineer with closer
>>>> experience to some of these very large clusters might correct
>>>> me) the
>>>> largest value that the largest clusters are using is in the
>>>> neighbourhood of 300s. There has to be some other problem at play
>>>> here
>>>> that you need 1000s.
>>>
>>> I can confirm that at a recent large installation with several
>>> thousand
>>> clients, the default of 100 is in effect.
>>>
>>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org
>
>
>
>
More information about the lustre-discuss
mailing list