[Lustre-discuss] osc_brw_redo_request error on clients

Bob Ball ball at umich.edu
Wed Feb 9 16:33:03 PST 2011


Maybe, clients should mount the file system with "localflock" 
parameter?  Please check the manual for information about this, but I 
think it was the same problem we had a while back where a dynamic link 
was failing.

bob

On 2/9/2011 7:24 PM, James Robnett wrote:
>> Normally I've had no problems but recently I have multiple clients
>> reporting the following error:
>>
>> LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo
>> for recoverable error  req at ffff8101ae084000 x1358858531428366/t60136289752
>> o4->lustre-OST0004_UUID at 192.168.1.12@o2ib:6/4 lens 448/608 e 0 to 1 dl
>> 1297285890 ref 2 fl Interpret:R/0/0 rc 0/0
>>
>> which in turn appears to generate a premature EOF on our user software.
>>
>> There are no corresponding errors on the servers.
>     The above is not true.  There are apparently corresponding errors of
> the form:
>
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
> csum 964d53e2
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1038:ost_brw_write()) Skipped 43 previous similar
> messages
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000:
> BAD WRITE CHECKSUM: changed in transit before arrival at OST from
> 12345-10.64.1.212 at tcp inum 2981338/1802650709 object 8183950/0 extent
> [2384461824-2385510399]
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError: Skipped 43 previous
> similar messages
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original
> server csum 964d53e2, server csum now 964d53e2
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1100:ost_brw_write()) Skipped 43 previous similar
> messages
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
> 3035:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
> csum 180cd9bd
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
> 3035:0:(ost_handler.c:1038:ost_brw_write()) Skipped 63 previous similar
> messages
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000:
> BAD WRITE CHECKSUM: changed in transit before arrival at OST from
> 12345-10.64.1.212 at tcp inum 2981338/1802650709 object 8183950/0 extent
> [4355784704-4356833279]
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError: Skipped 63 previous
> similar messages
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
> 3035:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original
> server csum 180cd9bd, server csum now 180cd9bd
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
> 3035:0:(ost_handler.c:1100:ost_brw_write()) Skipped 63 previous similar
> messages
>
>     The other OSS shows similar errors.  We are doing mmap I/O and a
> search implies those errors are related to mmap I/O.
>
>     I'm open to suggestions, in the meantime the userspace code can be
> switched from mmap to regular file I/O via an rc file so we'll try that
> and see if it at least makes the errors go away.
>
> James
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>



More information about the lustre-discuss mailing list