[Lustre-discuss] Starting a new MGS/MDS
    Ms. Megan Larko 
    dobsonunit at gmail.com
       
    Thu Sep  4 13:54:33 PDT 2008
    
    
  
Hi,
I have a new MGS/MDS that I would like to start.   It is another of
the same Cent0S 5 kernel 2.6.18-53.1.13.el5
lustre-1.6.4.3smp as my other boxes.  Initially I had an IP number
that was used elsewhere in our group.  I
changed it using the tunefs.lustre command below for the new MDT.
[root at mds2 ~]# tunefs.lustre --erase-params --writeconf
--mgsnode=ic-mds2 at o2ib /dev/sdd1
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
   Read previous values:
Target:     crew8-MDTffff
Index:      unassigned
Lustre FS:  crew8
Mount type: ldiskfs
Flags:      0x71
              (MDT needs_index first_time update )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=172.18.0.9 at o2ib
   Permanent disk data:
Target:     crew8-MDTffff
Index:      unassigned
Lustre FS:  crew8
Mount type: ldiskfs
Flags:      0x171
              (MDT needs_index first_time update writeconf )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=172.18.0.16 at o2ib
Writing CONFIGS/mountdata
Next I try to mount this new MDT onto the system....
[root at mds2 ~]# mount -t lustre /dev/sdd1 /srv/lustre/mds/crew8-MDT0000
mount.lustre: mount /dev/sdd1 at /srv/lustre/mds/crew8-MDT0000 failed:
Input/output error
Is the MGS running?
Ummm---  yeah, I thought the MGS is running.
[root at mds2 ~]# tail /var/log/messages
Sep  4 16:28:08 mds2 kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
Sep  4 16:28:13 mds2 kernel: LustreError:
3526:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1220560088, 5s ago)  req at ffff81042f109000 x3/t0
o250->MGS at MGC172.18.0.16@o2ib_0:26 lens 240/272 ref 1 fl Rpc:/0/0 rc
0/-22
Sep  4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:954:server_register_target()) registration with
the MGS failed (-5)
Sep  4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:1054:server_start_targets()) Required registration
failed for crew8-MDTffff: -5
Sep  4 16:28:13 mds2 kernel: LustreError: 15f-b: Communication error
with the MGS.  Is the MGS running?
Sep  4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:1570:server_fill_super()) Unable to start targets:
-5
Sep  4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:1368:server_put_super()) no obd crew8-MDTffff
Sep  4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:119:server_deregister_mount()) crew8-MDTffff not
registered
Sep  4 16:28:13 mds2 kernel: Lustre: server umount crew8-MDTffff complete
Sep  4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:1924:lustre_fill_super()) Unable to mount  (-5)
The o2ib network is up.   It is ping-able via bash and lctl.   I can
get to it from itself and from other computers on
this local subnet.
[root at mds2 ~]# lctl
lctl > ping 172.18.0.16 at o2ib
12345-0 at lo
12345-172.18.0.16 at o2ib
lctl > ping 172.18.0.15 at o2ib
12345-0 at lo
12345-172.18.0.15 at o2ib
lctl > quit
On this net, there are no firewalls as the computers are using only
non-routable IP numbers.  So there is not a
firewall issue of which I am aware...
[root at mds2 ~]# iptables -L
-bash: iptables: command not found
The only oddity I have found is that the modules in my working MGS/MDS
are used more than the modules in my
new MGS/MDT.
Correctly functioning MGS/MDT:
[root at mds1 ~]# lsmod | grep mgs
mgs                   181512  1
mgc                    86744  2 mgs
ptlrpc                659512  8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
obdclass              542200  13
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
lvfs                   84712  12
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
libcfs                183128  14
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
[root at mds1 ~]# lsmod | grep osc
osc                   172136  11
ptlrpc                659512  8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
obdclass              542200  13
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
lvfs                   84712  12
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
libcfs                183128  14
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
[root at mds1 ~]# lsmod | grep lnet
lnet                  255656  4 lustre,ko2iblnd,ptlrpc,obdclass
libcfs                183128  14
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
Failing MGS/MDT:
[root at mds2 ~]# lsmod | grep mgs
mgs                   181512  0
mgc                    86744  1 mgs
ptlrpc                659512  8 osc,lustre,lov,mdc,mds,lquota,mgs,mgc
obdclass              542200  10
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc
lvfs                   84712  12
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc,obdclass
libcfs                183128  14
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
[root at mds2 ~]# lsmod | grep osc
osc                   172136  0
ptlrpc                659512  8 osc,lustre,lov,mdc,mds,lquota,mgs,mgc
obdclass              542200  10
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc
lvfs                   84712  12
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc,obdclass
libcfs                183128  14
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
[root at mds2 ~]# lsmod | grep lnet
lnet                  255656  4 lustre,ko2iblnd,ptlrpc,obdclass
libcfs                183128  14
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
The failing MGS/MDT has a 0 by mgs and not a 1 like the working MGS/MDT.
The osc module has 11 by it in the working version and 0 by it in the
non-working version.
The lnet is the same as are most of the other module comparisons.  Am
I missing something at the module mgs/mgc/osc
level?  Or are those modules just indicating that they are actually
in-use on my good MGS/MDT?
Even with IB cabling aside (I'm working on the MGS/MDS itself), why
can I not mount a new MDT?  Why do I see the message:
Is the MGS running?  I am actually on the MGS/MDS itself.
Also I receive the same result if I attempt to mount an OST on an OSS
which is pointing to this new MGS/MDT.  The OST won't
even mount locally on the OSS without successful communication with
its associated MGS/MDT.
Any and all suggestions gratefully appreciated.
megan
    
    
More information about the lustre-discuss
mailing list