[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
Sébastien Buisson
sebastien.buisson at bull.net
Tue May 19 08:55:21 PDT 2009
Hi,
We took a slightly different approach to deal with IB QoS in Lustre.
We decided to assign a specific service-id to Lustre: in ofa-kernel we
added a new value in the rdma_port_space enum, that we called
RDMA_PS_LUSTRE. Then we modified the calls to rdma_create_id in
o2iblnd.c and o2iblnd_cb.c to use this new port space value instead of
RDMA_PS_TCP (well, we did a little more than that in the Lustre code,
because we wanted the service-id to be a ko2iblnd module parameter, so
we added some stuff in o2iblnd_modparams.c for instance).
The next step is to tell OpenSM to assign an SL to this service-id.
Here is an extract of our "QoS policy file":
qos-ulps
default : 0
any, service-id=0x.....: 3
end-qos-ulps
The major drawback of this solution is that the modification we made in
the ofa-kernel is not OpenFabrics Alliance compliant, because the
portspace list is defined in the IB standard.
Cheers,
Sebastien.
Jim Garlick a écrit :
> On Mon, May 18, 2009 at 12:04:37PM +0200, Daniel Kobras wrote:
>> Hi!
>>
>> Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
>> LND allowed to #define a service level, but I couldn't find a similar
>> facility in o2ib. Is there a different way to apply QoS rules?
>>
>> Thanks,
>>
>> Daniel.
>
> Hi, I don't know much about this stuff, but our IB guys did use QoS
> to help us when we found LNET was falling apart when we brought up
> our first 1K node cluster based on quad socket, quad core opterons,
> and ran MPI collective stress tests on all cores.
>
> Here are some notes they put together - see the "QoS Policy file" section.
>
> Jim
> ____________________________________
> QoS configuration on Infiniband
>
> May 18, 2009
>
> Albert Chu
> chu11 at llnl.gov
>
> Overview
> --------
> Quality of Service (QoS) is offered in Infiniband as a means to offer some
> guarantees/minimum requirements for certain applications on the fabric.
>
> Definitions
> -----------
>
> Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14)
> Virtual Lanes (VLs) for traffic. The virtual lanes support
> independent virtual transmit/receive buffers for each port on the
> fabric.
>
> Service Level (SL): A number (0-15) that can be assigned to any
> Infiniband packet. The definition/purpose of a SL is not defined.
> It's up to the user to determine.
>
> Basic QoS Implementation in Infiniband
> --------------------------------------
>
> There are three basic parts to QoS in Infiniband.
>
> 1) Assign/configure protocols/tool/applications to use appropriate
> SLs.
>
> Normally, you assign different SLs to different protocols,
> applications, etc. (i.e. MPI, Lustre). This allows each
> protocol/application to be given unique QoS requirements.
>
> 2) Configure SL2VL mapping
>
> Map SLs to VLs. For example, SL0->VL0, SL1->VL1, etc.
>
> 3) Configure VL Arbitration
>
> Determines VL transmission rules based on a set of prioritization
> rules.
>
> It is the responsibility of administrators/users to use and configure
> the SLs/VLs properly. VLs and SLs do nothing/mean nothing in the
> Infiniband card.
>
> SL2VL Mapping Configuration
> ---------------------------
>
> This is pretty basic. You assign a SL to a VL. It's a direct one to
> one mapping. i.e. SL1->VL1, SL2->VL2
>
> Normally, you map SLX -> VLX. If you do otherwise, you're starting to
> do something pretty crazy.
>
> VL Arbitration Configuration
> ----------------------------
>
> This is not so basic. There are three components to VL Arbitration
> configuration, the High-Priority Table, the Low-Priority Table, and
> the Limit of High Priority.
>
> High/Low VL Arbitration Tables
> ------------------------------
>
> High & Low Priority VL Arbitration Tables are a list of VL numbers
> (0-14) and a weighting value (0-255) pairs. The weighting value
> indicates the number of 64 byte units that can be transmitted from
> that VL when it is that VL's turn to transmit. A weight of 0 means no
> data can be transferred. Counters are rounded up as needed for
> packets (i.e. a weight of 1 means a packet > 64 bytes can still be
> sent). The High Priority VL Arbitration Table is weights for "high
> priority" data while the Low Priority VL Arbitration Table is weights
> for "low priority" data (the usefulness will make more sense after you
> read "Limit of High Priority" below).
>
> Note that 64*255 =~ 16K, which is small number for many institutions.
> I think it is easiest to think of the weights as ratios for percentage
> bandwidth if the network is completely flooded with data from all
> protocols/applications.
>
> For example:
>
> A) VL0 Weight = 255, VL1 Weight = 255
>
> 50% bandwidth for VL0 and VL1 each.
>
> B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255
>
> 33% bandwidth for VL0, VL1, and VL2 each.
>
> C) VL0 Weight = 200, VL1 Weight = 100
>
> 66% bandwidth for VL0, 33% bandwidth for VL1.
>
> D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100
>
> 50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each.
>
> Limit of High Priority
> ----------------------
>
> Indicates the number of high-priority packets (from the High VL
> Arbitration Table) that can be sent without an opportunity to send a
> low priority packet (from the Low VL Arbitration Table). Increments
> are in 4K bytes (special numbers, 0 = one packet. 255 = unlimited
> data).
>
> 4K*254 =~ 1M, which again is small number for many institutions. The
> most likely numbers to consider using are:
>
> 0 - one packet
> 254 - max high limit data w/o being unlimited
> 255 - unlimited data
>
> VL Arbitration Examples
> -----------------------
>
> When you combine the High/Low VL Arbitration tables with the Limit of
> High Priority, you can create some interesting QoS behavior.
>
> Example 1:
>
> (Following example is borrowed from the "Quality and Service in OFED
> 3.1" presentation listed below.)
>
> High-Limit: 0
> VL-Arb-High: VL2 Weight = 1
> VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50
>
> Effectively, anytime any data on VL2 is available, send at most one
> packet from VL2 before sending data from VL0 or VL1. If no VL2 data
> is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth.
>
> Idea:
>
> (Assume Lustre Meta Data Servers and Lustre OSTs are on the same
> fabric)
>
> MPI -> SL0 -> VL0
> Lustre OST Data -> SL1 -> VL1
> Lustre Meta Data -> SL2 -> VL2
>
> In this example, Lustre meta data traffic is assumed to be low, but
> with the high priority, is accessed faster and theoretically allow for
> better Lustre interaction. When there is no Lustre meta data traffic
> on the fabric, MPI is given the majority share of bandwidth b/c it is
> more timing sensitive.
>
> Example 2:
>
> High-Limit: 254
> Vl-Arb-High: VL0 Weight = 255
> Vl-Arb-Low: VL1 Weight = 1
>
> Effectively, whenever there is data on VL0, always send it before VL1.
> But do not allow VL0 to starve VL1. Let VL1 send *something* once in
> awhile.
>
> Idea:
>
> MPI -> SL1 -> VL0
> Lustre -> Sl1 -> VL1
>
> So MPI always gets priority over Lustre, but cannot starve it out.
> The High-Limit of 254 means a low priority packet must be sent once in
> awhile. This could be important if Lustre "pings" are done to keep
> some services alive.
>
> Configuring for OpenSM
> ----------------------
>
> Currently configure in /var/cache/opensm/opensm.opts (later to be in
> /etc/opensm/opensm.conf).
>
> #
> # QoS OPTIONS
> #
> qos TRUE
>
> qos_policy_file /var/cache/opensm/qos-policy.conf
>
> # QoS default options
> qos_max_vls 2
> qos_high_limit 254
> qos_vlarb_high 0:255
> qos_vlarb_low 1:1
> qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
>
> qos_ca_max_vls 2
> qos_ca_high_limit 254
> qos_ca_vlarb_high 0:255
> qos_ca_vlarb_low 1:1
> qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
>
> # achu: VL2 not used, need to give non-null input to buggy opensm
> qos_swe_max_vls 2
> qos_swe_high_limit 255
> qos_swe_vlarb_high 0:225,1:25
> qos_swe_vlarb_low 2:1
> qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
>
> Notes/Comments:
>
> There are default QoS options, and specific QoS options
> for channel adapters, switches, etc. They allow you to configure
> for different port-types across the fabric.
>
> The "max_vls" entries can be ignored.
>
> The "high_limit", "vlarb_high", and "vlarb_low" fields are hopefully
> self exaplanatory. The "vlarb_high"/"vlarb_low" entries take inputs
> as <VL>:<Weight> as input.
>
> In the above example, channel Adapters have:
>
> VL0 Weight = 255 -> For MPI
>
> VL1 Weight = 1 -> For Lustre
>
> Idea: With the High Limit of 254, MPI always gets priority, but cannot
> starve Lustre.
>
> In the above example, Switches have:
>
> VL0 Weight = 225 -> For MPI
> VL1 Weight = 25 -> For Lustre
>
> Idea: Across the entire cluster, MPI, Lustre, etc. are going on from
> different jobs/tasks. We don't want MPI to starve out other traffic
> so we give it a nice chunk of bandwidth but not all bandwidth (in this
> example 90% for MPI, 10% for Lustre).
>
> SLs to VLs are mapped by listing the VLs for each SL in increasing
> order. In the above example, SL0 -> VL0 and SL1 -> VL1. The input of
> 15 is if the SL is one you don't care about.
>
> Assigning SLs
> -------------
>
> The configuration of QoS is now over, but we still need to make
> protocols/applications use the appropriate SL.
>
> Some tools allow you to pick an SL when you run.
>
> i.e.
>
>> mpirun -sl 0
>
> However, it may not be easy to force/change users/applications to use
> different SLs. The easiest way to configure SLs is through the OpenSM
> QoS policy file.
>
> QoS Policy File
> ---------------
>
> Depending on OpenSM version, this file is in
> /var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf.
>
> The following is the short summary of options I think are needed for
> our environment. See "QoS Management in OpenSM" for full set of
> options.
>
> Format:
>
> qos-ulps
> <user level protocol>, <options> : <SL level>
> end-qos-ulps
>
> <user level protocol> = IPoIB, SDP, SRP, iSER
>
> <options> = port-num, pkey, service-id, target-port-guid
> (Note: options depends on which user level protocol is selected)
>
> <SL level> = SL level 0-15.
>
> Example:
>
> qos-ulps
> default : 0
> any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1
> end-qos-ulps
>
> Idea:
>
> Everything (most notably MPI) defaults to SL0. Any of the above
> locations with the listed destination GUID gets SL1.
>
> If the target-port-guid's list of GUIDs are Lustre Routers, that would
> indicate Lustre data gets SL=1. In combination with the VL
> Arbitration and SL2VL Mapping configuration listed above, hopefully it
> can be seen how MPI gets priority over Lustre, but does not starve it
> out.
>
> Note that files with target-port-guids must be kept up to date if
> GUIDs change. You can determine GUIDs via /usr/sbin/ibstat.
>
> Verifying Configuration
> -----------------------
>
> The tool smpquery can be used to verify that VL Arbitration tables and
> SL2VL tables have been configured in cards/switches properly.
>
> # > /usr/sbin/smpquery sl2vl 346
> # SL2VL table: Lid 346
> # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in 0, out 0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15|
>
> # > /usr/sbin/smpquery vlarb 346
> # VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8
> # Low priority VL Arbitration Table:
> VL : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> # High priority VL Arbitration Table:
> VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
> WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
>
> The high limit can be determined by issuing portinfo queries via
> /usr/sbin/smpquery.
>
> # > /usr/sbin/smpquery portinfo 346 | grep Limit
> VLHighLimit:.....................0
>
> Random Configuration Notes
> --------------------------
>
> SLs are most often assigned during Infiniband Queue Pair (QP) creation
> time. So, if you change your QoS settings, any tools/applications
> (including Lustre) that are currently running and have already created
> QPs may not have absorbed the newest QoS policy. The appropriate
> tools/applications should be restarted.
>
> Not all Infiniband adapters support VLs. Those that do many not
> support all 15 VLs. You can determine what your system supports by
> issuing portinfo queries via /usr/sbin/smpquery.
>
> References
> ----------
>
> Qos Management in OpenSM
>
> (this is a link to the Git Tree - hopefully the URL is always legit)
>
> http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD
>
> Quality and Service in OFED 3.1 - Liran Liss
>
> http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt
>
> QoS support in OFED
>
> (this is a link to the Git Tree - the URL is on the ofed_1_4 branch,
> so it probably will change at some point)
>
> http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
More information about the lustre-discuss
mailing list