[Lustre-discuss] [PATCH] Avoid Lustre failure on temporary failure

Tue Sep 2 06:05:53 PDT 2014

we don’t need too much sends to single peer, except of LNet routers.
as about other limits
Number of RPC in flight == 1 for MDC<>MDT links,
and isn’t more 32 for OST, but we have limited to the 512 OST_IO threads.

about credits - number of credits used in LNet calculation - should depend of buffers posted to incoming process and that number of buffers should depend of performance results - like number of RPC processed in some time. 
It’s avoid over buffering in all places, but it open a question about credits distribution over cluster.

On Sep 2, 2014, at 4:40 PM, Zhen, Liang <liang.zhen at intel.com> wrote:

> Precisely ³credit² should be concurrent sends (of ko2iblnd message) to a
> single peer, it is not number of inflight Lustre RPCs. I understand the
> memory issue of this, and by enabling map_on_demand, ko2iblnd will create
> FMR for large fragments bulk IO (for example, 32+ fragments or 128K+), and
> only allow small IOs to use current way and avoid overhead of creating
> FMR, then we have up to 32 fragments and QP size is only 1/8 of now.
> 
> Regards
> Liang
> 
> On 9/2/14, 6:09 PM, "Alexey Lyashkov" <alexey_lyashkov at xyratex.com> wrote:
> 
>> credits for Lustre ? it¹s works? now it¹s strange number without relation
>> to real network structure and produce over buffering issues on server
>> side.
>> 
>> On Sep 2, 2014, at 12:22 PM, Zhen, Liang <liang.zhen at intel.com> wrote:
>> 
>>> Yes, I think this is the potential issue of this patch, for each 1M
>>> data lustre has 256 fragments (256 pages) on 4K pagesize system, which
>>> means we can have max to (credits X 256) outstanding work requests for
>>> each connection, decreasing max_send_wr may hit ib_post_send() failure
>>> under heavy workload.
>>> 
>>> I understand this may be a problem for low level stack to allocate big
>>> chunk of space, and cause memory allocating failures. The solution is
>>> enabling map_on_demand and use FMR, however, enabling this on some nodes
>>> will prevent them to join cluster if other nodes have no map_on_demand,
>>> we already have a patch for this which is pending on review, please
>>> check this (LU-3322)
>>> 
>>> Thanks
>>> Liang
>>> 
>>> From: David McMillen <mcmillen at cray.com<mailto:mcmillen at cray.com>>
>>> Date: Sunday, August 31, 2014 at 6:48 PM
>>> To: 
>>> "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>"
>>> 
>>> <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
>>> , Eli Cohen <eli at dev.mellanox.co.il<mailto:eli at dev.mellanox.co.il>>
>>> Subject: Re: [Lustre-discuss] [PATCH] Avoid Lustre failure on temporary
>>> failure
>>> 
>>> Has this been tested with a significant I/O load?  We had tried a
>>> similar approach but ran into subsequent errors and connection drops
>>> when the ib_post_send() failed.  The code assumes that the original
>>> init_qp_attr->cap.max_send_wr value succeeded.  Is there a second part
>>> to this patch?
>>> 
>>> Dave
>>> 
>>> On Sun, Aug 31, 2014 at 2:53 AM, Eli Cohen
>>> <eli at dev.mellanox.co.il<mailto:eli at dev.mellanox.co.il>> wrote:
>>> 
>>>> Lustre code tries to create a QP with max_send_wr which depends on a
>>>> module
>>>> parameter.  The device capabilities do provide the maximum number of
>>>> send work
>>>> requests that the device supports but the actual number of work
>>>> requests that
>>>> can be supported in a specific case depends on other characteristics
>>>> of the
>>>> work queue, the transport type, etc. This is in compliance with the IB
>>>> spec:
>>>> 
>>>> 11.2.1.2 QUERY HCA
>>>> Description:
>>>> Returns the attributes for the specified HCA.
>>>> The maximum values defined in this section are guaranteed
>>>> not-to-exceed values. It is possible for an implementation to allocate
>>>> some HCA resources from the same space. In that case, the maximum
>>>> values returned are not guaranteed for all of those resources
>>>> simultaneously.
>>>> 
>>>> This patch tries to decrease the number of requested work requests to
>>>> a level
>>>> that can be supported by the HCA. This prevents unnecessary failures.
>>>> 
>>>> Signed-off-by: Eli Cohen <eli at mellanox.com>
>>>> ---
>>>> lnet/klnds/o2iblnd/o2iblnd.c | 25 ++++++++++++++++++-------
>>>> 1 file changed, 18 insertions(+), 7 deletions(-)
>>>> 
>>>> diff --git a/lnet/klnds/o2iblnd/o2iblnd.c
>>>> b/lnet/klnds/o2iblnd/o2iblnd.c
>>>> index 4061db00cba2..ef1c6e07cb45 100644
>>>> --- a/lnet/klnds/o2iblnd/o2iblnd.c
>>>> +++ b/lnet/klnds/o2iblnd/o2iblnd.c
>>>> @@ -736,6 +736,7 @@ kiblnd_create_conn(kib_peer_t *peer, struct
>>>> rdma_cm_id *cmid,
>>>>     int                     cpt;
>>>>     int                     rc;
>>>>     int                     i;
>>>> +     int                     orig_wr;
>>>> 
>>>>     LASSERT(net != NULL);
>>>>     LASSERT(!in_interrupt());
>>>> @@ -862,13 +863,23 @@ kiblnd_create_conn(kib_peer_t *peer, struct
>>>> rdma_cm_id *cmid,
>>>> 
>>>>     conn->ibc_sched = sched;
>>>> 
>>>> -        rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd,
>>>> init_qp_attr);
>>>> -        if (rc != 0) {
>>>> -                CERROR("Can't create QP: %d, send_wr: %d, recv_wr:
>>>> %d\n",
>>>> -                       rc, init_qp_attr->cap.max_send_wr,
>>>> -                       init_qp_attr->cap.max_recv_wr);
>>>> -                goto failed_2;
>>>> -        }
>>>> +     orig_wr = init_qp_attr->cap.max_send_wr;
>>>> +     do {
>>>> +             rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd,
>>>> init_qp_attr);
>>>> +             if (!rc || init_qp_attr->cap.max_send_wr < 16)
>>>> +                     break;
>>>> +
>>>> +             init_qp_attr->cap.max_send_wr /= 2;
>>>> +     } while (rc);
>>>> +     if (rc != 0) {
>>>> +             CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d\n",
>>>> +                    rc, init_qp_attr->cap.max_send_wr,
>>>> +                    init_qp_attr->cap.max_recv_wr);
>>>> +             goto failed_2;
>>>> +     }
>>>> +     if (orig_wr != init_qp_attr->cap.max_send_wr)
>>>> +             pr_info("original send wr %d, created with %d\n",
>>>> +                     orig_wr, init_qp_attr->cap.max_send_wr);
>>>> 
>>>>        LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr));
>>>> 
>>> 
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
>