[Patch 10/12] tabled: retry initial CLD session open etc.

hail-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Pete Zaitcev <zaitcev@redhat.com>
To: Jeff Garzik <jeff@garzik.org>
Cc: Project Hail List <hail-devel@vger.kernel.org>
Subject: [Patch 10/12] tabled: retry initial CLD session open etc.
Date: Sat, 17 Apr 2010 22:43:23 -0600	[thread overview]
Message-ID: <20100417224323.14c73a41@redhat.com> (raw)

This was an error in the conversion to ncld. In the cldc code, we
kick the state machine and the natural retries do the rest. Any
failures occure there. But in ncld the original kick can fail too.

Five retries give CLD server time to reboot. If it's down, then
clients refuse to start. This may be a bad idea, or may be not.
We may yet change the retries to be infinite, but for now it's
better if builds terminate somehow in case of unexpected problems.

Also, is_dead should not be cleared if a retry timeout is scheduled.

Signed-off-by: Pete Zaitcev <zaitcev@redhat.com>

---
 server/cldu.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

commit a922bf8ae3137d1b8adc83b52c816f6334dd7291
Author: Master <zaitcev@lembas.zaitcev.lan>
Date:   Sat Apr 17 20:38:12 2010 -0600

    The CLD client fixes from Chunk (keep is_dead is most important).

diff --git a/server/cldu.c b/server/cldu.c
index a10b8fe..e247f45 100644
--- a/server/cldu.c
+++ b/server/cldu.c
@@ -157,14 +157,16 @@ static void cldu_tm_rescan(int fd, short events, void *userdata)
 		applog(LOG_DEBUG, "Rescanning for Chunks in %s", sp->xfname);

 	if (sp->is_dead) {
-		ncld_sess_close(sp->nsp);
-		sp->nsp = NULL;
-		sp->is_dead = false;
+		if (sp->nsp) {
+			ncld_sess_close(sp->nsp);
+			sp->nsp = NULL;
+		}
 		newactive = cldu_nextactive(sp);
 		if (cldu_set_cldc(sp, newactive)) {
 			evtimer_add(&sp->tm_rescan, &cldu_rescan_delay);
 			return;
 		}
+		sp->is_dead = false;
 	}

 	scan_chunks(sp);
@@ -589,6 +591,7 @@ int cld_begin(const char *thishost, const char *thisgroup, int verbose)
 {
 	static struct cld_session *sp = &ses;
 	struct timespec tm;
+	int newactive;
 	int retry_cnt;

 	cldu_hail_log.verbose = verbose;
@@ -635,9 +638,15 @@ int cld_begin(const char *thishost, const char *thisgroup, int verbose)
 	 * -- Actually, it only works when recovering from CLD failure.
 	 *    Thereafter, any slave CLD redirects us to the master.
 	 */
-	if (cldu_set_cldc(sp, 0)) {
+	newactive = 0;
+	retry_cnt = 0;
+	for (;;) {
+		if (!cldu_set_cldc(sp, newactive))
+			break;
 		/* Already logged error */
-		goto err_net;
+		if (++retry_cnt == 5)
+			goto err_net;
+		newactive = cldu_nextactive(sp);
 	}

 	retry_cnt = 0;

                 reply	other threads:[~2010-04-18  4:43 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:a10b8fe dfblob:e247f45
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100417224323.14c73a41@redhat.com \
    --to=zaitcev@redhat.com \
    --cc=hail-devel@vger.kernel.org \
    --cc=jeff@garzik.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).