good show

Thu Jul 29 11:29:40 PDT 2004

On Wed, Jul 28, 2004 at 11:55:32PM -0400, Bill Vermillion wrote:
> > "Secondary MX" is the common name for a combination of techniques
> > intended to reduce mail delivery failures.
> 
> > To have this really be reliable, you need (at least) two things, in
> > each of two categories:
> 
> > You need for the master DNS zone for your domain to be served
> > from at least 2 machines, and preferably 3 or 4, *on different
> > backbones and uplink providers*. This way, mail will never
> > bounce with "can't resolve domain", which is a soft bounce (the
> > sending SMTP server will usually retry for up to 5 days).
> 
> It all depends upon where your machines are and just how reliable
> you must be.  Five 9's is easily doable - six 9's [ approximately
> 30 seconds downtime per year ] gets to be a bit expensive.

A nice table is at 

http://www.eventhelix.com/RealtimeMantra/FaultHandling/reliability_availability_basics.htm

Along with some other info on the topic.

And remember, these are usually *system* reliability numbers, not
component.  Engineering high system availablility with common
components is the Holy Grail.

> Five 9's is about 5 minutes per year.  I was averaging that for the
> past 2 years until 2AM Monday morning when a Cisco 7120 decided to
> get finicky.  We lost a total of about 3 hours connection time
> from 2AM to 6AM when I configured a machine to act as a router.
> That's 4 hours total outage since March 2000.   Some of that lost
> time was bringing the Cicso backup and then watching it fall over
> again - while I was on the phone to tech support - in Australia.

:-)

> > You need at least one extra machine to actually *receive mail*
> > for your domain. These machines must have public, static IP
> > addresses, and properly administered mail SMTP mail systems.
> > You configure then in your DNS zone as additional MX records,
> > with higher numbers in their MX records (and therefore lower
> > priority).
> 
> > If a sending system tries to get mail to you, and for some reason
> > cannot contact your primary MX server, it will try your secondaries in
> > descending priority (ascending numerical) order.  Hopefully, *one* of
> > them will be accessible.  As usual, the optimal situation is to have
> > your secondaries in phsyically separate locations, on different
> > backbones, just like your DNS servers.
> 
> Optimal can be expensive. And it depends on your needs and
> it depends on your backbone.  I said I now totalled about 4
> hour downtime in 4.5 years.  The backbone I'm connected to has
> essentially no downtime.  There have been moments when they were
> reconfiguring - and I had advance notice - and while they said
> they expeced the network to be unavailable for up to 5 minutes, I
> never saw that much.  And that last notice like that was two years
> ago.   There are more phone companies in that building than I can
> count.  And the last time I was in the carrier side the line
> of Lucent Ascend devices stretched for many many feet.  I made a
> rough estimate of 30,000 dial in connetions at that time - and
> they've probably added more since then.

I'll bet.

> I'm on a 40Gbs global backbone - Level 3. Fibre comes into the
> building from three separate locations. The battery room has
> almost as much square footage as my house. Those will keep
> everyting running for 6 to 8 hours. And the ONLY reason those are
> there is in case the diesel generator doesn't start. The diesel
> turns on in seconds after any power failure. It has a 6000 gallon
> tank and puts out 1,250,000 watts - Caterpillar unit.

I'll tell you what I told Mark:

Sure, the AT&T 5E in the 6th subbasement of WTC2 kept running until
almost 1600 on 9/11, but it didn't matter much, did it?

> If you are a huge company - then having secondaries and DNS in
> separate locations - may be a requisite. But depending on your
> needs and what you use for a backbone, a separate backbone may
> not be neccesary.   

Oh sure.  But I was speaking pedagogically; not telling John what *he*
needed.

> > These secondary servers are configured to accept the mail
> > for your domain, but not try local delivery -- they then
> > attempt the delivery to your primary server themselves, for
> > however long your secondaries are configured to try -- which,
> > hopefully, you're in control of.
> 
> I handle secondary MX for a colo client with a flock of domains.
> His machines get swamped at time - and I have been inside his [he
> has given me access] and he's woefully short on memory.   So he can
> get his mail server bogged down and things will come over to my
> secondary MX machines until his machines start breathing normally
> again.
> 
> If don't control your secondaries you should at least be able
> to specify how long you want things to be held.   A decent provider
> should do that for you.  However I've seen some places just totally
> nuke the queues on a daily basis.  Those are usually smaller ISPs.

Yikes.  If they do, IMHO, they're *providing* secondary MX service.

They're playing games.

> > Worst case, if your machine is running but your link has suffered
> > backhoe fade, you might be able to sneakernet the mail spool from a
> > secondary to the primary for delivery.
> 
> Only if they are close by and if you don't have a huge amount of
> mail to deliver.

Or you have a cablemodem at home, and a CD burner.

> > The highest bandwidth data transport known to mankind is a FedEx plane
> > full of DVD-ROM's.  (This used to be a station wagon fill of magtape,
> > when someone at Duke coined it about Usenet; I've clearly updated.)
> 
> The problem with that is that it has a huge bandwidth but a very
> poor temporal timeframe.  And when the station wagon full of mag
> tapes analogy was made most backbones were in the 56K range as I
> recall.

Yeah, the lagtime is horrible, but the bandwidth is *still* high.  :-)

> 25,000 units and $5,000,000,000 later the new unit - CRS-1 [Carrier
> Router System] is the one that will replace those.  Bandwidth needs
> have grown faster than anyone had imagined.

Yeah.  Good ghod...

> While a plane full of DVD-ROMs may have a higher aggregate
> bandwidth the time consumed burning those DVDs is one thing.
> And if you had to ship it overnight, those should be ready to go no
> later than 8PM for 10AM delivery.   
> 
> That's 14 hours, or 50400 seconds. In that time frame the CRS-1
> will be able to move over 4.6 exabits or about 580 exabtyes in
> that time frame.
> 
> 580 Exabytes the standard terminology is about 580 quadrillion
> bytes.
> 
> So it might be a close race between the plane and the data
> providing you already have the DVDs made.   :-)

Nice to know you're still on the ball, Bill.

Cheers,
-- jra
-- 
Jay R. Ashworth                                                jra at baylink.com
Designer                          Baylink                             RFC 2100
Ashworth & Associates        The Things I Think                        '87 e24
St Petersburg FL USA      http://baylink.pitas.com             +1 727 647 1274

	"You know: I'm a fan of photosynthesis as much as the next guy,
	but if God merely wanted us to smell the flowers, he wouldn't 
	have invented a 3GHz microprocessor and a 3D graphics board."
					-- Luke Girardi