Monday, September 21, 2009

Now that you built it, how do you keep it running and measure it ?

I had a great discussion with fellow MVP’er and genius Kent Weare. The topic of discussion spawned from a very simple question:

“Internally, we are starting to measure our critical system’s uptime availability.  These “critical” systems would include applications like Exchange, SAP, BizTalk , our Work order management system and the network of course.  Are there any standards that your organizations follow? ”

I would say the majority of standards are represented in one way our another via the Microsoft Operations Framework (MOF) which also has it ties with the Information Technology Infrastructure Library (ITIL).


Those 2 links provide a heck of a lot of reading but at the end of the day, all these are, are simple guidelines. The trick is to figure out how to apply it to your business.


Here are some general guidelines:


  1. Determine SLA’s from the business for each class of applications (i.e. email, DB, integration)
    • Meeting SLA’s are the criteria for what constitutes uptime/downtime
    • Get consensus if maintenance windows constitute downtime
  2. Find out from the business what level of uptime is required for each
  3. Determine a weighting factor for each required class (i.e. DB is weighted at 80 points, Integration is 15 points, light DB is 5 points)
    • weighting factors can help determine overall uptime, if light DB is 5 points, even if it is down for 50% of the time, the division can still have a 99% + uptime record.
  4. Get current volumes and end to end architecture for present and present + 6 months
  5. Estimate cost of up time in 1 based on points 2-4
  6. Be prepared to redo points 1-5 based on freak out from number in 5 :)
  7. Typical organizations begin with uptimes around 98.9% to 99.9% as a starting point (and NO, 98.9 does not constitute 2 9’s ;) )


Things to remember

  1. This model has to be reviewed with every single project and at a minimum every 3 months due to growth and bursts in regular business
  2. Remember to follow the entire length of the transaction to determine uptime
    • these may include 3rd party providers unless specifically noted
    • these do include network, power generators etc. as they can tie into a DR plan (Disaster Recovery)
  3. Factor in year to year growth
    • Servers spec’d out 2 years ago may not be sufficient to handle the current load if your cluster fails due to organic growth
Technorati Tags: ,,

No comments: