Difference between revisions of "OPS635-lab-nagios"

From CDOT Wiki
Jump to: navigation, search
m (Updating lab number for fall 2019)
m (Adding more detail to the lab. Not done yet.)
Line 2: Line 2:
 
=OPS635 Lab 1: Nagios Installation and Configuration=
 
=OPS635 Lab 1: Nagios Installation and Configuration=
 
==Overview==
 
==Overview==
In an enterprise environment, a production server must be staged before deployment. Any upgrade to the production servers must be tested in a testing environment and signed off by the change manager(s) before deploying to the production environment. In this lab, you will install and configure the Nagios monitoring framework on a VM running on your testing environment before deploying it to the production environment.
+
In an enterprise environment, a production server must be staged before deployment. Any upgrade to the production servers must be tested in a testing environment and signed off by the change manager(s) before deploying to the production environment. In this lab, you will install and configure the Nagios monitoring framework on a VM running on your testing environment before deploying it to the production environment.  You will use many of the common definitions encountered in a typical nagios installation.
==Investigation 1: Manual Nagios Installation==
+
==Investigation 1: Minimal Nagios Resources==
 
Clone your existing VM. Call the new VM nagios.<yourdomain>.ops, and provide it a static address of your choice.
 
Clone your existing VM. Call the new VM nagios.<yourdomain>.ops, and provide it a static address of your choice.
 
* Add the necessary records for this machine to your DNS server.
 
* Add the necessary records for this machine to your DNS server.
 
* Install and configure Nagios on this machine.
 
* Install and configure Nagios on this machine.
* Configure your Nagios to monitor the following host(s)/service(s):
+
* Configure your Nagios to also use any definitions you include in a file called lab1.cfg.
** Seneca host(s)/Service(s):
+
* Using the lab1.cfg file, create definitions to get your nagios installation to monitor the following hosts/services:
**scs.senecac.on.ca
+
** Create a host definition to make the nagios machine monitor itself (using a non-loopback address).  It should use the check_ping command every ten minutes to make sure it is active.
**ict.senecacollege.ca
+
** Create a service definition to make the nagios machine monitor it's own web-service (using the non-loopback address).  It should use the check_http command every 30 minutes, re-checking every 10 minutes if the initial check fails. 
*If either of these services go into a hard-fail state, nagios should send you an email.
+
** Create a timeperiod definition, and set it to only include the days and times you are in OPS635.  Modify the definitions in lab1.cfg to only run during this time.
==Investigation 2: Scripted Nagios Installation==
+
* Make sure the webservice running on your nagios machine is accessible from your host machine.
* Clone your existing VM again. Call the new VM nagiosclone.<yourdomain>.ops, and provide it a static address of your choice.
+
* Access the nagios web console and confirm that these checks are working before continuing.
* It is not necessary to add this machine to your DNS service.
+
==Investigation 2: Nagios Notifications==
* Write a script to automate the nagios configuration process you used in Investigation 1
+
* Turn flap detection off for the checks you created in investigation 1.
* Test the script to ensure it will automatically configure a newly installed machine as a nagios server.
+
* Modify the lab1.cfg file to include a contact named after yourself, using your email address in your domain. Set its notification periods to use the same timeperiod you created in investigation 1.
 +
* Create a second contact called senioradmin, using the email account for root@<yourdomain>.ops.
 +
* Set the notification interval for the host and service you created in investigation 1 to five minutes. This is unreasonably short for most installations, but in this lab we want to get multiple notifications in a very short time line so that we can be sure they are working.
 +
* If either of these services go into a hard-fail state, nagios should now send you an email.
 +
* Manipulate you machine to cause these checks to fail (e.g. set your firewall to block the ping traffic), and make sure you receive the email before continuing.
 +
* Fix your machine so the checks are passing again.
 +
* Add a hostescalation and a serviceescalation so that if you don't fix the issue before you are notified three times, the notification will instead be sent to the senior admin.
 +
* Cause the checks to fail again, and wait for the notification to be sent to root.
 +
==Investigation 3: Nagios Custom Commands==
 +
* Create a script plugin called check_apache that will use systemctl to check the state of your httpd service.  If the service is running, return 0.  If it is inactive, return 1.  If it is failed, return 2.  For any other result return 3.
 +
* Create a command definition called check_apache_status that will call the check_apache plugin.
 +
* Create a new service definition that will use the new command to check the status of your apache service every two minutes, going into a hard-fail state on the third failed check.
 +
* Create an event handler script to restart apache if it is inactive.  Use the nagios macros to make sure it only tries to restart apache on the second failed check (that is, before it goes into a hard-fail state).
 +
* Add notifications similar to those for your other checks (you should be notified if the service goes into a hard-fail state, and the senior admin should be notified if you don't fix it).
 +
==Investigation 4: Nagios Remote Commands==
 +
* Under Construction
 +
* Clone your existing VM again. Call the new VM nagiosclone.<yourdomain>.ops, provide it a static address of your choice, and add it to your DNS server.
 +
* Install NRPE on nagiosclone.
 
==Submission==
 
==Submission==
 
Demonstrate the your script working on a newly installed VM, and upload it to blackboard.
 
Demonstrate the your script working on a newly installed VM, and upload it to blackboard.

Revision as of 20:06, 8 January 2020

OPS635 Lab 1: Nagios Installation and Configuration

Overview

In an enterprise environment, a production server must be staged before deployment. Any upgrade to the production servers must be tested in a testing environment and signed off by the change manager(s) before deploying to the production environment. In this lab, you will install and configure the Nagios monitoring framework on a VM running on your testing environment before deploying it to the production environment. You will use many of the common definitions encountered in a typical nagios installation.

Investigation 1: Minimal Nagios Resources

Clone your existing VM. Call the new VM nagios.<yourdomain>.ops, and provide it a static address of your choice.

  • Add the necessary records for this machine to your DNS server.
  • Install and configure Nagios on this machine.
  • Configure your Nagios to also use any definitions you include in a file called lab1.cfg.
  • Using the lab1.cfg file, create definitions to get your nagios installation to monitor the following hosts/services:
    • Create a host definition to make the nagios machine monitor itself (using a non-loopback address). It should use the check_ping command every ten minutes to make sure it is active.
    • Create a service definition to make the nagios machine monitor it's own web-service (using the non-loopback address). It should use the check_http command every 30 minutes, re-checking every 10 minutes if the initial check fails.
    • Create a timeperiod definition, and set it to only include the days and times you are in OPS635. Modify the definitions in lab1.cfg to only run during this time.
  • Make sure the webservice running on your nagios machine is accessible from your host machine.
  • Access the nagios web console and confirm that these checks are working before continuing.

Investigation 2: Nagios Notifications

  • Turn flap detection off for the checks you created in investigation 1.
  • Modify the lab1.cfg file to include a contact named after yourself, using your email address in your domain. Set its notification periods to use the same timeperiod you created in investigation 1.
  • Create a second contact called senioradmin, using the email account for root@<yourdomain>.ops.
  • Set the notification interval for the host and service you created in investigation 1 to five minutes. This is unreasonably short for most installations, but in this lab we want to get multiple notifications in a very short time line so that we can be sure they are working.
  • If either of these services go into a hard-fail state, nagios should now send you an email.
  • Manipulate you machine to cause these checks to fail (e.g. set your firewall to block the ping traffic), and make sure you receive the email before continuing.
  • Fix your machine so the checks are passing again.
  • Add a hostescalation and a serviceescalation so that if you don't fix the issue before you are notified three times, the notification will instead be sent to the senior admin.
  • Cause the checks to fail again, and wait for the notification to be sent to root.

Investigation 3: Nagios Custom Commands

  • Create a script plugin called check_apache that will use systemctl to check the state of your httpd service. If the service is running, return 0. If it is inactive, return 1. If it is failed, return 2. For any other result return 3.
  • Create a command definition called check_apache_status that will call the check_apache plugin.
  • Create a new service definition that will use the new command to check the status of your apache service every two minutes, going into a hard-fail state on the third failed check.
  • Create an event handler script to restart apache if it is inactive. Use the nagios macros to make sure it only tries to restart apache on the second failed check (that is, before it goes into a hard-fail state).
  • Add notifications similar to those for your other checks (you should be notified if the service goes into a hard-fail state, and the senior admin should be notified if you don't fix it).

Investigation 4: Nagios Remote Commands

  • Under Construction
  • Clone your existing VM again. Call the new VM nagiosclone.<yourdomain>.ops, provide it a static address of your choice, and add it to your DNS server.
  • Install NRPE on nagiosclone.

Submission

Demonstrate the your script working on a newly installed VM, and upload it to blackboard.