Difference between revisions of "OPS635-lab-nagios"

From CDOT Wiki
Jump to: navigation, search
m (Updating lab number for fall 2019)
m (Protected "OPS635-lab-nagios": OER transfer ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)))
 
(11 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
[[Category:OPS635]][[Category:OPS635-Labs]][[Category:peter.callaghan]]
 
[[Category:OPS635]][[Category:OPS635-Labs]][[Category:peter.callaghan]]
=OPS635 Lab 1: Nagios Installation and Configuration=
+
=OPS635 Lab: Nagios Installation and Configuration=
 
==Overview==
 
==Overview==
In an enterprise environment, a production server must be staged before deployment. Any upgrade to the production servers must be tested in a testing environment and signed off by the change manager(s) before deploying to the production environment. In this lab, you will install and configure the Nagios monitoring framework on a VM running on your testing environment before deploying it to the production environment.
+
In an enterprise environment, a production server must be staged before deployment. Any upgrade to the production servers must be tested in a testing environment and signed off by the change manager(s) before deploying to the production environment. In this lab, you will install and configure the Nagios monitoring framework on a VM running on your testing environment before deploying it to the production environment.  You will use many of the common definitions encountered in a typical nagios installation.
==Investigation 1: Manual Nagios Installation==
+
==Investigation 1: Minimal Nagios Resources==
 
Clone your existing VM. Call the new VM nagios.<yourdomain>.ops, and provide it a static address of your choice.
 
Clone your existing VM. Call the new VM nagios.<yourdomain>.ops, and provide it a static address of your choice.
 
* Add the necessary records for this machine to your DNS server.
 
* Add the necessary records for this machine to your DNS server.
* Install and configure Nagios on this machine.
+
* [[OPS635_installation_nagios | Install and configure Nagios]] on this machine.
* Configure your Nagios to monitor the following host(s)/service(s):
+
* Configure your Nagios to also use any definitions you include in a file called lab.cfg.
** Seneca host(s)/Service(s):
+
* Using the lab.cfg file, create definitions to get your nagios installation to monitor the following hosts/services:
**scs.senecac.on.ca
+
** Create a host definition to make the nagios machine monitor itself (using a non-loopback address).  It should use the check_ping command every ten minutes to make sure it is active.
**ict.senecacollege.ca
+
** Create a service definition to make the nagios machine monitor whether it can connect to it's own ssh service (using the non-loopback address).  It should use the pre-written check_ssh command every 30 minutes, re-checking every 10 minutes if the initial check fails. 
*If either of these services go into a hard-fail state, nagios should send you an email.
+
** Create a timeperiod definition, and set it to only include the days and times you are in OPS635.  Modify the definitions in lab.cfg to only run during this time.
==Investigation 2: Scripted Nagios Installation==
+
* Make sure the webservice running on your nagios machine is accessible from your host machine.
* Clone your existing VM again. Call the new VM nagiosclone.<yourdomain>.ops, and provide it a static address of your choice.
+
* Access the nagios web console and confirm that these checks are working before continuing.
* It is not necessary to add this machine to your DNS service.
+
 
* Write a script to automate the nagios configuration process you used in Investigation 1
+
==Investigation 2: Nagios Notifications==
* Test the script to ensure it will automatically configure a newly installed machine as a nagios server.
+
* Turn flap detection off for the checks you created in investigation 1.
 +
* Modify the lab.cfg file to include a contact named after yourself, using your email address in your domain. Set its notification periods to use the same timeperiod you created in investigation 1.
 +
* Create a second contact called senioradmin, using the email account for root@<yourdomain>.ops.
 +
* Set the notification interval for the host and service you created in investigation 1 to five minutes. This is unreasonably short for most installations, but in this lab we want to get multiple notifications in a very short time line so that we can be sure they are working.
 +
* If either of these services go into a hard-fail state, nagios should now send you an email. Note that you should already have configured the email server on your host to accept email for your domain.
 +
* Manipulate you machine to cause these checks to fail (e.g. set your firewall to block ssh traffic), and make sure you receive the email before continuing.
 +
* Fix your machine so the checks are passing again.
 +
* Add a hostescalation and a serviceescalation so that if you don't fix the issue before you are notified three times, the notification will instead be sent to the senior admin.
 +
* Cause the checks to fail again, and wait for the notification to be sent to root.
 +
 
 +
==Investigation 3: Nagios Custom Commands==
 +
* Create a script plugin called check_sshd that will use systemctl to check the state of your sshd service.  If the service is running, return 0.  If it is inactive, return 1.  If it is failed, return 2.  For any other result return 3.
 +
* Create a command definition called check_sshd_status that will call the check_sshd plugin.
 +
* Create a new service definition that will use the new command to check the status of your sshd service every two minutes, going into a hard-fail state on the third failed check.
 +
* Create an event handler script to restart sshd if it is inactive.  Use the nagios macros to make sure it only tries to restart the service on the second failed check (that is, before it goes into a hard-fail state).
 +
* Add notifications similar to those for your other checks (you should be notified if the service goes into a hard-fail state, and the senior admin should be notified if you don't fix it).
 +
 
 +
==Investigation 4: Nagios Remote Commands==
 +
* Clone your existing VM again. Call the new VM nagiosnrpe.<yourdomain>.ops, provide it a static address of your choice, and add it to your DNS server.
 +
* Install NRPE on nagiosnrpe.
 +
* Make sure to modify the NRPE configuration on nagiosnrpe to allow your nagios server to contact it.
 +
* Copy your check_sshd plugin to nagiosnrpe, making sure the user account for nrpe can run it.  Note you will have to negotiate this with sudo and selinux.
 +
* Add a command to your nrpe configuration to allow remote execution of check_sshd.
 +
* Start and enable the service, and allow traffic to it through your firewall.
 +
* Back on your nagios server, add a new host definition for nagiosnrpe, and add a service that uses nrpe to run the check_sshd plugin on nagiosnrpe.
 +
* Ensure that the check runs correctly, then do something to intentionally make it fail (e.g. stop the sshd service), and ensure that that gets recorded too.
 +
 
 
==Submission==
 
==Submission==
Demonstrate the your script working on a newly installed VM, and upload it to blackboard.
+
Upload your lab.cfg, the nagios configuration from nagiosnrpe, your check_sshd plugin, and your event handler to blackboard.
 +
 
 +
==Completing The Lab==
 +
You have now gained experience using common elements of nagios to monitor machines in your network.  You have configured hosts that should be monitored, identified services to monitor on them, created contacts and notifications so that administrators will be notified when things to wrong (and senior admins can be notified if they don't get fixed), and used nrpe to allow checks to be performed remotely.  You have also written simple checks to customize what you want monitored, and event handlers so that nagios can try to repair simple issues for you.  There is still more to learn (host and service groups and dependencies will make your configuration much more efficient), but there is only so much room in the course.  With what we have covered you have the basic building blocks to monitor your network.

Latest revision as of 20:40, 12 June 2023

OPS635 Lab: Nagios Installation and Configuration

Overview

In an enterprise environment, a production server must be staged before deployment. Any upgrade to the production servers must be tested in a testing environment and signed off by the change manager(s) before deploying to the production environment. In this lab, you will install and configure the Nagios monitoring framework on a VM running on your testing environment before deploying it to the production environment. You will use many of the common definitions encountered in a typical nagios installation.

Investigation 1: Minimal Nagios Resources

Clone your existing VM. Call the new VM nagios.<yourdomain>.ops, and provide it a static address of your choice.

  • Add the necessary records for this machine to your DNS server.
  • Install and configure Nagios on this machine.
  • Configure your Nagios to also use any definitions you include in a file called lab.cfg.
  • Using the lab.cfg file, create definitions to get your nagios installation to monitor the following hosts/services:
    • Create a host definition to make the nagios machine monitor itself (using a non-loopback address). It should use the check_ping command every ten minutes to make sure it is active.
    • Create a service definition to make the nagios machine monitor whether it can connect to it's own ssh service (using the non-loopback address). It should use the pre-written check_ssh command every 30 minutes, re-checking every 10 minutes if the initial check fails.
    • Create a timeperiod definition, and set it to only include the days and times you are in OPS635. Modify the definitions in lab.cfg to only run during this time.
  • Make sure the webservice running on your nagios machine is accessible from your host machine.
  • Access the nagios web console and confirm that these checks are working before continuing.

Investigation 2: Nagios Notifications

  • Turn flap detection off for the checks you created in investigation 1.
  • Modify the lab.cfg file to include a contact named after yourself, using your email address in your domain. Set its notification periods to use the same timeperiod you created in investigation 1.
  • Create a second contact called senioradmin, using the email account for root@<yourdomain>.ops.
  • Set the notification interval for the host and service you created in investigation 1 to five minutes. This is unreasonably short for most installations, but in this lab we want to get multiple notifications in a very short time line so that we can be sure they are working.
  • If either of these services go into a hard-fail state, nagios should now send you an email. Note that you should already have configured the email server on your host to accept email for your domain.
  • Manipulate you machine to cause these checks to fail (e.g. set your firewall to block ssh traffic), and make sure you receive the email before continuing.
  • Fix your machine so the checks are passing again.
  • Add a hostescalation and a serviceescalation so that if you don't fix the issue before you are notified three times, the notification will instead be sent to the senior admin.
  • Cause the checks to fail again, and wait for the notification to be sent to root.

Investigation 3: Nagios Custom Commands

  • Create a script plugin called check_sshd that will use systemctl to check the state of your sshd service. If the service is running, return 0. If it is inactive, return 1. If it is failed, return 2. For any other result return 3.
  • Create a command definition called check_sshd_status that will call the check_sshd plugin.
  • Create a new service definition that will use the new command to check the status of your sshd service every two minutes, going into a hard-fail state on the third failed check.
  • Create an event handler script to restart sshd if it is inactive. Use the nagios macros to make sure it only tries to restart the service on the second failed check (that is, before it goes into a hard-fail state).
  • Add notifications similar to those for your other checks (you should be notified if the service goes into a hard-fail state, and the senior admin should be notified if you don't fix it).

Investigation 4: Nagios Remote Commands

  • Clone your existing VM again. Call the new VM nagiosnrpe.<yourdomain>.ops, provide it a static address of your choice, and add it to your DNS server.
  • Install NRPE on nagiosnrpe.
  • Make sure to modify the NRPE configuration on nagiosnrpe to allow your nagios server to contact it.
  • Copy your check_sshd plugin to nagiosnrpe, making sure the user account for nrpe can run it. Note you will have to negotiate this with sudo and selinux.
  • Add a command to your nrpe configuration to allow remote execution of check_sshd.
  • Start and enable the service, and allow traffic to it through your firewall.
  • Back on your nagios server, add a new host definition for nagiosnrpe, and add a service that uses nrpe to run the check_sshd plugin on nagiosnrpe.
  • Ensure that the check runs correctly, then do something to intentionally make it fail (e.g. stop the sshd service), and ensure that that gets recorded too.

Submission

Upload your lab.cfg, the nagios configuration from nagiosnrpe, your check_sshd plugin, and your event handler to blackboard.

Completing The Lab

You have now gained experience using common elements of nagios to monitor machines in your network. You have configured hosts that should be monitored, identified services to monitor on them, created contacts and notifications so that administrators will be notified when things to wrong (and senior admins can be notified if they don't get fixed), and used nrpe to allow checks to be performed remotely. You have also written simple checks to customize what you want monitored, and event handlers so that nagios can try to repair simple issues for you. There is still more to learn (host and service groups and dependencies will make your configuration much more efficient), but there is only so much room in the course. With what we have covered you have the basic building blocks to monitor your network.