Episode 15, Monitoring

Here are the show notes for episode 15.

Make sure to send us feedback so we can make the show even better.
PodCast Feed



Links:
Network Monitoring
Nagios
MRTG
Cacti
SNMP
NET-SNMP


RYOS, Episode 15 - Monitoring



Announcer: Run Your Own Server Podcast, for March the 24th, 2007.

[intro music]

In this episode, monitoring, why would you want to monitor, why you
need to monitor, what are some tools you can use to monitor, what
is SNMP, a moment of sack, and rat weaving.

[music]

Thud: OK, let's get started. Seg, what do you want to monitor on your
network?

Seg: The things I need to monitor are anything that is critical,
anything that is production. If I am running a website, I need to
monitor that website. If I am running a mail server, I need to
monitor a mail server. I want to monitor anything that is important
to me; whether it is a web server process, to know that Apache or
IIS or whatever, is still listening, still serving pages; whether
it is a mail server that is still taking in and putting out mail,
handling mail properly; or service resources, such as hard drive
space, RAM, CPU, things like that. Those are the important things
that I need to monitor on my network.

Thud: When I do network monitoring that's exactly what I am monitoring. I
am basically monitoring for - is the mail server's port 25 up and
running? Is it accepting mail? One of the critical things for me,
especially for high load situations, is the CPU load. That is
something that is very important. There is also disk usage for file
servers, there's just tons and tons of stuff that you can monitor.

In this show, we have kind of thrown together two separate things,
one is network monitoring and one is server monitoring, because you
do your server monitoring over the network. One of the things that
you really do need to monitor, as far as the network is concerned,
is not just -- is that switch or router up or down -- but also the
network traffic. Because network traffic can point you to potential
problems.

If you have a mail server that is pretty consistent throughout the
day, and has a certain pattern, day to day, throughout a week, and
all of a sudden it starts getting way more mail than it usually
does, that's something that you can see in the network traffic
monitoring. You can see - is it really mail or is it specific
attacks that are coming in for it.

Or on an FTP server, for example. If you normally see pretty spiky
traffic as a few people download here and there, throughout the
day, then all of a sudden, you see it peg up to a 100 megs a second
and keep on going, you probably have a server that has been hacked.
So monitoring the network traffic is also very, very important.

Seg: Although I know a lot of people would do this. One of the things
that is an easy check for me, to make sure that my boxes haven't
been popped, that my backups are happening on schedule and that
they are successful, is I monitor my bandwidth usage. Every X
number of hours when my backups go on, I see a spike, and looking
at the graphs that my hoster provides me, it is very easy. I see
that spike at regular intervals. I know these backups are happening
when they need to.

I know what kind of normal traffic I should expect over my network.
If that were to go up or go down, if the flat line pegs at the top
for a long time, obviously something is wrong. That is a very easy
way to keep an eye on a number of things, just by getting this one
piece of information. That is fantastic.

In corporate environments, in hosting facility environments, that's
the whole goal of network and server monitoring; is to take a lot
of information and present it in a way that someone can look at it
and know everything that is going on very quickly, and say, "OK,
this light is green, so that means that these 30 systems are
running. This light is orange, that means that this is wrong or
something else I need to investigate. As long as this light is
green, I know that all of them are running just fine."

That's the kind of goal that people who sell network monitoring or
server monitoring, monitoring in general, that's what they aspire
for.

Thud: OK. That's actually a good safe way into our next topic. What are
the tools that we can use to do network or system monitoring?

Seg: The two tools I am most familiar with are Nagios and SNMP. I don't
know a whole lot about either one of them, but those are the things
that I have used up to date. Before that, I went around and had bio
-mechanical implants put into me so that I could check, always know
the status of all the servers. Very, very borg-like, but it works.
We got to about four or five thousand, I was like, "This is too
much. I can't find a date anymore. We got to do something else."
They said, "OK, we will go to Nagios and SNMP."

[laughter]

Cutting that out...

Thud: I am leaving that in.

[laughter]

Seg: Nagios is a system that can do a lot of different things, including
SNMP polling. Nagios has a web front-end that network
administrators can use to check on the status of the network. If
you have got a group of servers, you can use Nagios to scan them
all. Nagios can give you a nice web update that says, "Hey, here's
some problems." And you can do a lot of pretty cool administrative
stuff through the web interface.

SNMP is one of the many tools that Nagios can use. Nagios can go
out and do TCP/UDP port checks - say check port 80 and make sure
that it's open. Just make sure that the connection isn't refused.
If it is refused, the web server is down, we have a fire, we got to
go around and fix it.

It can use SNMP for resource monitoring. SNMP is a gigantic complex
thing. You can use SNMP for a lot of different things, under the
umbrella of network monitoring, and that's really the only thing I
have ever used it for. Just go out and tell me what the RAM usage
is, what the CPU usage is, disk space, then what the top values
could be, do a comparison, tell me if there is a problem. If it is
five percent free I want to know about it, if it is 20 percent
free, it is OK, since it is a 20 gig partition, so I don't care.

Nagios just uses SNMP and port checks and a lot of other stuff,
uses all those tools to go out and check on your network and your
servers and reported back and say what's good and what's bad.

Thud: Yeah, Nagios is a very very flexible because most of the monitoring
scripts themselves are written in Perl. But you can really write
them in anything. It does have SNMP checking, of course. We'll
actually go into a little bit more detail about SNMP a little later
on.

But for now I just want to say that Nagios is extremely flexible.
They even have their own daemon that can run on a server you need
to monitor so that you can run Nagios checks on the server you're
monitoring and have it report back to the Nagios server.

It's very very flexible and free like a lot of very good software
is.

Some of the other tools that we can use to monitor stuff, with
which I've had experience in the past for the network traffic
monitoring, are MRTG and Cacti. They can actually do more of the
network monitoring, but both of them are basically the same idea.

You monitor traffic from your switch ports or from your server
network interfaces. You can basically do trending with it. You get
a graph and it shows you, at this time this server had this much
traffic incoming and this much traffic outgoing.

It can be used to do SNMP mapping as well. Those tools are really
good tools and fairly easy to set up to do network monitoring as
far as trending goes, which is kind of what we want to get into
next.

When I talk about trending what I mean is simply not really
monitoring in the sense of something's happening so there is an
alert sent. But it's more monitoring, and just getting a feel of
what your servers normally do.

Seg, what you were talking about with the monitoring your traffic
spikes to know that your backup actually fired off is kind of the
same idea. If I'm watching a mail server, for example I can see the
trend that in the middle of the day, around noon, I get lots and
lots of email. It kind of ramps up from midnight to noon and then
tapers off to midnight, when I'm getting very little mail.

And it does that every day, Monday through Friday. Saturday it does
peak in the middle, but not nearly as much. I'm only getting 10
percent of the mail that I normally do.

If I can watch that, week after week and get used to that, when I
go back through and I'm reviewing my logs and I look at that graph
and I see that on Saturday we got four times more mail than we
normally do, that can be a trigger for me to look into something.
Not something I wouldn't necessarily want to get alerted for, woken
up and alerted at three o'clock in the morning, but definitely
something I need to go investigate.

Seg: Trending is like higher level monitoring. If you're monitoring just
that the port is open, that's one thing. And getting .

Thud: Really, anything that you can monitor is something that you should
be monitoring. Anything could point out a potential problem. If you
find out that you're not monitoring swap space usage on a system,
you may not realize that you are over using your swap, or that
you're running out of RAM for a lot of your processes.

Monitoring everything on a system and the network as well is a good
way of spotting potential problems before they become big huge
fires that you have to worry about putting out.

Seg: Right, and the core of monitoring comes down to the whole fact that
a good Sys Admin spends a lot of his or her time on the server or
checking things out, upgrading patching and doing good
housekeeping. But, no matter how good a Sys Admin is, they can't be
there all the time.

In the case of one or a small group of Sys Admins is responsible
for a large number of servers, you can't be on all those places at
the same time. So you need to set up monitoring to do the work for
you.

Of course it's always good to avoid hand hacking. Always write
programs to .

Thud: One of the other things that trending is really helpful in is
trying to figure out when your maintenance windows are. I need to
take down the entire mail system for maintenance, and I know they
work Monday through Fridays. So Saturday is probably a good time to
do it.

If I can go back and look at my mail traffic, and see the trending
that no, actually on Saturdays we have a lot of salespeople that
send out emails, maybe Sunday is going to be a better day for me.
That can kind of confirm whether or not the maintenance window is
really the lowest point in traffic so that you can do it at the
proper time and not disrupt anybody else.

OK, so let's get into a little bit more detail on SNMP. To kind of
start at the basic level for SNMP, SNMP is a protocol that involves
agents. In the UNIX world, the most common one is net SNMP.

The way, you basically set it up is you have an SNMP agent on your
server that you want to monitor. The monitoring server, the one
that is actually doing the monitoring can do it in two ways.

The monitoring server can query the SNMP agent and say, "What is
your current processor utilization or your current available disk
space?" or that kind of stuff. The agent can report back, "This is
my current state."

The problem with that is that the monitoring server has to make the
request. That's normally the way graphing goes. You make the
request every five minutes and figure out an average and all that
stuff.

But what happens if something horrible happens in between the five
minutes that you are scheduled to check? That's the other direction
in which SNMP can work. The SNMP agent itself can do what's called
trap, which is basically send that SNMP trap message back to the
monitoring server whether it's the monitoring server or something
specifically set up to capture the traps. Those happen more or less
instantly.

Seg: One example is that you can set up a trap to act on reboot. Now if
you are monitoring service ports every two to five minutes you are
saying, let's say every four minutes, Nagios go out and check
server one. Make sure that Apache is still running.

Nagios says, "Fine, I'll do that every four minutes. That's fine."

Well, you're going to have a delay of four minutes. And if that
server crashes or if it goes down, if it was rebooted you need to
know about that as soon as possible. Sometimes waiting those four
minutes, or that three and half minutes or those two minutes is
going to make an impact on your business.

You can configure your SNMP to send a trap when it shuts down. You
can also send the traps on many other things. But that's one real
world example of that. You tell SNMP, when you're going to die,
sent out a trap to this IP address, which is configured to take
that trap and tell me about it.

That way you get a trap you before you get any of the service
alarms. It's going to take one to four minutes before Nagios says,
"Hey, Apache's not running," or any of the other services aren't
running. You're going to know as soon as it happens with minimal
network delay. As soon as it happens, and you're going to have that
information fast and you're going to be able to work on it faster.

And I've used that in real world situations, it's been a big help,
since I work on Solaris a lot being able to get that as soon as it
happens... so SNMP is going down, this box is rebooting, I don't
have any other words for it, but I know the trap came through, it's
in the process of crashing.

So I'll log into the console on the box and watch what happens,
because getting that kind of information that early in the crash is
extremely vital. There's a lot of information you can get from
that, that you wouldn't be able to get from the logs, and so you're
able to get on right away, as it goes down and as it comes back up,
and this is very important.

Of course you can also configure SNMP to send traps when it starts
back up, you set a cold start trap, I start a backup and it must be
a reboot, so send it off to the monitoring server and he can deal
with it.

Traps are also very important on boxes that reboot very fast. You
can have boxes that are very small, very... maybe only have one or
two processors, not a lot of RAM, but they're very quick in their
reboot process so if they can reboot in under four minutes or eight
minutes or however long it takes for Nagios to say this is a red
alert, you might never know about it unless you had a trap system
set up to tell you about it.

Thud: Well one of the other tricks for SNMP is on Windows boxes. If you
get, not the default SNMP agent, but some third party SNMP agents,
actually most of them, you can actually get those agents to send
traps for events in the Windows event logs systems.

That's something that's actually going to come in handy in one of
our later episodes when we talk about logs in more detail. It's a
very easy way to get events out of a Windows box into your
monitoring system if you're monitoring mostly Unix systems and you
have one or two Windows boxes that you need to monitor as well.
SNMP is... like I said it's a standard protocol so it's pretty
ubiquitous as far as the operating systems go.

Seg: And a trap is going to be much more efficient than watching the
logs. If you've got a remote logging server where your Windows
system logs to that log server, that's a whole lot of information,
potentially a whole lot of information that it needs to parse
through before it finds something that it can tell you about.

But in the case of a trap, it only happens when it actually
happens, and so there's not a lot of parsing, there's no overhead,
as soon as it happens you get the trap, you know what's going on.
And you're also removing the whole facility from those two
possibilities.

If you're logging into a log server you're going to have more
chances for something to break. You could have the log server go
down, there could be a lag, the processes that do the parsing could
go down, but if it's a trap it's straightforward. You know exactly
when it happens. Like this trap happened, here's what the trap was,
let's act on it.

Thud: You can probably tell SNMP has a lot of features and can be very
complicated. We're actually planning on going into a lot more
detail by doing just a standalone SNMP show, but for right now it's
just good to know that you can use it to monitor not only your
network but your systems as well.

[music]

Thud: This episode's with Seg. We're going to talk about private networks
and monitoring. The idea being that if you have a public network,
web servers, email servers, ftp servers... whatever, it's usually a
good idea and gives you a lot more flexibility in locking your
systems down if you have a separate network that is not routable,
it's usually one of the non-routable networks, 10.172.16,
192.168... that is used specifically for logging onto your boxes to
manage them and to do system monitoring on.

You can also use it for backup where you can have a backup on a
completely different network. So Seg, what are some of the good
points about separating out a network like this?

Seg: I think the best possible use is for separating... having separate
networks... is guarding your systems from public intrusion, into
them... if you... the best example I've seen is you have a
collection of web servers, say, you have five web servers, they're
all using Apache and PHP and some SQL on a database server. But
none of these web heads have public IP addresses, they all have
private, so you have 192s or 10.s or whatever. They are isolated
from the public network by these private IP addresses.

And so the idea is that you monitor those private ones, you keep
all the monitoring traffic on the internal network, none of those
packets ever go out onto the Internet, they never touch a public IP
at all. So it's much more secure, it's also going to be more stable
and much faster.

And then to get traffic to those boxes you set up VIPs, or some
kind of netting. You do some trick routing so that when packets
come in on the public switches, they are routed to those private IP
addresses and when they are coming back they're routed out
properly. A lot of ways to do that... it can get really
complicated, but just to talk about the examples of a network, a
monitoring network... I think that's the best possible example I've
seen so far.

If you're going to separate off your backups there's another great
example of that. By separating your backups you'll be able to
monitor bandwidth on those a lot easier, you're also going to be
able to things that are kind of neat like doing a gig switch for
your backups, but 100 megs for your publics. Maybe you don't need a
gig switch for your publics. Maybe it'll cost more for every gig
switch that you use or every gig port that you use.

But you need to get the data off your system for a backup much
faster so you just sprint for the extra bucks for a backup switch
that is a gig-E. And also by doing that, you're completely
isolating that traffic. If it's going on a completely different
port, it's on an isolated network that's only for backups, it's
much more secure, much more stable and it's going to be obviously a
lot faster.

[music]

Narrator: For show notes or other details, please visit our web site at
runyourownserver.org. If you would like to send us feedback, or
have questions you would like us to answer on the show, please
visit our forums at forums.runyourownserver.org. The intro music,
"I Like Caffeine" is by Tom Cody. This song, "Down the Road", is by
Rob Castle. Please visit our web site for links to their web site.
This podcast is covered under a Creative Commons License. Please
visit our web site for more details.