Episode 9, Backups
Make sure to send us feedback so we can make the show even better.
Links:
DisasterRecovery
Netbackup
Amanda
BackupPC
Thud's backup script
RYOS, Episode 9 - Backups
Thud: The RunYourOwnServer
podcast for September 8th, 2006.
Thud:
In this
episode, backups. Why would you want to back stuff
up? Tapes are disks? Local or remote? Commercial,
open source, or home-grown?, and a moment of seg.
This episode's reverse sponsor is BackupPC.
BackupPC is a high-performance, enterprise-grade
system for backing up Linux and Windows, PCs and
laptops to a server's disk. BackupPC is highly
configurable and easy to install and maintain. You
can find more details at
backuppc.sourceforge.net.
OK, let's get started. Gek, why do you want to back
stuff up?
Gek:
There are a
lot of different reasons that somebody might want
to back something up. Some of the obvious ones are:
If you are running a server, you're most likely
doing it because you have data that you need access
to or that you are trying to create, so you back
that stuff up so you don't lose it. The whole point
of having a server, in most cases, is to have data
on it. And although sometimes you may have multiple
web-heads pointing to the same database, you are
going to want to at least back up one of the
web-heads, so that if a web-head goes down you have
a copy of the website.
There are other reasons too. You could want to to
back up data because you don't want to store it on
disk anymore, you want to just keep it on tape, and
should you ever need it down the road, you have it
on that tape. You might back it up on CDs too,
that's pretty common. That's about all I can think
of. Do you have anything that you can think of?
Thud: Well, really backing up
stuff protects against the two most dangerous
things in the computer industry. The first one
being bad hard drives. Even if you have RAID so
that you can survive a single hard drive failure,
I've seen situations where multiple drives go down,
or 50% of your drives go down, or even more, and
your data is all gone.
The other most dangerous thing in the computer
industry is stupid admins. Right now, I'm running
about 50/50. I've had data loss about 50% of the
time due to bad disks and about 50% of the time due
to a misplaced rm. I was in the middle of typing
something, forgot where I was, and ended up wiping
out a directory I didn't really mean to. It's
happened to just about every admin that I know. The
best thing to do: if you back stuff up all the time
and you do something like that, you don't have to
feel too bad, because you think, "Yeah, I did it in
the wrong place, I lost the data, but I can
recover. I can get it back pretty quickly."
OK, so there's different kinds of backup media.
There is copying data to another disk, so there is
hard drive backup media, the other one is CD or
DVD, which is optical, and the third one is tape.
Gek, what's your experience with those three?
Gek: Most of the clients
that I work with just back up to tape. There are
some where, like for databases it's necessary, they
don't want to take their database down, and most
databases have a way of dumping a copy of their
contents to disk. So what we do is, we dump to
disk, and then we back up the dump instead of
backing up the actual live database. Typically in
order to back up a database, you have to bring it
down, and most people don't want to do that.
My ideal solution is actually a combination of the
two. What I really like to see is, you have some
kind of massive storage device, a storage array or
a SAN or something where you can dump files, and
then you back that up. I honestly believe that if
you have to go to a tape, then you're already in a
whole lot of trouble. Tapes go bad, they are slow
to restore. It's much faster if you can pull things
straight off of a disk, and you can always lose
tapes. Hard drives are typically mounted in
machines or arrays, and it's a harder thing to just
walk away with. But tapes, if you've got a massive
tape library that's got 400 tapes in it, if people
are swapping tapes in and out of the library
constantly, then it's real easy for a tape to get
misplaced and now you don't have that backup.
I prefer backing up to disk at least for your hot
backup. You want to have a copy of everything on
some other disk array and then back that up as you
need to, to make archives of things. How have you
seen things done, usually?
Thud: Well, I've seen it done
a lot of different ways. In my personal experience,
I try not to rely on tape. It tends to be slow
unless you have a massive infrastructure -- high
speed tapes, high speed backup server, high speed
network that the rest of the servers are connected
to so you can backup as quickly as possible. It's
real expensive to make tape fast, and the only
legitimate reason that I can think of for using
tape is if you need to move large amounts of data
off-site in a small medium. I think that they're
approaching one terabyte on a tape. I think right
now they're around 400, I have seen a couple of ads
for 500 gigs on a single tape.
It just makes it easy to get a lot of data
off-site. It's not something that, in my
experience, you can really rely on. Some of the
commercial products can actually make copies of
tapes, and I've even seen that fail. Like I said,
it's expensive to do fast, it's a lot cheaper just
to get a whole bunch of hard drives and back it up
there. Even if you back it up to the hard drive
initially, and then from there to tape so you have
different backup media, or from the secondary hard
drive to optical like CDs or DVDs.
I really like the idea of having backups on disk
because they are fast to access and it's more
readily available. I've seen times when we suddenly
realized we had to do a restore and the tape was
just shipped off-site. Now we have to wait for the
tape to be delivered at the storage facility and
then turned around the next day to be sent back;
now you are waiting two days for a restore that
should take ten minutes.
Gek: Another thing with
larger storage devices, like EMC filers. You can
get advanced features, where they'll make versions
of documents, so as things change, you actually get
kind of an archival backup. That alleviates some of
the concern that people have backing up to disk,
where, if you're just copying stuff to disk, you
don't have different versions of documents and you
can't go back. If you have to go back a month, then
that means you're going to tape. With some of the
more advanced file servers, you can actually just
roll back a file or roll back a directory and
recover things.
Thud: Okay, let's talk a
little about the difference between local and
remote storage. Local, of course, is local to the
system. So, basically, just backing up to another
hard drive in the system, or to another directory
in the system, for that matter. Remote being on a
remote server that's either in the same facility,
or it could be halfway around the world. Of those
two options, gek, which one do you prefer?
Gek: It depends. A lot of
customers that I work with need offsite storage. If
the building that they host their stuff in should
crumble, they need the ability to get their stuff
back - not necessarily back up and running, like a
failover site, but they do need access to the data.
For them, it makes sense to have something stored
offsite, and that usually means a different zip
code. If you do something like that, it's very,
very expensive, and it is the kind of thing that
you're not going to do it just because you think
it's a good idea. If you need that, it will be
obvious.
I like local storage for most situations. Really,
it is the most cost effective of the two. If you
need to store something offsite, you could always
use somebody like Iron Mountain to come pick up
backup tapes, or even spare hard drives, and keep
it offsite without setting up a remote site to do
backups to.
Thud: Yeah, I kind of agree
with that. Generally, like some of the home-brewed
backup groups I have for my personal servers, I do
an incremental backup every day to a local
directory, and once a week I do a full backup.
Because my servers don't change that often, usually
every couple of weeks I'll go download those and
remove them from the server. Just so I have them
here at home, so if I need them I can restore it.
The important thing for me is, I am trying to
protect against a drive failure, but the data isn't
that hard to replicate.
If I were in a commercial environment, I would want
backups as often as possible. I would want them
local to the drive or local to the server, so that
I can restore quickly. I would also want them off
on another server somewhere, whether that's a
backup server with tapes or whether it's just
another NFS server that's mounted. I'd also want it
on tapes, so that I could easily move it around. If
I were making millions of dollars, of course I'd
want it offsite. I'd want a second site, where the
data gets sent to.
There are a lot of options when it come to local
and remote storage. It just boils down to how
important is the data, how easy is it to replicate,
and how much you are going to lose if you're down
while you're trying to rebuild the data.
Now let's talk about commercial backup
applications. You have a lot more experience in
this arena than I do, gek, so why don't you tell us
a little bit about it.
Gek: My main experience has
been with Veritas: Backup Exec and NetBackup.
They're both pretty good products. NetBackup is a
much better product than Backup Exec.
I've found that, for my own stuff and when I've
been in charge of making the decision as to what
gets used for backups, it's just easier to use Tar
or rsync and some scripts, because it allows you a
lot more flexibility. For instance, if you want to
run encrypted backups on NetBackup, you're talking
about buying the backup server, which is really
expensive software, then you have to buy an
encryption license for each client that you want to
do that on. It can rack up the dollars really
quickly.
If you wanted to do it with just open source and
regular tools, you could use tar and GPG, and you
could password protect or use a key to encrypt
backups. It costs you nothing; the software is
free, it's just a little bit of labor to set it up.
I actually used that in a situation where I was the
admin for a bunch of boxes.
The commercial projects are great if you're working
on a large scale. If you've got hundreds and
hundreds of servers, scripts are probably not the
best way to go; you definitely want to look at a
product like NetBackup. There are others out there;
I haven't worked with many of them. I've worked
with ARCserv a long time ago, I think I've seen
BrightStor. But, honestly, I've never seen a
product put in front of me that I liked better than
NetBackup when it comes to the commercial side of
the house.
Thud, what do you use typically if you're going to
try and go the open source route?
Thud: There are a couple of
projects out there. The most common one, and
probably best known, is AMANDA. I have used it in
the past. It's very flexible and easy to set up.
The one problem I had with it is that if you have a
tape library that's never been used with AMANDA
before, you're not going to get it to work. It
seems that the tape library controls are different
enough that they pretty much have to write it
specifically for it. There are some tape library
tools that are built in, to Linux for example, that
assist with that, but they always have issues. In
our particular case, and we were trying it, I don't
know, probably five years ago, the issue was that
it would eject the tape, command the robot to take
the tape and put it into a certain slot and it was
always off by one slot. So, if we told it to put it
in slot three, it would put it in slot four, even
if there was already a tape in it. So that caused a
lot of issues, but for doing small server backups,
it's not that big a deal. It seems to work great
for that, if you've got manual tapes or things like
that.
The other one that I used, and this was probably
the last time that I used a project backup
application, is called BackupPC. You can find it on
SourceForge, just search for BackupPC. It's a nice
little application that is designed for backing up
Windows and Linux servers. You can back up through
SCP or NFS or on the Windows servers with samba
shares.
It has a very interesting way of storing the
backups. Basically, if you do a full backup and
then it does incremental backups, the way that it
works is that every incremental backup has,
depending on how you do it, a hard link or a soft
link on the Linux server back to the files that
haven't changed. So, what you have is a directory
that is only the size of an incremental, but if you
go to restore from that, you can restore a full
backup from it. It simply links back to the things
that haven't changed. They don't take up any extra
space. If you have a one gig file that's on there
on every full, but it doesn't change throughout the
week when you're doing your incrementals, you only
have to store the one gig and then the six
additional hard links for it. That came in really
handy, because normally when you have to do a
restore, you have to do the full and then all the
individual incrementals after that. This way, you
can just do a full restore off your last
incremental.
And it also has some really nice - it has a nice
web interface into it. So you can see what's backed
up, handle some of the scheduling. It was a pretty
well thought out system. As it turned out, it works
OK if you're doing multiple servers; but for me
personally, because I only have a handful of
personal servers, I want the stuff backed up, but I
don't really care about it that much. It was just
overkill for what I do.
That kind of brings us into the next section which
is homegrown solutions, which is more or less what
I use for all of my stuff, using things like tar,
GZip, rsync, SCP, and a variety of other tools,
just to script together something that meets my
needs and works exactly the way that I want. Gek,
do you have anything that you've built like that?
Gek: Actually, I have a
couple of things I've used and one idea that I'm
kind of working on at the moment. My backups right
now, I just do a very simple rsync from my server
to a USB drive that I disconnect once the backups
are done.
But in the past, when I was running servers out on
the internet, and some at my dad's office, and I
had more servers here, I would really just use tar
and GPG to encrypt them. Then I would have scripts
that automatically went and copied all of the stuff
that was out on the internet, at my dad's office,
or at my hosting provider. I'd copy the backups to
my machine, and then my machine, I'd copy to my
dad's office, so that the stuff was in different
locations. And even if somebody came across one of
the files, they wouldn't be able to get to it,
because it was encrypted. Stuff like that is good
just for quick and dirty, but if you've got data
you really care about, I would not suggest, even if
it is encrypted, taking it somewhere that is
public, or could become public if it was breached.
One of the ideas I want to try and play around
with, is with rsync you have the ability to build a
list of files that it would have copied over, and I
was going to try and make some sort of incremental
backup using that list. How about you, thud, what
do you usually use?
Thud: Well, I have a script
that I've kind of been working on over the years
and adding features onto it. I think the last one I
added was the ability to do encrypted backups, if I
wanted to. I'm actually kind of debating about
whether or not it's ready for public consumption. I
mean, I use it and I've never had any issues with
it. It's not really a complete backup system, which
kind of bothers me. It basically just creates a
directory with full backups and then incrementals,
and you have to have some other mechanism for
getting it off of the server. That's one of the
reasons why I'm not - I really don't want to
release it to the public, because people may think
it's the end-all, be-all.
Basically what I'll do instead of making it
available for the general public, if you're
interested in seeing it or want some more detail
about it, just send us an email at
Podcast
at RunYourOwnServer.org and ask for my backup
script, and I'll send you a copy of it. It's not
very long. It doesn't need very many tools. It was
originally written for OpenBSD, but I know it works
on Linux and FreeBSD, but you just have to make
sure, for example, if you want to do the
encryption, you have to have GPG installed and
configured. But yeah, if you're interested, just
shoot us an email and I'll do it that way. It's not
something I really want showing up in Google. It's
not something I want to have to support later on.
Thud:
This week
since we're on backups and we've mentioned it a
couple of times, let's go into a little bit more
detail about encrypted backups. Gek, why do you
think encrypted backups are important?
Gek: Well, if you follow the
news, I don't think you could have missed the
stories about the government losing backups of
sensitive data. If you've got data that you care
about and you don't want anybody to get it, you
need to protect your backups just as well as you do
your data and that's the bottom line.
If anybody has their backups, that might even might
be more valuable to them than what's live, because
now they've got a history of what's happened. And
really when you think about it, that's a ton more
data than just the way things are currently. If you
can see how things have been, you could even learn
things like people's upgrade cycles: How frequently
do they upgrade? When do they patch? There's a lot
of information besides just the actual data that
can be gleaned from somebody stealing somebody's
backup tapes.
And really it isn't that hard to get the data off
the tape. Once you have the tape, you don't
necessarily have to have the software even all the
time to read the data off of it. All I can say is
that if you care about the data enough to encrypt
it when it's not backed up, then it definitely
needs to be encrypted when you back it up. What
about you, thud?
Thud: Yeah, I'd have to
agree. The other thing to think about is that a lot
of time in backup, especially on the enterprise
level, backup solutions - because a tape would
hold, say 400 gigs - and if it has a bunch of
incrementals on it, you now have data from a lot of
different servers on that tape. So getting hold of
a tape is much more valuable than trying to hack
in, because if you hack in you might get one
server. If you get a tape, there might be ten or
fifteen servers on there. So, you know, it's very
important to encrypt it.
I'm even willing to go to the point of saying that
if you're in a commercial atmosphere, where you've
got any kind of data that is making you money,
whether you think it's personal information or not,
it should be encrypted. Every commercial package I
know of supports encryption. Most of the open
source do on some level. Or you can write your own.
You know, tar and GZip and GPG and a shell language
like bash, you've got a backup system that can
encrypt the data.
It's very important to get all the data encrypted.
It's so easy, I just don't know why everybody
doesn't do it. Take the extra time to do the
backups properly, so you don't have to worry about
it when a tape goes missing.
Gek: Yeah, you have to ask
yourself: Is it going to cost you more to have to
pay to have the data encrypted now? Or is it going
to cost you more when you have to recover from the
data having been stolen?
[music, "Down the Road" by Rob Coslo]
Thud: For show notes, or
other details, please visit our website at
RunYourOwnServer.org.
[music continues]
Thud:
If you
would like to send us feedback, or have a question
you would like to answer on the show, please send
an email to Podcast
att RunYourOwnServer.org.
[music continues]
Thud: The intro music, "I
Like Caffeine," is by Tom Cody. This song, "Down
the Road," is by Rob Coslo. Please visit our
website for links to their websites.
Thud:
This
podcast is covered under a Creative Commons
license. Please visit our website for more details.
Transcription
by
CastingWords

