Tar and Compress
===============================================================================
Overview: Files sometimes get to be huge and take up too much disk space.
Tutorial shows how to pack or compress these files, and how to
bundle together all files in a directory so that they can be treated
as one file. Many tips and hints are given.
===============================================================================
Section Topic
------- -----
Introduction
Alternatives, exhortation to clean out one's old files
File compression, (compress and uncompress commands)
Zcat, a temporary uncompress
Tar, tape archiver command
Using tar, compress and uuencode together
Summary and quick reference
Introduction
------------
Quite often you need to conserve disk space. Rather than delete files, you may
wish to consider alternatives that allow you to either back up files on floppies
or cut down on space usage using compression.
To see how much disk space you are using, try
du -s ~
"du" is the disk usage summary command, and ~ stands for your home directory.
This will print out the number of disk blocks required for all your files. Each
disk block is 1K, or 1024 bytes. So if you see that you are using 68 blocks,
know that your file's all total up to 69632 bytes, or less. You could know the
exact count if you used ls -l. The reason why it might be less is because UNIX
uses whole disk blocks for files, so if you have a 10-byte file, it takes as
many disk blocks as a 1000-byte file, namely 1 block.
Another UNIX command to use is "df" which tells what percentage of
each file system is used up currently. A file system is a section
of disk blocks on a device that is considered to be one unit. A real
disk can have multiple file systems on it, which are set up by the
system administrator. If you are a student, your home directory
will be in /mnt, so look for /mnt in the right hand column. Next to
it will be a percentage, 74% for Wed. Sept. 8. This means that 74%
of all available blocks are currently allocated and used by users.
When this number goes over 90%, the system administrator starts
sending out mail!
Alternatives
------------
The first thing to do is to consider if there are files that you don't
really need anymore. These would include executables or .o files.
The .o files really aren't needed once you have the final executable.
Use
ls -lR ~ | less
to get a long listing of ALL your files in all your subdirectories, piped
into "less" for convenience. (You can always use the shell escape from
inside "less" to execute a "rm" command.)
A second alternative is to copy your files onto floppy diskettes, either
in Mac form or MS-DOS form. Use ftp for this (see the ftp tutorial.)
Then you can delete the files from your UNIX account, and if you need
them later, just ftp copies back.
Compression
-----------
File compression refers to the technique of using an alternate encoding
for the data in the file, one that takes fewer bits than the way the file
is currently stored. Textual data takes up a lot more room than it
should. For example, every letter is stored using 8 bits. But if your
text consists of nothing but lower case letters, blanks and periods, then
you could use 5 bits because 26+2=28, and you can encode the numbers 0 to
28 using 5 bits. Actually, you could also use 4 other characters, since
2 to the fifth power is 32. So throw in the question and exclamation
marks and two others! If you were able to use this method, then a 1000
byte file would go from using 8,000 bits down to 5,000 bits, a savings of
38%. That is, your new file would be only 63% the size of your old file.
Other compression techniques are often used, and in fact, the compress
command available on most UNIX systems, or available as public domain
software, uses a technique that often saves almost 70%! That is, the new
file is often about 30% the size of the original file. You could effectivey
store 3 times as much data using this method!
You can compress any file you have, except one that is already compressed
and a few other types of files. But all source code files and all data
files can be compressed. The standard compress program actually erases
the old file and creates a new one with the same name, only ending in .Z
For example, suppose that you have a huge data file called experiment17
and you want to compress it.
% compress experiment17
Now if you do "ls", you will see experiment17.Z. This is the compressed
form of the file. DO NOT PRINT OR VI IT! It is unreadable by humans.
Now of course compression is no good unless you can recover the original
file exactly, and so you can. Just type "uncompress". You don't need
the .Z extension.
% uncompress experiment17
The old experiment17.Z is erased.
If you try to compress a compressed file, nothing happens. No new files
are created, and no old ones are changed.
Zcat
----
Sometimes we want to store huge files in compressed format but we need to
do things with them, like search them or use them for input to a program.
We could uncompress the whole file and then run the grep command or the
program that uses our file as data. But this takes up a lot of space.
Zcat to the resuce!
Zcat is a filter program that decompresses part of the file and pipes it
to some other program for temporary use. The advantage is that we don't
have to use up extra disk space to access our file.
For example, if we want to print a compressed file, we could zcat it to the
printer:
% zcat experiment17.Z | lpr -Pqms
Or if we wanted to use it in a program:
% zcat experiment17.Z | a.out
Tar
---
Tar stands for "tape archiver" but its use has expanded and it is now used
as a general purpose archiver program. An archiver is a program that manages
a bunch of files that are stored together either to save space or to keep
everything in one real file so that it can be transmitted easily. Think of
an archiver as a kind of file librarian and the physical file is the library
with a bunch of logical files in it. For those acquainted with the MS-DOS
world, ZIP is a common archiver.
Often, a user creates an archive that contains all the files in a particular
directory. This archive is then compressed as one file and it is stored or
transmitted over the network. Let's examine how this is done.
First, let's explain the syntax of tar. Following the command name tar is
a set of one letter commands that say what to do, followed by several file-
names. Usually the one letter command group ends with "f", followed by a
blank, followed by the name of the file that functions as the archive file.
Suppose that you have a directory called EXPERIMENT with lots of data files
and program files in it. There may even be other subdirectories within
EXPERIMENT, and there may be executables, even compressed files in it.
To bundle all this together into an archive, do
% tar cf EXPERIMENT.tar EXPERIMENT
The "c" stands for "create" a new archive file. The directory is of course
EXPERIMENT, and the new file is called EXPERIMENT.tar. It is a very good
idea to name this archive file so that you can remember what is in it. I
usually use the name of the old directory, appended with ".tar" so that I
know it is a tar archive file.
Once you do this, you have a new file called EXPERIMENT.tar, probably a
very large file. The EXPERIMENT directory is not changed or hurt in any
way, although later you may want to delete it. (Usually you don't create an
archive file unless you want to delete the originals.) To delete an entire
directory and everything in it, do
% rm -rf EXPERIMENT
To see what files are in the archive, use the "t" command:
% tar tf EXPERIMENT.tar
This lists the files and all the subdirectories. If you want to see the
full information, like ls -l gives you, do
% tar tvf EXPERIMENT.tar
instead. The "v" letter stands for "verbose" and it can be used with most
other commands to give you lots more information.
Now you can have any number of tar files. There is no limit. But beware of
using too much disk space because every file in the directory is copied
into the tar file.
To recover the directory from the archive file and restore everything to
exactly the way it was, use "x":
% tar xf EXPERIMENT.tar
This "extracts" everything from EXPERIMENT.tar and rebuilds the directories.
Sometimes you only want to extract one file or just one subdirectory. To do
this, put the name of the file or subdirectory on the tar command:
% tar xf EXPERIMENT.tar data17
This just gets file "data17".
You can use tar for keeping files together, not just for compressing.
Tar, compress, uuencode
-----------------------
Lots of data is stored on computer systems as compressed tar files. In
other words, someone has bundled together a directory of files into one
archive, and then compressed it. If there are compressed files already in
the directory, this process will not hurt them. You can have any kind of
files inside that directory, even other tar files, even other compressed
tar files.
Here's a sequence of steps one might use to archive and compress the
EXPERIMENT directory:
% tar cf EXPERIMENT.tar EXPERIMENT
% compress EXPERIMENT.tar
% rm -rf EXPERIMENT
Now all that you have is the single file EXPERIMENT.tar. Don't lose it!
You can put it on a floppy disk (using ftp) or you could mail it. However,
tar files and compressed files contain binary codes which are unreadable
and may screw up the mail system.
There is a program that can create a readable file from an unreadable one
so that it can be sent via email. It is called uuencode, Unix to Unix
encode, because it was first used in primitive networks of UNIX systems.
The command is easy to use but you have to put the name of the file twice
on the command line. Here's how you could encode EXPERIMENT.tar.Z:
% uuencode EXPERIMENT.tar.Z EXPERIMENT.tar.Z > EXP.encoded
The output is sent to stdout so you have to redirect it. In our example
above we redirected it into a new file called EXP.encoded. You can vi
this file, or email it. What you see won't make any sense, but it does
consist only of printable ASCII characters.
At the other end you can decode it by doing
% uudecode EXP.encoded
It will automatically make a new file EXPERIMENT.tar.Z.
A uuencoded file is usually much larger than the original, so it is not
a good way to store a file like this for a long period of time. It is
really only meant as a way of sending a file through email.
If you have a file whose name does not give away that it is compressed
or tar'ed, use the UNIX file command:
% file mystery
and it will tell you if this file (mystery in this case) is a tar file or
a compressed file.
Remember to use zcat to save on space. For example, if you get a
compressed tar file and you want to get its table of contents (i.e. its
file listing), do
% zcat EXPERIMENT.tar.Z | tar tf - | less
The - is used to signify that the tar file is really coming from standard
input. The less command is simply there to keep the information from
scrolling off the screen before you can read it.
Summary & Quick Reference
-------------------------
If you have a directory of files and you wish to archive and compress that
directory in order to save space, put yourself in the directory immediately
above the one you wish to archive. Let's call the directory you wish to
archive TARGET. Then make an archive file using "tar". Compress that
archive file and you're done.
% tar cf TARGET.tar TARGET
% compress TARGET.tar
The new file will be TARGET.tar.Z. Normally, you would then remove the TARGET
directory and all its contents:
% rm -rf TARGET
Optionally you could uuencode the file if you wanted to send it via email to
somebody:
% uuencode TARGET.tar.Z TARGET.tar.Z > TARGET.encoded
....at the other end.....
If you receive a uuencoded file, just do
% uudecode TARGET.encoded
This makes a new file: TARGET.tar.Z. Next, decompress and finally untar the
file:
% uncompress TARGET.tar <--these 2 commands are "step X"
% tar xf TARGET.tar
The new directory TARGET will exist. The old archive file still exists and
uses up space, so normally you would delete that:
% rm TARGET.tar
As an alternative, if you are really cramped for space, don't fully uncompress
the file before untarring it, but rather use zcat in a pipeline:
% zcat TARGET.tar.Z | tar xf -
This command replaces "step X" above.