Advanced File Concepts
=============================================================================
Overview: Files are very important in UNIX and there are many commands that
deal with them. In this tutorial we discuss links, hard and
soft, filesystems, and the important sort command.
=============================================================================
Section Topic
------- -----
Giving a file more than one name using "ln"
Hard versus soft links and filesystems
Sorting files
Giving a file more than one name using ln
-----------------------------------------
Sometimes you want to make a copy of a file without making a copy of it.
That is, you would like a file to either have two different names, or to
appear in two different directories, yet you do not want to waste disk space
by having two separate, complete copies of the file. UNIX provides a command
to allow this.
In UNIX, a "link" is a pointer to a group of blocks on disk that contain
data. Directories are then just files that contain a list of names and links.
When a program or a command is given a filename, it looks up the name in the
directory, gets the associated link, and then instructs the file system to
retrieve the blocks from disk. Making a "ghost copy" of a file, then, merely
involves creating another entry in a directory, using the same link.
To make a link to a file in the same directory, one would do
% ln anyfile zfile
"Anyfile" is the name of the existing file, and zfile will be the name of the
ghost copy. Now, any command that is given "anyfile" or "zfile" will behave
identically, since these are just synonyms or aliases for the same group of
disk blocks. This is also dangerous, since any change to anyfile is reflected
in the contents of zfile immediately; they are the same file.
Links are specified by large numbers, called i-node numbers. These i-node
numbers are used in another table by UNIX to locate the actual disk blocks.
To find out what the i-node numbers are, use the -i option on "ls:"
% ls -i
85325 a.out
91053 anyfile
193503 beta-test
91053 zfile
Notice that anyfile and zfile have the same i-node number, making them the
same file. This is the only way that you can discover if two names actually
refer to the same file on disk.
We would say that the file whose i-node is 91053 has two links to it.
The terminology is a bit vague, since we have been used to thinking of files
as entities that are uniquely identifiable by their names.
When a file is removed (using the rm command), its disk blocks are freed
up and put into a pool for UNIX to reuse. However, if you remove a file that
has more than one link to it, the file remains on disk, although the name
vanishes from the directory. UNIX keeps a count of how many links each file
has; each ln command increases that by 1 and each rm decreases it by 1.
When the link count goes to 0, UNIX then marks all the disk blocks as freed
and ready to recycle.
A more typical use of ln is to insert the name of a file into more than
one directory. Suppose that we have a file called "C-tutorial" in our
home directory, and we want to have an identical copy in /usr/local/doc.
Thus we could do the following:
% ln C-tutorial /usr/local/doc
(This presupposes that we have write privilege to /usr/local/doc, or that we
are acting as superuser.)
After the above command, there will still be just one set of disk blocks
with the information, but there will be entries in two different directories.
The names of the file are:
~/C-tutorial
/usr/local/doc/C-tutorial
where ~ is our home directory.
The ln command has almost the same syntax as the "cp" command, which makes
sense since they are so similar. However, their options are different.
You can also link many files at once. Here's an example of linking all
the tutorial files in a distant directory into our current directory:
% ln /usr/local/doc/TUTORIALS/* .
The dot (.) refers to our current directory, which may or may not be our
home directory.
Hard versus Soft Links and filesystems
--------------------------------------
The links that we have been making above are called "hard links". In
UNIX, the disks are divided into a number of groups of blocks called
"filesystems". One physical disk drive may have just one filesystem, or
it may have many. The way to display the various filesystems on the computer
is by using the "df" command:
% df
Filesystem kbytes used avail capacity Mounted on
/dev/sd0a 30767 9415 18276 34% /
/dev/sd0g 492527 326179 117096 74% /usr
/dev/sd0h 327327 33497 261098 11% /var
/dev/sd1g 492527 361980 81295 82% /usr1
/dev/sd1h 327327 212765 81830 72% /export
/dev/sd2g 916334 644009 180692 78% /usr2
/dev/sd4g 613252 491115 60812 89% /mnt2
/dev/sd5g 613252 489808 62119 89% /mnt
/dev/sd6g 613252 434519 117408 79% /mnt1
Notice that each filesystem resides on a specific disk, whose name is
given by /dev/... and that it has a fixed size in kilobytes. This cannot
be changed except by root and only after rebooting UNIX, so it is not often
done. Df also kindly informs us how many kilobytes are in use and what
percentage of the disk is in use. When this reaches 100%, users cannot
write new files, which may be catastrophic. (A kilobyte is 1024 bytes.)
Filesystems in UNIX are sort of like drives in MS-DOS. For example,
there is usually the A: drive, and the B: drive, and perhaps even C: and
D:. Each drive is independent to a degree, with its own set of files and
i-node tables. This is done primarily for fault tolerance; in case one
of the disk drives goes bad, then not all the files are inaccessible.
The problem in UNIX is that the i-nodes start with 0 in each file-
system, so it would be impossible to link a file that resides in one
filesystem into another, because the i-nodes might not be unassigned and
confusion would result when the disk blocks were accessed for the wrong
file.
UNIX, beginning with Berkeley UNIX, solved this by introducing soft
links which are ways to link without using the i-nodes. First, to make
a soft link, use the -s option:
% ln -s myfile /usr/local/doc/C-tutorial
If you try to make a hard link, by omitting the -s, and the files are
on different filesystems, UNIX will tell you so and refuse to make the
link.
The way that UNIX makes a soft link is to create a small file with
the name of the actual file in it. Thus, after the above ln command were
executed, there would be a file called /usr/local/doc/C-tutorial which has
but one line in it, and that line has the name of the original file. This
uses a little extra space on disk, however.
You can also link a directory by making a soft link. In fact, UNIX
disallows you to make a hard link to a directory. For details on this
see the lesson on directories.
Sorting a file
--------------
Files often contain datasets, which are organized data often representing
measurements instead of documents with text. One of the most usual things
that you need to do is to sort those dataset files, and there is a UNIX
command to do this for you. The UNIX sort command uses a very fast algorithm
(quicksort) and it provides many options, so it is preferrable to use it
rather than your own sort program, especially if the file you wish to sort is
huge.
Sort takes a filename and writes the new sorted file to stdout, so you
must use redirection if you wish to capture it in a file. You cannot sort
a file in place; sort is a UNIX filter command. You can always rename the
sorted file:
% sort mydata > mydata.sorted
% mv mydata.sorted mydata
Sort will also accept more than one file name at a time. It pretends that you
gave it one very long file:
% sort UnSorted1 UnSorted2 > newsorted
There are many options to sort; -o lets you specify an output file name.
% sort UnSorted1 -o newsorted # output file
Sort can also be used to merge two already sorted files. Use the -m
option:
% sort -m sorted1 sorted2 > newsorted
At this point, you may be wondering how sort knows what part of the
file to sort on. A dataset file usually consists of a number of identically
formatted lines, sometimes called records, with each line consisting of one
or more fields. A field usually represents one item of information, and sort
imagines that any sequence of non-blank characters between blanks (or at the
beginning of the line and the end of the line) is one field. Here's a simple
dataset file:
Mark 37 Buffalo Valentine
Kathy 27 Greeley Ainsworth
Sally 56 Lincoln Valentine
Doran 32 Greeley Berkeley
Anthony 1 Greeley Greeley
Of course the kind of data that this describes is up to the owner of the data,
and is of no real concern to us. Suffice it to say that the first field of
each line contains names of people, the second field their age, the third
field the town where they current live, and the last field the town where they
were born.
Sort normally sorts based on the entire line. That is, the entire line
is the key. However, you can ask it to sort on one or more fields. These
are specified by using + and - option numbers. Here's how we would sort
the above on their names, the first key:
% sort +0 -1 people
while the following sorts by age.
% sort +1 -2 somefile
The fields are numbered beginning at 0, so the two numbers tell which field
to start at and which one to stop before. +0 -1 means start at field 0 and
use all fields up to but not including 1.
You can also reverse the order, sorting from largest to smallest.
Here's how you would sort in reverse order from the 5th field to the end of
the line (if you omit the - number, then the sort key is the starting field
to the end of the line.)
% sort -r +4 somefile
Another option, -c, allows you to check if the file is already sorted
without changing it or producing any output file:
% sort -c somefile
Sometimes you need to treat the characters in a field as numerals, rather
than as ASCII characters. -n is the way to get sort to treat the data as
numeric:
% sort -n +3 somefile
For example, "-45" as a character string would be considered less than "-55"
because character 4 in ASCII is less than character 5.
Not all files have fields that are separated by blanks. -t is the way
to specify that some other character is to be used as a field separator:
% sort -t: +2 -5 somefile
The colon is very commonly used in UNIX to separate fields, as in the password
file.
There are many other options to sort, which you can learn about in the
man page.