More about files in UNIX

============================================================================= 
Overview:  Continuing to learn about files, we talk about protections, how to
           make files public, searching files for strings, comparing 2 files
           to see if they are the same, and so forth.  Then we conclude with
           a discussion of filter commands and input/output redirection, and
           give several examples of common filters.
=============================================================================

Section                            Topics
-------                            ------

   Special file names
   Wildcard Characters
   Privileges and protection
   Using chmod to give or revoke permissions
   Searing using grep
   Comparing files
   Spelling
   Filter commands


Special file names
------------------
 
     Sometimes we want to have some files which do not show up when we do "ls".
UNIX gives us a way to do this by using a file name that starts with a period
such as .secret.  This can be very handy when you want to keep backup copies
around or other auxiliary files, but you don't want to be necessarily aware of
them when doing an "ls".  You can still list these files by using the -a option:

     % ls -a

The -a means all files, even those beginning with a period.

     Directories' names can also begin with a period, making them ordinarily
invisible when doing an "ls".

     Filenames can contain other special characters, including dashes, numbers,
asterisks and blanks.  Generally these file names present problems for the
shell unless the name is enclosed in double quotes.


Wildcard Characters
-------------------

     Often you will want to do the same command to every file in a directory,
or every file that begins with S or every file that ends in ".c".  UNIX provides
some shortcuts to enable you to do this.  

     The asterisk character is used to match file names.  For example, to apply
wc to every find that ends in .c, you could do

     % wc *.c

The asterisk stands for 0 or more characters, so "*.c" matches any file name in
the current working directory whose name ends in ".c".  
 
     To remove every file in the current directory, do

     % rm *

However, the asterisk does not match a filename that starts with a period,
which is a safety precaution to keep you from deleting the parent directory
(whose name is "..").

     The asterisk is one of several wildcard characters in UNIX, so named
because they are like jokers in card games, which can be used for any card
you need.  Other wildcards are discussed in the "Advanced UNIX -- Shell"
tutorial.

     Wildcards are actually handled by the shell, not by UNIX itself nor by
the individual commands.  Thus, when you type in a command with a wildcard,
such as

     % wc *.c

the shell first forms a list of all file names in the current working
directory.  Then it selects from this list those names that end in .c, and
finally it sends this list to the wc command.  Thus, wc sees something like:

     % wc main.c sub1.c z.c

if these were the names of the three .c files in the current directory.


Privileges and protection
-------------------------

     Every computer system needs a way of protecting users but at the same
time allowing them to share work and resources.  UNIX has a scheme that allows
users to set protections on their files to either open them up to or close them
off from the rest of the UNIX users.  There is also a way to allow a subset of
users, called a group, access to files without also allowing the whole UNIX 
community access.

     In UNIX there are three classifications of users, with their abbreviations:

    user    u   --  the owner of the file
    group   g   --  a group that the file and its owner are in
    other   o   --  also called "the world", everyone on the system

The owner of a file can change privileges on a file for these classes of users 
by using the chmod command. The privileges are, along with their abbreviations:

    read    r   --  file can be read (copied, listed, compiled, etc)
    write   w   --  file can be written, i.e. modified
    execute x   --  file can be executed, i.e. run as a command
 
     The privileges of a file are displayed in a rather cryptic form.  They are
shown as a nine-letter string, divided into three groups.  The first three
letters stand for the user's privileges, while the next three are for the
group.  The last three letters specify the privileges for all other users.
Within each letter slot, there can only be r, w, x, or a dash, which means
no privilege.  For example:
 
     rwxr-xr--
 
In the above string of permissions, the owner (user) can read, write or execute
the file.  The group can read or execute, but not write.  The world (others) can
only read the file.   Here's another way to mentally break down this to 
discover who can do what:

     rwx   r-x   r--         <-- read, write, execute
      u     g     o          <-- user, group, others

If a file had all privileges turned on, it would look like
 
     rwxrwxrwx
 
Note that if any one of these is turned off, it becomes a dash.  In any one
letter slot, there can only be either a dash or the letter shown above.  For
example, the following is NOT legal:
 
     rwwxxx---     (you will never see this)


Using chmod to give or revoke permissions
-----------------------------------------

     First off, every file and every directory has an owner, which you can see 
if you do "ls -l".  Only the superuser can change the ownership of files -- you
can't give a file away to somebody, although you can make it public so they can
copy it into a new file of their own.

     To make a file public so that others can access it, or to make it private,
use the chmod command.  The protection levels are sometimes called the "mode" of
a file, so "chmod" stands for "change mode".

     The chmod command specifies the class of user followed by a plus or minus,
to turn on or turn off the corresponding privilege, followed by the privilege.
For example, to let the owner, group and world read the file "syllabus", the 
following would work:
 
     chmod ugo+r syllabus
 
     As mentioned above, the letter u stands for "user" or owner. The letter g 
stands for "group" and the letter o stands for "others".

     To make sure that the world cannot write the file, do
 
     chmod o-w syllabus
 
     If you specify that somebody may read one of your files, all the
directories in which that file is nested must be readable as well.  That is,
all the directories in the full pathname must have at least "x" (execute)
privilege in order for the user to get to the file.  However, just because a
directory is public doesn't mean that the files within it are public.  Some
of them might be, and some might not be.

     There is another form of the chmod command that uses octal numbers to
encode the permissions.  Since most novice UNIX users are just confused by
this, we will not mention it further, although you might run across it.  The
man pages explain it in detail for the curious.


Searching using grep
--------------------

     Many new UNIX users seem either befuddled, irritated or amused by the
choice of names for UNIX commands.  While users should realize that command
names are really just arbitrary symbols, users should also know that UNIX
was developed on very slow, clunky teletypes that printed 10 characters per
second by people who were not professional typists.  Most commands also have
some vague rationale for the names.  Grep is one of them.

     A better name for grep might be "search-for-strings-in-file" but who
wants to type this a hundred times a day?  "grep" stands for "general regular
expression and print", which means that it uses regular expressions (patterns)
to search for character strings in files and then display the lines from
those files.  Grep is used a great deal by UNIX users.

     The simplest use of grep is to ask it to find a string in a file.  For
example, the following command asks UNIX to search file "syllabus" for the
string "final".  If found, each line that contains that string will be
printed:

     % grep final syllabus

If "final" appears 100 times, then all 100 of these lines are printed.  If the 
word appears more than once on a single line, that line is printed only once.

     There are many options to grep; a few of the most important will be
presented here.  One of these is -l which prints out only the file name if the
string is found.  Thus, to have UNIX merely tell us if "final" appears in
file syllabus, do:

     % grep -l final syllabus

If it does appear at least once, UNIX just prints out the name of the file,
syllabus.  If not, then UNIX prints nothing.

     Quite often you might want to find the file where a particular string
appeared.  Suppose you are in a directory that contains your latest C program,
but you can't remember where that darn function make_monster() is.  So you use
grep to search for that string:

     % grep -l make_monster *

The asterisk, again, means "all files in this directory".  UNIX will print out
the names of the files that contain make_monster.  There might be many if this
function is called many times, so you might want to tell UNIX to search for a
longer string.  In C, the declaration of a function is preceded by the return
type.  Suppose that you know make_monster() has a return type of void.  Then
you could search for "void make_monster", but because there is a blank in this
character string, you will have to surround it on the grep command with double
quotes:

     % grep -l "void make_monster" *

     Going back to the single file example, if you know that the string is
definitely found in a particular file, such as "final" in "syllabus", you could
just enter "vi" and search for the word final.  Another thing that grep will do
is to print line numbers for you, which might be very useful:

     % grep -n final syllabus

If it says that final appears on line 62, then you could start vi on line 62
of file syllabus:

     % vi +62 syllabus

Notice the starting line number option in vi uses a plus rather than a minus,
which some UNIX options do.

     Sometimes you want to find out if a particular character string DOES NOT
occur in a file.  Suppose that you are looking at a dataset in a file and
notice that every line is supposed to begin with the character X.  To make
sure that no line violates this rule, you could reverse the sense of the
grep search by using -v:

     % grep -v X datafile

This will print out any line that does NOT have an X in it.  (Actually, this
is slightly wrong, because the X could be within the line rather than at the
beginning, but discussion of ways to achieve this with grep are reserved for
later lessons.)

     The last option of grep which we show herein is -i which ignores the
case of the letters.  Suppose that you wanted to find out if UNIX appeared
in a file, but it might be spelled UNIX or UNIx or any combination:

     % grep -i unix somefile

This would tell you if UNIX every appears in this file, regardless of case.


Comparing files
---------------

     A very handy command is "cmp" which compares two files.  Quite often you
will have two files and you know that they should have the same contents, for
example if they are outputs from a program or backup copies of a file.  To
see if the contents are EXACTLY the same, use "cmp":

     % cmp file1 file2

In typical UNIX tradition, if they are the same, nothing will be printed.
But if these two files differ in even one character, cmp will tell you so,
and in which line and character the first difference occurs.

     A handier command to use, if you want to actually see the differences, is
"diff" which prints out the differences in the files.  It actually lists the
lines which would have to be removed from the second file or inserted into
the first file to make the two files identical.  Interpretation of the output
of diff takes some practice and experimentation to understand.  Our best advice
is to play around with it.

     % diff file1 file2

     Comm is another file comparison command in UNIX.  However, it is meant to
be used with files that are sorted.  It prints out three columns, showing lines
that occur only in the first file, lines occurring only in the second file,
and those occurring in both.  This might be handy if you had two sorted lists 
of words or numbers and wanted to find the differences in the lists.


Spelling
--------

     One of the great ironies of the modern computerized world is that, even
though there are wonderful spelling checkers in almost every computer system
and word processor, people STILL turn in documents with spelling errors!
(Please don't squawk too loudly if you find any spelling errors in these
lessons!  They were spell checked, believe it or not!)

     UNIX has a rather crude spelling checker, but one that nevertheless works
and can be used effectively.  Here's how to get a sorted list of all the 
potentially misspelled words in a file:

     % spell myfile

This just prints out a list of words that it thinks are misspelled, one per
line on your screen.  It doesn't tell you where they were or how many times
they occurred, or even what possible corrections might be.  However, a 
reasonably intelligent person can still use this list to search the file and
find the words and fix them.

     Many users have their own dictionaries which contain words or names that
they use often but which are not in the standard dictionaries.  Spell lets
you specify such a custom dictionary in the form of a file containing words,
one per line and in alphabetical order.  Give the name of this file on the
spell command with a plus sign:

     % spell +mine paper2

All the words in mine that appear in "paper2" but not in the UNIX dictionary
will not be printed.

     There is a primitive way of finding correct spellings in UNIX, using
the "look" command.  This command takes in a character string and tries to
find it in its dictionary.  If it finds it, it prints it out, otherwise you
won't see anything.  For example:

     % look harrass

will print nothing since "harrass" is misspelled.  If you feel harassed by
this system, then give look a shorter prefix and it will print out all words
that begin with that prefix.  For example:

     % look har

UNIX prints out all the words it can find that begin with har, and you will
see "harass" among them.

     Once you discover the misspelled words, use "vi" to edit the file and
search for the misspelling by using the "/" command in "vi".

     As primitive as this system is, it nevertheless works well enough that
no student should turn in a paper that is not checked for spelling!


Filter commands
---------------

     Sometimes you will get a file that contains some printable text inter-
spersed with unprintable characters.  For example, if you download a Microsoft
Word document from your Macintosh to UNIX, there will be "garbage" in the file
because Word puts numerous formatting commands in the file using nonprintable
characters.  There is a way to squish them out:

     % strings filex

This command is a filter command because it acts like a sieve or filter, taking
in data from one file, performing some transformation on it, and spewing out
the result.  Most of these filters, like "strings", spew forth their result
to the terminal screen, which is called "stdout" (standard output).  However,
this can be sent to a new file by using the redirect symbol:

     % strings filex > filex.2

The arrow can be thought of as pointing to where the output should go, in this
example a new file called "filex.2".  You CANNOT use the same file name for
both the input and the output, however.  If you want to get rid of the old 
"filex" and rename filex.2 as filex, you must do those steps explicitly.

     Strings does not actually change its input file.  It merely creates a new
data stream.  Thus, the old copy of the file "filex" is safe.

     Another filter that is commonly used is "tr" which translates characters
in a file.  Here's the way to translate lower case letters in a file to
upper case so that the entire file will be upper case:

     % tr a-z A-Z < file1 > file2

Two ranges of characters are given, both the same length.  The first is "a to z"
specified by a-z.  The replacement character set is "A to Z", specified by A-Z.
Since the characters sets are both 26 letters long, there is a one to one
matching.

     Notice the use of the < sign.  This is another form of redirection and it
tells the tr command to take its input from the file "file1" rather than from
"stdin" (standard input), which is the keyboard.  Again, the inconsistency of
UNIX shows up.  Some commands allow you to specify a filename on the command
line, like "strings".  Other commands, such as "tr", do not allow this, but
rather they imagine that all input is coming from stdin, and if you want to
draw the data from a file, you must use the redirection symbol.

     Here's a neat example of using "tr" to encrypt a file using a simple
method called "Caesar's cipher", so named because it was actually used by the
Romans to keep military commands private.  Suppose that you have a file that
you wish to encrypt by shifting all the letters down the alphabet by 1, and
suppose that there is only upper case in the file.  Thus, the word CAB would
become DBC.  Here's the tr command to do this:

     % tr A-Z B-ZA < file1 > file1.encrypted

See the two character sets; both are 26 characters long.  The second set uses
both a range and an explicit letter:  B-ZA, which is shorthand for

     BCDEFGHIJKLMNOPQRSTUVWXYZA

To decrypt you can apply the reverse transformation:

     % tr B-ZA A-Z < file1.encrypted > newfile

"newfile" should be identical to file1.

     Filters are common in UNIX and are often used with redirection.  In all 
cases, filters do not actually change their input files, but rather create a 
new data stream that can be either viewed directly on the screen or redirected 
into a new file.  Another common use of filters is to hook several of them 
together in order to create a pipeline.  Data flows through multiple filters,
each one doing one particular thing to the data stream.  These pipelines are 
also discussed in the shell lesson.

     One more common filter will be shown, which is "sed" or "stream editor".
"sed" reads all the lines from a file and applies a simple editing command to
each line.  It is actually quite powerful and can be used in many different
ways -- so many, in fact, that there are entire books devoted to it.

     In this lesson, we will look at only one simple use -- to search for and
replace every occurrence of a character string in a file.  Though this can be
done by using the "vi" editor, using sed enables this to happen in a pipeline
or a shell script, and does not require any interactive commands.  Here's the
format explained by an example:

     % sed 's/Unix/UNIX(TM)/g' file1 > file1.changed

The specific editing command follows the sed command, and it often uses
special characters so it is must be enclosed in quotes, as shown above.
"s" means substitute, and the the "g" at the end means "global" which says
"do this command to all occurrences in every line, not just the first occur-
rence".  The old string, called the target, is "Unix", and its replacement is
"UNIX(TM)".  The slashes delimit or mark off these strings from each other.

     If you want to use the slash as a data character, there are several ways
to do it, the simplest being to change the delimiter to something else, like
a colon.  The following is equivalent to the above:

     % sed 's:Unix:UNIX(TM):g' file1 > file1.changed

Notice again that "file1" is not changed by this command, but a whole new
file called "file1.changed" is created.  The name, file1.changed, is totally
arbitrary, but it is often useful to give it a meaningful name so that you do
not lose track of what is in each file.