PEP(1) PRELIMINARY PEP(1) NAME pep - a file detergent SYNOPSIS pep [-a] [-b[x][0][1][7][+list][-list]] [-c[size]] [-d[number]] [-e[0|1|2]] [-f commandfile] [-g[intable][:outtable]] [-h[l]] [-k file] [-l[n][size]] [-n] [-o[b]] [-p[a]] [-r[l][m][t]] [-s[size]] [-t[size]] [-u[[inchar ...][,*]|-|+][:outchar ...]] [-v] [-z] [filename ...] DESCRIPTION Pep is a filter program to "clean" files. It is named after a popular Norwegian detergent. Pep may be used to remove control characters, strip parity bits, inter- pret ANSI escape sequences, compress tabulation, extract strings and convert character sets. Pep is a filter. Its default operation is to read from standard input (the keyboard) and write on standard output (the terminal). You may also specify the name of one or more files as the last argument on the command line. Pep accepts ambiguous filename arguments, and input may be redirected. Output is sent to standard output unless you use the -o option. To get a brief summary of the command line syntax and all the options, you need to specify the -h option. Just type the command: pep -h followed by the RETURN key. Note that typing just pep will not give you this summary. The command: pep will start pep as a filter, and it will just echo back whatever you type, until you type the end of file character (usually CTRL-D or CTRL- Z). When pep is running as filter, it is reading from the standard input and writing to the standard output. In this state, pep will be very much less verbose than it usually is. It will still print error mes- sages, but very little else. Note that while: pep < foobar.in > foobar.out pep -ob foobar.txt will do more or less the same job, the first will do it quietly, in the tradition of Unix filters; the latter will print the copyright notice, a detailed list of the things it will do, and finally a list and line count of all the files it processes as it plods along. If you want to check what pep actually intend to do to your file before it does it, you may make it pause with the -p option. For example: pep -p foobar.txt will make pep stop after displaying a list of the conversions it will apply to the file. The user is prompted and may choose to proceed (hitting the RETURN key), or abort the program without doing anything (hitting CTRL-C). What pep will do to a file depend on which conversion functions are selected. Specific conversion functions are selected by specifying one or more options on the command line. Most of the options may be combined with other options, but a few are mutually exclusive. If the user specifies invalid options or option arguments, then pep will abort with an error message and return an error exit code on operating systems that support exit codes. OPTIONS -a Write out information about pep. -b [x] [0] [1] [7] [+list] [-list] Remove (or expand) "non-printing" (binary) characters. At the outset, pep considers all characters "printing", but on its own, the -b option will restrict the range of printing characters to the range 32-126, and 160-255, plus TAB (9), LF (10), FF (12) and CR (13). I.e. the ISO 8859 character set. If the -b option is followed by the modifier x, the symbols that would normally be removed is expanded to hexadecimal character codes between angle brackets. The other modifiers specifies the range of characters to remove or expand. The modifier 0 is shorthand for considering all standard control characters (0-31) and 128 as printing charac- ters. The modifier 1 is likewise shorthand for condidering all characters (128-159). printing characters. (So -b01 will result in all characters 0-255 being considered printing characters, so nothing will be removed or expanded. This nulls the option.) The modifier 7 will restrict the set of printing characters to the original 7-bit character set known as US-ASCII or ISO 646 (32-126) plus TAB (9), LF (10), FF (12) and CR (13). To achive even better control, you may use the modifier + fol- lowed by a list of characters to include, or the modifier - fol- lowed by a list of characters to exclude. The list of modifiers is read from left to right, so the table of modifiers is just accumulated and later entries overrides earlier in case of con- flict. When combined with the -s option, this list will determine what the program thinks off as a "printable character" for the pur- poses of computing consequtive strings of printable characters. -c [size] Compress space into tabulation. I.e. insert TAB characters when replacing a run of two or more SPACE characters would produce a smaller output file. The default tabulation size is 8, but you may specify any other tabulation with the optional numeric argu- ment. This function is the opposite of the function invoked with the -t option. -d [number] Doublespace (i.e. an extra, empty line). If the optional number is given, it specifies the number of lines to inject. A numeric argument of 0 (zero) have a special meaning. In this case, empty lines (i.e. lines without graphic scaracters) are discarded. -e [ 0 | 1 | 2 ] Interpret ANSI screen control sequences (also known as ANSI ESCAPE sequences). This function makes pep emulate cursor posi- tioning and other functions on an ANSI-terminal. Pep will complain about "strange" (i.e. implementation depen- dent) use of ANSI escape sequences. Pep will normally save a screen image on the output file when one of two events occur: 1) When the screen is full and scrolls up; or 2) just before a screen image is erased with the "erase screen" ANSI screen control sequence. In some cases important fields on the screen will be overwritten or erased. There is no good solution to this problem, but pep provides the user with some opportunity to guard against overwriting and erasure. This is done by specifying an additional numeric argument to the -e option. This numeric indicate the level of protection and is interpreted as follows: 0: no protection -- fields may be erased and overwritten (this is the default); 1: sequences that erase fields are ignored; 2: sequences that erase or overwrite fields are ignored. -f [ commandfile ] Read stream editor commands from a file. The name of the file must be appended as the argument to this option. For file format, see the file sample.ed in beta/lib/pep/ed in the distribution. If the name of the file is omitted, pep will write out a list the directories it searches for these files. -g [ intable ][ : outtable] To convert between charsets, you need to specify the names of two files containing descriptions of the input and output char- acter sets. Files covering Macintosh, IBM-PC (CP 850), ISO-8859-1 and ISO-646-60 are included with the pep distribu- tion. To create your own charset description file, see comments in the file sample in beta/lib/pep/cs. If the name of the file is omitted, pep will write out a list the directories it searches for these files. -h [l] Write a brief summary of pep options, and exit. The optional argument l displays a longer help file. -k[file] Instead of typing the options in on a command line, pep can read them from a file. The file pep26.cf is included with the dis- tribution, and will give the same effect as the default setting of pep ver. 2.6. It is located in the directory beta/lib/pep/cf/. -l [n] [size] Split long lines into lines of maximum length given by the size argument. This option will also make sure that there will be at least one blank line between each paragraph, unless the optional argument n is specified. If size is not specified, a default value of 72 characters are used. -n When the source file contains words that are underscored by using the backspace control in combination with underscore char- acter, the underscores and backspace is removed leaving the plain text. -o [ b ] Pep will usually write the result of conversions on the standard output (stdout). This option instead instructs pep to replace each named input file with a file containing the result of fil- tering the file through pep. If the option is augmented with the argument b (i.e. -ob), then pep will create a backup copy of the original input file on a file with extension .BAK. If you just specify -o the original file is deleted. -p [ a ] Write out a brief description the conversion functions that will be activated by the current set of options, and pause. The user may review the list of conversion functions and abort (by hit- ting CTRL-C) if they do not have the intended effect. If you just want to see spelled out what conversion functions a partic- ular combination of switches and/or configuration file has, the optional argument a (i.e. -pa), will abort the run after dis- playing the list conversions on the screen. -r [ l ] [ m ] [ t ] Remove leading, multiple and/or trailing spaces. This option implicitly activates the -u option, -s [ size ] Find strings in extremely "noisy" files. Pep's concept of a string is that it is a sequence of "print- able" characters of a certain length. The default minimum length of this sequence is 4, but this may be changed by the user by supplying an optional numeric argument that becomes the minimum length of the sequence. When you use the -s option, consequtive strings of printable characters below the minimum length is discarded, and separated from other strings with a newline. Note that by default, pep considers all characters "printable". This means that by itself, this option will do nothing. By using the -b option you can restrict the range of printable characters to something sensible. -t [ size ] Expand tabulation, replacing the TAB character with a suitable number of spaces. The default tabulation size is 8, but the optional numeric argument size may be used to set tabulation to any desired size. This function is the opposite of the function invoked with the -c option. -u [ [\inchar ...] [,*] | + | - ][ : \outchar ... ] If you just specify -u without any options, pep will try to guess what constitutes individual lines in the input file, and terminate each of those with the canonical line terminator (the standard way to terminate a text line) on the platform pep is running on. This means that lines will be terminated with a CR/LF pair on a Microsoft system, LF on a Unix system, and CR on a Macintosh computer). Alternatively, you can override this by specifying explicit what symbol(s) to look for in the input stream which signifies an end-of-line. You can use the ordinary shorthand notation, e.g.: \r, for carrige return (CR), \n, for newline (LF), \s, for record separator (RS, ASCII \x1E), \r\n to get carrige return followed by a newline (CR/LF). You may also give the numeric code for the character in decimal, octal or hex: \10, \012, \x0A and \n are just alternate notations for the newline character. You may specify several alternative sequence of characters that all may signify line endings, separated by commas. You can also use the symbol +, which in this context mean that it will use pep's current notion of a "printing" character, and treat everything that is not a "printing" character as a line ending. If the file is noisy, there will be a lot of empty lines in the output. An alternate form of this option is the symbol -, which is the same as +, but which will discard empty lines. After the colon, you can specify which character or characters that shall be used to terminate each line in the output file. It is useful to understand how the -u option works. Basically, pep accumulates character on a line, looking for a sequence that may signify a end of line. When such a sequence is detected, it is removed from the line buffer, and the buffer is then flushed to the output with the canonical line ending appended. -v Normally, pep will terminate each line with a line terminator. Some typesetting programs and word processors, however, require that no hard line terminator is present within a paragraph, and that only paragraphs are hard terminated. If you want to export a file to such a typesetting program or word processor, you may instruct pep to terminate paragraphs only with this option. See note in "bugs" section below about treatment of "end-of- line" and "end-of-paragraph". -z Zero the eight bit (a.k.a. the parity bit) on all characters in the file. ENVIRONMENT Pep uses the environment variable: PEP when it searches for files for the -f, -g and -k options. Below is some examples on how to set this in some operating systems: set PEP=C:\MISC\LIB (MS-DOS) setenv PEP /home/george/lib (Unix, csh) The command to set this environment variable should usually be part of the command file that is read during login (this may be named AUTOEXEC.BAT, LOGIN.COM, .profile or .login depending upon your choice of operating system. Pep also uses HOME (to locate the user's home directory), and PAGER (to find the program to page the help file). DIAGNOSTICS If you specify an option that pep does not recognize, then pep will write a summary of usage and abort. Other errors on the command line will result in pep writing an error message before aborting. On operating systems that support exit codes, pep will return an exit code upon termination. If pep is interpreting ANSI escape sequences and notices syntactical or semantical errors in the way they are used, a warning is printed on the screen, prefixed with the string "ansi:". This means that it is also possible to use pep to check if programs use ANSI sequences in a portable way. FILES Pep searches for the files used by the -f, -g and -k options in the following order: first the current directory, then the directory pointed to by the PEP environment variable, then in a subdirectory named pep below the users home directory, and finally the directory beta/lib/pep. The standard configuration, translation and editor files are stored in three subdirectories below beta/lib/pep: beta/lib/pep/cf standard configuration files (used with the -k option) beta/lib/pep/cs standard charset files (used with the -g option) beta/lib/pep/ed standard editor files (used with the -f option) AUTHOR Copyright (c) 1987-2004 Gisle Hannemyr. This software is released under the GNU General Public License. See the file COPYING.TXT for details. Pep may be freely distributed and copied, as long as this file is included in the distribution and that these statements about authorship and copyright is not altered or removed. Money, bug reports, improvements, comments, suggestions and praise are welcome. Please send to: Snail: Gisle Hannemyr, Hegermannsgt. 13c, NO-0478 Oslo, Norway Email: gisle@hannemyr.no URL: http://hannemyr.com/enjoy/pep.html FAQ: http://hannemyr.com/faq/pepfaq.html ACKNOWLEDGMENTS Several people have contributed character tables, ideas and/or bug reports. Thanks to: Inge Arnesen, Knut Borge, Dag Asheim, Ola Garstad, Ottar Grimstad, Bjorn Larsen, Nils-Eivind Naas, Knut Omang, Tor Sjowall, Geir-Harald Strand, Jens-Henrik Sorensen and Bjorn Asle Valde. My apologies if anyone is forgotten. SEE ALSO sed(1), strings(1), tr(1). BUGS Pep uses a rather simplistic heuristic to identify the end of a para- graph: it bluntly assumes that paragraphs are separated by blank lines. Pep only knows the ANSI sequences implemented in the standard MS-DOS console driver ANSI.SYS. There cannot be a space character between an option and the option's argument (e.g. you'll have to use "-kfoobar.cf", not "-k foobar.cf"). Pep will only filter "regular" files. It will skip directories, sock- ets and "special" files. Links are the GOTOs of file systems. If you run a hard linked file through pep using the -o option, the link will not be preserved. Pep will just skip soft linked files. Pep'sdefault notion of whitespace is that it is a consequtive sequence of \x20 (space), \x09 (tab), \x0A, (linefeed), \x0D (carriage return) in any order. Pep does not consider ISO-8859-1 \xA0 (non-breaking space) as whitespace (i.e. non-breaking space is treated as a "printing character"). Version 3.00B 2004 April 24 PEP(1)