dwww Home | Manual pages | Find package

extract++(1)                General Commands Manual                extract++(1)

NAME
       extract++ - SWISH++ text extractor

SYNOPSIS
       extract++ [ options ] directory...  file...

DESCRIPTION
       extract++  is the SWISH++ text extractor, a utility to extract what text
       there is from a (mostly) binary file (similar to the strings(1) command)
       prior to indexing.  Original files are untouched.

       Text is extracted from the specified files and files  in  the  specified
       directories;  text from files in subdirectories of specified directories
       is also extracted by default (unless the -r, --no-recurse, -f, or --fil-
       ter option or the RecurseSubdirs or ExtractFilter variable is given).

       Ordinarily, text is extracted from files either only if  their  filename
       matches  one  of the patterns in the set specified with either the -e or
       --pattern option or the IncludeFile variable (unless standard  input  is
       used;  see next paragraph) or is not among the set specified with either
       the -E or --no-pattern option or the ExcludeFile variable.

       If there is a single filename of `-', the list of directories and  files
       to extract is instead taken from standard input (one per line).  In this
       case,  filename  patterns  of files to extract need not be specified ex-
       plicitly: all files, regardless of whether they match a pattern  (unless
       they  are  among  the set not to extract specified with either the -E or
       --no-pattern option or the ExcludeFile variable), are  extracted,  i.e.,
       extract++  assumes  you know what you're doing when specifying filenames
       in this manner.

       Ordinarily, the text extracted from a file is written to another file in
       the same directory having the same filename but with the ``.txt'' exten-
       sion appended by default, e.g., ``foo.doc'' becomes ``foo.doc.txt''  af-
       ter  extraction.   (See  also  the  -x  or --extension option or the Ex-
       tractExtension variable.)  However, extraction is not performed  if  the
       extracted text file exists.

       If  either  the  -f  or --filter option or the ExtractFilter variable is
       given, then only a single file specified on  the  command  line  is  ex-
       tracted  to  standard  output.   In this case, filename patterns are not
       used and the existence of an extracted text file is irrelevant.

   Filters
       Via the FilterFile configuration file variable, files having  particular
       patterns  can  be  filtered  prior  to extraction.  (See the examples in
       swish++.conf(5).)

   Character Mapping and Word Determination
       extract++ performs the same character mapping, character entity  conver-
       sions, and word determination heuristics used by index++(1) but also ad-
       ditionally:

       1.  Considers all PostScript Level 2 operators that are not also English
           words  to  be  stop words.  Such words in a file usually indicate an
           encapsulated PostScript (EPS) file and such should not be indexed.

       2.  Looks specifically for encapsulated PostScript  (EPS)  data  between
           everything  between  one  of %%BeginSetup, %%BoundingBox, %%Creator,
           %%EndComments, or %%Title and %%Trailer and discards it.

       3.  Discards strings of ASCII hex data Word_Hex_Min_Size  characters  or
           longer, e.g., ``7F454C46.''  (Default is 5.)

   Motivation
       extract++  was  developed  to be able to index non-text files in propri-
       etary formats such as Microsoft Office documents.  There are a couple of
       reasons why the functionality of extract++ isn't simply built  into  in-
       dex++(1):

       1.  Users  who do not need to index such documents shouldn't have to pay
           the performance penalty for doing the extra  checks  for  PostScript
           and hex data.

       2.  While index++(1) can uncompress files on the fly using filters also,
           uncompressing  them  every  time indexing is performed is excessive.
           Text extraction, on the other hand, is done only once per  file;  if
           the  file  is  updated, the text-extracted version should be deleted
           and recreated.

OPTIONS
       Options begin with either a `-' for short options or a ``--''  for  long
       options.   Either a `-' or ``--'' by itself explicitly ends the options;
       however, the difference is that `-' is returned as the first  non-option
       whereas  ``--''  is skipped entirely.  Long option names may be abbrevi-
       ated so long as the abbreviation is unambiguous.

       For a short option that takes an argument, the argument is either  taken
       to  be  the remaining characters of the same option, if any, or, if not,
       is taken from the next option unless said option begins with a `-'.

       Short options that take no arguments can be grouped (but the last option
       in the group can take an argument), e.g., -lrv4 is equivalent to  -l  -r
       -v4.

       For  a  long option that takes an argument, the argument is either taken
       to be the characters after a `=', if any, or, if not, is taken from  the
       next option unless said option begins with a `-'.

       -?
       --help            Print the usage (``help'') message and exit.

       -cc
       --config-file=c   The  name  of the configuration file, c, to use.  (De-
                         fault is swish++.conf in the  current  directory.)   A
                         configuration  file is not required: if none is speci-
                         fied and the default does not  exist,  none  is  used;
                         however,  if  one  is specified and it does not exist,
                         then this is an error.

       -ep[,p...]
       --pattern=p[,p...]
                         A filename pattern (or set of  patterns  separated  by
                         commas),  p,  of  files to extract text from.  Case is
                         significant.  Multiple -e or --pattern options may  be
                         specified.

       -Ep[,p...]
       --no-pattern=p[,p...]
                         A filename pattern or patterns, p, of files not to ex-
                         tract text from.  Case is significant.  Multiple -E or
                         --no-pattern options may be specified.

       -f
       --filter          Extract a single file to standard output and exit.

       -l
       --follow-links    Follow  symbolic links during extraction.  The default
                         is not to follow them.  (This option is not  available
                         under  Microsoft Windows since it doesn't support sym-
                         bolic links.)

       -r
       --no-recurse      Do not recursively extract the  files  in  subdirecto-
                         ries,  that  is:  when a directory is encountered, all
                         the files in that directory are extracted (modulo  the
                         filename patterns specified via the -e, --pattern, -E,
                         or --no-pattern options or the IncludeFile or Exclude-
                         File variables) but subdirectories encountered are ig-
                         nored  and  therefore  the files contained in them are
                         not extracted.  (This option is most useful when spec-
                         ifying the directories and files to extract via  stan-
                         dard  input.)   The default is to extract the files in
                         subdirectories recursively.

       -sf
       --stop-file=f     The name of a file, f, containing the  set  stop-words
                         to  use  instead of the built-in set.  Whitespace, in-
                         cluding blank lines, and characters  starting  with  #
                         and  continuing  to the end of the line (comments) are
                         ignored.

       -S
       --dump-stop       Dump the built-in set of stop-words to standard output
                         and exit.

       -vc
       --verbosity=v     The verbosity level, v, for printing additional infor-
                         mation to standard output during indexing.   The  ver-
                         bosity levels, 0-4, are:

                         0   No output is generated (except for errors).
                         1   Only  run  statistics  (elapsed  time,  number  of
                             files, word count) are printed.
                         2   Directories are printed as extraction progresses.
                         3   Directories and files are  printed  with  a  word-
                             count for each file.
                         4   Same  as  3 but also prints all files that are not
                             extracted and why.

       -V
       --version         Print the version number of SWISH++ and exit.

       -xe
       --extension=e     The extension to append to  filenames  during  extrac-
                         tion.   (It  can be specified with or without the dot;
                         default is txt.)

CONFIGURATION FILE
       The following variables can be set in a configuration  file.   Variables
       and command-line options can be mixed.

            ExcludeFile       Same as -E or --no-pattern
            ExtractExtension  Same as -x or --extension
            ExtractFilter     Same as -f or --filter
            FilterAttachment  (See FILTERS in swish++.conf(5).)
            FilterFile        (See FILTERS in swish++.conf(5).)
            FollowLinks       Same as -l or --follow-links
            IncludeFile       Same as -e or --pattern
            RecurseSubdirs    Same as -r or --no-recurse
            StopWordFile      Same as -s or --stop-file
            Verbosity         Same as -v or --verbosity

EXAMPLES
   Extraction
       To extract text from all Microsoft Office files on a web server:

            cd /home/www/htdocs
            extract++ -v3 -e '*.doc' -e '*.ppt' -e '*.xls' .

   Filters
       (See the examples in swish++.conf(5).)

EXIT STATUS
       Exits with one of the values given below:

            0    Success.
            1    Error in configuration file.
            2    Error in command-line options.
            20   File to extract does not exist.
            30   Unable to read stop-word file.

CAVEATS
       1.  Text extraction is not perfect, nor can be.

       2.  As  with  index++(1), the word-determination heuristics employed are
           heavily geared for English.  Using SWISH++ as-is to extract files in
           non-English languages is not recommended.

FILES
       swish++.conf      default configuration file name

SEE ALSO
       index++(1), search++(1), strings(1), swish++.conf(5), glob(7)

       Adobe Systems Incorporated.  PostScript Language Reference  Manual,  2nd
       ed.  Addison-Wesley, Reading, MA.  pp. 346-359.

       International  Standards  Organization.   ``ISO/IEC  9945-2: Information
       Technology -- Portable Operating System Interface  (POSIX)  --  Part  2:
       Shell and Utilities,'' 1993.

AUTHOR
       Paul J. Lucas <pauljlucas@mac.com>

SWISH++                         November 1, 2002                   extract++(1)

Generated by dwww version 1.16 on Tue Dec 16 04:16:41 CET 2025.