dwww Home | Manual pages | Find package

Parser(3pm)           User Contributed Perl Documentation           Parser(3pm)

NAME
       XML::Parser - A perl module for parsing XML documents

SYNOPSIS
         use XML::Parser;

         $p1 = XML::Parser->new(Style => 'Debug');
         $p1->parsefile('REC-xml-19980210.xml');
         $p1->parse('<foo id="me">Hello World</foo>');

         # Alternative
         $p2 = XML::Parser->new(Handlers => {Start => \&handle_start,
                                            End   => \&handle_end,
                                            Char  => \&handle_char});
         $p2->parse($socket);

         # Another alternative
         $p3 = XML::Parser->new(ErrorContext => 2);

         $p3->setHandlers(Char    => \&text,
                          Default => \&other);

         open(my $fh, 'xmlgenerator |');
         $p3->parse($fh, ProtocolEncoding => 'ISO-8859-1');
         close($fh);

         $p3->parsefile('junk.xml', ErrorContext => 3);

DESCRIPTION
       This module provides ways to parse XML documents. It is built on top of
       XML::Parser::Expat, which is a lower level interface to James Clark's
       expat library. Each call to one of the parsing methods creates a new
       instance of XML::Parser::Expat which is then used to parse the document.
       Expat options may be provided when the XML::Parser object is created.
       These options are then passed on to the Expat object on each parse call.
       They can also be given as extra arguments to the parse methods, in which
       case they override options given at XML::Parser creation time.

       The behavior of the parser is controlled either by "STYLES" and/or
       "HANDLERS" options, or by "setHandlers" method. These all provide
       mechanisms for XML::Parser to set the handlers needed by
       XML::Parser::Expat.  If neither "Style" nor "Handlers" are specified,
       then parsing just checks the document for being well-formed.

       When underlying handlers get called, they receive as their first
       parameter the Expat object, not the Parser object.

METHODS
       new This is a class method, the constructor for XML::Parser. Options are
           passed as keyword value pairs. Recognized options are:

           •   Style

               This  option  provides  an  easy  way to create a given style of
               parser. The  built  in  styles  are:  "Debug",  "Subs",  "Tree",
               "Objects",  and  "Stream".  These  are  all  defined in separate
               packages under "XML::Parser::Style::*", and you can find further
               documentation for each style both below, and in those packages.

               Custom styles can be provided by  giving  a  full  package  name
               containing at least one '::'. This package should then have subs
               defined  for  each  handler  it  wishes  to  have installed. See
               "STYLES" below for a discussion of each built in style.

           •   Handlers

               When  provided,  this  option  should  be  an   anonymous   hash
               containing  as  keys  the  type  of  handler and as values a sub
               reference to handle that type of event.  All  the  handlers  get
               passed  as  their  1st  parameter  the instance of expat that is
               parsing the document. Further details on handlers can  be  found
               in  "HANDLERS". Any handler set here overrides the corresponding
               handler set with the Style option.

           •   Pkg

               Some styles will refer to subs defined in this package.  If  not
               provided,   it   defaults   to  the  package  which  called  the
               constructor.

           •   ErrorContext

               This is an Expat option. When this option is defined, errors are
               reported in context. The value should be the number of lines  to
               show on either side of the line in which the error occurred.

           •   ProtocolEncoding

               This  is  an Expat option. This sets the protocol encoding name.
               It defaults  to  none.  The  built-in  encodings  are:  "UTF-8",
               "ISO-8859-1",  "UTF-16",  and "US-ASCII". Other encodings may be
               used if they have encoding maps in one of the directories in the
               @Encoding_Path list. Check "ENCODINGS" for more  information  on
               encoding  maps.  Setting  the  protocol  encoding  overrides any
               encoding in the XML declaration.

           •   Namespaces

               This is an Expat option. If this is set to a  true  value,  then
               namespace  processing is done during the parse. See "Namespaces"
               in  XML::Parser::Expat  for  further  discussion  of   namespace
               processing.

           •   NoExpand

               This is an Expat option. Normally, the parser will try to expand
               references  to  entities defined in the internal subset. If this
               option is set to a true value, and a  default  handler  is  also
               set,  then  the  default  handler  will be called when an entity
               reference is seen in text. This  has  no  effect  if  a  default
               handler  has  not  been  registered, and it has no effect on the
               expansion of entity references inside attribute values.

           •   Stream_Delimiter

               This is an Expat option. It takes  a  string  value.  When  this
               string  is  found  alone  on a line while parsing from a stream,
               then the parse is ended as  if  it  saw  an  end  of  file.  The
               intended  use  is  with  a  stream  of  xml  documents in a MIME
               multipart format. The  string  should  not  contain  a  trailing
               newline.

           •   ParseParamEnt

               This  is  an  Expat option. Unless standalone is set to "yes" in
               the XML declaration, setting this to a  true  value  allows  the
               external DTD to be read, and parameter entities to be parsed and
               expanded.

           •   NoLWP

               This  option  has  no  effect  if  the ExternEnt or ExternEntFin
               handlers are directly set. Otherwise, if true, it forces the use
               of a file based external entity handler.

           •   Non_Expat_Options

               If provided, this should be an anonymous  hash  whose  keys  are
               options  that  shouldn't be passed to Expat. This should only be
               of concern to those subclassing XML::Parser.

       setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]])
           This  method  registers  handlers  for  various  parser  events.  It
           overrides  any  previous  handlers  registered  through the Style or
           Handler  options  or  through  earlier  calls  to  setHandlers.   By
           providing  a  false  or undefined value as the handler, the existing
           handler can be unset.

           This method returns a list of type, handler pairs  corresponding  to
           the  input.  The  handlers returned are the ones that were in effect
           prior to the call.

           See a description of the handler types in "HANDLERS".

       parse(SOURCE [, OPT => OPT_VALUE [...]])
           The SOURCE parameter should either be a string containing the  whole
           XML  document,  or  it  should  be  an  open IO::Handle. Constructor
           options to  XML::Parser::Expat  given  as  keyword-value  pairs  may
           follow  the  SOURCE  parameter.  These  override, for this call, any
           options or attributes passed through from the XML::Parser instance.

           A die call is thrown if a parse  error  occurs.  Otherwise  it  will
           return  1  or whatever is returned from the Final handler, if one is
           installed.  In other words, what parse may  return  depends  on  the
           style.

       parsestring
           This is just an alias for parse for backwards compatibility.

       parsefile(FILE [, OPT => OPT_VALUE [...]])
           Open  FILE  for  reading,  then call parse with the open handle. The
           file is closed no matter  how  parse  returns.  Returns  what  parse
           returns.

       parse_start([ OPT => OPT_VALUE [...]])
           Create   and   return   a   new  instance  of  XML::Parser::ExpatNB.
           Constructor options may be provided. If an  init  handler  has  been
           provided,   it  is  called  before  returning  the  ExpatNB  object.
           Documents are parsed by making incremental calls to  the  parse_more
           method  of  this  object, which takes a string. A single call to the
           parse_done  method  of  this  object,  which  takes  no   arguments,
           indicates that the document is finished.

           If  there  is  a  final  handler  installed,  it  is executed by the
           parse_done method before returning and the parse_done method returns
           whatever is returned by the final handler.

HANDLERS
       Expat is an event based parser. As the parser recognizes  parts  of  the
       document  (say  the  start  or  end  tag  for  an XML element), then any
       handlers registered for that type of an event are called  with  suitable
       parameters.   All  handlers receive an instance of XML::Parser::Expat as
       their  first  argument.  See  "METHODS"  in  XML::Parser::Expat  for   a
       discussion of the methods that can be called on this object.

   Init                (Expat)
       This is called just before the parsing of the document starts.

   Final                (Expat)
       This  is  called  just after parsing has finished, but only if no errors
       occurred during the parse. Parse returns what this returns.

   Start                (Expat, Element [, Attr, Val [,...]])
       This event is generated when an XML start tag is recognized. Element  is
       the  name of the XML element type that is opened with the start tag. The
       Attr & Val pairs are generated for each attribute in the start tag.

   End                (Expat, Element)
       This event is generated when an XML end tag is recognized. Note that  an
       XML empty tag (<foo/>) generates both a start and an end event.

   Char                (Expat, String)
       This  event  is  generated when non-markup is recognized. The non-markup
       sequence of characters is in String. A  single  non-markup  sequence  of
       characters  may  generate  multiple  calls to this handler. Whatever the
       encoding of the string in the original document, this is  given  to  the
       handler in UTF-8.

   Proc                (Expat, Target, Data)
       This event is generated when a processing instruction is recognized.

   Comment                (Expat, Data)
       This event is generated when a comment is recognized.

   CdataStart        (Expat)
       This is called at the start of a CDATA section.

   CdataEnd                (Expat)
       This is called at the end of a CDATA section.

   Default                (Expat, String)
       This  is called for any characters that don't have a registered handler.
       This includes both characters that are  part  of  markup  for  which  no
       events  are  generated  (markup  declarations) and characters that could
       generate events, but for which no handler has been registered.

       Whatever the encoding in the original document, the string  is  returned
       to the handler in UTF-8.

   Unparsed                (Expat, Entity, Base, Sysid, Pubid, Notation)
       This  is  called  for a declaration of an unparsed entity. Entity is the
       name of the entity. Base is the base to be used for resolving a relative
       URI.  Sysid is the system id. Pubid is the public id.  Notation  is  the
       notation name. Base and Pubid may be undefined.

   Notation                (Expat, Notation, Base, Sysid, Pubid)
       This  is  called for a declaration of notation. Notation is the notation
       name.  Base is the base to be used for resolving a relative  URI.  Sysid
       is the system id. Pubid is the public id. Base, Sysid, and Pubid may all
       be undefined.

   ExternEnt        (Expat, Base, Sysid, Pubid)
       This  is  called when an external entity is referenced. Base is the base
       to be used for resolving a relative URI. Sysid is the system  id.  Pubid
       is the public id. Base, and Pubid may be undefined.

       This  handler  should  either  return  a  string,  which  represents the
       contents of the external entity, or return an open filehandle  that  can
       be  read to obtain the contents of the external entity, or return undef,
       which indicates the external entity couldn't be found and will  generate
       a parse error.

       If  an open filehandle is returned, it must be returned as either a glob
       (*FOO) or as a reference to a glob (e.g. an instance of IO::Handle).

       A default handler is installed for this event. The  default  handler  is
       XML::Parser::lwp_ext_ent_handler  unless  the  NoLWP option was provided
       with a true value, otherwise  XML::Parser::file_ext_ent_handler  is  the
       default handler for external entities. Even without the NoLWP option, if
       the URI or LWP modules are missing, the file based handler ends up being
       used after giving a warning on the first external entity reference.

       The  LWP  external  entity  handler  will  use  proxies  defined  in the
       environment (http_proxy, ftp_proxy, etc.).

       Please note that the LWP external entity handler reads the entire entity
       into a string and  returns  it,  where  as  the  file  handler  opens  a
       filehandle.

       Also  note  that  the  file external entity handler will likely choke on
       absolute URIs or file names that don't fit the conventions of the  local
       operating system.

       The  expat  base  method  can  be  used  to  set a basename for relative
       pathnames. If no basename is given, or  if  the  basename  is  itself  a
       relative name, then it is relative to the current working directory.

   ExternEntFin        (Expat)
       This  is called after parsing an external entity. It's not called unless
       an ExternEnt handler is also set. There is a default  handler  installed
       that pairs with the default ExternEnt handler.

       If  you're  going to install your own ExternEnt handler, then you should
       set (or unset) this handler too.

   Entity                (Expat, Name, Val, Sysid, Pubid, Ndata, IsParam)
       This is called when an entity is declared. For  internal  entities,  the
       Val  parameter will contain the value and the remaining three parameters
       will be undefined. For external entities,  the  Val  parameter  will  be
       undefined,  the  Sysid  parameter  will  have  the  system id, the Pubid
       parameter will have the public  id  if  it  was  provided  (it  will  be
       undefined  otherwise), the Ndata parameter will contain the notation for
       unparsed entities. If this is a parameter entity declaration,  then  the
       IsParam parameter is true.

       Note  that  this handler and the Unparsed handler above overlap. If both
       are set, then this handler will not be called for unparsed entities.

   Element                (Expat, Name, Model)
       The element handler is called when an element declaration is found. Name
       is  the  element  name,  and  Model  is  the   content   model   as   an
       XML::Parser::Content  object. See "XML::Parser::ContentModel Methods" in
       XML::Parser::Expat for methods available for this class.

   Attlist                (Expat, Elname, Attname, Type, Default, Fixed)
       This handler is called for each attribute in an ATTLIST declaration.  So
       an ATTLIST  declaration  that  has  multiple  attributes  will  generate
       multiple  calls to this handler. The Elname parameter is the name of the
       element with which  the  attribute  is  being  associated.  The  Attname
       parameter  is  the  name  of  the attribute. Type is the attribute type,
       given as a string. Default is the default value, which  will  either  be
       "#REQUIRED",  "#IMPLIED"  or  a  quoted string (i.e. the returned string
       will begin and end with a quote character).  If Fixed is true, then this
       is a fixed attribute.

   Doctype                (Expat, Name, Sysid, Pubid, Internal)
       This handler is called for DOCTYPE declarations. Name  is  the  document
       type  name.  Sysid  is  the  system  id  of the document type, if it was
       provided, otherwise it's undefined.  Pubid  is  the  public  id  of  the
       document  type,  which  will  be  undefined  if  no public id was given.
       Internal is the internal subset, given as a  string.  If  there  was  no
       internal  subset,  it  will  be  undefined.  Internal  will  contain all
       whitespace, comments, processing instructions, and declarations seen  in
       the  internal subset. The declarations will be there whether or not they
       have been processed by another handler  (except  for  unparsed  entities
       processed  by  the  Unparsed  handler). However, comments and processing
       instructions  will  not  appear  if  they've  been  processed  by  their
       respective handlers.

   * DoctypeFin                (Parser)
       This  handler  is  called  after  parsing of the DOCTYPE declaration has
       finished, including any internal or external DTD declarations.

   XMLDecl                (Expat, Version, Encoding, Standalone)
       This handler is  called  for  xml  declarations.  Version  is  a  string
       containing  the  version.  Encoding  is  either undefined or contains an
       encoding string.  Standalone will be either true, false, or undefined if
       the standalone attribute is yes, no, or not made respectively.

STYLES
   Debug
       This just prints out the document in outline form.  Nothing  special  is
       returned by parse.

   Subs
       Each time an element starts, a sub by that name in the package specified
       by  the  Pkg  option  is  called with the same parameters that the Start
       handler gets called with.

       Each time an element ends,  a  sub  with  that  name  appended  with  an
       underscore  ("_"),  is  called  with  the  same  parameters that the End
       handler gets called with.

       Nothing special is returned by parse.

   Tree
       Parse will return a parse tree for the document. Each node in  the  tree
       takes the form of a tag, content pair. Text nodes are represented with a
       pseudo-tag  of  "0"  and the string that is their content. For elements,
       the content is an array reference. The first item  in  the  array  is  a
       (possibly  empty) hash reference containing attributes. The remainder of
       the array is a sequence of tag-content pairs representing the content of
       the element.

       So for example the result of parsing:

         <foo><head id="a">Hello <em>there</em></head><bar>Howdy<ref/></bar>do</foo>

       would be:

                    Tag   Content
         ==================================================================
         [foo, [{}, head, [{id => "a"}, 0, "Hello ",  em, [{}, 0, "there"]],
                     bar, [         {}, 0, "Howdy",  ref, [{}]],
                       0, "do"
               ]
         ]

       The root document "foo", has 3  children:  a  "head"  element,  a  "bar"
       element  and  the  text  "do". After the empty attribute hash, these are
       represented in it's contents by 3 tag-content pairs.

   Objects
       This is similar to the Tree style, except that a hash object is  created
       for  each  element.  The corresponding object will be in the class whose
       name is created by appending "::" and the element name  to  the  package
       set  with  the  Pkg  option. Non-markup text will be in the ::Characters
       class. The contents of the corresponding object will be in an  anonymous
       array that is the value of the Kids property for that object.

   Stream
       This  style  also  uses  the  Pkg package. If none of the subs that this
       style looks for is there, then the effect of parsing with this style  is
       to   print  a  canonical  copy  of  the  document  without  comments  or
       declarations.  All the subs receive as their  1st  parameter  the  Expat
       instance for the document they're parsing.

       It looks for the following routines:

       •   StartDocument

           Called at the start of the parse .

       •   StartTag

           Called  for  every  start tag with a second parameter of the element
           type. The $_ variable will contain a copy of  the  tag  and  the  %_
           variable will contain attribute values supplied for that element.

       •   EndTag

           Called  for  every  end  tag  with a second parameter of the element
           type. The $_ variable will contain a copy of the end tag.

       •   Text

           Called just before start or end  tags  with  accumulated  non-markup
           text in the $_ variable.

       •   PI

           Called  for  processing instructions. The $_ variable will contain a
           copy of the PI and the target and data  are  sent  as  2nd  and  3rd
           parameters respectively.

       •   EndDocument

           Called at conclusion of the parse.

ENCODINGS
       XML  documents  may  be  encoded in character sets other than Unicode as
       long as they may be mapped into the Unicode  character  set.  Expat  has
       further  restrictions  on  encodings. Read the xmlparse.h header file in
       the expat distribution to see details on these restrictions.

       Expat has built-in encodings for: "UTF-8", "ISO-8859-1",  "UTF-16",  and
       "US-ASCII".  Encodings  are  set  either  through  the  XML  declaration
       encoding attribute or through the ProtocolEncoding option to XML::Parser
       or XML::Parser::Expat.

       For encodings  other  than  the  built-ins,  expat  calls  the  function
       load_encoding in the Expat package with the encoding name. This function
       looks  for  a  file in the path list @XML::Parser::Expat::Encoding_Path,
       that matches the lower-cased name with a '.enc' extension. The first one
       it finds, it loads.

       If you wish to build your own encoding maps, check out the XML::Encoding
       module from CPAN.

AUTHORS
       Larry Wall <larry@wall.org> wrote version 1.0.

       Clark Cooper <coopercc@netheaven.com> picked up support, changed the API
       for this version (2.x), provided documentation, and added some  standard
       package features.

       Matt Sergeant <matt@sergeant.org> is now maintaining XML::Parser

perl v5.40.0                       2024-10-15                       Parser(3pm)

Generated by dwww version 1.16 on Tue Dec 16 05:48:28 CET 2025.