dwww Home | Manual pages | Find package

Unicod...CString(3pm) User Contributed Perl Documentation Unicod...CString(3pm)

NAME
       Unicode::GCString - String as Sequence of UAX #29 Grapheme Clusters

SYNOPSIS
           use Unicode::GCString;
           $gcstring = Unicode::GCString->new($string);

DESCRIPTION
       Unicode::GCString treats Unicode string as a sequence of extended
       grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].

       Grapheme cluster is a sequence of Unicode character(s) that consists of
       one grapheme base and optional grapheme extender and/or “prependcharacter.  It is close in that people consider as character.

   Public Interface
       Constructors

       new (STRING, [KEY => VALUE, ...])
       new (STRING, [LINEBREAK])
           Constructor.   Create new grapheme cluster string (Unicode::GCString
           object) from Unicode string STRING.

           About   optional   KEY   =>   VALUE   pairs   see    "Options"    in
           Unicode::LineBreak.    On  second  form,  Unicode::LineBreak  object
           LINEBREAK controls breaking features.

           Note: The first form was introduced by release 2012.10.

       copy
           Copy constructor.  Create a copy of grapheme cluster  string.   Next
           position of new string is set at beginning.

       Sizes

       chars
           Instance  method.   Returns  number  of  Unicode characters grapheme
           cluster string includes, i.e. length as Unicode string.

       columns
           Instance method.   Returns  total  number  of  columns  of  grapheme
           clusters  defined  by built-in character database.  For more details
           see "DESCRIPTION" in Unicode::LineBreak.

       length
           Instance method.  Returns number of grapheme clusters  contained  in
           grapheme cluster string.

       Operations as String

       as_string
       """OBJECT"""
           Instance  method.  Convert grapheme cluster string to Unicode string
           explicitly.

       cmp (STRING)
       STRING "cmp" STRING
           Instance method.  Compare strings.  There are no oddities.   One  of
           each STRING may be Unicode string.

       concat (STRING)
       STRING "." STRING
           Instance  method.   Concatenate  STRINGs.  One of each STRING may be
           Unicode string.  Note that number  of  columns  (see  columns())  or
           grapheme  clusters  (see length()) of resulting string is not always
           equal to sum of both strings.  Next position of new string  is  that
           set on the left value.

       join ([STRING, ...])
           Instance  method.   Join  STRINGs inserting grapheme cluster string.
           Any of STRINGs may be Unicode string.

       substr (OFFSET, [LENGTH, [REPLACEMENT]])
           Instance method.  Returns  substring  of  grapheme  cluster  string.
           OFFSET and LENGTH are based on grapheme clusters.  If REPLACEMENT is
           specified,  substring is replaced by it.  REPLACEMENT may be Unicode
           string.

           Note:  This  method  cannot  return  the  lvalue,  unlike   built-in
           substr().

       Operations as Sequence of Grapheme Clusters

       as_array
       "@{"OBJECT"}"
       as_arrayref
           Instance  method.   Convert  grapheme  cluster string to an array of
           grapheme clusters.

       eos Instance method.  Test if current position is  at  end  of  grapheme
           cluster string.

       item ([OFFSET])
           Instance method.  Returns OFFSET-th grapheme cluster.  If OFFSET was
           not specified, returns next grapheme cluster.

       next
       "<"OBJECT">"
           Instance  method,  iterative.   Returns  next  grapheme  cluster and
           increment next position.

       pos ([OFFSET])
           Instance method.  If optional OFFSET is specified, set next position
           by it.  Returns next position of grapheme cluster string.

       Miscelaneous

       lbc Instance    method.     Returns    Line    Breaking    Class    (See
           Unicode::LineBreak)   of  the  first  character  of  first  grapheme
           cluster.

       lbcext
           Instance    method.     Returns    Line    Breaking    Class    (See
           Unicode::LineBreak)  of  the last grapheme extender of last grapheme
           cluster.  If there are no grapheme extenders or  its  class  is  CM,
           value of last grapheme base will be returned.

CAVEATS
       •   The  grapheme  cluster  should not be referred to as "grapheme" even
           though Larry does.

       •   On Perl around 5.10.1, implicit  conversion  from  Unicode::GCString
           object  to  Unicode  string sometimes let "utf8_mg_pos_cache_update"
           cache be confused.

           For example, instead of doing

               $sub = substr($gcstring, $i, $j);

           do

               $sub = substr("$gcstring", $i, $j);

               $sub = substr($gcstring->as_string, $i, $j);

       •   This module implements default algorithm  for  determining  grapheme
           cluster boundaries.  Tailoring mechanism has not been supported yet.

VERSION
       Consult $VERSION variable.

   Incompatible Changes
       Release 2013.10
           •   The  new() method can take non-Unicode string argument.  In this
               case it will be decoded by iso-8859-1 (Latin 1)  character  set.
               That  method  of  former  releases  would  die  with non-Unicode
               inputs.

SEE ALSO
       [UAX #29] Mark Davis (ed.) (2009-2013).   Unicode  Standard  Annex  #29:
       Unicode         Text        Segmentation,        Revisions        15-23.
       <http://www.unicode.org/reports/tr29/>.

AUTHOR
       Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>

COPYRIGHT
       Copyright (C) 2009-2013 Hatuka*nezumi - IKEDA Soji.

       This program is free software; you can redistribute it and/or modify  it
       under the same terms as Perl itself.

perl v5.40.0                       2025-01-23             Unicod...CString(3pm)

Generated by dwww version 1.16 on Tue Dec 16 07:02:07 CET 2025.