mawk-arrays(7) - Man pages

dwww Home | Manual pages | Find package
MAWK-ARRAYS(7)                   Miscellaneous                   MAWK-ARRAYS(7)

NAME
       mawk-arrays - design notes for mawk's array implementation

SYNOPSIS
       This  is  the  documentation  for the mawk implementation of awk arrays.
       Arrays in awk are associations of strings to  awk  scalar  values.   The
       mawk  implementation  stores  the associations in hash tables.  The hash
       table scheme was influenced by and is similar to the design presented in
       Griswold and Townsend, The Design and Implementation of Dynamic  Hashing
       Sets  and Tables in Icon, Software Practice and Experience, 23, 351-367,
       1993.

DATA STRUCTURES
   Array Structure
       The type ARRAY is a pointer to a struct array.  The size  field  is  the
       number  of  elements  in the table.  The meaning of the other fields de-
       pends on the type field.

       There are three types of arrays and these are distinguished by the  type
       field in the structure.  The types are:

       AY_NULL
            The  array  is  empty and the size field is always zero.  The other
            fields have no meaning.

       AY_SPLIT
            The array was created by the AWK built-in split.  The return  value
            from  split is stored in the size field.  The ptr field points at a
            vector of CELLs.  The number of CELLs is the limit  field.   It  is
            always   true   that   size ≤ limit.    The   address  of  A[i]  is
            (CELL*)A->ptr+i-1 for 1≤ i ≤ size.  The hmask field has no meaning.

       Hash Table
            The array is a hash table.  If the AY_STR bit in the type field  is
            set,  then the table is keyed on strings.  If the AY_INT bit in the
            type field is set, then the table is keyed on integers.  Both  bits
            can  be set, and then the two keys are consistent, i.e., look up of
            A[-14] and A["-14"] will return identical  CELL  pointers  although
            the  look  up  methods  will  be different.  In this case, the size
            field is the number of hash nodes in the table.  When insertion  of
            a  new element would cause size to exceed limit, the table grows by
            doubling   the   number   of   hash   chains.     The    invariant,
            (hmask+1)max_ave_list_length=limit       is       always      true.
            Max_ave_list_length is a tunable constant.

   Hash Tables
       The hash tables are linked lists of nodes, called ANODEs.  The number of
       lists is hmask+1 which is always a power of two.  The ptr  field  points
       at  a  vector  of  list heads.  Since there are potentially two types of
       lists, integer lists and strings lists, each list head is  a  structure,
       DUAL_LINK.

       The  string  lists  are chains connected by slinks and the integer lists
       are chains connected by ilinks.  We sometimes refer to  these  lists  as
       slists  and ilists, respectively.  The elements on the lists are ANODEs.
       The fields of an ANODE are:

       slink The link field for slists.  ilink The link field for ilists.  sval
       If non-null, then sval is a pointer to a string key.  For a given table,
       if the AY_STR bit is set then every ANODE has a non-null sval field  and
       conversely, if AY_STR is not set, then every sval field is null.

       hval The hash value of sval.  This field has no meaning if sval is null.

       ival  The integer key.  The field has no meaning if set to the constant,
       NOT_AN_IVALUE.  If the AY_STR bit is off, then every ANODE will  have  a
       valid  ival  field.  If the AY_STR bit is on, then the ival field may or
       may not be valid.

       cell The data field in the hash table.  \ndhitems

       So the value of A[expr is stored in the cell field, and if  expr  is  an
       integer, then expr is stored in ival, else it is stored in sval.

ARRAY OPERATIONS
       The functions that operate on arrays are,

       CELL* array_find(ARRAY A, CELL *cp, int create_flag)
            returns  a  pointer  to  A[expr]  where cp is a pointer to the CELL
            holding expr.  If the create_flag is on and expr is not an  element
            of A, then the element is created with value null.

       void array_delete(ARRAY A, CELL *cp)
            removes  an element A[expr from the array A.  cp points at the CELL
            holding expr.

       void array_load(ARRAY A, size_t cnt)
            builds a split array.  The values A[1..cnt] are moved into  A  from
            an  anonymous  buffer with transfer_to_array() which is declared in
            split.h.

       void array_clear(ARRAY A) removes all elements of A.
            The type of A is then AY_NULL.

       STRING** array_loop_vector(ARRAY A, size_t *sizep)
            returns a pointer to a linear vector that  holds  all  the  strings
            that  are indices of A.  The size of the the vector is returned in-
            directly in *sizep.  If A->size≡0, a null pointer is returned.

       CELL* array_cat(CELL *sp, int cnt)
            concatenates the elements of sp[1-cnt..0], with each element  sepa-
            rated by SUBSEP, to compute an array index.  For example, on a ref-
            erence to A[i,j], array_cat computes i ○ SUBSEP ○ j where ○ denotes
            concatenation.

   Array Find
       Any reference to A[expr] creates a call to array_find(A,cp,CREATE) where
       cp points at the cell holding expr.  The test, expr in A, creates a call
       to array_find(A,cp,NO_CREATE).

       Array_find is a hash-table lookup function that handles two cases:

       1.   If  *cp is numeric and integer valued, then lookup by integer value
            using find_by_ival.  If *cp is numeric,  but  not  integer  valued,
            then convert to string with sprintf(CONVFMT,...) and go to case~2.

       2.   If  *cp  is  string  valued,  then  lookup  by  string  value using
            find_by_sval.  \ndlist

       To test whether cp->dval is integer, we convert to the  nearest  integer
       by rounding towards zero (done by do_to_I) and then cast back to double.
       If we get the same number we started with, then cp->dval is integer val-
       ued.

       When we get to the function find_by_ival, the search has been reduced to
       lookup in a hash table by integer value.

       When  a  search by integer value fails, we have to check by string value
       to correctly handle the case insertion by A["123"] and later  search  as
       A[123].   This  string search is necessary if and only if the AY_STR bit
       is set.  An important point is that all ANODEs get created with a  valid
       sval  if AY_STR is set, because then creation of new nodes always occurs
       in a call to find_by_sval.

       Searching by string value is easier because AWK arrays are really string
       associations.  If the array does not have the AY_STR bit  set,  then  we
       have  to  convert  the  array to a dual hash table with strings which is
       done by the function add_string_associations.

       One Int value is reserved to show that the ival field is invalid.   This
       works because d_to_I returns a value in [-Max_Int, Max_Int].

       On  entry to add_string_associations, we know that the AY_STR bit is not
       set.  We convert to a dual hash table, then walk all the  integer  lists
       and put each ANODE on a string list.

   Array Delete
       The  execution  of  the statement, delete A[expr], creates a call to ar-
       ray_delete(ARRAY A, CELL *cp).  Depending on the type of *cp,  the  call
       is  routed  to  find_by_sval  or  find_by_ival.  Each of these functions
       leaves its return value on the front of an slist or ilist, respectively,
       and then it is deleted from the front  of  the  list.   The  case  where
       A[expr  is on two lists, e.g., A[12] and A["12"] is checked by examining
       the sval and ival fields of the returned ANODE*.

       Even though we found a node by searching an ilist it might also be on an
       slist and vice-versa.

       When the size of a hash table drops below a certain value, it  might  be
       profitable  to  shrink  the hash table.  Currently we don't do this, be-
       cause our guess is that it would be a waste of time for most AWK  appli-
       cations.   However, we do convert an array to AY_NULL when the size goes
       to zero which would resize a large hash table that had  been  completely
       cleared by successive deletions.

   Building an Array with Split
       A  simple  operation is to create an array with the AWK primitive split.
       The code that performs split puts the pieces  in  an  anonymous  buffer.
       array_load(A, cnt) moves the cnt elements from the anonymous buffer into
       A.  This is the only way an array of type AY_SPLIT is created.

       If  the array A is a split array and big enough then we reuse it, other-
       wise we need to allocate a new split array.  When we allocate a block of
       CELLs for a split array, we round up to a multiple of 4.

   Array Clear
       The function array_clear(ARRAY A) converts A to type AY_NULL  and  frees
       all storage used by A except for the struct array itself.  This function
       gets called in three contexts:

       (1)  when an array local to a user function goes out of scope,

       (2)  execution of the AWK statement, delete A and

       (3)  when an existing changes type or size from split().

   Constructor and Conversions
       Arrays  are  always created as empty arrays of type AY_NULL.  Global ar-
       rays are never destroyed although they can go empty or have  their  type
       change by conversion.  The only constructor function is a macro.

       Hash  tables  only  get  constructed by conversion.  This happens in two
       ways.  The function make_empty_table converts an  empty  array  of  type
       AY_NULL  to  an empty hash table.  The number of lists in the table is a
       power of 2 determined by the constant STARTING_HMASK.  The limit size of
       the table is determined by the constant MAX_AVE_LIST_LENGTH which is the
       largest average size of the hash lists that we are willing  to  tolerate
       before enlarging the table.  When A->size exceeds A->limit, the hash ta-
       ble grows in size by doubling the number of lists.  A->limit is then re-
       set to MAX_AVE_LIST_LENGTH times A->hmask+1.

       The  other  way  a  hash table gets constructed is when a split array is
       converted to a hash table of type AY_INT.

       To determine the size of the table, we set the initial  size  to  START-
       ING_HMASK+1 and then double the size until A->size ≤ A->limit.

   Doubling the Size of a Hash Table
       The whole point of making the table size a power of two is to facilitate
       resizing  the table.  If the table size is 2**(n) and h is the hash key,
       then h mod 2**(n) is the hash chain index which can be  calculated  with
       bit-wise and, h & (2**(n-1)).  When the table size doubles, the new bit-
       mask  has  one  more bit turned on.  Elements of an old hash chain whose
       hash value have this bit turned on get moved to a new  chain.   Elements
       with  this  bit turned off stay on the same chain.  On average only half
       the old chain moves to the new chain.   If  the  old  chain  is  at  ta-
       ble[i], 0 ≤ i < 2**(n), then the elements that move, all move to the new
       chain at table[i + 2**(n)].

       As  we  walk an old string list with pointer p, the expression p->hval &
       new_hmask takes one of  two  values.   If  it  is  equal  to  p->hval  &
       old_hmask  (which equals i), then the node stays otherwise it gets moved
       to a new string list at j.  The new string list preserves order so  that
       the  positions  of the move-to-the-front heuristic are preserved.  Nodes
       moving to the new list  are  appended  at  pointer  tail.   The  ANODEs,
       dummy0~and  dummy1, are sentinels that remove special handling of bound-
       ary conditions.

       The doubling of the integer lists is exactly the same except that  slink
       is replaced by ilink and hval is replaced by ival.

   Array Loops
       Our mechanism for dealing with execution of the statement,

              for (i in A) { statements }

       is simple.  We allocate a vector of STRING* of size, A->size.  Each ele-
       ment  of  the vector is a string key for~A.  Note that if the AY_STR bit
       of A is not set, then A has to be converted to a string hash table,  be-
       cause the index i walks string indices.

       To  execute  the  loop, the only state that needs to be saved is the ad-
       dress of i and an index into the vector of string keys.   Since  nothing
       about  A is saved as state, the user program can do anything to A inside
       the body of the loop, even delete A, and the loop still  works.   Essen-
       tially,  we  have  traded data space (the string vector) in exchange for
       implementation simplicity.  On a 32-bit system, each ANODE is 36  bytes,
       so the extra memory needed for the array loop is 11 more than the memory
       consumed  by  the  ANODEs of the array.  Note that the large size of the
       ANODEs is indicative of our whole design which pays data space for inte-
       ger lookup speed and algorithm simplicity.

       The only aspect of array loops that occurs in array.c is construction of
       the string vector.  The rest of the implementation is in the  file  exe-
       cute.c.

       As we walk over the hash table ANODEs, putting each sval in ret, we need
       to  increment each reference count.  The user of the return value is re-
       sponsible for these new reference counts.

   Concatenating Array Indices
       In AWK, an array expression A[i,j] is equivalent to the  expression  A[i
       SUBSEP j], i.e., the index is the concatenation of the three elements i,
       SUBSEP  and  j.  This is performed by the function array_cat.  On entry,
       sp points at the top of a stack of CELLs.  Cnt cells are popped off  the
       stack  and  concatenated  together separated by SUBSEP and the result is
       pushed back on the  stack.   On  entry,  the  first  multi-index  is  in
       sp[1-cnt]  and  the last is in sp[0].  The return value is the new stack
       top.  (The stack is the run-time evaluation stack.  This  operation  re-
       ally  has nothing to do with array structure, so logically this code be-
       longs in execute.c, but remains here for historical reasons.)

       We make a copy of SUBSEP which we can cast to  string  in  the  unlikely
       event the user has assigned a number to SUBSEP.

       Set  sp  and  top so the cells to concatenate are inclusively between sp
       and top.

       The total_len is the sum of the lengths of the cnt strings and the cnt-1
       copies of subsep.

       The return value is sp and it is already set correctly.  We just need to
       free the strings and set the contents of sp.

Version 1.3.4                      2024-01-23                    MAWK-ARRAYS(7)
Generated by dwww version 1.16 on Wed Feb 4 08:56:41 CET 2026.