Field References¶

In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world. Each operation returns a simple or compound data type. Additional operations can refer to data within that initial operation using field references. To reference a field, you use a reference based on the type of field position you want to reference.

Reference Type	Properties	Type Applicability	Type return
Struct Field	Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan.	struct	Type of field referenced
Array Value	Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping).	list	type of list
Array Slice	Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length.	list	Same type as original list
Map Key	A map value that is matched exactly against available map keys and returned.	map	Value type of map
Map KeyExpression	A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.]	map	List of map value type
Masked Complex Expression	An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object.	any	any

Compound References¶

References are typically constructed as a sequence. For example: [struct position 0, struct position 1, array offset 2, array slice 1..3].

Field references are in the same order they are defined in their schema. For example, let’s consider the following schema:

column a:
  struct<
    b: list<
      struct<
        c: map<string, 
          struct<
            x: i32>>>>>

If we want to represent the SQL expression:

a.b[2].c['my_map_key'].x

We will need to declare the nested field such that:

Struct field reference a
Struct field b
List offset 2
Struct field c
Map key my_map_key
Struct field x

Or more formally in Protobuf Text, we get:

selection {
  direct_reference {
    struct_field {
      field: 0 # .a
      child {
        struct_field {
          field: 0 # .b
          child {
            list_element {
              offset: 2
              child {
                struct_field {
                  field: 0 # .c
                  child {
                    map_key {
                      map_key {
                        string: "my_map_key" # ['my_map_key']
                      }
                      child {
                        struct_field {
                          field: 0 # .x
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
  root_reference { }
}

Validation¶

References must validate against the schema of the record being referenced. If not, an error is expected.

Masked Complex Expression¶

A masked complex expression is used to do a subselection of a portion of a complex record. It allows a user to specify the portion of the complex object to consume. Imagine you have a schema of (note that structs are lists of fields here, as they are in general in Substrait as field names are not used internally in Substrait):

struct:
  - struct:
    - integer
    - list:
      struct:
        - i32
        - string
        - string
     - i32
  - i16
  - i32
  - i64

Given this schema, you could declare a mask of fields to include in pseudocode, such as:

0:[0,1:[..5:[0,2]]],2,3

OR

0:
  - 0
  - 1:
    ..5:
      -0
      -2
2
3

This mask states that we would like to include fields 0 2 and 3 at the top-level. Within field 0, we want to include subfields 0 and 1. For subfield 0.1, we want to include up to only the first 5 records in the array and only includes fields 0 and 2 within the struct within that array. The resulting schema would be:

struct:
  - struct:
    - integer
    - list:
      struct: 
        - i32
        - string
  - i32
  - i64

Unwrapping Behavior¶

By default, when only a single field is selected from a struct, that struct is removed. When only a single element is removed from a list, the list is removed. A user can also configure the mask to avoid unwrapping in these cases. [TBD how we express this in the serialization formats.]

Discussion Points

Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)