Skip to content

Field References

In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world. Each operation returns a simple or compound data type. Additional operations can refer to data within that initial operation using field references. To reference a field, you use a reference based on the type of field position you want to reference.

Field references can originate from different root types:

  • RootReference: References the incoming record from the relation
  • OuterReference: References outer query records in correlated subqueries, supporting either offset-based (steps_out) or id-based (rel_reference) resolution (see Outer References)
  • Expression: References the result of evaluating an expression
  • LambdaParameterReference: References lambda parameters within lambda body expressions (see Lambda Expressions)
Reference Type Properties Type Applicability Type return
Struct Field Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan. struct Type of field referenced
Array Value Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping). list type of list
Array Slice Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length. list Same type as original list
Map Key A map value that is matched exactly against available map keys and returned. map Value type of map
Map KeyExpression A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.] map List of map value type
Masked Complex Expression An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object. any any

Compound References

References are typically constructed as a sequence. For example: [struct position 0, struct position 1, array offset 2, array slice 1..3].

Field references are in the same order they are defined in their schema. For example, let’s consider the following schema:

column a:
  struct<
    b: list<
      struct<
        c: map<string, 
          struct<
            x: i32>>>>>

If we want to represent the SQL expression:

a.b[2].c['my_map_key'].x

We will need to declare the nested field such that:

Struct field reference a
Struct field b
List offset 2
Struct field c
Map key my_map_key
Struct field x

Or more formally in Protobuf Text, we get:

selection {
  direct_reference {
    struct_field {
      field: 0 # .a
      child {
        struct_field {
          field: 0 # .b
          child {
            list_element {
              offset: 2
              child {
                struct_field {
                  field: 0 # .c
                  child {
                    map_key {
                      map_key {
                        string: "my_map_key" # ['my_map_key']
                      }
                      child {
                        struct_field {
                          field: 0 # .x
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
  root_reference { }
}

Validation

References must validate against the schema of the record being referenced. If not, an error is expected.

Masked Complex Expression

A masked complex expression is used to do a subselection of a portion of a complex record. It allows a user to specify the portion of the complex object to consume. Imagine you have a schema of (note that structs are lists of fields here, as they are in general in Substrait as field names are not used internally in Substrait):

struct:
  - struct:
    - integer
    - list:
      struct:
        - i32
        - string
        - string
     - i32
  - i16
  - i32
  - i64

Given this schema, you could declare a mask of fields to include in pseudocode, such as:

0:[0,1:[..5:[0,2]]],2,3

OR

0:
  - 0
  - 1:
    ..5:
      -0
      -2
2
3

This mask states that we would like to include fields 0 2 and 3 at the top-level. Within field 0, we want to include subfields 0 and 1. For subfield 0.1, we want to include up to only the first 5 records in the array and only includes fields 0 and 2 within the struct within that array. The resulting schema would be:

struct:
  - struct:
    - integer
    - list:
      struct: 
        - i32
        - string
  - i32
  - i64

Unwrapping Behavior

By default, when only a single field is selected from a struct, that struct is removed. When only a single element is removed from a list, the list is removed. A user can also configure the mask to avoid unwrapping in these cases. [TBD how we express this in the serialization formats.]

Discussion Points
  • Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)

Outer References

Outer references allow expressions inside a subquery to access records from an enclosing relation. The OuterReference root type supports two mutually exclusive resolution strategies:

steps_out (offset-based)

steps_out resolves the reference by counting subquery boundaries upward (steps_out >= 1). This works correctly whenever the plan is a tree, i.e., when each relation has exactly one parent, the path to the binding relation can be uniquely determined via steps_out.

rel_reference (id-based)

rel_reference resolves the reference by referring to the binding relation via its plan-wide unique RelCommon.rel_anchor. The rel_anchor on the referenced relation must be set (>= 1) and unique across all relations in the plan.

Coexistence rules

Exactly one of steps_out or rel_reference must be set on each OuterReference. A single plan may contain outer references using different strategies (e.g., some using steps_out and others using rel_reference), as long as every individual reference is unambiguous. However, if any shared relation (via ReferenceRel) contains an unresolved outer reference, that reference must use rel_reference.

When to use rel_reference

rel_reference must be used instead of steps_out when an outer reference appears inside a relation shared via ReferenceRel and the shared relation can be reached through multiple paths with different subquery depths, making steps_out ambiguous. In this case, the same outer reference could require different steps_out values depending on which path is followed.

For example, consider a plan with two nested scalar subqueries that share a common relation x. The outer reference to tableA.a lives inside x, which is reached via paths of different depth:

PlanRel.relations[0].rel:  # let's call it 'x'
FilterRel(a > outer_ref(steps_out=1, tableA.a)) # steps_out 1 or 2?
  └── ReadRel(tableB)

PlanRel.relations[1].root:
ProjectRel # Correct binding for tableA.a for the outer reference tableA.a in x.
├── ReadRel(tableA)
└── Subquery.Scalar # Subquery (1)
    └── SetRel(MINUS_PRIMARY)
        ├── ProjectRel
        |   └── Subquery.Scalar # Subquery (2)
        │       └── ReferenceRel(0) # Here steps_out=1 binds incorrectly, because tableA.a is actually two subquery boundaries out.
        └── ReferenceRel(0) # Here steps_out=1 binds correctly, because tableA.a is one subquery boundary out.

The same shared relation x contains a single stored steps_out=1 outer reference, but that value is only correct for one of its uses. The other use would need steps_out=2, so offset-based resolution is ambiguous.

With rel_reference, both reference rels can unambiguously refer to the correct binding.

PlanRel.relations[0].rel:  # let's call it 'x'
FilterRel(a > outer_ref(rel_reference=7, tableA.a))
  └── ReadRel(tableB)

PlanRel.relations[1].root:
ProjectRel [rel_anchor=7] # Correct binding for tableA.a for the outer reference tableA.a in x.
├── ReadRel(tableA)
└── Subquery.Scalar # Subquery (1)
    └── SetRel(MINUS_PRIMARY)
        ├── ProjectRel
        |   └── Subquery.Scalar # Subquery (2)
        │       └── ReferenceRel(0) # Reference 1: rel_reference = 7
        └── ReferenceRel(0) # Reference 2: rel_reference = 7