Skip to content

Extensions

In many cases, the existing objects in Substrait will be sufficient to accomplish a particular use case. However, it is sometimes helpful to create a new data type, scalar function signature or some other custom representation within a system. For that, Substrait provides a number of extension points.

Simple Extensions

Some kinds of primitives are so frequently extended that Substrait defines a standard YAML format that describes how the extended functionality can be interpreted. This allows different projects/systems to use the YAML definition as a specification so that interoperability isn’t constrained to the base Substrait specification. The main types of extensions that are defined in this manner include the following:

  • Data types
  • Type variations
  • Scalar Functions
  • Aggregate Functions
  • Window Functions
  • Table Functions

To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required urn field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the urn: prefix), they will be referred to as extension URNs for clarity.

This extension URN uses the format extension:<OWNER>:<ID>, where:

  • OWNER represents the organization or entity providing the extension and should follow reverse domain name convention (e.g., io.substrait, com.example, org.apache.arrow) to prevent name collisions
  • ID is the specific identifier for the extension (e.g., functions_arithmetic, custom_types)

The YAML file is constructed according to the YAML Schema. Each definition in the file corresponds to the YAML-based serialization of the relevant data structure. If a user only wants to extend one of these types of objects (e.g. types), a developer does not have to provide definitions for the other extension points.

A Substrait plan can reference one or more YAML files via their extension URN. In the places where these entities are referenced, they will be referenced using an extension URN + name reference. Each extension entity (type, type variation, or function) is assigned an anchor value, which is a non-negative integer starting from 0. The anchor value 0 is valid and can be used to reference extension entities, but prefer non-zero values for ergonomics. The name scheme per type works as follows:

Category Naming scheme
Type The name as defined on the type object.
Type Variation The name as defined on the type variation object.
Function Signature A function signature as described below.

Referencing User-Defined Types

Within YAML extension files, user-defined types must be referenced with the u! prefix followed by the type name (e.g., u!point) in function arguments and return types:

# User-defined type example: a point type with two scalar functions
urn: extension:example:point_type
types:
  - name: "point"

scalar_functions:
  - name: "lat"
    impls:
      - args:
        - name: p
          value: u!point
        return: fp64
  - name: "lon"
    impls:
      - args:
        - name: p
          value: u!point
        return: fp64

Built-in types like i32, string, or list (as in list<fp64>) do not use any prefix.

A YAML file can also reference types and type variations defined in another YAML file. To do this, it must declare the extension it depends on using a key-value pair in the dependencies key, where the value is the extension URN, and the key is a valid identifier that can then be used as an identifier-safe alias for the extension URN. This alias can then be used as a .-separated namespace prefix wherever a type class or type variation name is expected. Note that user-defined types still require the u! prefix when referenced via namespace aliases (e.g., ext.u!point).

Grammar

The grammar for referencing user-defined types is (in ABNF):

udt-reference = [dependency-alias "."] "u!" type-name

For example, if the extension with extension URN extension:io.substrait:extension_types defines a user-defined type called point, a different YAML file can use the type in a function declaration as follows:

urn: extension:example:distance_functions
dependencies:
  ext: extension:io.substrait:extension_types
scalar_functions:
- name: distance
  description: The distance between two points.
  impls:
  - args:
    - name: a
      value: ext.u!point
    - name: b
      value: ext.u!point
    return: f64

Here, the choice for the name ext is arbitrary, as long as it does not conflict with anything else in the YAML file.

Function Signature

A YAML file may contain one or more functions with the same name, each with one or more implementations (impls). A specific function implementation within a YAML file can be identified using a Function Signature which consists of two components:

  • Function Name: the name of the function
  • Argument Signature: the short type names of each argument joined with underscores

These are combined with a colon separator.

The resulting function signatures look like: <function_name>:<short_arg_type0>_<short_arg_type1>_..._<short_arg_typeN>

Grammar

The formal grammar for function signatures (in ABNF):

function-signature = function-name ":" argument-signature
argument-signature = short-arg-type *("_" short-arg-type)

Argument types (short_arg_type) are encoded using the Type Short Names given below.

Variadic Functions

For variadic functions, the variadic argument is included once in the argument signature.

Uniqueness Constraint

A function signature uniquely identifies a function implementation within a single YAML file. As such, every function implementation within a YAML must have a distinct function signature in order for references to the implementation to remain unambiguous. A YAML file in which this is not the case is invalid.

Type Short Names

Argument Type Signature Name
Required Enumeration req
i8 i8
i16 i16
i32 i32
i64 i64
fp32 fp32
fp64 fp64
string str
binary vbin
boolean bool
timestamp ts
timestamp_tz tstz
date date
time time
interval_year iyear
interval_day iday
interval_compound icompound
uuid uuid
fixedchar<N> fchar
varchar<N> vchar
fixedbinary<N> fbin
decimal<P,S> dec
precision_time<P> pt
precision_timestamp<P> pts
precision_timestamp_tz<P> ptstz
struct<T1,T2,…,TN> struct
list<T> list
map<K,V> map
func<T->R>, func<(T1,…,TN)->R> func
any[\d]? any
user-defined type <name> u!<name>

Examples

Function Signature Function Name
add(optional enumeration, i8, i8) => i8 add:i8_i8
avg(fp32) => fp32 avg:fp32
extract(required enumeration, timestamp) => i64 extract:req_ts
sum(any1) => any1 sum:any
concat(str...) => str concat:str
transform(list<any1>, func<any1 -> any2>) => list<any2> transform:list_func

Any Types

# Example showing the 'any' type - arguments can be of any type
urn: extension:example:any_type
scalar_functions:
- name: foo
  impls:
  - args:
    - name: a
      value: any
    - name: b
      value: any
    return: int64

The any type indicates that the argument can take any possible type. In the foo function above, arguments a and b can be of any type, even different ones in the same function invocation.

# Example showing the 'any1' type - arguments must be of the same type
urn: extension:example:any1_type
scalar_functions:
- name: bar
  impls:
  - args:
    - name: a
      value: any1
    - name: b
      value: any1
    return: int64
The any[\d] types (i.e. any1, any2, …, any9) impose an additional restriction. Within a single function invocation, all any types with same numeric suffix must be of the same type. In the bar function above, arguments a and b can have any type as long as both types are the same.

Extension Metadata

Extensibility is a core principle of Substrait. To ensure that the extension mechanism itself remains extensible, extension files support an optional metadata field that can contain arbitrary data created by the extension author. If you find that the standard YAML schema lacks a field you need, the metadata field provides a forward-compatible way to add it without waiting for schema changes.

This field is available at multiple levels to provide flexibility:

  • Top-level: Metadata about the extension file itself
  • Type definitions: Metadata about custom types
  • Functions: Metadata about functions (scalar, aggregate, and window functions)

Example:

# Example showing the metadata field at multiple levels
urn: extension:io.substrait:metadata_examples
metadata:
  version: 2.0
  maintainer: example-team
types:
  - name: point
    structure:
      latitude: i32
      longitude: i32
    metadata:
      coordinate_system: "WGS84"
scalar_functions:
  - name: "not_equal"
    impls:
      - args:
          - value: any1
            name: x
          - value: any1
            name: y
        return: boolean
    metadata:
      performance_hint: "vectorized"
      cost_estimate: 1

Consumers of extension files are not required to understand or validate metadata fields.

Advanced Extensions

Advanced extensions provide a way to embed custom functionality that goes beyond the standard YAML-based simple extensions. Unlike simple extensions, advanced extensions allow arbitrary, custom schemas. In the Protocol Buffers implementation, the google.protobuf.Any type is used to embed arbitrary extension data directly into Substrait messages.

How Advanced Extensions Work

Advanced extensions come in several main forms, discussed below:

  1. Embedded extensions: These use the AdvancedExtension message for adding custom data to existing Substrait messages
  2. Custom read/write types: For defining new ways to read from or write to data sources
  3. Custom relation types: For defining entirely new relational operations

Embedded Extensions via AdvancedExtension

The simplest forms of advanced extensions use the AdvancedExtension message, which contains two types of extensions:

    message AdvancedExtension {
      // An optimization is helpful information that don't influence semantics. May
      // be ignored by a consumer.
      repeated google.protobuf.Any optimization = 1;

      // An enhancement alter semantics. Cannot be ignored by a consumer.
      google.protobuf.Any enhancement = 2;

    }

Enhancements vs Optimizations

  • Use optimizations for performance hints that don’t change semantics and can be safely ignored.
  • Use enhancements for semantic changes that must be understood by consumers or the plan cannot be executed correctly.

Optimizations

  • Provide hints to improve performance but don’t change the meaning of operations
  • Can be safely ignored by consumers that don’t understand them
  • Multiple optimizations can be attached to a single message
  • Examples: memory usage hints, preferred algorithms, caching strategies

Enhancements

  • Modify the semantic behavior of operations
  • Must be understood by consumers, or else the plan cannot be executed correctly
  • Only one enhancement per message
  • Examples: specialized join conditions (e.g. fuzzy matching, geospatial)

Enhancement Constraints

Semantic-changing extensions shouldn’t change the core characteristics of the underlying relation. For example, they should avoid changing the default direct output field ordering or the number of fields output. If one needs to change one of these behaviors, one should define a new relation as described in Custom Relations.

Where AdvancedExtension Messages Can Be Used

The AdvancedExtension message can be attached to various parts of a Substrait plan:

Location Usage
Plan Global extensions affecting the entire plan
RelCommon Extensions for any relational operator
Relations (e.g. ProjectRel) Extensions for a specific relation type
Hints Extensions within optimization hints
ReadRel.NamedTable Custom metadata to named table references
ReadRel.LocalFiles Custom metadata to local file sources
WriteRel.NamedObjectWrite Custom metadata to write targets
DdlRel.NamedObjectWrite Custom metadata to DDL targets

Custom Read and Write Types

The second form of advanced extensions allows you to define extension data sources and destinations:

Extension Type Description Examples
ReadRel.ExtensionTable Define new table source types APIs, specialized formats
WriteRel.ExtensionObject Define new write destination types APIs, specialized formats
DdlRel.ExtensionObject Define new DDL destination types Catalogs, schema registries

Consider Core Specification First

Before implementing custom read/write types as extensions, consider checking with the Substrait community. If your scenario turns out to be common enough, it may be more appropriate to add it directly to the specification rather than as an extension.

Custom Relations

The third form of advanced extensions provides entirely new relational operations via dedicated extension relation types. These allow you to define custom relations while maintaining proper integration with the type system:

Relation Type Description Examples
ExtensionLeafRel Custom relations with no inputs Custom table sources
ExtensionSingleRel Custom relations with one input Custom relational transformations
ExtensionMultiRel Custom relations with multiple inputs Custom joins

These extension relations are first-class relation types in Substrait and can be used anywhere a standard relation would be used.

When to Use What

Custom relations are the most flexible option, but also the least interoperable. Prefer enhancements to existing relations when they can express your use case, since this preserves existing patterns and compatibility. As a general rule, choose the least powerful extension mechanism that solves the problem.