Skip to content

Specification

Status

The specification has passed the initial design phase and is now in the final stages of being fleshed out. The community is encouraged to identify (and address) any perceived gaps in functionality using GitHub issues and PRs. Once all of the planned implementations have been completed all deprecated fields will be eliminated and version 1.0 will be released.

Components (Complete)

Section Description
Simple Types A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound).
Compound Types Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc.
Type Variations Physical variations to base types.
User Defined Types Extensions that can be defined for specific IR producers/consumers.
Field References Expressions to identify which portions of a record should be operated on.
Scalar Functions Description of how functions are specified. Concepts include arguments, variadic functions, output type derivation, etc.
Scalar Function List A list of well-known canonical functions in YAML format.
Specialized Record Expressions Specialized expression types that are more naturally expressed outside the function paradigm. Examples include items such as if/then/else and switch statements.
Aggregate Functions Functions that are expressed in aggregation operations. Examples include things such as SUM, COUNT, etc. Operations take many records and collapse them into a single (possibly compound) value.
Window Functions Functions that relate a record to a set of encompassing records. Examples in SQL include RANK, NTILE, etc.
User Defined Functions Reusable named functions that are built beyond the core specification. Implementations are typically registered thorough external means (drop a file in a directory, send a special command with implementation, etc.)
Embedded Functions Functions implementations embedded directly within the plan. Frequently used in data science workflows where business logic is interspersed with standard operations.
Relation Basics Basic concepts around relational algebra, record emit and properties.
Logical Relations Common relational operations used in compute plans including project, join, aggregation, etc.
Text Serialization A human producible & consumable representation of the plan specification.
Binary Serialization A high performance & compact binary representation of the plan specification.

Components (Designed but not Implemented)

Section Description
Table Functions Functions that convert one or more values from an input record into 0..N output records. Example include operations such as explode, pos-explode, etc.
User Defined Relations Installed and reusable relational operations customized to a particular platform.
Embedded Relations Relational operations where plans contain the “machine code” to directly execute the necessary operations.
Physical Relations Specific execution sub-variations of common relational operations that describe have multiple unique physical variants associated with a single logical operation. Examples include hash join, merge join, nested loop join, etc.