Specification¶

Status¶

The specification has passed the initial design phase and is now in the final stages of being fleshed out. The community is encouraged to identify (and address) any perceived gaps in functionality using GitHub issues and PRs. Once all of the planned implementations have been completed all deprecated fields will be eliminated and version 1.0 will be released.

Components (Complete)¶

Section	Description
Simple Types	A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound).
Compound Types	Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc.
Type Variations	Physical variations to base types.
User Defined Types	Extensions that can be defined for specific IR producers/consumers.
Field References	Expressions to identify which portions of a record should be operated on.
Scalar Functions	Description of how functions are specified. Concepts include arguments, variadic functions, output type derivation, etc.
Scalar Function List	A list of well-known canonical functions in YAML format.
Specialized Record Expressions	Specialized expression types that are more naturally expressed outside the function paradigm. Examples include items such as if/then/else and switch statements.
Aggregate Functions	Functions that are expressed in aggregation operations. Examples include things such as SUM, COUNT, etc. Operations take many records and collapse them into a single (possibly compound) value.
Window Functions	Functions that relate a record to a set of encompassing records. Examples in SQL include RANK, NTILE, etc.
User Defined Functions	Reusable named functions that are built beyond the core specification. Implementations are typically registered thorough external means (drop a file in a directory, send a special command with implementation, etc.)
Embedded Functions	Functions implementations embedded directly within the plan. Frequently used in data science workflows where business logic is interspersed with standard operations.
Relation Basics	Basic concepts around relational algebra, record emit and properties.
Logical Relations	Common relational operations used in compute plans including project, join, aggregation, etc.
Text Serialization	A human producible & consumable representation of the plan specification.
Binary Serialization	A high performance & compact binary representation of the plan specification.

Components (Designed but not Implemented)¶

Section	Description
Table Functions	Functions that convert one or more values from an input record into 0..N output records. Example include operations such as explode, pos-explode, etc.
User Defined Relations	Installed and reusable relational operations customized to a particular platform.
Embedded Relations	Relational operations where plans contain the “machine code” to directly execute the necessary operations.
Physical Relations	Specific execution sub-variations of common relational operations that describe have multiple unique physical variants associated with a single logical operation. Examples include hash join, merge join, nested loop join, etc.