Show HN: Skyline – A Go Library for Efficient Multi-Dimensional Skyline Queries

3 months ago 1

A Go library for skyline queries — multi-dimensional optimization to find the set of Pareto-optimal points from a dataset.

Supports both static and dynamic skyline computation with multiple algorithms.

Skyline queries identify the subset of points in a multi-dimensional dataset that are not dominated by any other point.
A point A dominates B if A is as good or better than B in all dimensions and strictly better in at least one dimension.

This is useful in scenarios like:

Finding products that are best across multiple criteria (price, performance, battery life)
Multi-criteria decision making in finance, logistics, recommendation systems, etc.

The result, called the skyline set, represents the Pareto-optimal front of your data.

What does this library do?

Compute skyline points from static datasets using multiple algorithms (Block Nested Loop, Divide & Conquer, SkyTree)
Support dynamic updates: insert, delete, and update points incrementally without recomputing from scratch
Allow flexible dimension selection and preference (minimize/maximize per dimension)
Provide a simple, idiomatic Go API for both static and dynamic skyline queries

go get github.com/gkoos/skyline

type Point map[string]float64 type Preference map[string]Order type Order int const ( Min Order = iota Max )

func Skyline(points []Point, dims []string, prefs Preference, algo string) ([]Point, error)

Computes the skyline from a static dataset.

points: input points
dims: dimensions to consider
prefs: preferences per dimension (Min or Max)
algo: algorithm to use ("bnl", "dnc", "skytree")

package main import ( "fmt" "github.com/yourname/skyline" ) func main() { points := []skyline.Point{ {"price": 400, "battery": 10}, {"price": 500, "battery": 12}, {"price": 300, "battery": 9}, {"price": 450, "battery": 11}, } prefs := skyline.Preference{ "price": skyline.Min, "battery": skyline.Max, } // Static skyline result, err := skyline.Skyline(points, []string{"price", "battery"}, prefs, "dnc") if err != nil { panic(err) } fmt.Println("Static skyline:", result) // Dynamic skyline engine, err := skyline.DynamicSkyline(points, []string{"price", "battery"}, prefs, "dnc") if err != nil { panic(err) } newPoint := skyline.Point{"price": 420, "battery": 15} engine.Insert(newPoint) fmt.Println("After insert:", engine.Skyline()) updatedPoint := skyline.Point{"price": 460, "battery": 14} engine.Update(newPoint, updatedPoint) fmt.Println("After update:", engine.Skyline()) engine.Delete(updatedPoint) fmt.Println("After delete:", engine.Skyline()) }

Simple, intuitive algorithm
Compares each point with all others to find dominating points
Works well for small datasets and supports incremental updates easily
In dynamic mode, we always use this algorithm

Recursively divides data into smaller subsets, computes skylines, and merges results
More efficient than BNL for larger datasets
Static algorithm, dynamic extension is complex

Advanced algorithm using tree structures to prune comparisons
Scales well with high-dimensional and large datasets
Designed primarily for static datasets

The SkyTree implementation in this library includes several advanced optimizations for performance and scalability:

Advanced Pivot Selection: Uses median or custom pivot selection to improve partitioning and pruning efficiency
Parallelization: SkyTree uses parallelism in two main phases:
- Parallel Recursion: When the number of partitions (regions) exceeds a threshold, recursive calls for each partition are executed in parallel using goroutines. This allows the algorithm to process different branches of the tree concurrently, greatly speeding up computation on multicore systems.
- Parallel Merge: After recursion, the partial skylines from each partition are merged in parallel using a pairwise, multi-stage approach. At each stage, pairs of skylines are merged concurrently, reducing the total merge time to log₂(N) stages for N partitions.
- Worker Pool: Both recursion and merge parallelism are managed by a configurable worker pool, which limits the number of concurrent goroutines to avoid oversubscription and maximize CPU efficiency. The pool size defaults to the number of available CPU cores, but can be tuned for your workload.
Dominance Caching: Caches dominance checks between points to avoid redundant computations, reducing overall work
Slice Reuse: Minimizes memory allocations by reusing slices in recursive calls and helpers
Custom Deduplication: Uses a fast custom join for point keys, improving deduplication speed for large datasets
Configurable Recursion Depth: Allows limiting recursion depth to prevent stack overflow and excessive computation; falls back to BNL if the limit is reached
Small Partition BNL Switch: If the number of points in a partition is lsmall, SkyTree will use the Block Nested Loop (BNL) algorithm for that partition instead of recursing further. This optimization avoids SkyTree's overhead on small datasets, where BNL is typically faster, and can significantly improve performance for workloads with many small partitions. The threshold is tunable; see the configuration section for details.

These optimizations make SkyTree suitable for very large and high-dimensional datasets, balancing speed, memory usage, and accuracy.

Skyline algorithms can be fine-tuned using configuration options to optimize performance and scalability for different dataset sizes and characteristics. Tuning these options allows you to balance speed, memory usage, and accuracy, especially for large or high-dimensional data. Proper configuration is essential to avoid bottlenecks, excessive memory consumption, or incomplete results, and lets you adapt the algorithms to your specific workload and hardware.

No configuration options. BNL is simple and always compares all points; best for small datasets or incremental updates.

Threshold: Minimum number of points in a partition before switching to BNL. Lower values increase recursion, higher values use BNL more often. Tune for your dataset size.
BatchSize: Number of points processed together in each batch. Larger batches can improve cache locality and throughput, but may use more memory.

PivotSelector: Function to choose the pivot point for partitioning. The default is median selection, but you can provide a custom function for domain-specific optimization.
ParallelThreshold: Minimum number of partitions before enabling parallel processing. Lower values increase parallelism, higher values reduce goroutine overhead.
MaxRecursionDepth: Maximum allowed recursion depth. If exceeded, SkyTree falls back to BNL for the remaining data. Prevents stack overflow and excessive computation for very large or complex datasets.
BNLSwitchThreshold: If the number of points in a partition is less than or equal to this threshold, SkyTree will use the Block Nested Loop (BNL) algorithm for that partition instead of recursing further. This improves performance by avoiding SkyTree's overhead on small datasets, where BNL is typically faster. The default is 32, but you can tune this value for your workload and hardware. Lower values reduce BNL usage; higher values make SkyTree switch to BNL more often for small partitions.
WorkerPoolSize: Controls the maximum number of goroutines (workers) used for parallel recursion and merging in SkyTree. Setting this to 0 (the default) will use the number of available CPU cores on your system, which is usually optimal for most workloads. You can set a specific positive value to limit CPU usage or experiment with different levels of parallelism. Increasing this value may improve performance on large, partitionable datasets, but setting it too high can cause oversubscription and reduce efficiency. For most users, leaving it at 0 is recommended.

Refer to the code and examples for how to set these options in your application.

To run all unit tests for algorithms and utilities:

This will execute all correctness and edge case tests for the skyline algorithms, domination logic, and dynamic updates. Tests cover small, large, and high-dimensional datasets, as well as pathological cases.

Benchmarking Skyline Algorithms

To run performance benchmarks and compare algorithms:

go test -bench . ./internal/algorithms

This command runs repeatable benchmarks for all implemented skyline algorithms (BNL, D&C, SkyTree) on a variety of datasets, including small, large, high-dimensional, clustered, and pathological cases.

Algorithm selection guidance:

Block Nested Loop (BNL): Best for small datasets or when incremental updates are needed. Simple, but slow for large or high-dimensional data.
Divide & Conquer (DNC): Generally fastest for large, diverse datasets with a moderate skyline size. Uses more memory, but scales well unless most points are dominated.
SkyTree: Optimized for very large, high-dimensional datasets with many dominated (clustered) points and a small skyline. SkyTree is much faster and more memory-efficient than DNC when the dataset is highly clustered and the skyline is small. For datasets where most points are Pareto-optimal (large skyline), DNC may be faster.

Summary:

Use BNL for small or dynamic datasets.
Use DNC for large, diverse datasets with a moderate skyline.
Use SkyTree for large, high-dimensional, clustered datasets with a small skyline.
Benchmark your own data to select the best algorithm for your use case.