module Lapper

Overview

This module provides a simple data-structure for fast interval searches.

Features

Example query:

      0  1  2  3  4  5  6  7  8  9  10 11
(0,10]X  X  X  X  X  X  X  X  X  X
(2,5]       X  X  X
(3,8]          X  X  X  X  X
(3,8]          X  X  X  X  X
(3,8]          X  X  X  X  X
(3,8]          X  X  X  X  X
(5,9]                X  X  X  X
(8,11]                        X  X  X

Query: (8, 11]
Answer: ((0,10], (5,9], (8,11])

Most interaction with this shard will be through the Lapper class. The main methods are Lapper#find and Lapper#seek.

The overlap function for this assumes a zero based genomic coordinate system. So [start, stop) is not inclusive of the stop position for neither queries nor the Intervals.

Lapper does not use an interval tree, instead, it operates on the assumtion that most intervals are of similar length; or, more exactly, that the longest interval in the set is not long compred to the average distance between intervals.

For cases where this holds true (as it often does with genomic data), we can sort by start and use binary search on the starts, accounting for the length of the longest interval. The advantage of this approach is simplicity of implementation and speed. In realistic tests queries returning the overlapping intervals are 1000 times faster than brute force and queries that merely check for the overlaps are > 5000 times faster.

Examples

require "lapper"

# Create some fake data
data = (0..100).step(by: 20).map { |x| Lapper::Interval(Int32).new(x, x + 10, 0) }.to_a

# Create the lapper
lapper = Lapper::Lapper(Int32).new(data)

# Demo `find`
lapper = Lapper::Lapper(Int32).new(data)
lapper.find(5, 11).size == 2

# Demo `seek` - calculate overlap between queries and the found intervals
sum = 0
cursor = 0
(0..10).step(by: 3).each do |i|
  sum += lapper.seek(i, i + 2).map { |iv| Math.min(i + 2, iv.stop) - Math.max(i, iv.start) }.sum
end

Defined in:

lapper.cr

Constant Summary

VERSION = "1.3.0"