module Lapper
Overview
This module provides a simple data-structure for fast interval searches.
Features
- Extremely fast on most genomic datasets.
- Extremly fast on in order queries.
Example query:
0 1 2 3 4 5 6 7 8 9 10 11
(0,10]X X X X X X X X X X
(2,5] X X X
(3,8] X X X X X
(3,8] X X X X X
(3,8] X X X X X
(3,8] X X X X X
(5,9] X X X X
(8,11] X X X
Query: (8, 11]
Answer: ((0,10], (5,9], (8,11])
Most interaction with this shard will be through the Lapper
class.
The main methods are Lapper#find
and Lapper#seek
.
The overlap function for this assumes a zero based genomic coordinate system. So
[start, stop) is not inclusive of the stop position for neither queries nor the
Intervals
.
Lapper does not use an interval tree, instead, it operates on the assumtion that most intervals are of similar length; or, more exactly, that the longest interval in the set is not long compred to the average distance between intervals.
For cases where this holds true (as it often does with genomic data), we can sort by start and use binary search on the starts, accounting for the length of the longest interval. The advantage of this approach is simplicity of implementation and speed. In realistic tests queries returning the overlapping intervals are 1000 times faster than brute force and queries that merely check for the overlaps are > 5000 times faster.
Examples
require "lapper"
# Create some fake data
data = (0..100).step(by: 20).map { |x| Lapper::Interval(Int32).new(x, x + 10, 0) }.to_a
# Create the lapper
lapper = Lapper::Lapper(Int32).new(data)
# Demo `find`
lapper = Lapper::Lapper(Int32).new(data)
lapper.find(5, 11).size == 2
# Demo `seek` - calculate overlap between queries and the found intervals
sum = 0
cursor = 0
(0..10).step(by: 3).each do |i|
sum += lapper.seek(i, i + 2).map { |iv| Math.min(i + 2, iv.stop) - Math.max(i, iv.start) }.sum
end
Defined in:
lapper.crConstant Summary
-
VERSION =
"1.3.0"