During our day-to-day work, we as developers quite often have to deal with big data or — sometimes — data of unknown size. We parse big files and import the stuff we need, we are scraping some data from various websites in search of something, or we are just working with some big data from our own database to transform it and do something with it.

To do so efficiently we have many options to fall back to. Ruby, like most languages, gives us a lot of options to handle such situations. When working with files we can read them line by line, we can divide the data into smaller chunks with the help of self-written methods and passing around blocks. But, we have another option as well, that isn’t so often used: lazy enumeration.

What is Lazy Enumeration?

Lazy enumeration means that an expression is only processed when we are working with the result. There are languages that are lazy by default, like Haskell, and many others are implementing a way to lazily evaluate expressions.

With lazy enumeration, it is possible to create a pipeline of transformations that are processed only when needed, and if needed the pipeline is processed as a whole for one item at a time. A small code example will make this clear:

data = ["one", "two", "three"]
data2 = ["four", "five", "six"]
pipeline = data
  .map { |item| puts "item: #{item}"; item.reverse }
  .take_while { |item| puts "item: #{item}"; item.length < 6 }

p pipeline.class
p pipeline.to_a

When you run this snippet you will notice that the puts inside the map and the take_while block are called one after the other. This means that the item is flowing from one block to the next before proceeding to the next item in the array. Contrary, if you remove the call to the lazy method the blocks are printing all items to the terminal before proceeding to the next step in the pipeline, so the steps are processed sequentially for all items in the array.

Let’s see this ourselves:

item: one
item: eno
item: two
item: owt
item: three
item: eerht
[["eno", "four"], ["owt", "five"], ["eerht", "six"]]

So, how do we use this to our advantage when we process a lot of data?

Lazy IO reading

When we open a file and access the lines we in fact have an enumerator. With the call to the lazy method we can now access them in a more memory sensitive way, like so:

File.open("very-big-file.txt") do |f|

  # Get the first 3 lines with the word "error"
  f.each_line.lazy.select { |line| line.match(/error/i) }.first(3)

The call to the method each_line is generating an Enumerator and now we can convert this one to an Enumerator::Lazy. This is especially useful when you are are using a pipeline of functions to get the lines you want to. This works for all IO objects that have each_* operation, so when we read from a socket it works the same way.

But, should we use lazy enumeration by default now? Let’s see how they compare performance-wise for a smaller number of iterations:

Lies, more lies, and benchmarks

In many languages that are lazy by default the performance penalties for being lazy are practically none existing. But, for Ruby, this is a little bit different. Here is a small benchmark that is lazily evaluating all the items in a big array:

require 'benchmark/ips'

$x = Enumerator.produce([0, 1]) {|b1, b2| [b1 - b2, b2 + b1] }.take(50_000).to_a

Benchmark.ips do |b|
  b.time = 2
  b.warmup = 1

  b.report("lazy: ") { $x.lazy.map { _1.reverse }.take(50_000).to_a }
  b.report("eager: ") { $x.map { _1.reverse } }


Here are the results for my computer:

Warming up --------------------------------------
             lazy:      6.000  i/100ms
            eager:     24.000  i/100ms
Calculating -------------------------------------
             lazy:      75.752  (± 4.0%) i/s -    156.000  in   2.063805s
            eager:     254.646  (± 6.7%) i/s -    528.000  in   2.083519s

            eager: :      254.6 i/s
             lazy: :       75.8 i/s - 3.36x  (± 0.00) slower

So, the lazy evaluation is more than three times slower than the eager solution. But, what happens if we want to work only on a few items? Let’s adjust the benchmark a little bit and find out:

$x = Enumerator.produce([0, 1]) {|b1, b2| [b1 - b2, b2 + b1] }.take(50_000).to_a

Benchmark.ips do |b|
  b.time = 2
  b.warmup = 1

  b.report("lazy: ") { $x.lazy.map { _1.reverse }.first(5) }
  b.report("eager: ") { $x.map { _1.reverse }.first(5) }


This changes a lot! Here is the output:

Warming up --------------------------------------
             lazy:     23.021k i/100ms
            eager:     22.000  i/100ms
Calculating -------------------------------------
             lazy:     206.116k (±26.3%) i/s -    368.336k in   2.095998s
            eager:     229.351  (± 7.4%) i/s -    462.000  in   2.026127s

             lazy: :   206115.6 i/s
            eager: :      229.4 i/s - 898.69x  (± 0.00) slower

Now, the lazy evaluation really pays of! The lazy version is several orders of magnitude faster than the eager solution because it stops iterating when it found all the items it was looking for while the eager solution is reversing all the entries in the array and then takes the first five elements.

Benchmark Verdict

This clearly shows that lazy evaluation is comparable slow when we are working with many, or even most, items in a dataset. But, when we are working on a subset of the data we can be faster by a huge amount! Of course, this isn’t always true. If we are searching for an entry in our dataset and find it in the last entry we will be slower. So, we have to be mindful of when to use this.

Word of warning

Also, we have to be aware of the methods that are available for the Enumerator::Lazy module. For example, the methods #reduce and #each_with_object are not available for Enumerator::Lazy because we can not know what type the end result for this object would be. If we want to include these methods into our pipeline we would have to call the lazy method afterwards again. Here is a link to the documentation for all the available methods. Conclusion

Lazy evaluation in Ruby is a great feature when working with a subset of a big dataset. But, if it really pays of for your use case has to be evaluated every time separately.