Even Faster Unnest
Ying Su, Masha Basmanova, Orri Erling
Unnest is a common operation in Facebook’s daily Presto workload. It converts an ARRAY
, MAP
, or ROW
into a flat relation. Its original implementation used deep copy all the time and was very inefficient. In Unnest Operator Performance Enhancement with Dictionary Blocks, the author improved the Unnest operator by up to 10x in CPU and elapsed times by using DictionaryBlock
when possible. We went one step further and improved it for another 5-10x.
In this post, we will refer to the PrestoSQL implementation as "baseline". The JMH Benchmark results for different cases are shown below ("before" is the baseline implementation, and "after" is our new implementation):
The optimized Unnest implementation is available in Presto 0.235 and onward and is enabled by default. The JMH benchmark was also expanded to have better coverage and can be found in BenchmarkUnnestOperator.java.
In Facebook's production workload, we observed over a 6x CPU reduction on this operator. The following chart shows the pairwise comparison for all unnest operators in one of our test runs. Every dot below the diagonal is a win. About 25% of operators show over a 5x CPU reduction, and some of them have even over a 20x reduction.
The histogram of the after vs. before CPU ratio shows most operators has a ratio less than 1, meaning most of them were more efficient.
The following chart shows the operator's CPU percentage was reduced from about 2% to 0.1% out of all operators after the roll-out happened on May 28. The amount of CPU, though small, still matters at Facebook's scale.
What is the Unnest Operator?
When users have data structured as ARRAY
, MAP
, or ROW
they sometimes need to flatten them so that the nested structure can be regarded as top level citizen and sent to downstream operators for easier arithmetic or aggregation processing. An example query is as follows:
SELECT
zoo, animal
FROM (
VALUES
('OaklandZoo', ARRAY['dog', 'cat', 'tiger']),
('SanFranciscoZoo', ARRAY['dog', 'cat'])
) AS x (zoo, animals)
CROSS JOIN UNNEST(animals) AS t (animal);
In this example, the zoo
column, which is a VARCHAR
column, is being replicated for the same row and thus being called the "replicated column"; and the animals
column, which is an ARRAY(VARCHAR)
column, is unnested into a VARCHAR
column, named as animal
and is a "unnest column":
| zoo | animal |
| ----------------| ------ |
| OaklandZoo | dog |
| OaklandZoo | cat |
| OaklandZoo | tiger |
| SanFranciscoZoo | dog |
| SanFranciscoZoo | cat |
In Presto relational data is represented as a series of Page
s internally. Page
s are composed of Block
s, one for each column. For replicated columns, the Unnest operator needs to create new Block
s from the original Block
s with the data being replicated within each original row. The baseline implementation achieved this by creating a DictionaryBlock
on top of the original Block
, and thus avoided expensive deep copies. The same thing was done for the unnest columns if there is no need to insert nulls, where DictionaryBlocks
were created, and the array offsets were translated to indices of the underlying block.
What happens when there are multiple unnest columns and their cardinalities do not match? Take a look at this example:
SELECT
zoo,
animal,
employee
FROM (
VALUES
('OaklandZoo', ARRAY['dog', 'cat', 'tiger'], ARRAY['Alice', 'Bob']),
('SanFranciscoZoo', ARRAY['dog', 'cat'], ARRAY['Clara', 'Danny'])
) AS x (zoo, animals, employees)
CROSS JOIN UNNEST(animals, employees) AS t (animal, employee);
The result is
| zoo | animal | employee |
| --------------- | ------ | -------- |
| OaklandZoo | dog | Alice |
| OaklandZoo | cat | Bob |
| OaklandZoo | tiger | NULL |
| SanFranciscoZoo | dog | Clara |
| SanFranciscoZoo | cat | Danny |
Note that a NULL
is inserted in the last column on the third row. In this case, can we still create a DictionaryBlock
on the original block? The original nested block does not have a NULL
! So it’s impossible to find an index that points to a NULL
value and build a DictionaryBlock
easily. The baseline implementation just switches back to deep copying for this case. In the sections below, we are going to explain how we tackled this problem.
Opportunities for Improvements? Yes!
The baseline implementation did a number of things well. It avoided deep copy by using DictionaryBlock which improved the original improvement by an order of magnitude for cases without NULL insertions. Can we improve it additionally? The answer is yes.
Process data column-by-column, not row-by-row
If you read our blog post 5 design choices—and 1 weird trick — to get 2x efficiency gains in Presto repartitioning, you may remember the design choice “Process data column by column, not row by row”. The same thing applies to UnnestOperator
as well. The baseline implementation constructs the output blocks in the row-by-row manner:
for each incoming row
for each replicated column
append the repeated values by adding ids to the DictionaryBlock
for each unnest column
append the unnested values by adding ids to the DictionaryBlock if possible.
Otherwise append the value to BlockBuilder.
if there is ordinality column
append the oridinality to the BlockBuilder
A key principle of vectorized execution is to use tight loops which allow the JVM to vectorize the compiled binary, and allow the CPU to parallelize the work as much as possible. Processing the data in a column by column manner makes this possible and an example of such a loop for building a replicated block looks like this:
public Block buildOutputBlock(int[] maxEntries, int offset, int length, int totalEntries)
{
int[] ids = new int[totalEntries];
int fromPosition = 0;
for (int i = 0; i < length; i++) {
int toPosition = fromPosition + maxEntries[offset + i];
Arrays.fill(ids, fromPosition, toPosition, sourcePosition++);
fromPosition = toPosition;
}
return new DictionaryBlock(totalEntries, source, ids);
}
This simple change improved the performance of the NO-NULL case by at least 3x as shown by the JMH benchmarks.
Computation of maxEntries array and batch size
To make the above loop possible, we pre-compute the cardinalities for each row for all unnest columns and use them to get the max cardinalities (we call it the maxEntries array) in the operator. This way the memory can be accessed sequentially and the computation can be reused. The size of these arrays should require very little memory overhead, since they’re only for top level rows and should always be less than 1024 per block.
We then compute the batch size, ie. top level row count that can fit into the next output page. Right now we limit the nested row count for each batch to 1000 with a minimum batch size of 1. This means the unnested row count for each block might be over 1000. For example, if a row contains 10000 array entries, then the unnested row count would be 10000. Although this is over the 1000 rows we still have to output the whole top level row which translates into 10000 unnested rows.
Bonus: No need to create DictionaryBlock at all for some cases!
There are several cases that we can tell no NULL
s need to be inserted:
- When there is only one unnest column (except
ARRAY
ofROW
s case). - When there are multiple columns, but the cardinalities for them are the same for the current batch.
- For a single Array of Row unnest column case, if the row block doesn't contain any nulls.
For these cases, the result can simply be acquired by using getRegion()
to get a view on top of the original leaf block:
public Block buildOutputBlockWithoutNulls(int outputPositionCount)
{
Block output = source.getRegion(sourcePosition, outputPositionCount);
sourcePosition += outputPositionCount;
return output;
}
This doesn’t copy the data, nor does it construct the ids
mapping and DictionaryBlock
. Nice!
A better way to copy Blocks
How do we handle the cases where NULL
s need to be inserted? We can formulate the problem as a simple question: given a Block
without NULL
, how do we insert a NULL
efficiently?
There are two ways of doing this:
- Use special id, e.g.
-1
to identify null inDictionaryBlock
- Copy the blocks and insert a NULL.
Currently DictionaryBlock
doesn’t support the logic in 1). To make it happen, we need to change all getId()
and getIds()
call to handle the special values. While the size of the callsites is manageable (about 50), the performance implication need to be studied carefully.
The baseline implementation used the second way and used PageBuilder
and BlockBuilder
to copy the values for this case. However, the JMH benchmarks shows that they are not super efficient for this purpose. To compare the cost of copying blocks using
PageBuilder
and BlockBuilder
versus doing bulk copy using tight loops, we measured the following two implementations:
- Using
PageBuilder
andBlockBuilder
@Benchmark
public void copyBlockByAppend(BenchmarkData data)
{
LongArrayBlockBuilder longArrayBlockBuilder = new LongArrayBlockBuilder(null, POSITIONS_PER_PAGE);
for (int i = 0; i < BLOCK_COUNT; i++) {
Block block = data.blocks.get(i);
int positionCount = block.getPositionCount();
for (int j = 0; j < positionCount; j++) {
BIGINT.appendTo(block, j, longArrayBlockBuilder);
}
Block outputBlock = longArrayBlockBuilder.build();
}
}
- Using bulk copy
@Benchmark
public void copyBlockByLoop(BenchmarkData data)
{
for (int i = 0; i < BLOCK_COUNT; i++) {
Block block = data.blocks.get(i);
int positionCount = block.getPositionCount();
long[] values = new long[positionCount];
for (int j = 0; j < positionCount; j++) {
values[j] = block.getLong(j);
}
boolean[] valueIsNull = copyIsNulls(block);
Block outputBlock = new LongArrayBlock(positionCount, Optional.of(valueIsNull), values);
}
}
Even though the second code snippet is copying through the Block interface, it was over 4x faster over the first one:
We thus thought copying the Blocks this way can give us acceptable performance. To make the code clean and better encapsulated, we added a new method appendNull()
to the Block
interface. For all primitive types like BOOLEAN
, SMALLINT
, INTEGER
, BIGINT
, etc., we create a new Block
with all the positions copied and then a NULL
appended at the end of the block. For VariableWidthBlock
, ArrayBlock
, MapBlock
and RowBlock
, we can get hold of the underlying elements block or the raw slice, and then construct the offsets array. It doesn’t need to do a deep copy of the nested structure. To do this, we need to convert the ArrayBlock
, MapBlock
and RowBlock
into the ColumnarXXX
objects but the cost of conversion is very worthwhile.
Further reading
- Here is the original issue explaining the design https://github.com/prestodb/presto/issues/13751
- Here is the main pull request that made it all happen https://github.com/prestodb/presto/pull/13746