Complete Table Scan: A Quantitative Assessment

July 23, 2019

In the previous article we looked at the abstract problem statement and possibilities inherent in scanning tables. In this piece we look at the quantitative upside with Presto. We look at a number of queries and explain the findings.

The initial impulse motivating this work is the observation that table scan is by far the #1 operator in Presto workloads I have seen. This is a little over half of all Presto CPU, with repartitioning a distant second, at around 1/10 of the total. The other half of the motivation is ready opportunity: Presto in its pre-Aria state does almost none of the things that are common in table scan.

Everything You Always Wanted To Do in Table Scan

June 29, 2019

Orri Erling

Orri Erling, Maria Basmanova, Ying Su, Timothy Meehan, Elon Azoulay

Table scan, on the face of it, sounds trivial and boring. What’s there in just reading a long bunch of records from first to last? Aren’t indexing and other kinds of physical design more interesting?

As data has gotten bigger, the columnar table scan has only gotten more prominent. The columnar scan is a fairly safe baseline operation: The cost of writing data is low, the cost of reading it is predictable.

Another factor that makes the table scan the main operation is the omnipresent denormalization in data warehouse. This only goes further as a result of ubiquitous use of lists and maps and other non-first normal form data.

The aim of this series of articles is to lay out the full theory and practice of table scan with all angles covered. We will see that this is mostly a matter of common sense and systematic application of a few principles: Do not do extra work and do the work that you do always in bulk. Many systems like Google’s BigQuery do some subset of the optimizations outlined here. Doing all of these is however far from universal in the big data world, so there is a point in laying this all out and making a model implementation on top of Presto. We are here talking about the ORC format, but the same things apply equally to Parquet or JSON shredded into columns.

Introducing the Presto blog

June 28, 2019

Orri Erling

Presto is a key piece of data infrastructure at many companies. The community has many ongoing projects for taking it to new levels of performance and functionality plus unique experience and insight into challenges of scale.

We are opening this blog as an informal channel for discussing our work as well as technology trends and issues that affect the big data and data warehouse world at large. Our development continues to take place at github and can thus be followed by everybody. Here we seek to have a channel that is more concise and interesting to a broader readership than github issues and code comments would be.

We have current projects like Aria Presto for doubling CPU efficiency and Presto Unlimited for enabling fault tolerant execution of very large queries. We are running one of the world’s largest data warehouses and thus have a unique perspective on platform technologies, e.g. C++ vs. Java, data analytics usage patterns, integration of machine learning and database, data center infrastructure for supporting these and much more. Some of the big questions we are facing have to do with optimizing infrastructure at scale and designing the future of interoperable file formats and metadata. Today we are running ORC on Presto and Spark and system specific file formats for diverse online systems. We are constantly navigating the strait between universality and specialization and keep looking for ways to generalize while advancing functionality and performance.

The Presto user and developer community involves many of the world’s leading technology players. There is exciting work in progress around Presto at many of these companies. We look forward to tracking these too here. Articles from the Presto world are welcome. Stay tuned for everything Presto.

← Prev