PrestoCon and Growing Industry Consortium - Intel and Upsolver Join Presto Foundation

November 20, 2020

Presto Foundation joined the Linux Foundation over a year ago, and has been focused on growing the Presto open source project and community. We encourage industry involvement with an open charter, clear guiding principles, and community-oriented goals. We recently hosted PrestoCon 2020, our first annual community conference, which was widely attended and well represented by Presto community members. We also warmly welcome Intel and Upsolver who recently joined the Presto Foundation.

Presto Enables Internal Log Data Analysis at Drift

October 29, 2020

Arun Venkateswaran

I’m a Senior Software Engineer in the data group at Drift, a conversational marketing platform that is used for qualifying leads faster, automatically booking meetings and connecting customers to the right business solutions more efficiently. I’ve used Presto quite a bit throughout my career, and I want to first give readers a quick overview of how Presto has enabled my team at Drift to quickly and cost-effectively analyze distributed logs at scale. Then I will share how we used and benefited from Presto at Vistaprint, where I worked previously.

Even Faster Unnest

August 20, 2020

Ying Su

Ying Su, Masha Basmanova, Orri Erling

Unnest is a common operation in Facebook’s daily Presto workload. It converts an ARRAY, MAP, or ROW into a flat relation. Its original implementation used deep copy all the time and was very inefficient. In Unnest Operator Performance Enhancement with Dictionary Blocks, the author improved the Unnest operator by up to 10x in CPU and elapsed times by using DictionaryBlock when possible. We went one step further and improved it for another 5-10x.

Getting Started with PrestoDB and Aria Scan Optimizations

August 14, 2020

Adam Shook

This article was originally published by Adam on June 15th, 2020 over at his blog at datacatessen.com.

PrestoDB recently released a set of experimental features under their Aria project in order to increase table scan performance of data stored in ORC files via the Hive Connector. In this post, we'll check out these new features at a very basic level using a test environment of PrestoDB on Docker. To find out more about the Aria features, you can check out the Facebook Engineering blog post which was published June 2019.

Building a high-performance platform on AWS to support real-time gaming services using Presto and Alluxio

August 6, 2020

Teng Wang

Authors: Teng Wang, Du Li, Yu Jin and Sundeep Narravula

Electronic Arts (EA) is a leading company in the gaming industry, providing dozens of games to serve billions of users worldwide each year. Making near real-time decisions for EA’s online services is critical for our business. This blog describes a data platform on AWS based on Presto and Alluxio to support online services with instantaneous response within the gaming industry.

PrestoDB and Apache Hudi

August 4, 2020

Bhavani Sudha Saktheeswaran

Co-author: Brandon Scheller

Apache Hudi is a fast growing data lake storage system that helps organizations build and manage petabyte-scale data lakes. Hudi brings stream style processing to batch-like big data by introducing primitives such as upserts, deletes and incremental queries. These features help surface faster, fresher data on a unified serving layer. Hudi tables can be stored on the Hadoop Distributed File System (HDFS) or cloud stores and integrates well with popular query engines such as Presto, Apache Hive, Apache Spark and Apache Impala. Given Hudi pioneered a new model that moved beyond just writing files to a more managed storage layer that interops with all major query engines, there were interesting learnings on how integration points evolved.

In this blog we are going to discuss how the Presto-Hudi integration has evolved over time and also discuss upcoming file listing and query planning improvements to Presto-Hudi queries.

Running Presto in a Hybrid Cloud Architecture

July 17, 2020

Adit Madan

Migrating SQL workloads from a fully on-premise environment to cloud infrastructure has numerous benefits, including alleviating resource contention and reducing costs by paying for computation resources on an on-demand basis. In the case of Presto running on data stored in HDFS, the separation of compute in the cloud and storage on-premises is apparent since Presto’s architecture enables the storage and compute components to operate independently. The critical issue in this hybrid environment of Presto in the cloud retrieving HDFS data from an on-premise environment is the network latency between the two clusters.

This crucial bottleneck severely limits performance of any workload since a significant portion of its time is spent transferring the requested data between networks that could be residing in geographically disparate locations. As a result, most companies copy their data into a cloud environment and maintain that duplicate data, also known as Lift and Shift. Companies with compliance and data sovereignty requirements may even prevent organizations from copying data into the cloud. This approach is not scalable and requires introducing a lot of manual effort to achieve reasonable results. This article introduces Alluxio to serve as a data orchestration layer to help serve data to Presto efficiently, as opposed to either directly querying the distant HDFS cluster or manually providing a localized copy of the data to Presto in a cloud cluster.

Data Lake Analytics: Alibaba's Federated Cloud Strategy

June 30, 2020

George Wang

Presto is known to be a high-performance, distributed SQL query engine for Big Data. It offers large-scale data analytics with multiple connectors for accessing various data sources. This capability enables the Presto users to further extend some features to build a large-scale data federation service on cloud.

Alibaba Data Lake Analytics embraces Presto’s federated query engine capability and has accumulated a number of successful business use cases that signify the power of Presto's analytics capability.

Improving Presto Latencies with Alluxio Data Caching

June 16, 2020

Rohit Jain

Facebook: Rohit Jain, James Sun, Ke Wang, Shixuan Fan, Biswapesh Chattopadhyay, Baldeep Hira

Alluxio: Bin Fan, Calvin Jia, Haoyuan Li

The Facebook Presto team has been collaborating with Alluxio on an open source data caching solution for Presto. This is required for multiple Facebook use-cases to improve query latency for queries that scan data from remote sources such as HDFS. We have observed significant improvements in query latencies and IO scans in our experiments.

Spatial Joins 1: Local Spatial Joins

May 7, 2020

James Gill

A common type of spatial query involves relating one table of geometric objects (e.g., a table population_centers with columns population, latitude, longitude) with another such table (e.g., a table counties with columns county_name, boundary_wkt), such as calculating for each county the population sum of all population centers contained within it. These kinds of calculations are called spatial joins. While doing it for a single row each from population_centers and counties is manageable, doing it efficiently for two large tables is challenging. In this post, we'll talk about the machinery that Presto has built to make these queries blazingly fast.

← Prev Next →