Presto – part 2: Data Analysis and Machine Learning in SQL
Posted 1 year ago by Milos Svana
In an earlier article, we introduced you to Presto, an open-source, distributed SQL query engine. It lets you use SQL to query data from many types of storage solutions. This includes data sources that don’t support SQL natively; for example, a JSON file stored on AWS S3. What’s even more interesting, is that Presto lets you perform joins across multiple databases based on different technologies.
In our view, Presto is a great tool for organizations that store their data in multiple data sources, and want to analyze this data without creating a data warehouse or a data lake.
To help you with these data analysis tasks, Presto offers many advanced features that are not available in most other data storage solutions. In this article, we want to explore these features. Using a dataset of Airbnb listings in Berlin, we will first show you some advanced mathematical and statistical functions useful for initial data exploration. Then, we will train a simple machine learning model for predicting the price of individual listings. Directly in SQL!
DISCLAIMER: Presto’s machine learning features shown in this article are still under development and not recommended for commercial use.
Our data: Airbnb listings in Berlin
Before we start with the analysis, it’s a good idea to briefly describe the data we are using. We chose the Berlin Airbnb Data dataset from Kaggle. To show that Presto can indeed work with data stored across multiple databases, we split the dataset into several parts, each stored in a different database. The contents of the neighborhoods.csv file are stored in a MySQL table. We stored the data from listings_summary.csv and calendar_summary.csv in PostgreSQL. Finally, we did a bit of preprocessing and one-hot encoded the amenities column from the listings_summary.csv file. The results are stored as a MongoDB collection. The same solution was chosen for the data inside reviews_summary.csv.
After we had the data inplace, we defined proper Presto catalogs, as shown in the previous article. Using a series of SHOW statements, we could then check the basic structure of our data, for example:
presto> SHOW catalogs; Catalog ------------ ... mongodb postgresql mysql ... presto> SHOW TABLES FROM mongodb.presto; Table ------------------ amenities_onehot reviews presto> SHOW TABLES FROM postgresql.public; Table ------------------ calendar_summary listings
Advanced aggregation functions
Presto offers many functions useful for basic statistical analysis of your data. Features like this are often missing in relational databases, such as MySQL or PostgreSQL.
One of the most common tools in data analysis is histograms. Presto implements two functions to provide all the data required by a charting library or an application to draw a histogram. The first of these two functions is HISTOGRAM(column). It counts the occurrences of different values of a categorical variable. We can, for example, see how different property types are distributed across different neighborhoods:
presto> SELECT property_type, CAST(HISTOGRAM(neighbourhood) AS JSON) AS histogram FROM postgresql.public.listings GROUP BY property_type;
The second function, NUMERIC_HISTOGRAM(buckets, column), is the traditional histogram many of you know from statistics. It lets you approximate the distribution of a numeric variable. Let’s check how the price of listings is distributed for different property types:
presto> SELECT property_type, CAST(NUMERIC_HISTOGRAM(15, price) AS JSON) AS histogram FROM postgresql.public.listings GROUP BY property_type;
Another incredibly useful concept for analyzing relationships between two variables is correlation (usually calculated as a Pearson correlation coefficient). In Presto, the corresponding function is called CORR(). Since we are interested in the prediction of price, let’s calculate the correlation between price and the number of bedrooms, and also between price and the overall review score:
presto> SELECT CORR(price, bedrooms) as corr_price_bedrooms, CORR(price, review_scores_rating) as corr_price_review FROM postgresql.public.listings; corr_price_bedrooms | corr_price_review ---------------------+---------------------- 0.2960270209009118 | 0.017288321444897812
We see that there is almost no correlation between the price of a listing and its review score (the correlation coefficient is close to 0). On the other hand, the correlation between the price and number of bedrooms is slightly positive. This suggests that more bedrooms means higher prices. This can be a useful feature for our machine learning model.
Finally, let’s have a look at standard deviation. While a histogram helps us approximate the distribution of prices graphically, standard deviation numerically describes how “wide” the distribution is. We can examine how wide the price range is for different property types. In this case, we will use the population standard deviation, as we assume that our dataset covers almost all Airbnb listings in Berlin at a given time:
presto> SELECT property_type, STDDEV_POP(price) AS price_std_dev FROM postgresql.public.listings GROUP BY property_type; property_type | price_std_dev ------------------------+-------------------- ... House | 76.23012722319825 Apartment | 91.76427627084819 ...
Focusing on houses and apartments, we see that the range of prices should be wider for the latter.
This was just a very short selection of the vast amount of useful functions offered by Presto. Many more can be found in Presto’s official documentation. We definitely recommend checking out mathematical and aggregate functions.
Training a machine learning model in PrestoDB
Although still in its very early stages, not recommended for commercial use, and not that well documented, Presto also lets you train simple machine learning models on your data. In our case, we will attempt to predict the listing price using various features.
Let’s start with something very simple and train a model that uses 3 features or independent variables to predict the price of a listing. These features are the number of beds, the number of of people the owner can accommodate, and the number of bathrooms:
presto> SELECT learn_regressor( price, features(accomodates, bathrooms, beds)) AS model FROM postgresql.public.listings;
This query trains the model (an SVM regressor) on all available listings. We don’t split the data into a train set as a test set for evaluation. The learn_regressor() function responsible for training the model takes 2 arguments. The first one is the variable we want to predict – price. The second argument then defines the features that should be used for prediction. The features() function is a simple helper that takes a list of columns and transforms it into a format required by Presto. As of now, this function is limited to 10 columns.
The simple query above trains the model, but it doesn’t use it to make any predictions. Let’s fix that, and use the model we train to predict the price of a new previously unseen listing:
presto> SELECT regress(features(2,1,2), model) AS result FROM ( SELECT learn_regressor( price, features(accomodates, bathrooms, beds)) AS model FROM postgresql.public.listings ); result ------------------- 48.18489887154577
In the query, we trained a regression model just as before, but then we followed up with calling the regress() function to predict the price of a listing that accommodates 2 people, has one bathroom, and 2 separate beds. The predicted price is a little over $48.
Even though the features() function is limited to 10 columns, we can train models with a larger number of features if we make the transformation to a proper format ourselves. The learn_regressor() function requires the features to be a map, where keys can be simple integers, and the values are columns used as features. Let’s train a model that uses a few more features. We extend the list by involving one-hot encoded amenities stored in MongoDB. This data is simple. The value is 1 if a given listing provides the amenity, or 0, if it doesn’t:
presto -> SELECT learn_regressor( p.price, map(sequence(1, 21), ARRAY[ p.accomodates, p.bathrooms, p.beds, p.guests_included, o.internet, o.game_console, o.wide_doorway, o.paid_parking_off_premises, o.stove, o.dishwasher, o.self_check_in, o.gym, o.private_entrance, o.crib, o.patio_or_balcony, o.coffee_maker, o.microwave, o.dryer, o.long_term_stays_allowed, o.iron, o.cable_tv])) AS model FROM postgresql.public.listings p JOIN mongodb.presto.amenities_onehot o ON o.id = p.id;
One of the most important features related to machine learning that is still missing at the time of publication is model permanence. After we train a model, it’s not possible to store it for later use. The model has to be trained from scratch every time we want to make a prediction. As of now, Presto is therefore not the best solution if you need to use your ML model in real time.
Evaluation of the regression model
If we want to test the performance of a machine learning model, it is a good idea to split our dataset into at least 2 parts: train and test datasets. The train dataset is used to train the model. The test dataset then lets us evaluate its performance on previously unseen data. Making predictions based on data the model wasn’t trained on is the purpose of machine learning.
To split our listings into these 2 groups, we add a new column to the listings table. We call this column test_data, and it contains a simple binary flag that, if set to 1, indicates that a given listing is a part of the test dataset. If the flag is set to 0, the listing will be used for training.
presto> ALTER TABLE postgresql.public.listings ADD COLUMN test_data int SET DEFAULT 0;
We then split our data using a very simple rule of thumb. If the listing ID can be divided by 11, we will assign it to the test dataset. Otherwise, it falls into the train dataset:
presto> UPDATE listings SET test_data = 1 WHERE id % 11 = 0;
There are multiple evaluation metrics we can use to determine the quality of our model. As of now, Presto offers many evaluation functions for classifiers but their regression counterparts are missing. If we know the formula for calculating a given metric, we can work around this issue by implementing the calculation we need by ourselves.
We can, for example, easily calculate the Mean Absolute Error, which, considering all listings in the test dataset, simply tells us the average difference between the price predicted by our model and its real value:
presto> SELECT SUM( ABS(o.price - regress(features(o.accomodates, o.bathrooms, o.beds), model))) /COUNT(o.price) AS MAE FROM ( SELECT learn_regressor( p.price, features(p.accomodates, p.bathrooms, p.beds)) AS model FROM postgresql.public.listings p WHERE p.test_data = 0) CROSS JOIN postgresql.public.listings o WHERE o.test_data = 1; MAE ------------------- 21.68637564179335
Our simple model with just 3 features on average deviates from the real value by about $21.70.
We can even calculate the coefficient of determination, a metric used very often in econometrics:
presto> SELECT 1 - SUM( POW(o.price - regress(features(o.accomodates, o.bathrooms, o.beds), model),2)) /SUM(POW(o.price - mean_price,2)) AS R2 FROM ( SELECT learn_regressor( p.price, features(p.accomodates, p.bathrooms, p.beds)) AS model FROM postgresql.public.listings p WHERE p.test_data = 0) CROSS JOIN postgresql.public.listings o WHERE o.test_data = 1; R2 -------------------- 0.3271195352185564
The resulting value tells us that our model explains approximately 33% of the variance in the prices of listings.
Since machine learning features in Presto are still under development, no official documentation is provided for the functions we just showed you. To fill this gap, we decided to create a short document describing the ML functions we used in our experiments, as well as some others that you might find useful.
The option to perform advanced statistical analysis, and even train basic machine learning models directly in SQL is an interesting one. In line with what we described as the main strength of Presto in the previous article, these features let you quickly explore your data without moving it around, or creating additional Jupyter Notebooks or Python/R scripts. This can save a lot of time during your initial experiments.
At the same time, we still advise caution when using machine learning functions. As we mentioned several times, they are still under development and not recommended for commercial use. The main limitation is the inability to store your models after they are trained. The second important issue is that Presto can currently handle only small, SVM-based models. Yet, all these features give you a sneak peak on what you can expect from Presto in the future.